数据质量与清洗

“Garbage in, garbage out”——模型质量的上限由数据质量决定。数据清洗的目标是识别并修复会误导模型学习的问题样本。

数据质量维度

维度	说明	检测方法
完整性	关键字段是否有缺失值	`df.isnull().sum()`
准确性	标签/内容是否正确	人工抽查、交叉验证
一致性	同一实体在不同记录中表示一致	规范化后比对
唯一性	是否存在重复样本	哈希/相似度去重
时效性	数据是否过时	时间戳检查
相关性	是否与任务目标相关	困惑度过滤

文本数据清洗

基础清洗流程

import re
import unicodedata
from ftfy import fix_text
 
def clean_text(text: str) -> str | None:
    # 1. 修复编码问题（乱码、转义错误）
    text = fix_text(text)
    
    # 2. Unicode 标准化
    text = unicodedata.normalize("NFKC", text)
    
    # 3. 去除控制字符
    text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]", "", text)
    
    # 4. 折叠多余空白
    text = re.sub(r"\s+", " ", text).strip()
    
    # 5. 最小长度过滤
    if len(text) < 20:
        return None
    
    # 6. 最大长度过滤
    if len(text) > 100_000:
        return None
    
    return text

去重

import hashlib
from datasketch import MinHash, MinHashLSH
 
# 精确去重：哈希
def exact_dedup(texts: list[str]) -> list[str]:
    seen = set()
    result = []
    for t in texts:
        h = hashlib.md5(t.encode()).hexdigest()
        if h not in seen:
            seen.add(h)
            result.append(t)
    return result
 
# 模糊去重：MinHash LSH（适合大规模数据）
def fuzzy_dedup(texts: list[str], threshold=0.85) -> list[str]:
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique_texts = []
    
    for i, text in enumerate(texts):
        m = MinHash(num_perm=128)
        for word in text.split():
            m.update(word.encode())
        
        if not lsh.query(m):
            lsh.insert(str(i), m)
            unique_texts.append(text)
    
    return unique_texts

质量过滤（以 LLM 预训练数据为例）

def quality_filter(text: str) -> bool:
    """返回 True 表示保留"""
    words = text.split()
    
    # 1. 词数过少
    if len(words) < 50:
        return False
    
    # 2. 字母字符比例过低（可能是乱码或代码为主）
    alpha_ratio = sum(c.isalpha() for c in text) / len(text)
    if alpha_ratio < 0.7:
        return False
    
    # 3. 重复行比例过高
    lines = text.splitlines()
    unique_lines = set(lines)
    if len(unique_lines) / max(len(lines), 1) < 0.7:
        return False
    
    # 4. 标点符号比例过高（广告、乱序文本）
    punct_ratio = sum(c in "!?#@$%^&*" for c in text) / len(text)
    if punct_ratio > 0.1:
        return False
    
    return True

标签质量修复

置信学习（Confident Learning）

自动找出标签错误的样本：

from cleanlab.classification import CleanLearning
from sklearn.ensemble import RandomForestClassifier
 
# CleanLab 自动检测标签错误
cl = CleanLearning(clf=RandomForestClassifier())
cl.fit(X_train, y_train)
 
# 找出可疑样本
label_issues = cl.get_label_issues()
print(f"发现 {label_issues['is_label_issue'].sum()} 个可疑标签")
 
# 查看可疑样本
suspect = X_train[label_issues['is_label_issue']]

多数投票（Majority Voting）

多个标注员结果取多数：

from collections import Counter
 
def majority_vote(annotations: list[list[int]]) -> list[int]:
    """annotations[i][j] = 标注员 j 对样本 i 的标签"""
    result = []
    for sample_annotations in annotations:
        votes = Counter(a for a in sample_annotations if a is not None)
        result.append(votes.most_common(1)[0][0])
    return result

图像数据清洗

from PIL import Image
import numpy as np
 
def is_valid_image(path: str) -> bool:
    try:
        img = Image.open(path)
        img.verify()  # 检查文件完整性
        
        # 重新打开（verify 后需要重开）
        img = Image.open(path)
        arr = np.array(img)
        
        # 过滤全黑/全白图
        if arr.mean() < 5 or arr.mean() > 250:
            return False
        
        # 过滤过小图像
        if img.width < 64 or img.height < 64:
            return False
        
        # 过滤极端长宽比
        ratio = img.width / img.height
        if ratio > 10 or ratio < 0.1:
            return False
        
        return True
    except Exception:
        return False

数据清洗流水线

原始数据
  ↓
格式标准化（编码修复、类型转换）
  ↓
精确去重（MD5 哈希）
  ↓
模糊去重（MinHash LSH，阈值 0.85）
  ↓
质量过滤（规则过滤）
  ↓
标签审查（置信学习 / 人工抽检）
  ↓
类别平衡（过采样 / 欠采样）
  ↓
划分训练/验证/测试集（stratify）
  ↓
版本化存储（DVC / S3）

知识仓库

探索

数据质量与清洗

数据质量与清洗

数据质量维度

文本数据清洗

基础清洗流程

去重

质量过滤（以 LLM 预训练数据为例）

标签质量修复

置信学习（Confident Learning）

多数投票（Majority Voting）

图像数据清洗

数据清洗流水线

相关文档

关系图谱

目录

反向链接