特征工程

返回机器学习

把原始表格 / 文本 / 时间序列变成模型可用的 数值矩阵;与 模型评估 里的切分配合时,必须 只在训练折上 fit 变换器,再 transform 验证与测试,否则会产生 泄露(乐观偏差的罪魁祸首)。


StandardScaler / RobustScaler

import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
 
X = np.array([[100.0, 0.001], [120.0, 0.002], [900.0, 0.1]])  # 第二列有离群
 
z = StandardScaler().fit_transform(X)
r = RobustScaler().fit_transform(X)
 
print("Standardized row0:", z[0].round(3))
print("Robust row0:", r[0].round(3))

RobustScaler 用四分位数缩放,对离群点更不敏感;树模型对单调缩放往往不敏感,线性 / SVM / 神经网络通常需要缩放。


分桶与多项式特征

from sklearn.preprocessing import KBinsDiscretizer, PolynomialFeatures
 
X = [[0.1], [1.0], [2.5], [4.0], [10.0]]
 
kbd = KBinsDiscretizer(n_bins=3, encode="onehot-dense", strategy="quantile")
X_bin = kbd.fit_transform(X)
 
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform([[1, 2], [3, 4]])
print(X_poly)

类别:OneHot + 高基数哈希

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
 
df = pd.DataFrame({"city": ["bj", "sh", "bj", "gz"], "price": [1, 2, 3, 4]})
 
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
city_oh = ohe.fit_transform(df[["city"]])
print(ohe.categories_)
 
# 高基数可用 FeatureHasher(sparse)或目标编码(极易泄漏,需 CV 内嵌)

缺失值:SimpleImputer

import numpy as np
from sklearn.impute import SimpleImputer
 
X = np.array([[1.0, 2.0], [np.nan, 3.0], [7.0, 6.0]])
 
imp = SimpleImputer(strategy="median")
print(imp.fit_transform(X))

分类列可用 strategy="most_frequent" 或常数占位 + 缺失指示列(模型可学「缺了也是一种信号」)。


ColumnTransformer:数值与类别分路

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
 
df = pd.DataFrame({
    "age": [22, 45, 30],
    "city": ["A", "B", "A"],
    "label": [0, 1, 0],
})
 
num_cols = ["age"]
cat_cols = ["city"]
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)
 
clf = Pipeline([("prep", preprocess), ("lr", LogisticRegression(max_iter=200))])
 
X = df.drop(columns=["label"])
y = df["label"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33, random_state=42)
clf.fit(X_tr, y_tr)
print("acc:", clf.score(X_te, y_te))

文本:TF-IDF(经典基线)

from sklearn.feature_extraction.text import TfidfVectorizer
 
corpus = [
    "Redis is an in-memory key-value store",
    "MongoDB stores JSON-like documents",
]
vec = TfidfVectorizer(stop_words="english", max_features=8)
X_tfidf = vec.fit_transform(corpus)
print(vec.get_feature_names_out())
print(X_tfidf.toarray().round(2))

现代流水线常再接 预训练句向量(见 深度学习)。


时间特征(示意)

import numpy as np
import pandas as pd
 
ts = pd.to_datetime(["2024-01-05", "2024-06-20", "2024-12-31"])
feat = pd.DataFrame({
    "month": ts.month,
    "dow": ts.dayofweek,
    "sin_m": np.sin(2 * np.pi * ts.month / 12),
    "cos_m": np.cos(2 * np.pi * ts.month / 12),
})
print(feat)

与自动特征学习

  • 树集成(集成学习)对单调变换鲁棒;线性 / 神经网络通常要强缩放。
  • 深度模型可端到端学表示,但 业务规则特征合规字段 仍常手工加入。

相关链接