特征工程
→ 返回机器学习
把原始表格 / 文本 / 时间序列变成模型可用的 数值矩阵;与 模型评估 里的切分配合时,必须 只在训练折上 fit 变换器,再 transform 验证与测试,否则会产生 泄露(乐观偏差的罪魁祸首)。
StandardScaler / RobustScaler
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
X = np.array([[100.0, 0.001], [120.0, 0.002], [900.0, 0.1]]) # 第二列有离群
z = StandardScaler().fit_transform(X)
r = RobustScaler().fit_transform(X)
print("Standardized row0:", z[0].round(3))
print("Robust row0:", r[0].round(3))RobustScaler 用四分位数缩放,对离群点更不敏感;树模型对单调缩放往往不敏感,线性 / SVM / 神经网络通常需要缩放。
分桶与多项式特征
from sklearn.preprocessing import KBinsDiscretizer, PolynomialFeatures
X = [[0.1], [1.0], [2.5], [4.0], [10.0]]
kbd = KBinsDiscretizer(n_bins=3, encode="onehot-dense", strategy="quantile")
X_bin = kbd.fit_transform(X)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform([[1, 2], [3, 4]])
print(X_poly)类别:OneHot + 高基数哈希
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({"city": ["bj", "sh", "bj", "gz"], "price": [1, 2, 3, 4]})
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
city_oh = ohe.fit_transform(df[["city"]])
print(ohe.categories_)
# 高基数可用 FeatureHasher(sparse)或目标编码(极易泄漏,需 CV 内嵌)缺失值:SimpleImputer
import numpy as np
from sklearn.impute import SimpleImputer
X = np.array([[1.0, 2.0], [np.nan, 3.0], [7.0, 6.0]])
imp = SimpleImputer(strategy="median")
print(imp.fit_transform(X))分类列可用 strategy="most_frequent" 或常数占位 + 缺失指示列(模型可学「缺了也是一种信号」)。
ColumnTransformer:数值与类别分路
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
"age": [22, 45, 30],
"city": ["A", "B", "A"],
"label": [0, 1, 0],
})
num_cols = ["age"]
cat_cols = ["city"]
preprocess = ColumnTransformer(
transformers=[
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
)
clf = Pipeline([("prep", preprocess), ("lr", LogisticRegression(max_iter=200))])
X = df.drop(columns=["label"])
y = df["label"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.33, random_state=42)
clf.fit(X_tr, y_tr)
print("acc:", clf.score(X_te, y_te))文本:TF-IDF(经典基线)
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"Redis is an in-memory key-value store",
"MongoDB stores JSON-like documents",
]
vec = TfidfVectorizer(stop_words="english", max_features=8)
X_tfidf = vec.fit_transform(corpus)
print(vec.get_feature_names_out())
print(X_tfidf.toarray().round(2))现代流水线常再接 预训练句向量(见 深度学习)。
时间特征(示意)
import numpy as np
import pandas as pd
ts = pd.to_datetime(["2024-01-05", "2024-06-20", "2024-12-31"])
feat = pd.DataFrame({
"month": ts.month,
"dow": ts.dayofweek,
"sin_m": np.sin(2 * np.pi * ts.month / 12),
"cos_m": np.cos(2 * np.pi * ts.month / 12),
})
print(feat)与自动特征学习
- 树集成(集成学习)对单调变换鲁棒;线性 / 神经网络通常要强缩放。
- 深度模型可端到端学表示,但 业务规则特征、合规字段 仍常手工加入。