集成学习
→ 返回机器学习
组合多个基学习器:Bagging 降方差、Boosting 逐步修正残差、Stacking 用元学习器融合。表格数据上 RandomForest、HistGradientBoosting、XGBoost/LightGBM/CatBoost 往往是强基线;评估见 模型评估,特征预处理见 特征工程。
随机森林(Bagging + 随机子空间)
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)
rf = RandomForestClassifier(
n_estimators=100,
max_depth=None,
min_samples_leaf=1,
random_state=42,
n_jobs=-1,
)
rf.fit(X_tr, y_tr)
print("test acc:", rf.score(X_te, y_te).round(3))启用袋外估计:RandomForestClassifier(..., oob_score=True)。
rf_oob = RandomForestClassifier(
n_estimators=200, oob_score=True, random_state=42, n_jobs=-1
).fit(X_tr, y_tr)
print("OOB:", rf_oob.oob_score_.round(3))梯度提升树(sklearn 原生)
from sklearn.datasets import load_iris
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score
X, y = load_iris(return_X_y=True)
hg = HistGradientBoostingClassifier(
learning_rate=0.05,
max_iter=200,
max_depth=6,
random_state=42,
)
print("cv acc:", cross_val_score(hg, X, y, cv=5).mean().round(3))HistGradientBoosting* 基于直方图,默认支持 缺失值、大数据更快;调参重点:learning_rate、max_iter(树棵数)、max_depth、l2_regularization、min_samples_leaf。
AdaBoost(自适应提升,概念示例)
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
X, y = load_iris(return_X_y=True)
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=80,
learning_rate=0.8,
random_state=42,
)
print("cv acc:", cross_val_score(ada, X, y, cv=5).mean().round(3))投票融合(Voting)
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
X, y = load_iris(return_X_y=True)
vclf = VotingClassifier(
estimators=[
("lr", LogisticRegression(max_iter=300)),
("rf", RandomForestClassifier(n_estimators=80, random_state=42)),
("svc", SVC(probability=True)),
],
voting="soft",
)
print("cv acc:", cross_val_score(vclf, X, y, cv=5).mean().round(3))hard 多数票;soft 对各类概率平均(要求基分类器有 predict_proba)。
Stacking(堆叠,防泄露用 CV)
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
stack = StackingClassifier(
estimators=[
("rf", RandomForestClassifier(n_estimators=80, random_state=0)),
("svc", SVC(probability=True, random_state=0)),
],
final_estimator=LogisticRegression(max_iter=200),
cv=5,
passthrough=False,
)
stack.fit(X_tr, y_tr)
print("stack test acc:", stack.score(X_te, y_te).round(3))元特征由 训练折外推 得到,减轻二级过拟合;passthrough=True 会把原特征拼到元特征后(维数更高)。
树集成调参备忘
| 现象 | 可尝试 |
|---|---|
| 训练极好、验证差 | 减小 max_depth、增大 min_samples_leaf、子采样 subsample / max_features |
| 欠拟合 | 增大树棵数 n_estimators / max_iter、略增 max_depth |
| 速度慢 | HistGradientBoosting、n_jobs=-1、减小搜索空间 |
与三方库
生产环境常用 XGBoost / LightGBM / CatBoost(类别特征、缺省、GPU);API 与 sklearn 接近,可 fit / predict 或接到 sklearn Pipeline(需对应包装类)。