集成学习

返回机器学习

组合多个基学习器:Bagging 降方差、Boosting 逐步修正残差、Stacking 用元学习器融合。表格数据上 RandomForest、HistGradientBoosting、XGBoost/LightGBM/CatBoost 往往是强基线;评估见 模型评估,特征预处理见 特征工程


随机森林(Bagging + 随机子空间)

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
 
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)
 
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=None,
    min_samples_leaf=1,
    random_state=42,
    n_jobs=-1,
)
rf.fit(X_tr, y_tr)
print("test acc:", rf.score(X_te, y_te).round(3))

启用袋外估计:RandomForestClassifier(..., oob_score=True)

rf_oob = RandomForestClassifier(
    n_estimators=200, oob_score=True, random_state=42, n_jobs=-1
).fit(X_tr, y_tr)
print("OOB:", rf_oob.oob_score_.round(3))

梯度提升树(sklearn 原生)

from sklearn.datasets import load_iris
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score
 
X, y = load_iris(return_X_y=True)
hg = HistGradientBoostingClassifier(
    learning_rate=0.05,
    max_iter=200,
    max_depth=6,
    random_state=42,
)
print("cv acc:", cross_val_score(hg, X, y, cv=5).mean().round(3))

HistGradientBoosting* 基于直方图,默认支持 缺失值、大数据更快;调参重点:learning_ratemax_iter(树棵数)、max_depthl2_regularizationmin_samples_leaf


AdaBoost(自适应提升,概念示例)

from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
 
X, y = load_iris(return_X_y=True)
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=80,
    learning_rate=0.8,
    random_state=42,
)
print("cv acc:", cross_val_score(ada, X, y, cv=5).mean().round(3))

投票融合(Voting)

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
 
X, y = load_iris(return_X_y=True)
vclf = VotingClassifier(
    estimators=[
        ("lr", LogisticRegression(max_iter=300)),
        ("rf", RandomForestClassifier(n_estimators=80, random_state=42)),
        ("svc", SVC(probability=True)),
    ],
    voting="soft",
)
print("cv acc:", cross_val_score(vclf, X, y, cv=5).mean().round(3))

hard 多数票;soft 对各类概率平均(要求基分类器有 predict_proba)。


Stacking(堆叠,防泄露用 CV)

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
 
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
 
stack = StackingClassifier(
    estimators=[
        ("rf", RandomForestClassifier(n_estimators=80, random_state=0)),
        ("svc", SVC(probability=True, random_state=0)),
    ],
    final_estimator=LogisticRegression(max_iter=200),
    cv=5,
    passthrough=False,
)
stack.fit(X_tr, y_tr)
print("stack test acc:", stack.score(X_te, y_te).round(3))

元特征由 训练折外推 得到,减轻二级过拟合;passthrough=True 会把原特征拼到元特征后(维数更高)。


树集成调参备忘

现象可尝试
训练极好、验证差减小 max_depth、增大 min_samples_leaf、子采样 subsample / max_features
欠拟合增大树棵数 n_estimators / max_iter、略增 max_depth
速度慢HistGradientBoostingn_jobs=-1、减小搜索空间

与三方库

生产环境常用 XGBoost / LightGBM / CatBoost(类别特征、缺省、GPU);API 与 sklearn 接近,可 fit / predict 或接到 sklearn Pipeline(需对应包装类)。


相关链接