特征工程工具箱：选择、构造与嵌入（统计特征选择、递归特征消除、VarianceThreshold、特征哈希、嵌入降维）（八）

是否达到目标特征数?保留前k个最大奇异值。

WHCIS

783人浏览 · 2025-02-19 04:47:51

WHCIS · 2025-02-19 04:47:51 发布

一、特征工程全景视角

1.1 特征工程的三重境界

特征选择：从海量特征中筛选关键信号
特征构造：通过领域知识创造新特征
特征嵌入：将高维特征映射到低维空间

1.2 Scikit-learn工具链定位

二、统计特征选择：SelectKBest

2.1 数学原理深度剖析

卡方检验（分类任务）：
$χ2=∑i=1n(Oi−Ei)2Ei\chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i}$
其中 $O_i$ 为观测频数， $E_i$ 为期望频数
互信息法（非线性关系）：
$\sum_{y \in Y} \sum_{x \in X} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$

2.2 参数详解与实战

from sklearn.feature_selection import SelectKBest, f_classif

# 创建包含20个特征的模拟数据集
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5)

# 选择F检验得分最高的10个特征
selector = SelectKBest(score_func=f_classif, k=10)
X_new = selector.fit_transform(X, y)

# 可视化特征得分
plt.bar(range(20), selector.scores_)
plt.axhline(y=np.sort(selector.scores_)[-10], color='r', linestyle='--')
plt.title("Feature Scores Ranking")
plt.show()

2.3 最佳实践

结合交叉验证选择k值：

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('select', SelectKBest(f_classif)),
    ('model', LogisticRegression())
])

param_grid = {'select__k': [5, 10, 15]}
search = GridSearchCV(pipe, param_grid, cv=5)
search.fit(X, y)

三、递归特征消除：RFE

3.1 算法流程解析

3.2 核心参数解析

参数	说明	推荐值
n_features_to_select	目标特征数量	根据交叉验证确定
step	每次迭代移除的特征数	1或总特征数的5%
estimator	基模型选择	需具备feature_importances_属性

3.3 进阶技巧：RFECV

from sklearn.feature_selection import RFECV

estimator = RandomForestClassifier()
selector = RFECV(estimator, step=1, cv=5)
selector.fit(X, y)

print("Optimal number of features : %d" % selector.n_features_)
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()

四、方差阈值过滤：VarianceThreshold

4.1 数学原理与阈值选择

方差计算公式：
$Var(X)=1n∑i=1n(xi−μ)2\text{Var}(X) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2$

阈值选择策略：

绘制特征方差分布直方图
剔除方差接近0的特征
经验阈值：通常取0.8*最大方差

4.2 实战：处理混合型数据

from sklearn.datasets import make_classification
from sklearn.feature_selection import VarianceThreshold

# 生成包含二值、连续、常量特征的数据
X, _ = make_classification(n_features=5, n_informative=3, n_redundant=1)
X = np.hstack([X, np.zeros((1000, 2))])  # 添加两个常量特征

selector = VarianceThreshold(threshold=0.01)
X_new = selector.fit_transform(X)

print("原始特征数:", X.shape[1])
print("过滤后特征数:", X_new.shape[1])

五、特征哈希：FeatureHasher

5.1 哈希技巧原理

哈希函数特性：

确定性：相同输入产生相同输出
均匀性：输出值均匀分布
抗碰撞性：不同输入尽量产生不同输出

哈希空间计算：
$\text{特征维度} = 2^{n} \times \text{哈希表数量}$

5.2 高级应用：处理流式数据

from sklearn.feature_extraction import FeatureHasher

# 模拟实时数据流
data_stream = [
    {'user': 'A', 'action': 'click', 'timestamp': 'morning'},
    {'user': 'B', 'action': 'purchase', 'timestamp': 'afternoon'},
    {'user': 'C', 'action': 'view', 'timestamp': 'night'}
]

hasher = FeatureHasher(n_features=8, input_type='dict')
X = hasher.transform(data_stream)

print("哈希矩阵示例:\n", X.toarray())

5.3 碰撞处理策略

符号哈希：对特征名称同时使用hash和hash1函数
多重哈希表：使用多个哈希函数组合
布隆过滤器：预先检测可能碰撞的特征

六、截断SVD：TruncatedSVD

6.1 数学基础详解

给定矩阵 $Am×nA_{m \times n}$ ，其奇异值分解为：
$\Sigma V^T$
其中 $Σ\Sigma$ 为奇异值对角矩阵

截断原理：
$A_k = U_k \Sigma_k V_k^T$
保留前k个最大奇异值

6.2 与PCA的对比分析

特性	TruncatedSVD	PCA
输入矩阵	可接受稀疏矩阵	仅限稠密矩阵
中心化处理	不需要	需要自动中心化
计算方式	直接分解	通过协方差矩阵分解
内存消耗	较低	较高

6.3 实战：文本向量降维

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

# 构建文本数据集
corpus = [
    '自然语言处理是人工智能的重要方向',
    '深度学习推动了计算机视觉的发展',
    '特征工程决定模型的上限'
]

# 生成TF-IDF向量
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# 降维到2维空间
svd = TruncatedSVD(n_components=2)
X_reduced = svd.fit_transform(X)

# 可视化展示
plt.scatter(X_reduced[:,0], X_reduced[:,1])
for i, txt in enumerate(corpus):
    plt.annotate(txt[:10], (X_reduced[i,0], X_reduced[i,1]))
plt.show()

七、综合应用案例：电商用户行为分析

7.1 数据处理流程

from sklearn.pipeline import Pipeline

processing_pipe = Pipeline([
    ('hashing', FeatureHasher(n_features=2**18)),
    ('variance', VarianceThreshold(threshold=0.001)),
    ('svd', TruncatedSVD(n_components=100)),
    ('selection', SelectKBest(k=50))
])

X_processed = processing_pipe.fit_transform(raw_data)

7.2 性能优化技巧

内存管理：
- 使用scipy.sparse格式存储数据
- 设置dtype=np.float32减少精度损失

并行计算：

from joblib import parallel_backend

with parallel_backend('loky', n_jobs=4):
    processing_pipe.fit(large_data)

八、工具选型决策树

九、常见问题解答

Q1：如何处理类别型特征和数值型特征混合的情况？

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('num', VarianceThreshold(), numeric_features),
    ('cat', FeatureHasher(), categorical_features)
])

Q2：特征选择与正则化的关系是什么？

特征选择：硬性过滤
正则化：软性约束
推荐组合使用：先进行粗粒度特征选择，再用L1正则化精细筛选

十、延伸学习资源

理论深化：
- 《机器学习》周志华：第11章特征选择与稀疏学习
- 《深度学习》第9章应用机器学习流程
实战进阶：
- Kaggle特征工程专题比赛
- Scikit-learn官方文档案例库
工具扩展：
- Feature-engine库：更专业的特征工程工具
- Category Encoders：高级类别编码方法

from sklearn import ensemble
from sklearn import compose

final_pipe = compose.make_column_transformer(
    (FeatureHasher(n_features=16), ['category_feature']),
    remainder=VarianceThreshold(0.1)
)

final_pipe = Pipeline([
    ('preprocess', final_pipe),
    ('dim_reduce', TruncatedSVD(50)),
    ('select', RFECV(ensemble.RandomForestClassifier())),
    ('model', ensemble.GradientBoostingClassifier())
])

魔乐社区

魔乐社区（Modelers.cn) 是一个中立、公益的人工智能社区，提供人工智能工具、模型、数据的托管、展示与应用协同服务，为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作，由全产业链共同建设、共同运营、共同享有，推动国产AI生态繁荣发展。

更多推荐

替你试过了，消费级显卡可以跑的开源文生图SOTA模型，顶级渲染、高密度文本绘图

魔乐社区

量化挑战赛冠军专访：4小时啃下W4A8量化，我靠的是这些经验

魔乐社区

小参数・大码力・易部署 | Qwen3.6-27B上线魔乐社区，基于昇腾的部署教程来了

继一周前模型开源发布后，千问再度开源Qwen3.6-27B —— 一个拥有270亿参数的稠密多模态模型，也是社区呼声最高的模型规格。Qwen3.6-27B 依然支持多模态思考与非思考模式，在智能体编程方面达到了旗舰级表现，全面超越前代开源旗舰 Qwen3.5-397B-A17B（总参数397B / 激活参数17B的MoE模型）。作为稠密架构，它无需MoE路由即可部署，是开发者在实用、可广泛部署规模