套索回归岭回归_使用kydavra套索选择器进行回归特征选择_python_weixin

套索回归岭回归

机器学习(Machine Learning)

We all know the Occam’s Razor:

我们都知道Occam的剃刀：

From a set of solutions took the one that is the simplest.

从一组解决方案中选出了最简单的一个。

This principle is applied in the regularization of the linear models in Machine Learning. L1-regularisation (also known as LASSO) tend to shrink wights of the linear model to 0 while L2-regularisation (known as Ridge) tend to keep overall complexity as simple as possible, by minimizing the norm of vector weights of the model. One of Kydavra’s selectors uses Lasso for selecting the best features. So let’s see how to apply it.

该原理适用于机器学习中线性模型的正则化。 L1正则化(也称为LASSO)倾向于将线性模型的权重缩减为0，而L2正则化(称为Ridge)则通过最小化模型的矢量权重范数来使总体复杂度尽可能地简单。 Kydavra的选择器之一使用套索选择最佳功能。因此，让我们看看如何应用它。

使用Kydavra LassoSelector。 (Using Kydavra LassoSelector.)

If you still haven’t installed Kydavra just type the following in the following in the command line.

如果您还没有安装Kydavra，只需在命令行中的以下内容中键入以下内容。

pip install kydavra

Next, we need to import the model, create the selector, and apply it to our data:

接下来，我们需要导入模型，创建选择器，并将其应用于我们的数据：

from kydavra import LassoSelectorselector = LassoSelector()selected_cols = selector.select(df, ‘target’)

The select function takes as parameters the panda's data frame and the name of the target column. Also, it has a default parameter ‘cv’ (by default it is set to 5) it represents the number of folds used in cross-validation. The LassoSelector() takes the next parameters:

select函数将熊猫的数据框和目标列的名称作为参数。另外，它具有默认参数“ cv”(默认情况下设置为5)，它表示交叉验证中使用的折叠数。 LassoSelector()采用以下参数：

alpha_start (float, default = 0) the starting value of alpha.
alpha_start(浮点型，默认= 0)的起始值。
alpha_finish (float, default = 2) the final value of alpha. These two parameters define the search space of the algorithm.
alpha_finish(浮点型，默认= 2)的最终值。这两个参数定义了算法的搜索空间。
n_alphas (int, default = 300) the number of alphas that will be tested during the search.
n_alphas(整数，默认= 300)将在搜索过程中测试的alpha数量。
extend_step (int, default=20) if the algorithm will deduce that the most optimal value of alpha is alpha_start or alpha_finish it will extend the search range with extend_step, in such a way being sure that it will not stick and will find finally the optimal value.
extend_step(int，默认值= 20)如果算法将推断出alpha的最佳值是alpha_start或alpha_finish，它将使用extend_step扩展搜索范围，以确保不会粘滞并最终找到最优值值。
power (int, default = 2) used in formula 10^-power, defines the maximal acceptable value to be taken as 0.
公式10 ^ -power中使用的power(int，默认= 2)，定义了可接受的最大值为0。

So the algorithm after finding the optimal value of alpha will just see which weights are higher than 10^-power.

因此，算法在找到最佳α值之后将仅查看哪些权重高于10 ^-幂。

让我们来看一个例子： (Let’s take see an example:)

To show its performance I chose the Avocado Prices dataset.

为了显示其性能，我选择了鳄梨价格数据集。

After a bit of cleaning and training it on the next features:

经过一些清洁和培训之后，可以使用以下功能：

'Total Volume', '4046', '4225', '4770', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type', 'year'

The LinearRegression has the mean absolute error equal to 0.2409683103736682.

线性回归的平均绝对误差等于0.2409683103736682。

When LassoSelector applied on this dataset it chooses the next features:

当LassoSelector应用于此数据集时，它将选择以下功能：

'type', 'year'

Using only these features we got an MAE = 0.24518692823037008

仅使用这些功能，我们得到的MAE = 0.24518692823037008

A quite good result (keep in mind, we are using only 2 features).

相当不错的结果(请记住，我们仅使用2个功能)。

Note: Sometimes is recommended to apply the lasso on scaled data. In this case, applied to the data, the selector didn’t throw away any feature. You are invited to experiment and try with scaled and unscaled values.

注意：有时建议将套索应用于按比例缩放的数据。在这种情况下，应用于数据的选择器不会丢弃任何功能。邀请您进行试验，并尝试使用缩放和非缩放值。

奖金。 (Bonus.)

This module also has a plotting function. After applying the select function you can see why the selector selected some features and others not. To plot just type:

该模块还具有绘图功能。应用选择功能后，您可以了解选择器为何选择某些功能而未选择某些功能的原因。要绘制，只需键入：

selector.plot_process()

Image for post — This is the plot created by Kydavra LassoSelector on Avocado Price Dataset

The dotted lines are features that were thrown away because their weights were too close to 0. The central-vertical dotted line is the optimal value of the alpha found by the algorithm.

虚线是由于其权重太接近0而被丢弃的要素。中心垂直虚线是算法找到的alpha的最佳值。

The plot_process() function has the next parameters:

plot_process()函数具有以下参数：

eps (float, default = 5e-3) the length of the path.

eps ( float ，默认= 5e-3 )路径的长度。
title (string, default = ‘Lasso coef Plot’) — the title of the plot.

title (字符串，默认='Lasso coef Plot' )-绘图的标题。
save (boolean, default= False) if set to true it will try to save the plot.

save ( boolean ， default = False )如果设置为true，它将尝试保存绘图。
file_path (string, default = None) if the save parameter was set to true it will save the plot using this path.

file_path ( string ， default = None )如果将save参数设置为true，它将使用该路径保存图。

结论 (Conclusion)

LassoSelector is a selector that usee the LASSO algorithm to select features the most useful features. Sometimes it will be useful to scale the features, we highly recommend you to try both.

LassoSelector是一个选择器，它使用LASSO算法选择最有用的特征。有时，缩放功能会很有用，我们强烈建议您同时尝试两种功能。