机器学习--KNN(K- Nearest Neighbor)

KNN_KNN模型和评价指标代码实现(基于python)

@硬train一发

2222人浏览 · 2023-03-12 22:49:43

@硬train一发 · 2023-03-12 22:49:43 发布

1、KNN（K- Nearest Neighbor）法-----K最邻近法

KNN算法的核心思想是：如果一个样本在特征空间中的K个最相邻的样本中的大多数属于某一个类别，则该样本也属于这个类别，并具有这个类别上样本的特性。该方法在确定分类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。KNN方法在类别决策时，只与极少量的相邻样本有关。由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，KNN方法较其他方法更为适合。

k近邻(k-Nearest Neighbor,
简称kNN)学习是一种常用的监督学习方法，其工作机制为：给定测试样本，基于某种距离度量找出训练集中与其最靠近的k个训练样本，然后基于这k个“邻居”的信息来进行预测。

举个列子
KNN算法思想(图来自https://www.codenong.com/cs106094018/)

2、代码实现

2.1.metrics.py：定义一些衡量模型的性能的指标
包括分类和回归的指标

import numpy as np
from math import sqrt


# 分类准确度
def accuracy_score(y_true, y_predict):
    """计算y_true(y_test)和y_predict之间的准确率"""

    assert y_true.shape[0] == y_predict.shape[0], \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)


# 下面三个是对线性回归模型大的评测指标
def mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的mse"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum((y_true - y_predict) ** 2) / len(y_true)


def root_mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""
    return sqrt(mean_squared_error(y_true, y_predict))


def mean_absolute_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"
    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)


def r2_score(y_true, y_predict):
    """计算y_true和y_predict之间的R Square"""
    return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)


# 评价分类的指标
def TN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict == 0))


def FP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict == 1))


def FN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict == 0))


def TP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict == 1))


def confusion_matrix(y_true, y_predict):
    return np.array([
        [TN(y_true, y_predict), FP(y_true, y_predict)],
        [FN(y_true, y_predict), TP(y_true, y_predict)]
    ])


def precision_score(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fp = FP(y_true, y_predict)
    try:
        return tp / (tp + fp)
    except:
        return 0.0

def recall_score(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fn = FN(y_true, y_predict)
    try:
        return tp / (tp + fn)
    except:
        return 0.0

def f1_score(y_true, y_predict):
    precision = precision_score(y_true, y_predict)
    recall = recall_score(y_true, y_predict)
    try:
        return 2 * precision * recall / (precision + recall)
    except:
        return 0.0

def TPR(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fn = FN(y_true, y_predict)
    try:
        return tp / (tp + fn)
    except:
        return 0.0

def FPR(y_true, y_predict):
    fp = FP(y_true, y_predict)
    tn = TN(y_true, y_predict)
    try:
        return fp / (fp + tn)
    except:
        return 0.0

上面的代码是一些评估机器学习模型表现的函数，包括回归模型和分类模型的指标。下面是每个函数的具体功能描述：

accuracy_score(y_true, y_predict)：计算分类模型的准确率。
mean_squared_error(y_true, y_predict)：计算回归模型的均方误差。
root_mean_squared_error(y_true, y_predict)：计算回归模型的均方根误差。
mean_absolute_error(y_true, y_predict)：计算回归模型的平均绝对误差。
r2_score(y_true,y_predict)：计算回归模型的R²分数。
TN(y_true, y_predict)：计算二分类模型中真负类数。
FP(y_true,y_predict)：计算二分类模型中假正类数。
FN(y_true, y_predict)：计算二分类模型中假负类数。
TP(y_true, y_predict)：计算二分类模型中真正类数。
confusion_matrix(y_true,y_predict)：计算二分类模型中的混淆矩阵。
precision_score(y_true,y_predict)：计算二分类模型中的精确率。
recall_score(y_true, y_predict)：计算二分类模型中的召回率。
f1_score(y_true, y_predict)：计算二分类模型中的F1分数。
TPR(y_true,y_predict)：计算二分类模型中的真正类率。
FPR(y_true, y_predict)：计算二分类模型中的假正类率。

KNN.py: 按照sklearn库的标准对代码进行封装

import numpy as np
from math import sqrt
from collections import Counter
from .metrics import accuracy_score

class KNNClassifier:

    def __init__(self, k):
        """初始化kNN分类器"""
        assert k >= 1, "k must be valid"
        self.k = k
        """下划线_开头,为私有变量"""
        self._X_train = None
        self._y_train = None

    def fit(self, X_train, y_train):
        """根据训练数据集X_train和y_train训练kNN分类器"""

        assert X_train.shape[0] == y_train.shape[0], "the size of X_train must be equal to the size of y_train"
        assert self.k <= X_train.shape[0], "the size of X_train must be at least k"
        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""

        assert self._X_train is not None and self._y_train is not None,\
            "must fit before predict!"
        assert X_predict.shape[1] == self._X_train.shape[1],\
            "the feature number of X_predict must be equal to X_train"  # 特征的个数必须与训练集上的一样

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """给定单个待预测值x,返回x的预测结果值"""
        assert x.shape[0] == self._X_train.shape[1],\
            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x - x_train) ** 2)) for x_train in self._X_train]
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)
        return votes.most_common(1)[0][0]

    def score(self, X_test, y_test):
        """根据测试集 X_test,y_test确定当前模型的准确度"""
        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "KNN(k=%d)" % self.k

上面的代码实现了一个k近邻分类器(KNNClassifier)，其中:

初始化时传入k值，确定KNN模型中选择最近邻居的个数。
fit()方法用于对训练集进行训练，传入训练集数据X_train和对应的标签y_train。
predict()方法用于对新的待预测数据集X_predict进行预测，返回表示X_predict的结果向量。
score()方法用于对测试集进行预测，并计算分类器的准确度。
predict()方法是predict()方法的辅助函数，用于对单个待预测值进行预测。最后的__repr_()方法返回一个字符串，用于描述这个KNN分类器的k值。

main.py: 对代码进行测试

import KNN
import numpy as np

#使用模拟数据
raw_data_x = [[3.393533211,2.331273381],
              [3.110073483,1.781539638],
              [1.343808831,3.368360954],
              [3.582294042,4.679179110],
              [2.280362439,2.866990263],
              [7.423436942,4.696522875],
              [5.745051997,3.533989803],
              [9.172168622,2.511101045],
              [7.792783481,3.422088941],
              [7.939820817,0.791637231]
             ]

raw_data_y=[0,0,0,0,0,1,1,1,1,1]
#将x、y转化成np识别的数列
X_train=np.array(raw_data_x)
y_train=np.array(raw_data_y)

x = np.array([8.093607318,3.365731514])
X_predict = x.reshape(1,-1)

if __name__ == '__main__':
    knn_clf=KNN.KNNClassifier(6)
    knn_clf.fit(X_train,y_train)
    y_predict=knn_clf.predict(X_predict)
    print(y_predict[0])