K近邻算法（一） python实现，手写数字识别(from机器学习实战)

2019-11-08 02:12:17

字体：大中小

来源：转载

供稿：网友

k近邻法（KNN）算法概述

算法概述

优点精度高、对异常值不敏感、无数据输入假定。缺点计算复杂度高、空间复杂度高。试用数据范围数值型和标称型

工作原理：将新数据的每个特征与样本集中数据对应特征进行比较，计算之间的距离值，选取样本数据集中前k个最相似的数据。

伪代码： 1. 计算已知类别数据集中的点与当前点之间的距离 2. 按照距离递增次序排序 3. 选取与当前点距离最小的k个点 4. 确定前k个点所在类别的出现频率 5. 返回前k个点出现频率最高的类别作为当前点的预测分类

python实现

from numpy import *import Operator

引入numpy包。

def createDataSet(): group = array([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]]) labels = ['A', 'A', 'B', 'B'] return group, labels def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0]#4l diffMat = tile(inX, (dataSetSize,1)) - dataSet#inx按4*1重复排列再与dataset做差 sqDiffMat = diffMat**2#每个元素平方 sqDistances = sqDiffMat.sum(axis=1)#一个框里的都加起来 distances = sqDistances**0.5#加起来之后每个开根号 sortedDistIndicies = distances.argsort() #返回的是数组值从小到大的索引值，最小为0 classCount={} for i in range(k):#即0到k-1.最后得到的classCount是距离最近的k个点的类分布，即什么类出现几次如{'A': 1, 'B': 2} voteIlabel = labels[sortedDistIndicies[i]]#返回从近到远第i个点所对应的类 classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1#字典模式，记录类对应出现次数.这里.get(a,b)就是寻找字典classcount的a对应的值，如果没找到a，就显示b sortedClassCount = sorted(classCount.iteritems(),key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]

注1：这里的operator.itemgetter(1)是指定义一个函数，这个函数可以获取它括号里的第1个元素如

a=[3,4,2,6]b=operator.itemgetter(1)b(a)Out[116]: 4c=operator.itemgetter(0)c(a)Out[118]: 3

如果operator.itemgetter(1，2)的话指的是用多级排序，以1为首，2为辅

注2：这里sorted(iterable,cmp=None,key=None,reverse=False)

参数解释：

（1）iterable指定要排序的list或者iterable，不用多说；

（2）cmp为函数，指定排序时进行比较的函数，可以指定一个函数或者lambda函数，如：

students为类对象的list，没个成员有三个域，用sorted进行比较时可以自己定cmp函数，例如这里要通过比较第三个数据成员来排序，代码可以这样写：

students = [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]sorted(students, key=lambda student : student[2])Out[121]: [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

（3）key为函数，指定取待排序元素的哪一项进行排序，函数用上面的例子来说明，key指定的lambda函数功能是去元素student的第三个域（即：student[2]）（4）reverse=False是升序，True是降。默认升序。

故综合起来sorted(classCount.iteritems(),key=operator.itemgetter(1), reverse=True)是以第1个元素为排序标准，将classCount进行降序排列。如classCount为{‘A’: 1, ‘B’: 2}，则sortedClassCount[(‘B’, 2), (‘A’, 1)]，则最后返回值sortedClassCount[0][0]为“B”

注3：另外代码在使用前应加一句 group,labels=createDataSet() 因为group,labels是内置变量，在外面直接使用的话要再定义一遍。

最后输入如classify0([1,0], group, labels, 2)来使用

手写数字识别

这里写图片描述

问题描述：训练集对应0~9的数字，每个数字有200个这样的txt，要进行识别，这里我们先把矩阵转换为向量，再比较一个新的向量跟已有数据集向量的距离，找最为接近的k个训练实例，对应的最多的类就是我们识别出的数字。

def img2vector(filename):#把32*32矩阵处理成1*1024向量 returnVect = zeros((1,1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVectdef handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits') #load the training set,list of filename m = len(trainingFileList) trainingMat = zeros((m,1024)) #形成array,共有m个元素，每个是1*1024向量 for i in range(m): fileNameStr = trainingFileList[i] #如'0_1.txt' fileStr = fileNameStr.split('.')[0] #take off “.txt”即用.分割成2个元素，取前一个 classNumStr = int(fileStr.split('_')[0]) hwLabels.append(classNumStr) trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)#厉害了 testFileList = listdir('testDigits') #iterate through the test set errorCount = 0.0 mTest = len(testFileList) for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] #take off .txt classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) PRint "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr) if (classifierResult != classNumStr): errorCount += 1.0 print "/nthe total number of errors is: %d" % errorCount print "/nthe total error rate is: %f" % (errorCount/float(mTest))

出来结果为

......the classifier came back with: 9, the real answer is: 9the classifier came back with: 9, the real answer is: 9the classifier came back with: 9, the real answer is: 9the classifier came back with: 9, the real answer is: 9the classifier came back with: 9, the real answer is: 9the classifier came back with: 9, the real answer is: 9the classifier came back with: 9, the real answer is: 9the total number of errors is: 11the total error rate is: 0.011628

值得注意的是，这段代码一定要在已经导入os的情况下才能用，即必须最前面有 from os import listdir

上一篇：Python3.5配置django1.10连接MySQL5.7

下一篇：Python基础知识实例讲解