1, 下载数据

gemfield@ThinkPad-X1C:~/learning/gemfield_data$ ls -l

总用量 53672

-rw-rw-r-- 1 gemfield gemfield 7840016 12月 30 14:17 t10k-images-idx3-ubyte

-rw-rw-r-- 1 gemfield gemfield 10008 12月 30 14:16 t10k-labels-idx1-ubyte

-rw-rw-r-- 1 gemfield gemfield 47040016 12月 30 14:43 train-images-idx3-ubyte

-rw-rw-r-- 1 gemfield gemfield 60008 12月 30 14:42 train-labels-idx1-ubyte

2, 将Mnist的idx格式转化为python numpy的ndarray格式

Gemfield将其封装为了一个函数,代码如下所示:

import numpy as np

import timeit

from sklearn import svm

import struct

TRAIN_ITEMS = 60000

TEST_ITEMS = 10000

def loadMnistData():

mnist_data = []

for img_file,label_file,items in zip(['gemfield_data/train-images-idx3-ubyte','gemfield_data/t10k-images-idx3-ubyte'],

['gemfield_data/train-labels-idx1-ubyte','gemfield_data/t10k-labels-idx1-ubyte'],

[TRAIN_ITEMS, TEST_ITEMS]):

data_img = open(img_file, 'rb').read()

data_label = open(label_file, 'rb').read()

#fmt of struct unpack, > means big endian, i means integer, well, iiii mean 4 integers

fmt = '>iiii'

offset = 0

magic_number, img_number, height, width = struct.unpack_from(fmt, data_img, offset)

print('magic number is{}, image number is{}, height is{}and width is{}'.format(magic_number, img_number, height, width))

#slide over the 2 numbers above

offset += struct.calcsize(fmt)

#28x28

image_size = height * width

#B means unsigned char

fmt = '>{}B'.format(image_size)

#because gemfield has insufficient memory resource

if items > img_number:

items = img_number

images = np.empty((items, image_size))

for i in range(items):

images[i] = np.array(struct.unpack_from(fmt, data_img, offset))

#0~255 to 0~1

images[i] = images[i]/256

offset += struct.calcsize(fmt)

#fmt of struct unpack, > means big endian, i means integer, well, ii mean 2 integers

fmt = '>ii'

offset = 0

magic_number, label_number = struct.unpack_from(fmt, data_label, offset)

print('magic number is{}and label number is{}'.format(magic_number, label_number))

#slide over the 2 numbers above

offset += struct.calcsize(fmt)

#B means unsigned char

fmt = '>B'

#because gemfield has insufficient memory resource

if items > label_number:

items = label_number

labels = np.empty(items)

for i in range(items):

labels[i] = struct.unpack_from(fmt, data_label, offset)[0]

offset += struct.calcsize(fmt)

mnist_data.append((images, labels.astype(int)))

return mnist_data

3,使用sklearn模块中的svm

代码如下所示:

def forwardWithSVM():

start_time = timeit.default_timer()

training_data, test_data = loadMnistData()

# train

clf = svm.SVC()

clf.fit(training_data[0], training_data[1])

train_time = timeit.default_timer()

print('gemfield train cost{}'.format(str(train_time - start_time) ) )

# test

print('Begin the test...')

predictions = [int(a) for a in clf.predict(test_data[0])]

num_correct = sum(int(a == y) for a, y in zip(predictions, test_data[1]))

print("%sof%svalues correct." % (num_correct, len(test_data[1])))

test_time = timeit.default_timer()

print('gemfield test cost{}'.format(str(test_time - train_time) ) )

4,运行及测试

将TRAIN_ITEMS的值设置为10的时候,测试效果如下所示(可以看到正确性和瞎猜差不多):

gemfield@ThinkPad-X1C:~/learning$ python3 svm.py

magic number is 2051, image number is 60000, height is 28 and width is 28

magic number is 2049 and label number is 60000

magic number is 2051, image number is 10000, height is 28 and width is 28

magic number is 2049 and label number is 10000

gemfield start the training...

gemfield train cost 0.5457151120062917

Begin the test...

1135 of 10000 values correct.

gemfield test cost 0.1183781250147149

将TRAIN_ITEMS的值设置为100的时候,测试效果如下所示(哇,正确性翻了一番啊):

gemfield@ThinkPad-X1C:~/learning$ python3 svm.py

magic number is 2051, image number is 60000, height is 28 and width is 28

magic number is 2049 and label number is 60000

magic number is 2051, image number is 10000, height is 28 and width is 28

magic number is 2049 and label number is 10000

gemfield start the training...

gemfield train cost 0.5702047150116414

Begin the test...

2563 of 10000 values correct.

gemfield test cost 1.0582328418968245

将TRAIN_ITEMS的值设置为1000的时候,测试效果如下所示(及格了唉):

gemfield@ThinkPad-X1C:~/learning$ python3 svm.py

magic number is 2051, image number is 60000, height is 28 and width is 28

magic number is 2049 and label number is 60000

magic number is 2051, image number is 10000, height is 28 and width is 28

magic number is 2049 and label number is 10000

gemfield start the training...

gemfield train cost 1.7026840379694477

Begin the test...

8267 of 10000 values correct.

gemfield test cost 9.486829451052472

将TRAIN_ITEMS的值设置为10000的时候,测试效果如下所示(设为60000的话,在gemfield的PC上需要运行太多的时间):

gemfield@ThinkPad-X1C:~/learning$ python3 svm.py

magic number is 2051, image number is 60000, height is 28 and width is 28

magic number is 2049 and label number is 60000

magic number is 2051, image number is 10000, height is 28 and width is 28

magic number is 2049 and label number is 10000

gemfield start the training...

gemfield train cost 46.46685998595785

Begin the test...

9214 of 10000 values correct.

gemfield test cost 57.5887356440071

5,SVM的核心思想是什么?和神经网络的本质区别是什么?

稍等啊。

最起码随着训练数据的增加,SVM做forward消耗的时间也明显增加。

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐