2019年7月13日 星期六

用Python實作水果分類器

程式原始碼:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Solving%20A%20Simple%20Classification%20Problem%20with%20Python.ipynb

水果資料來源:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/fruit_data_with_colors.txt

我們先滙入pandas和matplotlib.pyplot函式,分別用來處理資料以及繪製統計圖。首先我們先使用read_table函式來讀取fruit_data_with_colors.txt的資料,並把資料集存在fruits變數中,然後呼叫head()讀取資料前五筆資料,其程式如下:

1
2
3
4
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits.head())

執行結果:



若要資料集的輪廓,可以讀取shape,其程式如下:

1
2
3
4
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits.shape)

執行結果:共有59筆資料,有7個欄位。

若想要知道資料集中有幾種水果,可以先讀取fruit_name欄位中的資料,再使用unique()來過濾掉重覆值。

1
2
3
4
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits['fruit_name'].unique())

執行結果:可以看出四種水果

接下來我們來繪製數量統計圖,這時我們採用seaborn函式庫。

1
2
3
4
5
6
7
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

import seaborn as sns
sns.countplot(fruits['fruit_name'],label="Count")
plt.show()

執行結果:

為了觀察資料的變異值,我們選用盒鬚圖來繪製。

1
2
3
4
5
6
7
8
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

fruits.drop('fruit_label', axis=1).plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9),
    title='Box Plot for each input variable')
plt.savefig('fruits_box')
plt.show()

執行結果:


接下來我們使用直方圖來表示各輸入值的分佈情形:
1
2
3
4
5
6
7
8
9
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

import pylab as pl
fruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9))
pl.suptitle("Histogram for each numeric input variable")
plt.savefig('fruits_hist')
plt.show()

執行結果:

我們來看看各變數間的相依關係:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X, c = y, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap = cmap)
plt.suptitle('Scatter-matrix for each input variable')
plt.savefig('fruits_scatter_matrix')
plt.show()

執行結果:

求邏輯迴歸分類值:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

執行結果:

對於有關其他統計分析可以參考:http://www.atyun.com/14092.html

沒有留言:

張貼留言