2019年7月14日 星期日

Python實作各種水果分類器那一種最好?

本篇文章延續上一篇:用Python實作水果分類器

水果分類器採用機器學習的技巧來實作,因此可以看出所有的演算法都是把資料分成訓練組和測試組,大家不妨找出那一種分類法,所得到分數最高,就是比較適合水果分類的方法。

求邏輯迴歸(Logistic Regression)值:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

執行結果:

決策樹(Decision Tree):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier().fit(X_train, y_train)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

執行結果:

設置最大決策樹深度有助於避免過度擬合(Overfitting):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


from sklearn.tree import DecisionTreeClassifier
clf2 = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf2.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf2.score(X_test, y_test)))
執行結果:

K-最近鄰居法(K-Nearest Neighbors):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test, y_test)))

執行結果:


線性區別分析(

Linear Discriminant Analysis):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
print('Accuracy of LDA classifier on training set: {:.2f}'
     .format(lda.score(X_train, y_train)))
print('Accuracy of LDA classifier on test set: {:.2f}'
     .format(lda.score(X_test, y_test)))

執行結果:


單純貝氏分類器(Gaussian Naive Bayes):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of GNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

執行成果:


支撐向量機(support vector machine, SVM):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
print('Accuracy of SVM classifier on training set: {:.2f}'
     .format(svm.score(X_train, y_train)))
print('Accuracy of SVM classifier on test set: {:.2f}'
     .format(svm.score(X_test, y_test)))

執行結果:

2019年7月13日 星期六

用百香果資料串列來學習Python的分支

可以改變果重變數值的內容:
1
2
3
果重 = 89
大小規格='大(L)' if 果重 > 80 else '中(M)' if 果重 > 60 else '小(S)'
print(大小規格)

執行結果:

用Python實作水果分類器

程式原始碼:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Solving%20A%20Simple%20Classification%20Problem%20with%20Python.ipynb

水果資料來源:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/fruit_data_with_colors.txt

我們先滙入pandas和matplotlib.pyplot函式,分別用來處理資料以及繪製統計圖。首先我們先使用read_table函式來讀取fruit_data_with_colors.txt的資料,並把資料集存在fruits變數中,然後呼叫head()讀取資料前五筆資料,其程式如下:

1
2
3
4
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits.head())

執行結果:



若要資料集的輪廓,可以讀取shape,其程式如下:

1
2
3
4
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits.shape)

執行結果:共有59筆資料,有7個欄位。

若想要知道資料集中有幾種水果,可以先讀取fruit_name欄位中的資料,再使用unique()來過濾掉重覆值。

1
2
3
4
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits['fruit_name'].unique())

執行結果:可以看出四種水果

接下來我們來繪製數量統計圖,這時我們採用seaborn函式庫。

1
2
3
4
5
6
7
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

import seaborn as sns
sns.countplot(fruits['fruit_name'],label="Count")
plt.show()

執行結果:

為了觀察資料的變異值,我們選用盒鬚圖來繪製。

1
2
3
4
5
6
7
8
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

fruits.drop('fruit_label', axis=1).plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9),
    title='Box Plot for each input variable')
plt.savefig('fruits_box')
plt.show()

執行結果:


接下來我們使用直方圖來表示各輸入值的分佈情形:
1
2
3
4
5
6
7
8
9
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

import pylab as pl
fruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9))
pl.suptitle("Histogram for each numeric input variable")
plt.savefig('fruits_hist')
plt.show()

執行結果:

我們來看看各變數間的相依關係:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X, c = y, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap = cmap)
plt.suptitle('Scatter-matrix for each input variable')
plt.savefig('fruits_scatter_matrix')
plt.show()

執行結果:

求邏輯迴歸分類值:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

執行結果:

對於有關其他統計分析可以參考:http://www.atyun.com/14092.html

2019年7月7日 星期日

用百香果資料串列來學習Python的迴圈

參考1:How to loop with indexes in Python
參考2:從百香果分級來看Python的List串列資料結構

在Python語言中,迴圈就屬while和for兩種,我們先從while開始介紹,從參考2中,可以得知使用PassionFruit用串列方式來儲存資料。首先我們先使用一個計數器的變數i,第三行程式將i歸零,第5行程式,呼叫len()來取得串列儲存資料的個數,因此當計數器i小於List的長度時,把計數當成索引值來讀取PassionFruit串列的內容。

1
2
3
4
5
6
7
PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

i=0

while i<len(PassionFruit):
    print(PassionFruit[i])
    i+=1

執行結果:

第二種寫法,我們採用for指令,搭配range()來產生從0到len()-1的數字,這些數字會依序放在i變數中,因此就可以用i當成索引值來讀取PassionFruit的資料。

1
2
3
4
PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

for i in range(len(PassionFruit)):
    print(PassionFruit[i])

第三種寫法是最常用的,直接利用逐一檢視串列中的元素。
1
2
3
4
PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

for item in PassionFruit:
    print(item)

第四種再使用索引值。
1
2
3
4
PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

for i in range(len(PassionFruit)):
    print("PassionFruit {}: {}".format(i + 1, PassionFruit[i]))

執行結果:

最後一種採用列舉的方式,會把總共有多少的數量傳給num變數,然後再逐一的讀取串列中每個元素。
1
2
3
4
PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

for num, item in enumerate(PassionFruit):
    print("{}: {}".format(num, item))

執行結果: