百香網誌: 7月 2019

2019年7月14日星期日

線性區別分析(

Linear Discriminant Analysis)：

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
print('Accuracy of LDA classifier on training set: {:.2f}'
     .format(lda.score(X_train, y_train)))
print('Accuracy of LDA classifier on test set: {:.2f}'
     .format(lda.score(X_test, y_test)))

執行結果：

單純貝氏分類器(Gaussian Naive Bayes)：

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of GNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

執行成果：

支撐向量機(support vector machine, SVM)：

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
print('Accuracy of SVM classifier on training set: {:.2f}'
     .format(svm.score(X_train, y_train)))
print('Accuracy of SVM classifier on test set: {:.2f}'
     .format(svm.score(X_test, y_test)))

執行結果：

2019年7月13日星期六

用Python實作水果分類器

程式原始碼：https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Solving%20A%20Simple%20Classification%20Problem%20with%20Python.ipynb

水果資料來源：https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/fruit_data_with_colors.txt

我們先滙入pandas和matplotlib.pyplot函式，分別用來處理資料以及繪製統計圖。首先我們先使用read_table函式來讀取fruit_data_with_colors.txt的資料，並把資料集存在fruits變數中，然後呼叫head()讀取資料前五筆資料，其程式如下：

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits.head())

執行結果：

若要資料集的輪廓，可以讀取shape，其程式如下：

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits.shape)

執行結果：共有59筆資料，有7個欄位。

若想要知道資料集中有幾種水果，可以先讀取fruit_name欄位中的資料，再使用unique()來過濾掉重覆值。

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')
print(fruits['fruit_name'].unique())

執行結果：可以看出四種水果

接下來我們來繪製數量統計圖，這時我們採用seaborn函式庫。

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

import seaborn as sns
sns.countplot(fruits['fruit_name'],label="Count")
plt.show()

執行結果：

為了觀察資料的變異值，我們選用盒鬚圖來繪製。

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

fruits.drop('fruit_label', axis=1).plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9),
    title='Box Plot for each input variable')
plt.savefig('fruits_box')
plt.show()

執行結果：

接下來我們使用直方圖來表示各輸入值的分佈情形：

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

import pylab as pl
fruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9))
pl.suptitle("Histogram for each numeric input variable")
plt.savefig('fruits_hist')
plt.show()

執行結果：

我們來看看各變數間的相依關係：

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X, c = y, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap = cmap)
plt.suptitle('Scatter-matrix for each input variable')
plt.savefig('fruits_scatter_matrix')
plt.show()

執行結果：

求邏輯迴歸分類值：

import pandas as pd
import matplotlib.pyplot as plt
fruits = pd.read_table('fruit_data_with_colors.txt')

from pandas.tools.plotting import scatter_matrix
from matplotlib import cm
feature_names = ['mass', 'width', 'height', 'color_score']
X = fruits[feature_names]
y = fruits['fruit_label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))

執行結果：

對於有關其他統計分析可以參考：http://www.atyun.com/14092.html

2019年7月7日星期日

參考1：How to loop with indexes in Python
參考2：從百香果分級來看Python的List串列資料結構

在Python語言中，迴圈就屬while和for兩種，我們先從while開始介紹，從參考2中，可以得知使用PassionFruit用串列方式來儲存資料。首先我們先使用一個計數器的變數i，第三行程式將i歸零，第5行程式，呼叫len()來取得串列儲存資料的個數，因此當計數器i小於List的長度時，把計數當成索引值來讀取PassionFruit串列的內容。

PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

i=0

while i<len(PassionFruit):
    print(PassionFruit[i])
    i+=1

執行結果：

第二種寫法，我們採用for指令，搭配range()來產生從0到len()-1的數字，這些數字會依序放在i變數中，因此就可以用i當成索引值來讀取PassionFruit的資料。

PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

for i in range(len(PassionFruit)):
    print(PassionFruit[i])

第三種寫法是最常用的，直接利用逐一檢視串列中的元素。

PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

for item in PassionFruit:
    print(item)

第四種再使用索引值。

PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

for i in range(len(PassionFruit)):
    print("PassionFruit {}: {}".format(i + 1, PassionFruit[i]))

執行結果：

最後一種採用列舉的方式，會把總共有多少的數量傳給num變數，然後再逐一的讀取串列中每個元素。

PassionFruit=['適用範圍','改良種','統一編碼','2030200000515','品名代碼','51']

for num, item in enumerate(PassionFruit):
    print("{}: {}".format(num, item))

執行結果：

訂閱：文章 (Atom)

百香網誌

2019年7月14日星期日

Python實作各種水果分類器那一種最好？

線性區別分析(

2019年7月13日星期六

用百香果資料串列來學習Python的分支

用Python實作水果分類器

2019年7月7日星期日

用百香果資料串列來學習Python的迴圈

2019年7月14日 星期日

Python實作各種水果分類器那一種最好？

線性區別分析(

2019年7月13日 星期六

用百香果資料串列來學習Python的分支

用Python實作水果分類器

2019年7月7日 星期日

用百香果資料串列來學習Python的迴圈

2019年7月14日星期日

2019年7月13日星期六

2019年7月7日星期日