水果資料來源:https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/fruit_data_with_colors.txt
我們先滙入pandas和matplotlib.pyplot函式,分別用來處理資料以及繪製統計圖。首先我們先使用read_table函式來讀取fruit_data_with_colors.txt的資料,並把資料集存在fruits變數中,然後呼叫head()讀取資料前五筆資料,其程式如下:
1 2 3 4 | import pandas as pd import matplotlib.pyplot as plt fruits = pd.read_table('fruit_data_with_colors.txt') print(fruits.head()) |
執行結果:
若要資料集的輪廓,可以讀取shape,其程式如下:
1 2 3 4 | import pandas as pd import matplotlib.pyplot as plt fruits = pd.read_table('fruit_data_with_colors.txt') print(fruits.shape) |
執行結果:共有59筆資料,有7個欄位。
若想要知道資料集中有幾種水果,可以先讀取fruit_name欄位中的資料,再使用unique()來過濾掉重覆值。
1 2 3 4 | import pandas as pd import matplotlib.pyplot as plt fruits = pd.read_table('fruit_data_with_colors.txt') print(fruits['fruit_name'].unique()) |
執行結果:可以看出四種水果
接下來我們來繪製數量統計圖,這時我們採用seaborn函式庫。
1 2 3 4 5 6 7 | import pandas as pd import matplotlib.pyplot as plt fruits = pd.read_table('fruit_data_with_colors.txt') import seaborn as sns sns.countplot(fruits['fruit_name'],label="Count") plt.show() |
執行結果:
為了觀察資料的變異值,我們選用盒鬚圖來繪製。
1 2 3 4 5 6 7 8 | import pandas as pd import matplotlib.pyplot as plt fruits = pd.read_table('fruit_data_with_colors.txt') fruits.drop('fruit_label', axis=1).plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False, figsize=(9,9), title='Box Plot for each input variable') plt.savefig('fruits_box') plt.show() |
執行結果:
接下來我們使用直方圖來表示各輸入值的分佈情形:
1 2 3 4 5 6 7 8 9 | import pandas as pd import matplotlib.pyplot as plt fruits = pd.read_table('fruit_data_with_colors.txt') import pylab as pl fruits.drop('fruit_label' ,axis=1).hist(bins=30, figsize=(9,9)) pl.suptitle("Histogram for each numeric input variable") plt.savefig('fruits_hist') plt.show() |
執行結果:
我們來看看各變數間的相依關係:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import pandas as pd import matplotlib.pyplot as plt fruits = pd.read_table('fruit_data_with_colors.txt') from pandas.tools.plotting import scatter_matrix from matplotlib import cm feature_names = ['mass', 'width', 'height', 'color_score'] X = fruits[feature_names] y = fruits['fruit_label'] cmap = cm.get_cmap('gnuplot') scatter = pd.scatter_matrix(X, c = y, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap = cmap) plt.suptitle('Scatter-matrix for each input variable') plt.savefig('fruits_scatter_matrix') plt.show() |
執行結果:
求邏輯迴歸分類值:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import pandas as pd import matplotlib.pyplot as plt fruits = pd.read_table('fruit_data_with_colors.txt') from pandas.tools.plotting import scatter_matrix from matplotlib import cm feature_names = ['mass', 'width', 'height', 'color_score'] X = fruits[feature_names] y = fruits['fruit_label'] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(X_train, y_train) print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(logreg.score(X_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(logreg.score(X_test, y_test))) |
執行結果:
對於有關其他統計分析可以參考:http://www.atyun.com/14092.html
沒有留言:
張貼留言