十二周，离大多数科目的结课考试，还有大概一个月的时间。中间我还得准备两次数模竞赛，英语六级。最近科研方面，柳老师还给我介绍了一个中科院的老师。看了他的一篇论文后，我感觉他对ML还是有比较深入的了解。我看看，能不能多接触一些项目吧。

星期六

今天几乎花了一天在机器学习上。

上午看了一篇论文

这篇论文使用RF对水样进行溯源分析
写了个阅读笔记

下午回寝室后，就开始学RF有关内容。

虽然之前有一点点了解，但是还没有实战
写了篇Random Forest的学习笔记，记录了：
- 集成算法入门
- 随机森林基本情况，特点
- 相关基础知识
- 随机森林中的特征重要性（在下面的实战中就体现了这个特点）
- sklean 中RF的实现，主要记录了一些超参数，可以在实例化时调节以完善模型
在学习过程中看到一段关于调参的描述，我摘录到了学习笔记的后面。

晚上在kaggle上做了篇实战，情况如下：

机器学习分类算法实战笔记

项目来自kaggle
主要是介绍了在泰坦尼克号数据集上创建机器学习模型的整个过程。包括数据预处理和正则化处理，可视化，创建新特征，多种分类模型搭建，模型性能评估。

机器学习分类算法实战笔记

参考资料

kaggle 实战项目

数据集

在这个notebook中，我将介绍在著名的泰坦尼克号数据集上创建机器学习模型的整个过程，该数据集被世界各地的许多人使用。
它提供了有关泰坦尼克号上乘客命运的信息，根据经济地位（等级）、性别、年龄和生存情况进行总结。
在这个挑战中，我们被要求预测泰坦尼克号上的乘客是否会幸存下来。

实战内容

# linear algebra
import numpy as np 

# data processing
import pandas as pd 

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

Getting the Data¶

1 2	test_df = pd.read_csv("test.csv") train_df = pd.read_csv("train.csv")

数据探索/分析(Data Exploration/Analysis )

训练集有 891 个样本和 11 个特征 + 目标变量（幸存）。其中 2 个特征是浮点数，5 个是整数，5 个是对象。下面我列出了这些feature的简短描述：

survival: Survival

PassengerId: Unique Id of a passenger.

pclass: Ticket class

sex: Sex

Age: Age in years

sibsp: # of siblings / spouses aboard the Titanic

parch: # of parents / children aboard the Titanic

ticket: Ticket number

fare: Passenger fare

cabin: Cabin number

embarked: Port of Embarkation

1	train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

1	train_df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

上面我们可以看到，38%的训练集在泰坦尼克号上幸存下来。我们还可以看到，乘客的年龄从0.4岁到80岁不等。最重要的是，我们已经可以检测到一些包含缺失值的特征，例如“年龄”功能。

1	train_df.head(15)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S

从上表中，我们可以注意到一些事情。首先，我们需要稍后将很多特征转换为数字特征，以便机器学习算法可以处理它们。

此外，我们可以看到这些特征具有不同的取值范围，我们需要将其转换为大致相同的比例。

我们还可以发现更多包含缺失值（NaN = 不是数字）的特征，我们需要处理这些特征。

让我们更详细地看看实际缺少哪些数据：

total = train_df.isnull().sum().sort_values(ascending= False) # ascendings是否按指定列的数组升序排列，默认为True，即升序排列

percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)

missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])

missing_data

	Total	%
Cabin	687	77.1
Age	177	19.9
Embarked	2	0.2
PassengerId	0	0.0
Survived	0	0.0
Pclass	0	0.0
Name	0	0.0
Sex	0	0.0
SibSp	0	0.0
Parch	0	0.0
Ticket	0	0.0
Fare	0	0.0

Embarked只有 2 个缺失值，可以轻松填充。处理“Age”功能会更加棘手，该功能有 177 个缺失值。“Cabin”功能需要进一步研究，但看起来我们可能希望将其从数据集中删除，因为其中 77% 丢失了。

1	train_df.columns.values

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

在上面，可以看到 11 个特征 + 目标变量（幸存）。

哪些功能可以提高存活率？

对我来说，除了“乘客IDPassengerId”，“Ticket”和“Name”之外的所有内容,都应该与和高存活率相关联。

数据分析之 Age and Sex:

survived = 'survived'
not_survived = 'not survived'

fig, axes = plt.subplots(nrows=1, 
                         ncols=2,
                         figsize=(10, 4)
                        )

women = train_df[train_df['Sex']=='female']
men = train_df[train_df['Sex']=='male']

ax = sns.distplot(women[women['Survived']==1].Age.dropna(),
                  bins=18,   # #  bins：int或list，控制直方图的划分 bin=18 划分成18份
                  label = survived, 
                  ax = axes[0],   # 指定绘图位置
                  kde =False
                 )  # 通过hist和kde参数调节是否显示直方图及核密度估计(默认hist,kde均为True)
ax = sns.distplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False)

ax.legend()
ax.set_title('Female')

ax = sns.distplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False)
ax = sns.distplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False)

ax.legend()
ax.set_title('Male')

你可以看到，男性在18岁到30岁之间生存的概率很高，这对女性来说也有点正确，但并不完全正确。对于女性来说，14至40岁的生存机会更高。

对于男性来说，5至18岁之间的生存概率非常低，但女性并非如此。另一件需要注意的事情是，婴儿的生存概率也更高一点。

由于似乎存在某些年龄，这增加了生存几率，并且因为我希望每个特征都大致相同，所以我稍后将创建年龄组。

Embarked, Pclass and Sex:

FacetGrid = sns.FacetGrid(train_df, row='Embarked', size=4.5, aspect=1.6)

FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None,  order=None, hue_order=None )

FacetGrid.add_legend()

登船点Embarked似乎与生存有关，具体取决于性别。

Q端和S端的妇女生存机会更高。

如果男性在端口C，他们的生存概率很高，但如果他们在端口Q或S，生存概率很低。

Pclass似乎也与生存有关。我们将在下面生成它的另一个图。

Pclass:

1	sns.barplot(x='Pclass', y='Survived', data=train_df)

<AxesSubplot:xlabel='Pclass', ylabel='Survived'>

在这里，我们清楚地看到，Pclass会对一个人的生存机会做出影响，特别是如果这个人属于1类。

我们将在下面创建另一个 pclass 图。

grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)

grid.map(plt.hist, 'Age', alpha=.5, bins=20)

grid.add_legend();

上面的图证实了我们对pclass 1的假设，但我们也可以发现pclass 3中的人无法生存的概率很高。

SibSp and Parch

SibSp和Parch作为一个组合特征会更有意义，它显示了一个人在泰坦尼克号上的亲属总数。我将在下面创建它，如果有人不是alon，它还有一个特征。

data = [train_df, test_df]
for dataset in data:
    dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
    dataset.loc[dataset['relatives'] > 0, 'not_alone'] = 0
    dataset.loc[dataset['relatives'] == 0, 'not_alone'] = 1
    dataset['not_alone'] = dataset['not_alone'].astype(int)

1	train_df['not_alone'].value_counts()

1    537
0    354
Name: not_alone, dtype: int64

1 2	axes = sns.factorplot('relatives','Survived', data=train_df, aspect = 2.5, )

在这里，我们可以看到，您有 1 到 3 个亲戚的话生存的概率很高，

但如果您有少于 1 个或超过 3 个，则生存概率较低（除了某些有 6 个亲戚的情况）。

数据预处理

首先，我将从火车组中删除“PassengerId”，因为它对人的生存概率没有贡献。我不会将其从测试集中删除，因为提交时需要它

1	train_df = train_df.drop(['PassengerId'], axis=1)

1	train_df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	relatives	not_alone
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1	0
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	1	0
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	0	1
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	1	0
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	0	1

缺失值处理

cabin：

提醒，我们必须处理小屋（687），登船（2）和年龄（177）。

首先我想，我们必须删除“Cabin”变量，但后来我发现了一些有趣的东西。客舱编号看起来像“C123”，字母指的是甲板。

因此，我们将提取这些并创建一个包含人员甲板的新特征。Afterwords 我们将特征转换为数值变量。缺失值将转换为零。

在下图中，您可以看到泰坦尼克号的实际甲板，范围从A到G。

import re
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
data = [train_df, test_df]

for dataset in data:
    dataset['Cabin'] = dataset['Cabin'].fillna("U0")
    dataset['Deck'] = dataset['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())  ## 正则化？看不懂
    dataset['Deck'] = dataset['Deck'].map(deck)
    dataset['Deck'] = dataset['Deck'].fillna(0)
    dataset['Deck'] = dataset['Deck'].astype(int)

1
2
3

# we can now drop the cabin feature
train_df = train_df.drop(['Cabin'], axis=1)
test_df = test_df.drop(['Cabin'], axis=1)

年龄缺失值：

现在我们可以解决年龄特征缺失值的问题。

我将创建一个包含随机数的数组，这些随机数是根据与标准差和is_null有关的平均年龄值计算的。

data = [train_df, test_df]

for dataset in data:
    mean = train_df["Age"].mean()
    std = test_df["Age"].std()
    is_null = dataset["Age"].isnull().sum()
    # compute random numbers between the mean, std and is_null
    rand_age = np.random.randint(mean - std, mean + std, size = is_null)
    # fill NaN values in Age column with random values generated
    age_slice = dataset["Age"].copy()
    age_slice[np.isnan(age_slice)] = rand_age
    dataset["Age"] = age_slice
    dataset["Age"] = train_df["Age"].astype(int)

1	train_df["Age"].isnull().sum()

Embarked：

由于 Embarked 特征只有 2 个缺失值，因此我们将用最常见的缺失值填充这些缺失值。

1	train_df['Embarked'].describe()

count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object

common_value = 'S'
data = [train_df, test_df]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].fillna(common_value)

转化特征

1	train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Survived   891 non-null    int64  
 1   Pclass     891 non-null    int64  
 2   Name       891 non-null    object 
 3   Sex        891 non-null    object 
 4   Age        891 non-null    int32  
 5   SibSp      891 non-null    int64  
 6   Parch      891 non-null    int64  
 7   Ticket     891 non-null    object 
 8   Fare       891 non-null    float64
 9   Embarked   891 non-null    object 
 10  relatives  891 non-null    int64  
 11  not_alone  891 non-null    int32  
 12  Deck       891 non-null    int32  
dtypes: float64(1), int32(3), int64(5), object(4)
memory usage: 80.2+ KB

上面，你可以看到“票价”Fare是一个浮点数，我们必须处理4个分类特征：Name, Sex, Ticket and Embarked.。让我们一个接一个地调查和转换。

Fare：

使用“astype（）”函数panda将“票价”从float转换为int64

data = [train_df, test_df]

for dataset in data:
    dataset['Fare'] = dataset['Fare'].fillna(0)
    dataset['Fare'] = dataset['Fare'].astype(int)

姓名：

我们将使用“名称”函数从“名称”中提取“标题”，这样我们就可以从中构建一个新特征。

data = [train_df, test_df]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in data:
    # extract titles
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    # replace titles with a more common title or as Rare
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                            'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    # convert titles into numbers
    dataset['Title'] = dataset['Title'].map(titles)
    # filling NaN with 0, to get safe
    dataset['Title'] = dataset['Title'].fillna(0)

1 2	train_df = train_df.drop(['Name'], axis=1) test_df = test_df.drop(['Name'], axis=1)

test_df

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	relatives	not_alone	Deck	Title
0	892	3	male	22	0	0	330911	7	Q	0	1	8	1
1	893	3	female	38	1	0	363272	7	S	1	0	8	3
2	894	2	male	26	0	0	240276	9	Q	0	1	8	1
3	895	3	male	35	0	0	315154	8	S	0	1	8	1
4	896	3	female	35	1	1	3101298	12	S	2	0	8	3
...	...	...	...	...	...	...	...	...	...	...	...	...	...
413	1305	3	male	19	0	0	A.5. 3236	8	S	0	1	8	1
414	1306	1	female	44	0	0	PC 17758	108	C	0	1	3	5
415	1307	3	male	42	0	0	SOTON/O.Q. 3101262	7	S	0	1	8	1
416	1308	3	male	34	0	0	359309	8	S	0	1	8	1
417	1309	3	male	18	1	1	2668	22	C	2	0	8	4

418 rows × 13 columns

性别sex：

将“性别”特征转换为数字。

genders = {"male": 0, "female": 1}
data = [train_df, test_df]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(genders)

Ticket:

1	train_df['Ticket'].describe()

count        891
unique       681
top       347082
freq           7
Name: Ticket, dtype: object

由于Ticket属性具有681个唯一的Ticket，因此将它们转换为有用的类别有点棘手。所以我们将从数据集中删除它。

1 2	train_df = train_df.drop(['Ticket'], axis=1) test_df = test_df.drop(['Ticket'], axis=1)

Embarked：

将“登船点”功能转换为数字。

ports = {"S": 0, "C": 1, "Q": 2}
data = [train_df, test_df]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].map(ports)

Creating Categories:创建类别：

We will now create categories within the following features:

现在，我们将在以下功能中创建类别：

年龄：¶

现在我们需要转换“年龄”特征。首先我们将它从float转换为integer。然后，我们将创建新的“AgeGroup”变量，将每个年龄分类到一个组中。请注意，重要的是要关注您如何组成这些组，因为例如您不希望80%的数据属于组1。

data = [train_df, test_df]
for dataset in data:
    dataset['Age'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age'] = 5
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age'] = 6
    dataset.loc[ dataset['Age'] > 66, 'Age'] = 6

1 2	# let's see how it's distributed train_df['Age'].value_counts()

6    162
4    160
5    146
3    136
2    123
1     96
0     68
Name: Age, dtype: int64

票价：

对于“票价”特征，我们需要执行与“年龄”特征相同的操作。但这并不容易，因为如果我们将票价值的范围划分为几个同样大的类别，80%的票价值将属于第一类。幸运的是，我们可以使用sklearn“qcut（）”函数，我们可以用它来查看如何形成类别。

1	train_df.head(10)

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	relatives	not_alone	Deck	Title
0	0	3	0	2	1	0	7	0	1	0	8	1
1	1	1	1	5	1	0	71	1	1	0	3	3
2	1	3	1	3	0	0	7	0	0	1	8	2
3	1	1	1	5	1	0	53	0	1	0	3	3
4	0	3	0	5	0	0	8	0	0	1	8	1
5	0	3	0	6	0	0	8	2	0	1	8	1
6	0	1	0	6	0	0	51	0	0	1	5	1
7	0	3	0	0	3	1	21	0	4	0	8	4
8	1	3	1	3	0	2	11	0	2	0	8	3
9	1	2	1	1	1	0	30	1	1	0	8	3

data = [train_df, test_df]

for dataset in data:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare']   = 3
    dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare']   = 4
    dataset.loc[ dataset['Fare'] > 250, 'Fare'] = 5
    dataset['Fare'] = dataset['Fare'].astype(int)

1	train_df['Fare'].describe()

count    891.000000
mean       1.523008
std        1.250743
min        0.000000
25%        0.000000
50%        1.000000
75%        2.000000
max        5.000000
Name: Fare, dtype: float64

Creating new Features

我将向数据集添加两个新特性，这些特性是我从其他特性中计算出来的。

Age times Class

1
2
3

data = [train_df, test_df]
for dataset in data:
    dataset['Age_Class']= dataset['Age']* dataset['Pclass']

Fare per Person

1
2
3

for dataset in data:
    dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['relatives']+1)
    dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)

1 2	# Let's take a last look at the training set, before we start training the models. train_df.head(20)

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	relatives	not_alone	Deck	Title	Age_Class	Fare_Per_Person
0	0	3	0	2	1	0	0	0	1	0	8	1	6	0
1	1	1	1	5	1	0	3	1	1	0	3	3	5	1
2	1	3	1	3	0	0	0	0	0	1	8	2	9	0
3	1	1	1	5	1	0	3	0	1	0	3	3	5	1
4	0	3	0	5	0	0	1	0	0	1	8	1	15	1
5	0	3	0	6	0	0	1	2	0	1	8	1	18	1
6	0	1	0	6	0	0	3	0	0	1	5	1	6	3
7	0	3	0	0	3	1	2	0	4	0	8	4	0	0
8	1	3	1	3	0	2	1	0	2	0	8	3	9	0
9	1	2	1	1	1	0	2	1	1	0	8	3	2	1
10	1	3	1	0	1	1	2	0	2	0	7	2	0	0
11	1	1	1	6	0	0	2	0	0	1	3	2	6	2
12	0	3	0	2	0	0	1	0	0	1	8	1	6	1
13	0	3	0	5	1	5	2	0	6	0	8	1	15	0
14	0	3	1	1	0	0	0	0	0	1	8	2	3	0
15	1	2	1	6	0	0	2	0	0	1	8	3	12	2
16	0	3	0	0	4	1	2	2	5	0	8	4	0	0
17	1	2	0	2	0	0	1	0	0	1	8	1	4	1
18	0	3	1	4	1	0	2	0	1	0	8	3	12	1
19	1	3	1	5	0	0	0	1	0	1	8	3	15	0

构建机器学习模型 Building Machine Learning Models

1
2
3

X_train = train_df.drop("Survived", axis=1)  # 训练集的特征
Y_train = train_df["Survived"]               # 训练集的目标
X_test  = test_df.drop("PassengerId", axis=1).copy()

随机梯度下降（SGD）学习

sklearn.linear_model.SGDClassifier

该估计器通过随机梯度下降（SGD）学习实现正则化线性模型：每次对每个样本估计损失的梯度，并以递减的强度(即学习率)沿此路径更新模型。SGD允许通过该方法进行小批量（在线/核心外）学习。为了使用默认学习率计划获得最佳结果，数据应具有零均值和单位方差。partial_fit

# 随机梯度下降（SGD）学习

sgd = linear_model.SGDClassifier(max_iter=5, tol=None)

sgd.fit(X_train, Y_train)

Y_pred = sgd.predict(X_test)

sgd.score(X_train, Y_train)

acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)


print(round(acc_sgd,2,), "%")

78.79 %

Random Forest

# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)

random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

print(round(acc_random_forest,2,), "%")

92.48 %

Logistic Regression

# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_test)

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
print(round(acc_log,2,), "%")

81.59 %

KNN

# KNN
knn = KNeighborsClassifier(n_neighbors = 3)

knn.fit(X_train, Y_train)

Y_pred = knn.predict(X_test)

acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
print(round(acc_knn,2,), "%")

85.41 %

Gaussian Naive Bayes

# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)

Y_pred = gaussian.predict(X_test)

acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
print(round(acc_gaussian,2,), "%")

77.55 %

Perceptron 感知机

# Perceptron
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_test)

acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
print(round(acc_perceptron,2,), "%")

78.23 %

Linear SVC

# Linear SVC
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
print(round(acc_linear_svc,2,), "%")

81.14 %

决策树

# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)

Y_pred = decision_tree.predict(X_test)

acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
print(round(acc_decision_tree,2,), "%")

92.48 %

Which is the best Model ?

results = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 
              'Decision Tree'],
    'Score': [acc_linear_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_decision_tree]})

result_df = results.sort_values(by='Score', ascending=False)

result_df = result_df.set_index('Score')

result_df.head(9)

	Model
Score
92.48	Random Forest
92.48	Decision Tree
85.41	KNN
81.59	Logistic Regression
81.14	Support Vector Machines
78.79	Stochastic Gradient Decent
78.23	Perceptron
77.55	Naive Bayes

正如我们所看到的，随机森林分类器排在第一位。但首先，让我们检查一下，当我们使用交叉验证时，RF的性能如何。

k折交叉验证

一般情况将K折交叉验证用于模型调优，找到使得模型泛化性能最优的超参值。，找到后，在全部训练集上重新训练模型，并使用独立测试集对模型性能做出最终评价。

K折交叉验证使用了无重复抽样技术的好处：每次迭代过程中每个样本点只有一次被划入训练集或测试集的机会

K-Fold交叉验证将训练数据随机分成K个子集，称为折叠。让我们想象一下，我们将数据分成4个折叠（K=4）。我们的随机森林模型将被训练和评估4次，每次使用不同的折叠进行评估，而它将在剩余的3个折叠上进行训练。

下面的代码使用10个折叠（K=10）对我们的随机森林模型执行K折叠交叉验证。因此，它输出一个具有10个不同分数的数组。

1
2
3

from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X_train, Y_train, cv=10, scoring = "accuracy")

1
2
3

print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

Scores: [0.77777778 0.84269663 0.74157303 0.84269663 0.88764045 0.85393258
 0.82022472 0.78651685 0.82022472 0.86516854]
Mean: 0.823845193508115
Standard Deviation: 0.04207611389315941

这看起来比以前更现实。我们的模型的平均精度为82%，标准偏差为4%。标准差告诉我们，估计值有多精确。

这意味着在我们的情况下，我们的模型的精度可能相差正负4%。

我认为准确度仍然很好，因为随机森林是一个易于使用的模型，我们将在下一节中进一步提高它的性能。

随机森林

什么是随机森林？

随机森林是一种有监督的学习算法。就像你已经从它的名字中看到的那样，它创建了一个森林，并使它变得随机。它构建的“森林”是决策树的集合，大部分时间都是通过“装袋”方法训练的。装袋方法的总体思想是，学习模型的组合提高了整体结果。

简单地说：随机森林构建多个决策树，并将它们合并在一起，以获得更准确和稳定的预测。

随机森林的一大优点是，它可以用于分类和回归问题，这是当前机器学习系统的主要组成部分。除了少数例外，随机森林分类器具有决策树分类器的所有超参数以及装袋分类器的所有超级参数，以控制集合本身。

随机森林算法在生长树木时为模型带来了额外的随机性。它不是在拆分节点时搜索最佳特征，而是在随机特征子集中搜索最佳特征。这一过程产生了广泛的多样性，通常会产生更好的模型。所以，当您在随机林中生长树时，只考虑一个随机的特征子集来分割节点。您甚至可以为每个特性使用随机阈值，而不是搜索可能的最佳阈值（就像普通决策树一样），从而使树更加随机。

功能重要性

随机森林的另一个很好的特点是，它们可以很容易地测量每个特征的相对重要性。Sklearn通过查看使用该特征的树节点平均减少了多少杂质（在森林中的所有树上）来衡量特征的重要性。它在训练后自动计算每个特征的得分，并对结果进行缩放，使所有重要度之和等于1。我们将在下面对此进行评估：

importances = pd.DataFrame({'feature':X_train.columns,
                            'importance':np.round( random_forest.feature_importances_  , 3)})

importances = importances.sort_values('importance',ascending=False).set_index('feature')

1	importances.head(15)

	importance
feature
Title	0.213
Sex	0.167
Age_Class	0.094
Deck	0.078
Age	0.076
Fare	0.071
Pclass	0.070
relatives	0.055
Embarked	0.053
SibSp	0.044
Fare_Per_Person	0.043
Parch	0.022
not_alone	0.013

1	importances.plot.bar()

<AxesSubplot:xlabel='feature'>

结论：

not_alone和Parch在我们的随机森林分类器预测过程中没有发挥重要作用。因此，我将从数据集中删除它们，
并再次训练分类器。我们也可以删除或多或少的特征，
但这需要对特征对模型的影响进行更详细的调查。但我认为只删除only Alone 和Parch.就行了。

train_df  = train_df.drop("not_alone", axis=1)
test_df  = test_df.drop("not_alone", axis=1)

train_df  = train_df.drop("Parch", axis=1)
test_df  = test_df.drop("Parch", axis=1)

Training random forest again:

# Random Forest

random_forest = RandomForestClassifier(n_estimators=100, oob_score = True)
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(round(acc_random_forest,2,), "%")

92.48 %

我们的随机森林模型预测效果和以前一样好。一个普遍的规则是，你拥有的功能越多，你的模型就越可能受到过度拟合的影响，反之亦然。但我认为我们的数据目前看起来不错，没有太多功能。

还有另一种评估随机森林分类器的方法，它可能比我们以前使用的指标准确得多。

我所说的是用来估计泛化精度的现成样本（OOB）。我不会在这里详细说明它是如何工作的。

请注意，包外评估OOB 与使用与训练集大小相同的测试集一样准确。因此，使用OOB误差估计值不需要备用测试集。

1	print("oob score:", round(random_forest.oob_score_, 4)*100, "%")

oob score: 82.04 %

超参数调整

现在我们可以开始调整随机森林的超参数了。

可以通过调整以下超参数来调整模型

min_samples_leaf,
min_samples_split
n_estimators

# Random Forest
random_forest = RandomForestClassifier(criterion = "gini", 
                                       min_samples_leaf = 1, 
                                       min_samples_split = 10,   
                                       n_estimators=100, 
                                       max_features='auto', 
                                       oob_score=True, 
                                       random_state=1, 
                                       n_jobs=-1)

random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

print("oob score:", round(random_forest.oob_score_, 4)*100, "%")

oob score: 83.61 %

结束submission

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_prediction
    })
submission.to_csv('submission.csv', index=False)

总结

这个项目大大加深了我的机器学习知识，我加强了我将从课本、博客和各种其他来源学到的概念应用于不同类型问题的能力。该项目重点关注数据准备部分，因为这是数据科学家大部分时间的工作。

我从数据探索开始，在那里我对数据集有了感觉，检查了缺失的数据，并了解了哪些特征是重要的。在此过程中，我使用seaborn和matplotlib进行可视化。在数据预处理部分，我计算了缺失的值，将特征转换为数字，将值分组为类别，并创建了一些新的特征。之后，我开始训练8个不同的机器学习模型，选择其中一个（随机森林），并对其进行交叉验证。然后我解释了随机森林是如何工作的，看看它赋予不同功能的重要性，并通过优化其超参数值来调整其性能。最后，我查看了它的混淆矩阵，并计算了模型的精度、召回率和f分数，然后将我在测试集上的预测提交给Kaggle排行榜。

当然，仍有改进的空间，比如通过比较和绘制特征，识别和删除有噪声的特征，进行更广泛的特征工程。另一个可以改善kaggle排行榜整体结果的方法是对多个机器学习模型进行更广泛的超参数调整。当然，你也可以做一些集体学习。

原总结

This project deepened my machine learning knowledge significantly and I strengthened my ability to apply concepts that I learned from textbooks, blogs and various other sources, on a different type of problem. This project had a heavy focus on the data preparation part, since this is what data scientists work on most of their time.

I started with the data exploration where I got a feeling for the dataset, checked about missing data and learned which features are important. During this process I used seaborn and matplotlib to do the visualizations. During the data preprocessing part, I computed missing values, converted features into numeric ones, grouped values into categories and created a few new features. Afterwards I started training 8 different machine learning models, picked one of them (random forest) and applied cross validation on it. Then I explained how random forest works, took a look at the importance it assigns to the different features and tuned it’s performace through optimizing it’s hyperparameter values. Lastly I took a look at it’s confusion matrix and computed the models precision, recall and f-score, before submitting my predictions on the test-set to the Kaggle leaderboard.

Of course there is still room for improvement, like doing a more extensive feature engineering, by comparing and plotting the features against each other and identifying and removing the noisy features. Another thing that can improve the overall result on the kaggle leaderboard would be a more extensive hyperparameter tuning on several machine learning models. Of course you could also do some ensemble learning.

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	relatives	not_alone	Deck	Title
0	0	3	0	2	1	0	7	0	1	0	8	1
1	1	1	1	5	1	0	71	1	1	0	3	3
2	1	3	1	3	0	0	7	0	0	1	8	2
3	1	1	1	5	1	0	53	0	1	0	3	3
4	0	3	0	5	0	0	8	0	0	1	8	1
5	0	3	0	6	0	0	8	2	0	1	8	1
6	0	1	0	6	0	0	51	0	0	1	5	1
7	0	3	0	0	3	1	21	0	4	0	8	4
8	1	3	1	3	0	2	11	0	2	0	8	3
9	1	2	1	1	1	0	30	1	1	0	8	3

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	relatives	not_alone	Deck	Title	Age_Class	Fare_Per_Person
0	0	3	0	2	1	0	0	0	1	0	8	1	6	0
1	1	1	1	5	1	0	3	1	1	0	3	3	5	1
2	1	3	1	3	0	0	0	0	0	1	8	2	9	0
3	1	1	1	5	1	0	3	0	1	0	3	3	5	1
4	0	3	0	5	0	0	1	0	0	1	8	1	15	1
5	0	3	0	6	0	0	1	2	0	1	8	1	18	1
6	0	1	0	6	0	0	3	0	0	1	5	1	6	3
7	0	3	0	0	3	1	2	0	4	0	8	4	0	0
8	1	3	1	3	0	2	1	0	2	0	8	3	9	0
9	1	2	1	1	1	0	2	1	1	0	8	3	2	1
10	1	3	1	0	1	1	2	0	2	0	7	2	0	0
11	1	1	1	6	0	0	2	0	0	1	3	2	6	2
12	0	3	0	2	0	0	1	0	0	1	8	1	6	1
13	0	3	0	5	1	5	2	0	6	0	8	1	15	0
14	0	3	1	1	0	0	0	0	0	1	8	2	3	0
15	1	2	1	6	0	0	2	0	0	1	8	3	12	2
16	0	3	0	0	4	1	2	2	5	0	8	4	0	0
17	1	2	0	2	0	0	1	0	0	1	8	1	4	1
18	0	3	1	4	1	0	2	0	1	0	8	3	12	1
19	1	3	1	5	0	0	0	1	0	1	8	3	15	0

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	relatives	not_alone	Deck	Title
0	0	3	0	2	1	0	7	0	1	0	8	1
1	1	1	1	5	1	0	71	1	1	0	3	3
2	1	3	1	3	0	0	7	0	0	1	8	2
3	1	1	1	5	1	0	53	0	1	0	3	3
4	0	3	0	5	0	0	8	0	0	1	8	1
5	0	3	0	6	0	0	8	2	0	1	8	1
6	0	1	0	6	0	0	51	0	0	1	5	1
7	0	3	0	0	3	1	21	0	4	0	8	4
8	1	3	1	3	0	2	11	0	2	0	8	3
9	1	2	1	1	1	0	30	1	1	0	8	3

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	relatives	not_alone	Deck	Title	Age_Class	Fare_Per_Person
0	0	3	0	2	1	0	0	0	1	0	8	1	6	0
1	1	1	1	5	1	0	3	1	1	0	3	3	5	1
2	1	3	1	3	0	0	0	0	0	1	8	2	9	0
3	1	1	1	5	1	0	3	0	1	0	3	3	5	1
4	0	3	0	5	0	0	1	0	0	1	8	1	15	1
5	0	3	0	6	0	0	1	2	0	1	8	1	18	1
6	0	1	0	6	0	0	3	0	0	1	5	1	6	3
7	0	3	0	0	3	1	2	0	4	0	8	4	0	0
8	1	3	1	3	0	2	1	0	2	0	8	3	9	0
9	1	2	1	1	1	0	2	1	1	0	8	3	2	1
10	1	3	1	0	1	1	2	0	2	0	7	2	0	0
11	1	1	1	6	0	0	2	0	0	1	3	2	6	2
12	0	3	0	2	0	0	1	0	0	1	8	1	6	1
13	0	3	0	5	1	5	2	0	6	0	8	1	15	0
14	0	3	1	1	0	0	0	0	0	1	8	2	3	0
15	1	2	1	6	0	0	2	0	0	1	8	3	12	2
16	0	3	0	0	4	1	2	2	5	0	8	4	0	0
17	1	2	0	2	0	0	1	0	0	1	8	1	4	1
18	0	3	1	4	1	0	2	0	1	0	8	3	12	1
19	1	3	1	5	0	0	0	1	0	1	8	3	15	0

星期六

机器学习分类算法实战笔记

机器学习分类算法实战笔记

参考资料

数据集

实战内容

Getting the Data¶

数据探索/分析(Data Exploration/Analysis )

数据分析之 Age and Sex:

Embarked, Pclass and Sex:

Pclass:

SibSp and Parch

数据预处理

缺失值处理

转化特征

Creating Categories:创建类别：

Creating new Features

Age times Class

Fare per Person

构建机器学习模型 Building Machine Learning Models

随机梯度下降（SGD）学习

Random Forest

Logistic Regression

KNN

Gaussian Naive Bayes

Perceptron 感知机

Linear SVC

决策树

Which is the best Model ?

k折交叉验证

随机森林

功能重要性

结论：

Training random forest again:

超参数调整

更多评价方式

Confusion Matrix 混沌矩阵

Precision and Recall:¶

F-Score

精度召回曲线 Precision Recall Curve

结束submission

总结

原总结

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	relatives	not_alone	Deck	Title
0	0	3	0	2	1	0	7	0	1	0	8	1
1	1	1	1	5	1	0	71	1	1	0	3	3
2	1	3	1	3	0	0	7	0	0	1	8	2
3	1	1	1	5	1	0	53	0	1	0	3	3
4	0	3	0	5	0	0	8	0	0	1	8	1
5	0	3	0	6	0	0	8	2	0	1	8	1
6	0	1	0	6	0	0	51	0	0	1	5	1
7	0	3	0	0	3	1	21	0	4	0	8	4
8	1	3	1	3	0	2	11	0	2	0	8	3
9	1	2	1	1	1	0	30	1	1	0	8	3

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	relatives	not_alone	Deck	Title	Age_Class	Fare_Per_Person
0	0	3	0	2	1	0	0	0	1	0	8	1	6	0
1	1	1	1	5	1	0	3	1	1	0	3	3	5	1
2	1	3	1	3	0	0	0	0	0	1	8	2	9	0
3	1	1	1	5	1	0	3	0	1	0	3	3	5	1
4	0	3	0	5	0	0	1	0	0	1	8	1	15	1
5	0	3	0	6	0	0	1	2	0	1	8	1	18	1
6	0	1	0	6	0	0	3	0	0	1	5	1	6	3
7	0	3	0	0	3	1	2	0	4	0	8	4	0	0
8	1	3	1	3	0	2	1	0	2	0	8	3	9	0
9	1	2	1	1	1	0	2	1	1	0	8	3	2	1
10	1	3	1	0	1	1	2	0	2	0	7	2	0	0
11	1	1	1	6	0	0	2	0	0	1	3	2	6	2
12	0	3	0	2	0	0	1	0	0	1	8	1	6	1
13	0	3	0	5	1	5	2	0	6	0	8	1	15	0
14	0	3	1	1	0	0	0	0	0	1	8	2	3	0
15	1	2	1	6	0	0	2	0	0	1	8	3	12	2
16	0	3	0	0	4	1	2	2	5	0	8	4	0	0
17	1	2	0	2	0	0	1	0	0	1	8	1	4	1
18	0	3	1	4	1	0	2	0	1	0	8	3	12	1
19	1	3	1	5	0	0	0	1	0	1	8	3	15	0