- 这篇论文使用RF对水样进行溯源分析
- 写了个阅读笔记
- 虽然之前有一点点了解,但是还没有实战
- 写了篇Random Forest的学习笔记,记录了:
- 集成算法入门
- 随机森林基本情况,特点
- 相关基础知识
- 随机森林中的特征重要性(在下面的实战中就体现了这个特点)
- sklean 中RF的实现,主要记录了一些超参数,可以在实例化时调节以完善模型
- 在学习过程中看到一段关于调参的描述,我摘录到了学习笔记的后面。
1 | # linear algebra |
Getting the Data¶
1 | test_df = pd.read_csv("test.csv") |
数据探索/分析(Data Exploration/Analysis )
训练集有 891 个样本和 11 个特征 + 目标变量(幸存)。其中 2 个特征是浮点数,5 个是整数,5 个是对象。下面我列出了这些feature的简短描述:
survival: Survival
PassengerId: Unique Id of a passenger.
pclass: Ticket class
sex: Sex
Age: Age in years
sibsp: # of siblings / spouses aboard the Titanic
parch: # of parents / children aboard the Titanic
ticket: Ticket number
fare: Passenger fare
cabin: Cabin number
embarked: Port of Embarkation
1 | train_df.info() |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
1 | train_df.describe() |
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
1 | train_df.head(15) |
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
我们还可以发现更多包含缺失值(NaN = 不是数字)的特征,我们需要处理这些特征。
1 | total = train_df.isnull().sum().sort_values(ascending= False) # ascendings是否按指定列的数组升序排列,默认为True,即升序排列 |
Total | % | |
Cabin | 687 | 77.1 |
Age | 177 | 19.9 |
Embarked | 2 | 0.2 |
PassengerId | 0 | 0.0 |
Survived | 0 | 0.0 |
Pclass | 0 | 0.0 |
Name | 0 | 0.0 |
Sex | 0 | 0.0 |
SibSp | 0 | 0.0 |
Parch | 0 | 0.0 |
Ticket | 0 | 0.0 |
Fare | 0 | 0.0 |
Embarked只有 2 个缺失值,可以轻松填充。处理“Age”功能会更加棘手,该功能有 177 个缺失值。“Cabin”功能需要进一步研究,但看起来我们可能希望将其从数据集中删除,因为其中 77% 丢失了。
1 | train_df.columns.values |
array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)
在上面,可以看到 11 个特征 + 目标变量(幸存)。
数据分析之 Age and Sex:
1 | survived = 'survived' |

Embarked, Pclass and Sex:
1 | FacetGrid = sns.FacetGrid(train_df, row='Embarked', size=4.5, aspect=1.6) |

1 | sns.barplot(x='Pclass', y='Survived', data=train_df) |
<AxesSubplot:xlabel='Pclass', ylabel='Survived'>

我们将在下面创建另一个 pclass 图。
1 | grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6) |

上面的图证实了我们对pclass 1的假设,但我们也可以发现pclass 3中的人无法生存的概率很高。
SibSp and Parch
1 | data = [train_df, test_df] |
1 | train_df['not_alone'].value_counts() |
1 537
0 354
Name: not_alone, dtype: int64
1 | axes = sns.factorplot('relatives','Survived', |

在这里,我们可以看到,您有 1 到 3 个亲戚的话生存的概率很高,
但如果您有少于 1 个或超过 3 个,则生存概率较低(除了某些有 6 个亲戚的情况)。
1 | train_df = train_df.drop(['PassengerId'], axis=1) |
1 | train_df.head() |
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | relatives | not_alone | |
0 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | 0 |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | 0 |
2 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 1 |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | 0 |
4 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 1 |
因此,我们将提取这些并创建一个包含人员甲板的新特征。Afterwords 我们将特征转换为数值变量。缺失值将转换为零。
1 | import re |
1 | # we can now drop the cabin feature |
1 | data = [train_df, test_df] |
1 | train_df["Age"].isnull().sum() |
由于 Embarked 特征只有 2 个缺失值,因此我们将用最常见的缺失值填充这些缺失值。
1 | train_df['Embarked'].describe() |
count 889
unique 3
top S
freq 644
Name: Embarked, dtype: object
1 | common_value = 'S' |
1 | train_df.info() |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Name 891 non-null object
3 Sex 891 non-null object
4 Age 891 non-null int32
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Ticket 891 non-null object
8 Fare 891 non-null float64
9 Embarked 891 non-null object
10 relatives 891 non-null int64
11 not_alone 891 non-null int32
12 Deck 891 non-null int32
dtypes: float64(1), int32(3), int64(5), object(4)
memory usage: 80.2+ KB
上面,你可以看到“票价”Fare是一个浮点数,我们必须处理4个分类特征:Name, Sex, Ticket and Embarked.。让我们一个接一个地调查和转换。
1 | data = [train_df, test_df] |
1 | data = [train_df, test_df] |
1 | train_df = train_df.drop(['Name'], axis=1) |
1 | test_df |
PassengerId | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | relatives | not_alone | Deck | Title | |
0 | 892 | 3 | male | 22 | 0 | 0 | 330911 | 7 | Q | 0 | 1 | 8 | 1 |
1 | 893 | 3 | female | 38 | 1 | 0 | 363272 | 7 | S | 1 | 0 | 8 | 3 |
2 | 894 | 2 | male | 26 | 0 | 0 | 240276 | 9 | Q | 0 | 1 | 8 | 1 |
3 | 895 | 3 | male | 35 | 0 | 0 | 315154 | 8 | S | 0 | 1 | 8 | 1 |
4 | 896 | 3 | female | 35 | 1 | 1 | 3101298 | 12 | S | 2 | 0 | 8 | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 1305 | 3 | male | 19 | 0 | 0 | A.5. 3236 | 8 | S | 0 | 1 | 8 | 1 |
414 | 1306 | 1 | female | 44 | 0 | 0 | PC 17758 | 108 | C | 0 | 1 | 3 | 5 |
415 | 1307 | 3 | male | 42 | 0 | 0 | SOTON/O.Q. 3101262 | 7 | S | 0 | 1 | 8 | 1 |
416 | 1308 | 3 | male | 34 | 0 | 0 | 359309 | 8 | S | 0 | 1 | 8 | 1 |
417 | 1309 | 3 | male | 18 | 1 | 1 | 2668 | 22 | C | 2 | 0 | 8 | 4 |
418 rows × 13 columns
1 | genders = {"male": 0, "female": 1} |
1 | train_df['Ticket'].describe() |
count 891
unique 681
top 347082
freq 7
Name: Ticket, dtype: object
1 | train_df = train_df.drop(['Ticket'], axis=1) |
1 | ports = {"S": 0, "C": 1, "Q": 2} |
Creating Categories:创建类别:
We will now create categories within the following features:
1 | data = [train_df, test_df] |
1 | # let's see how it's distributed |
6 162
4 160
5 146
3 136
2 123
1 96
0 68
Name: Age, dtype: int64
1 | train_df.head(10) |
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | relatives | not_alone | Deck | Title | |
0 | 0 | 3 | 0 | 2 | 1 | 0 | 7 | 0 | 1 | 0 | 8 | 1 |
1 | 1 | 1 | 1 | 5 | 1 | 0 | 71 | 1 | 1 | 0 | 3 | 3 |
2 | 1 | 3 | 1 | 3 | 0 | 0 | 7 | 0 | 0 | 1 | 8 | 2 |
3 | 1 | 1 | 1 | 5 | 1 | 0 | 53 | 0 | 1 | 0 | 3 | 3 |
4 | 0 | 3 | 0 | 5 | 0 | 0 | 8 | 0 | 0 | 1 | 8 | 1 |
5 | 0 | 3 | 0 | 6 | 0 | 0 | 8 | 2 | 0 | 1 | 8 | 1 |
6 | 0 | 1 | 0 | 6 | 0 | 0 | 51 | 0 | 0 | 1 | 5 | 1 |
7 | 0 | 3 | 0 | 0 | 3 | 1 | 21 | 0 | 4 | 0 | 8 | 4 |
8 | 1 | 3 | 1 | 3 | 0 | 2 | 11 | 0 | 2 | 0 | 8 | 3 |
9 | 1 | 2 | 1 | 1 | 1 | 0 | 30 | 1 | 1 | 0 | 8 | 3 |
1 | data = [train_df, test_df] |
1 | train_df['Fare'].describe() |
count 891.000000
mean 1.523008
std 1.250743
min 0.000000
25% 0.000000
50% 1.000000
75% 2.000000
max 5.000000
Name: Fare, dtype: float64
Creating new Features
Age times Class
1 | data = [train_df, test_df] |
Fare per Person
1 | for dataset in data: |
1 | # Let's take a last look at the training set, before we start training the models. |
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | relatives | not_alone | Deck | Title | Age_Class | Fare_Per_Person | |
0 | 0 | 3 | 0 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 8 | 1 | 6 | 0 |
1 | 1 | 1 | 1 | 5 | 1 | 0 | 3 | 1 | 1 | 0 | 3 | 3 | 5 | 1 |
2 | 1 | 3 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 8 | 2 | 9 | 0 |
3 | 1 | 1 | 1 | 5 | 1 | 0 | 3 | 0 | 1 | 0 | 3 | 3 | 5 | 1 |
4 | 0 | 3 | 0 | 5 | 0 | 0 | 1 | 0 | 0 | 1 | 8 | 1 | 15 | 1 |
5 | 0 | 3 | 0 | 6 | 0 | 0 | 1 | 2 | 0 | 1 | 8 | 1 | 18 | 1 |
6 | 0 | 1 | 0 | 6 | 0 | 0 | 3 | 0 | 0 | 1 | 5 | 1 | 6 | 3 |
7 | 0 | 3 | 0 | 0 | 3 | 1 | 2 | 0 | 4 | 0 | 8 | 4 | 0 | 0 |
8 | 1 | 3 | 1 | 3 | 0 | 2 | 1 | 0 | 2 | 0 | 8 | 3 | 9 | 0 |
9 | 1 | 2 | 1 | 1 | 1 | 0 | 2 | 1 | 1 | 0 | 8 | 3 | 2 | 1 |
10 | 1 | 3 | 1 | 0 | 1 | 1 | 2 | 0 | 2 | 0 | 7 | 2 | 0 | 0 |
11 | 1 | 1 | 1 | 6 | 0 | 0 | 2 | 0 | 0 | 1 | 3 | 2 | 6 | 2 |
12 | 0 | 3 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 1 | 8 | 1 | 6 | 1 |
13 | 0 | 3 | 0 | 5 | 1 | 5 | 2 | 0 | 6 | 0 | 8 | 1 | 15 | 0 |
14 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 8 | 2 | 3 | 0 |
15 | 1 | 2 | 1 | 6 | 0 | 0 | 2 | 0 | 0 | 1 | 8 | 3 | 12 | 2 |
16 | 0 | 3 | 0 | 0 | 4 | 1 | 2 | 2 | 5 | 0 | 8 | 4 | 0 | 0 |
17 | 1 | 2 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 1 | 8 | 1 | 4 | 1 |
18 | 0 | 3 | 1 | 4 | 1 | 0 | 2 | 0 | 1 | 0 | 8 | 3 | 12 | 1 |
19 | 1 | 3 | 1 | 5 | 0 | 0 | 0 | 1 | 0 | 1 | 8 | 3 | 15 | 0 |
构建机器学习模型 Building Machine Learning Models
1 | X_train = train_df.drop("Survived", axis=1) # 训练集的特征 |
1 | # 随机梯度下降(SGD)学习 |
78.79 %
Random Forest
1 | # Random Forest |
92.48 %
Logistic Regression
1 | # Logistic Regression |
81.59 %
1 | # KNN |
85.41 %
Gaussian Naive Bayes
1 | # Gaussian Naive Bayes |
77.55 %
Perceptron 感知机
1 | # Perceptron |
78.23 %
Linear SVC
1 | # Linear SVC |
81.14 %
1 | # Decision Tree |
92.48 %
Which is the best Model ?
1 | results = pd.DataFrame({ |
Model | |
Score | |
92.48 | Random Forest |
92.48 | Decision Tree |
85.41 | KNN |
81.59 | Logistic Regression |
81.14 | Support Vector Machines |
78.79 | Stochastic Gradient Decent |
78.23 | Perceptron |
77.55 | Naive Bayes |
1 | from sklearn.model_selection import cross_val_score |
1 | print("Scores:", scores) |
Scores: [0.77777778 0.84269663 0.74157303 0.84269663 0.88764045 0.85393258
0.82022472 0.78651685 0.82022472 0.86516854]
Mean: 0.823845193508115
Standard Deviation: 0.04207611389315941
1 | importances = pd.DataFrame({'feature':X_train.columns, |
1 | importances.head(15) |
importance | |
feature | |
Title | 0.213 |
Sex | 0.167 |
Age_Class | 0.094 |
Deck | 0.078 |
Age | 0.076 |
Fare | 0.071 |
Pclass | 0.070 |
relatives | 0.055 |
Embarked | 0.053 |
SibSp | 0.044 |
Fare_Per_Person | 0.043 |
Parch | 0.022 |
not_alone | 0.013 |
1 | importances.plot.bar() |

但这需要对特征对模型的影响进行更详细的调查。但我认为只删除only Alone 和Parch.就行了。
1 | train_df = train_df.drop("not_alone", axis=1) |
Training random forest again:
1 | # Random Forest |
92.48 %
请注意,包外评估OOB 与 使用与训练集大小相同的测试集一样准确。因此,使用OOB误差估计值不需要备用测试集。
1 | print("oob score:", round(random_forest.oob_score_, 4)*100, "%") |
oob score: 82.04 %
- min_samples_leaf,
- min_samples_split
- n_estimators
1 | # Random Forest |
oob score: 83.61 %
Confusion Matrix 混沌矩阵
1 | from sklearn.model_selection import cross_val_predict |
array([[490, 59],
[ 93, 249]], dtype=int64)
Precision and Recall:¶
- 精确率是正确预测的阳性类别与预测为阳性的所有样本的比率
- 召回是正确预测的阳性类别与所有实际阳性样本的比率:
- F1-score同时考虑了精确度和召回率。通过取两个指标的调和平均值来计算;考虑到这种相互竞争的权衡,拥有一个同时考虑精度和召回率的单一性能指标将非常重要。
1 | from sklearn.metrics import precision_score, recall_score |
Precision: 0.8084415584415584
Recall: 0.7280701754385965
Our model predicts 81% of the time, a passengers survival correctly (precision). The recall tells us that it predicted the survival of 73 % of the people who actually survived.
You can combine precision and recall into one score, which is called the F-score. The F-score is computed with the harmonic mean of precision and recall. Note that it assigns much more weight to low values. As a result of that, the classifier will only get a high F-score, if both recall and precision are high.
1 | from sklearn.metrics import f1_score |
F-Score: 0.7661538461538462
精度召回曲线 Precision Recall Curve
1 | from sklearn.metrics import precision_recall_curve |
1 | def plot_precision_and_recall(precision, recall, threshold): |

1 | def plot_precision_vs_recall(precision, recall): |

1 | submission = pd.DataFrame({ |
This project deepened my machine learning knowledge significantly and I strengthened my ability to apply concepts that I learned from textbooks, blogs and various other sources, on a different type of problem. This project had a heavy focus on the data preparation part, since this is what data scientists work on most of their time.
I started with the data exploration where I got a feeling for the dataset, checked about missing data and learned which features are important. During this process I used seaborn and matplotlib to do the visualizations. During the data preprocessing part, I computed missing values, converted features into numeric ones, grouped values into categories and created a few new features. Afterwards I started training 8 different machine learning models, picked one of them (random forest) and applied cross validation on it. Then I explained how random forest works, took a look at the importance it assigns to the different features and tuned it’s performace through optimizing it’s hyperparameter values. Lastly I took a look at it’s confusion matrix and computed the models precision, recall and f-score, before submitting my predictions on the test-set to the Kaggle leaderboard.
Of course there is still room for improvement, like doing a more extensive feature engineering, by comparing and plotting the features against each other and identifying and removing the noisy features. Another thing that can improve the overall result on the kaggle leaderboard would be a more extensive hyperparameter tuning on several machine learning models. Of course you could also do some ensemble learning.