这里采用决策树去分析泰坦尼克号乘客数据生还可能性。

准备数据

import pandas as pd
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

查看数据

titanic.head()

查看数据统计

titanic.info()

对特征进行选择,对于事故,我们认为sex,age,pclass可能是决定幸免与否的关键因素

X = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
age       633 non-null float64
sex       1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB

由于age列仅有633个,需要补全,而sex和pclass不是数值列,需要进行转换

这里采用平均数或中位数对模型偏离影响最小的策略。

X['age'].fillna(X['age'].mean(), inplace=True)

分隔数据集

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 33)

转换特征,凡是类别型的特征都单独剥离形成一列特征,而数值型则保持不变

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
print(vec.feature_names_)

['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']

对测试数据集做同样的处理

X_test = vec.transform(X_test.to_dict(orient='record'))

利用决策树训练模型

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_predict = dtc.predict(X_test)

测试

from sklearn.metrics import classification_report
print(dtc.score(X_test, y_test))
print(classification_report(y_predict, y_test, target_names = ['died', 'survived']))

results matching ""

    No results matching ""