titanic 特征工程

1.新增Title从姓名中提取乘客的称呼，归纳为六类。将[‘Capt’, ‘Col’, ‘Major’, ‘Dr’, ‘Rev’]映射为Officer，这些都是工作人员;将[‘Don’, ‘Sir’, ’the Countess’, ‘Dona’, ‘Lady’]映射为Royalty，the Countess是伯爵夫人，这几类称呼都是尊称，表明了这些乘客的社会地位很高还有贵族；将[‘Mme’, ‘Ms’, ‘Mrs’]映射为Miss,这些称呼只能看出性别是女性；将[‘Mr’]映射为Mr；将[‘Master’,‘Jonkheer’]映射为MasterTitle_Dict.update(dict.fromkeys([‘Capt’, ‘Col’, ‘Major’, ‘Dr’, ‘Rev’], ‘Officer’))Title_Dict.update(dict.fromkeys([‘Don’, ‘Sir’, ’the Countess’, ‘Dona’, ‘Lady’], ‘Royalty’))Title_Dict.update(dict.fromkeys([‘Mme’, ‘Ms’, ‘Mrs’], ‘Mrs’))Title_Dict.update(dict.fromkeys([‘Mlle’, ‘Miss’], ‘Miss’))Title_Dict.update(dict.fromkeys([‘Mr’], ‘Mr’))Title_Dict.update(dict.fromkeys([‘Master’,‘Jonkheer’], ‘Master’))2.将Parch和SibSp合并为FamilySize,这个特征描述了家庭成员的数量。接着按生存概率给FamiliSize分组。[2,4]的分为一组，（4,7]分为一组，大于7的分为一组，是FamilyLable特征。>>3.新增Deck特征，先把Cabin空缺值填充为’Unknown’，再提取Cabin中的首字母构成乘客的甲板号。4.新增TicketGroup特征，统计每个乘客的共票号数。按生存率把TicketGroup分为三类：[2,4]一类，(4,8] && 1一类，(8,)一类。5.数据清洗1）Age缺失量为263，缺失量较大，用Sex, Title, Pclass三个特征构建随机森林模型，填充年龄缺失值。from sklearn.ensemble import RandomForestRegressorage_df = all_data[[‘Age’, ‘Pclass’,‘Sex’,‘Title’]]age_df=pd.get_dummies(age_df)known_age = age_df[age_df.Age.notnull()].as_matrix()unknown_age = age_df[age_df.Age.isnull()].as_matrix()y = known_age[:, 0]X = known_age[:, 1:]rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)rfr.fit(X, y)predictedAges = rfr.predict(unknown_age[:, 1::])all_data.loc[ (all_data.Age.isnull()), ‘Age’ ] = predictedAges 2）Embarked缺失量为2，缺失Embarked信息的乘客的Pclass均为1，且Fare均为80。因为Embarked为C且Pclass为1的乘客的Fare中位数为80，所以缺失值填充为C。all_data[‘Embarked’] = all_data[‘Embarked’].fillna(‘C’)all_data.groupby(by=[“Pclass”,“Embarked”]).Fare.median()3) Fare缺失量为1，缺失Fare信息的乘客的Embarked为S，Pclass为3，所以用Embarked为S，Pclass为3的乘客的Fare中位数填充。all_data.groupby(by=[“Pclass”,“Embarked”]).Fare.median()fare=all_data[(all_data[‘Embarked’] == “S”) & (all_data[‘Pclass’] == 3)].Fare.median()all_data[‘Fare’]=all_data[‘Fare’].fillna(fare)4) 把姓氏相同的乘客划分为同一组，从人数大于一的组中分别提取出每组的妇女儿童和成年男性,发现绝大部分女性和儿童组的平均存活率都为1或0，即同组的女性和儿童要么全部幸存，要么全部遇难。all_data[‘Surname’]=all_data[‘Name’].apply(lambda x:x.split(’,’)[0].strip())Surname_Count = dict(all_data[‘Surname’].value_counts())all_data[‘FamilyGroup’] = all_data[‘Surname’].apply(lambda x:Surname_Count[x])Female_Child_Group=all_data.loc[(all_data[‘FamilyGroup’]>=2) & ((all_data[‘Age’]<=12) | (all_data[‘Sex’]==‘female’))]Male_Adult_Group=all_data.loc[(all_data[‘FamilyGroup’]>=2) & (all_data[‘Age’]>12) & (all_data[‘Sex’]==‘male’)]因为普遍规律是女性和儿童幸存率高，成年男性幸存较低，所以我们把不符合普遍规律的反常组选出来单独处理。把女性和儿童组中幸存率为0的组设置为遇难组，把成年男性组中存活率为1的设置为幸存组，推测处于遇难组的女性和儿童幸存的可能性较低，处于幸存组的成年男性幸存的可能性较高。Female_Child=pd.DataFrame(Female_Child_Group.groupby(‘Surname’)[‘Survived’].mean().value_counts())Female_Child.columns=[‘GroupCount’]Female_Childsns.barplot(x=Female_Child.index, y=Female_Child[“GroupCount”]).set_xlabel(‘AverageSurvived’)为了使处于这两种反常组中的样本能够被正确分类，对测试集中处于反常组中的样本的Age，Title，Sex进行惩罚修改。Female_Child_Group=Female_Child_Group.groupby(‘Surname’)[‘Survived’].mean()Dead_List=set(Female_Child_Group[Female_Child_Group.apply(lambda x:x==0)].index)print(Dead_List)Male_Adult_List=Male_Adult_Group.groupby(‘Surname’)[‘Survived’].mean()Survived_List=set(Male_Adult_List[Male_Adult_List.apply(lambda x:x==1)].index)print(Survived_List)train=all_data.loc[all_data[‘Survived’].notnull()]test=all_data.loc[all_data[‘Survived’].isnull()]test.loc[(test[‘Surname’].apply(lambda x:x in Dead_List)),‘Sex’] = ‘male’test.loc[(test[‘Surname’].apply(lambda x:x in Dead_List)),‘Age’] = 60test.loc[(test[‘Surname’].apply(lambda x:x in Dead_List)),‘Title’] = ‘Mr’test.loc[(test[‘Surname’].apply(lambda x:x in Survived_List)),‘Sex’] = ‘female’test.loc[(test[‘Surname’].apply(lambda x:x in Survived_List)),‘Age’] = 5test.loc[(test[‘Surname’].apply(lambda x:x in Survived_List)),‘Title’] = ‘Miss'5) 选取特征，转换为数值变量，划分训练集和测试集all_data=pd.concat([train, test])all_data=all_data[[‘Survived’,‘Pclass’,‘Sex’,‘Age’,‘Fare’,‘Embarked’,‘Title’,‘FamilyLabel’,‘Deck’,‘TicketGroup’]]all_data=pd.get_dummies(all_data)train=all_data[all_data[‘Survived’].notnull()]test=all_data[all_data[‘Survived’].isnull()].drop(‘Survived’,axis=1)X = train.as_matrix()[:,1:]y = train.as_matrix()[:,0]