简介

1912年4月15日,号称永不沉没的泰坦尼克号因为和冰山相撞沉没了。因为没有足够的救济设施,2224个乘客中有1502个乘客不幸遇难。事变曾经产生了,然而咱们能够从泰坦尼克号中的历史数据中发现一些数据法则吗?明天本文将会率领大家灵便的应用pandas来进行数据分析。

泰坦尼特号乘客数据

咱们从kaggle官网中下载了局部泰坦尼特号的乘客数据,次要蕴含上面几个字段:

变量名含意取值
survival是否生还0 = No, 1 = Yes
pclass船票的级别1 = 1st, 2 = 2nd, 3 = 3rd
sex性别
Age年龄
sibsp配偶信息
parch父母或者子女信息
ticket船票编码
fare船费
cabin客舱编号
embarked登录的港口C = Cherbourg, Q = Queenstown, S = Southampton

下载下来的文件是一个csv文件。接下来咱们来看一下怎么应用pandas来对其进行数据分析。

应用pandas对数据进行剖析

引入依赖包

本文次要应用pandas和matplotlib,所以须要首先进行上面的通用设置:

from numpy.random import randnimport numpy as npnp.random.seed(123)import osimport matplotlib.pyplot as pltimport pandas as pdplt.rc('figure', figsize=(10, 6))np.set_printoptions(precision=4)pd.options.display.max_rows = 20

读取和剖析数据

pandas提供了一个read_csv办法能够很不便的读取一个csv数据,并将其转换为DataFrame:

path = '../data/titanic.csv'df = pd.read_csv(path)df

咱们看下读入的数据:

PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
68983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
78992Caldwell, Mr. Albert Francismale26.01124873829.0000NaNS
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC
99013Davies, Mr. John Samuelmale21.020A/4 4887124.1500NaNS
....................................
40813003Riordan, Miss. Johanna Hannah""femaleNaN003349157.7208NaNQ
40913013Peacock, Miss. Treasteallfemale3.011SOTON/O.Q. 310131513.7750NaNS
41013023Naughton, Miss. HannahfemaleNaN003652377.7500NaNQ
41113031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q
41213043Henriksson, Miss. Jenny Lovisafemale28.0003470867.7750NaNS
41313053Spector, Mr. WoolfmaleNaN00A.5. 32368.0500NaNS
41413061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
41513073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS
41613083Ware, Mr. FrederickmaleNaN003593098.0500NaNS
41713093Peter, Master. Michael JmaleNaN11266822.3583NaNC

418 rows × 11 columns

调用df的describe办法能够查看根本的统计信息:

PassengerIdPclassAgeSibSpParchFare
count418.000000418.000000332.000000418.000000418.000000417.000000
mean1100.5000002.26555030.2725900.4473680.39234435.627188
std120.8104580.84183814.1812090.8967600.98142955.907576
min892.0000001.0000000.1700000.0000000.0000000.000000
25%996.2500001.00000021.0000000.0000000.0000007.895800
50%1100.5000003.00000027.0000000.0000000.00000014.454200
75%1204.7500003.00000039.0000001.0000000.00000031.500000
max1309.0000003.00000076.0000008.0000009.000000512.329200

如果要想查看乘客登录的港口,能够这样抉择:

df['Embarked'][:10]
0    Q1    S2    Q3    S4    S5    S6    Q7    S8    C9    SName: Embarked, dtype: object

应用value_counts 能够对其进行统计:

embark_counts=df['Embarked'].value_counts()embark_counts[:10]
S    270C    102Q     46Name: Embarked, dtype: int64

从后果能够看出,从S港口登录的乘客有270个,从C港口登录的乘客有102个,从Q港口登录的乘客有46个。

同样的,咱们能够统计一下age信息:

age_counts=df['Age'].value_counts()age_counts.head(10)

前10位的年龄如下:

24.0    1721.0    1722.0    1630.0    1518.0    1327.0    1226.0    1225.0    1123.0    1129.0    10Name: Age, dtype: int64

计算一下年龄的平均数:

df['Age'].mean()
30.272590361445783

实际上有些数据是没有年龄的,咱们能够应用平均数对其填充:

clean_age1 = df['Age'].fillna(df['Age'].mean())clean_age1.value_counts()

能够看出平均数是30.27,个数是86。

30.27259    8624.00000    1721.00000    1722.00000    1630.00000    1518.00000    1326.00000    1227.00000    1225.00000    1123.00000    11            ..36.50000     140.50000     111.50000     134.00000     115.00000     17.00000      160.50000     126.50000     176.00000     134.50000     1Name: Age, Length: 80, dtype: int64

应用平均数来作为年龄可能不是一个好主见,还有一种方法就是抛弃平均数:

clean_age2=df['Age'].dropna()clean_age2age_counts = clean_age2.value_counts()ageset=age_counts.head(10)ageset
24.0    1721.0    1722.0    1630.0    1518.0    1327.0    1226.0    1225.0    1123.0    1129.0    10Name: Age, dtype: int64

图形化示意和矩阵转换

图形化对于数据分析十分有帮忙,咱们对于下面得出的前10名的age应用柱状图来示意:

import seaborn as snssns.barplot(x=ageset.index, y=ageset.values)

接下来咱们来做一个简单的矩阵变换,咱们先来过滤掉age和sex都为空的数据:

cframe=df[df.Age.notnull() & df.Sex.notnull()]cframe
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
68983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
78992Caldwell, Mr. Albert Francismale26.01124873829.0000NaNS
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC
99013Davies, Mr. John Samuelmale21.020A/4 4887124.1500NaNS
....................................
40312951Carrau, Mr. Jose Pedromale17.00011305947.1000NaNS
40412961Frauenthal, Mr. Isaac Geraldmale43.0101776527.7208D40C
40512972Nourney, Mr. Alfred (Baron von Drachstedt")"male20.000SC/PARIS 216613.8625D38C
40612982Ware, Mr. William Jefferymale23.0102866610.5000NaNS
40712991Widener, Mr. George Duntonmale50.011113503211.5000C80C
40913013Peacock, Miss. Treasteallfemale3.011SOTON/O.Q. 310131513.7750NaNS
41113031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q
41213043Henriksson, Miss. Jenny Lovisafemale28.0003470867.7750NaNS
41413061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
41513073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS

332 rows × 11 columns

接下来应用groupby对age和sex进行分组:

by_sex_age = cframe.groupby(['Age', 'Sex'])by_sex_age.size()
Age    Sex   0.17   female    10.33   male      10.75   male      10.83   male      10.92   female    11.00   female    32.00   female    1       male      13.00   female    15.00   male      1                ..60.00  female    360.50  male      161.00  male      262.00  male      163.00  female    1       male      164.00  female    2       male      167.00  male      176.00  female    1Length: 115, dtype: int64

应用unstack将Sex的列数据变成行:

Sexfemalemale
Age
0.171.00.0
0.330.01.0
0.750.01.0
0.830.01.0
0.921.00.0
1.003.00.0
2.001.01.0
3.001.00.0
5.000.01.0
6.000.03.0
.........
58.001.00.0
59.001.00.0
60.003.00.0
60.500.01.0
61.000.02.0
62.000.01.0
63.001.01.0
64.002.01.0
67.000.01.0
76.001.00.0

79 rows × 2 columns

咱们把同样age的人数加起来,而后应用argsort进行排序,失去排序过后的index:

indexer = agg_counts.sum(1).argsort()indexer.tail(10)
Age58.0    3759.0    3160.0    2960.5    3261.0    3462.0    2263.0    3864.0    2767.0    2676.0    30dtype: int64

从agg_counts中取出最初的10个,也就是最大的10个:

count_subset = agg_counts.take(indexer.tail(10))count_subset=count_subset.tail(10)count_subset
Sexfemalemale
Age
29.05.05.0
25.01.010.0
23.05.06.0
26.04.08.0
27.04.08.0
18.07.06.0
30.06.09.0
22.010.06.0
21.03.014.0
24.05.012.0

下面的操作能够简化为上面的代码:

agg_counts.sum(1).nlargest(10)
Age21.0    17.024.0    17.022.0    16.030.0    15.018.0    13.026.0    12.027.0    12.023.0    11.025.0    11.029.0    10.0dtype: float64

将count_subset 进行stack操作,不便前面的画图:

stack_subset = count_subset.stack()stack_subset
Age   Sex   29.0  female     5.0      male       5.025.0  female     1.0      male      10.023.0  female     5.0      male       6.026.0  female     4.0      male       8.027.0  female     4.0      male       8.018.0  female     7.0      male       6.030.0  female     6.0      male       9.022.0  female    10.0      male       6.021.0  female     3.0      male      14.024.0  female     5.0      male      12.0dtype: float64
stack_subset.name = 'total'stack_subset = stack_subset.reset_index()stack_subset
AgeSextotal
029.0female5.0
129.0male5.0
225.0female1.0
325.0male10.0
423.0female5.0
523.0male6.0
626.0female4.0
726.0male8.0
827.0female4.0
927.0male8.0
1018.0female7.0
1118.0male6.0
1230.0female6.0
1330.0male9.0
1422.0female10.0
1522.0male6.0
1621.0female3.0
1721.0male14.0
1824.0female5.0
1924.0male12.0

作图如下:

sns.barplot(x='total', y='Age', hue='Sex',  data=stack_subset)

本文例子能够参考: https://github.com/ddean2009/...

本文已收录于 http://www.flydean.com/01-pandas-titanic/

最艰深的解读,最粗浅的干货,最简洁的教程,泛滥你不

欢送关注我的公众号:「程序那些事」,懂技术,更懂你!