关于python:吐血整理python数据分析利器pandas的八个生命周期

这里从八个pandas的数据处理生命周期，整顿汇总出pandas框架在整个数据处理过程中都是如何解决数据的。

【浏览全文】

也就是从pandas的数据表对象以及数据汇总、数据统计等等直到数据导出的八个处理过程来实现pandas应用的汇总解决。

首先，须要筹备好将python非标准库导入进来，除了pandas之外个别随同数据分析解决应用的还有numpy科学计算库。

# Importing the pandas library and giving it the alias pd.import pandas as pd# Importing the numpy library and giving it the alias np.import numpy as np

1、数据表对象（DataFrame）

在pandas的数据分析解决中，次要依赖的是对DataFrame对象的解决来实现数据的提取、汇总、统计等操作。

那么在初始化DataFrame对象的时候有两种形式，一种是间接读取Excel、csv文件获取数据后返回DataFrame数据对象。

# Reading the csv file and converting it into a dataframe.dataframe_csv = pd.DataFrame(pd.read_csv('./data.csv'))# Reading the excel file and converting it into a dataframe.dataframe_xlsx = pd.DataFrame(pd.read_excel('./data.xlsx'))

另一种则是须要本人创立DataFrame对象的数据，将字典等类型的python对象间接初始化为DataFrame数据表的模式。

# Creating a dataframe with two columns, one called `name` and the other called `age`.dataframe = pd.DataFrame({"编程语言": ['Java', 'Python', 'C++'],                          "已诞生多少年": [23, 20, 28]},                         columns=['编程语言', '已诞生多少年'])

2、数据表（DataFrame）构造信息

通过DataFrame对象内置的各种函数来查看数据维度、列名称、数据格式等信息。

# Creating a dataframe with two columns, one called `name` and the other called `age`.dataframe = pd.DataFrame({"编程语言": ['Java', 'Python', 'C++'],                          "已诞生多少年": [23, 20, 28]},                         columns=['编程语言', '已诞生多少年'])

【加粗】dataframe.info()

查看数据表的根本信息展现，包含列数、数据格式、列名称、占用空间等。

dataframe.info()# <class 'pandas.core.frame.DataFrame'># Index: 0 entries# Data columns (total 2 columns):#  #   Column  Non-Null Count  Dtype# ---  ------  --------------  -----#  0   编程语言    0 non-null      object#  1   已诞生多少年  0 non-null      object# dtypes: object(2)# memory usage: 0.0+ bytes

【加粗】dataframe.columns

查看DataFrame对象的所有列的名称，并返回数组信息。

print('显示所有列的名称是：{0}'.format(dataframe.columns))# 显示所有列的名称是：Index(['编程语言', '已诞生多少年'], dtype='object')

【加粗】dataframe['列名'].dtype

查看DataFrame对象中某一列的格局dtype是什么。

print('列名（编程语言）的格局是：{0}'.format(dataframe[u'编程语言'].dtype))# 列名（编程语言）的格局是：object

【加粗】dataframe.shape

通过DataFrame对象的shape函数，进而展现出数据是几行几列的构造。

print('dataframe的构造是：{0}'.format(dataframe.shape))# dataframe的构造是：(3, 2)

【加粗】dataframe.values

应用DataFrame对象的values函数，得出所有数据内容的后果。

# Importing the pprint function from the pprint module.from pprint import pprintpprint('dataframe对象的值是：{0}'.format(dataframe.values))# "dataframe对象的值是：[['Java' 23]\n ['Python' 20]\n ['C++' 28]]"

3、数据荡涤

数据荡涤即是对DataFrame对象中的数据进行规范化的解决，比方空值的数据填充、反复数据的清理、数据格式的对立转换等等。

【加粗】dataframe.fillna()

# 将所有数据为空的项填充为0dataframe.fillna(value=0)# 应用均值进行填充dataframe[u'已诞生多少年'].fillna(dataframe[u'已诞生多少年'].mean())

【加粗】map(str.strip)

# 去除指定列的首尾多余的空格后，再从新赋值给所在列dataframe[u'编程语言'] = dataframe[u'编程语言'].map(str.strip)

【加粗】dataframe.astype

# 更改DataFrame数据对象中某个列的数据格式。dataframe[u'已诞生多少年'].astype('int')

【加粗】dataframe.rename

# 更改DataFrame数据对象中某个列的名称dataframe.rename(columns={u'已诞生多少年': u'语言年龄'})

【加粗】 dataframe.drop_duplicates

# 以DataFrame中的某个列为准，删除其中的反复项dataframe[u'编程语言'].drop_duplicates()

【加粗】dataframe.replace

# 替换DataFrame数据对象中某个列中指定的值dataframe[u'编程语言'].replace('Java', 'C#')

4、数据预梳理

数据预处理（data preprocessing）是指在次要的解决以前对数据进行的一些解决。

如对大部分地球物理面积性观测数据在进行转换或加强解决之前，首先将不规则散布的测网通过插值转换为规定网的解决，以利于计算机的运算。

【加粗】数据合并

应用DataFrame对象数据合并的有四种形式能够抉择，别离是merge、append、join、concat形式，不同形式实现的成果是不同的。

接下来应用两种比拟常见的形式append、concat、join来演示一下DataFrame对象合并的成果。

应用两个DataFrame的数据对象通过append将对象的数据内容进行合并。

# Creating a dataframe with two columns, one called `编程语言` and the other called `已诞生多少年`.dataframeA = pd.DataFrame({"编程语言": ['Java', 'Python', 'C++'],                           "已诞生多少年": [23, 20, 28]}, columns=['编程语言', '已诞生多少年'])# Creating a dataframe with two columns, one called `编程语言` and the other called `已诞生多少年`.dataframeB = pd.DataFrame({"编程语言": ['Scala', 'C#', 'Go'],                           "已诞生多少年": [23, 20, 28]}, columns=['编程语言', '已诞生多少年'])# Appending the dataframeB to the dataframeA.res = dataframeA.append(dataframeB)# Printing the result of the append operation.print(res)#      编程语言  已诞生多少年# 0    Java      23# 1  Python      20# 2     C++      28# 0   Scala      23# 1      C#      20# 2      Go      28## Process finished with exit code 0

应用两个DataFrame的数据对象通过concat将对象的数据内容进行合并。

# Concatenating the two dataframes together.res = pd.concat([dataframeA, dataframeB])# Printing the result of the append operation.print(res)#      编程语言  已诞生多少年# 0    Java      23# 1  Python      20# 2     C++      28# 0   Scala      23# 1      C#      20# 2      Go      28

concat函数的合并成果和append函数有殊途同归之妙，两者同样都是对数据内容进行纵向合并的。

应用两个DataFrame的数据对象通过join将对象的数据结构及数据内容进行横向合并。

# Creating a dataframe with two columns, one called `编程语言` and the other called `已诞生多少年`.dataframeC = pd.DataFrame({"编程语言": ['Java', 'Python', 'C++'],                           "已诞生多少年": [23, 20, 28]}, columns=['编程语言', '已诞生多少年'])# Creating a dataframe with one column called `历史体现` and three rows.dataframeD = pd.DataFrame({"历史体现": ['A', 'A', 'A']})# Joining the two dataframes together.res = dataframeC.join(dataframeD, on=None)# Printing the result of the append operation.print(res)#      编程语言  已诞生多少年 历史体现# 0    Java      23    A# 1  Python      20    A# 2     C++      28    A

能够发现应用join的函数之后，将dataframeD作为一个列扩大了并且对应的每一行都精确的填充了数据A。

【加粗】设置索引

给DataFrame对象设置索引的话就比拟不便了，间接DataFrame对象提供的set_index函数设置须要定义索引的列名称就OK了。

# Creating a dataframe with two columns, one called `编程语言` and the other called `已诞生多少年`.dataframeE = pd.DataFrame({"编程语言": ['Java', 'Python', 'C++'],                           "已诞生多少年": [23, 20, 28]}, columns=['编程语言', '已诞生多少年'])# Setting the index of the dataframe to the column `编程语言`.dataframeE.set_index(u'编程语言')# Printing the dataframeE.print(dataframeE)#      编程语言  已诞生多少年# 0    Java      23# 1  Python      20# 2     C++      28

【加粗】数据排序

DataFrame数据对象的排序次要是通过索引排序、某个指定列排序的形式为参照实现对DataFrame对象中的整个数据内容排序。

# Sorting the dataframeE by the index.res = dataframeE.sort_index()# Printing the res.print(res)#      编程语言  已诞生多少年# 0    Java      23# 1  Python      20# 2     C++      28# Sorting the dataframeE by the column `已诞生多少年`.res = dataframeE.sort_values(by=['已诞生多少年'], ascending=False)# Printing the res.print(res)#      编程语言  已诞生多少年# 2     C++      28# 0    Java      23# 1  Python      20

sort_index函数是指依照以后DataFrame数据对象的索引进行排序，sort_values则是依照指定的一个或多个列的值进行降序或者升序。

【加粗】数据分组

数据预处理中的数据分组次要是须要的分组的数据打上非凡的标记以便于前期对数据的归类解决。

比较简单一些的分组解决能够应用numpy中提供的函数进行解决，这里应用numpy的where函数来设置过滤条件。

# Creating a new column called `分组标记（高龄/低龄）` and setting the value to `高` if the value in the column `已诞生多少年` is greater# than or equal to 23, otherwise it is setting the value to `低`.dataframeE['分组标记（高龄/低龄）'] = np.where(dataframeE[u'已诞生多少年'] >= 23, '高', '低')# Printing the dataframeE.print(dataframeE)#      编程语言  已诞生多少年 分组标记（高龄/低龄）# 0    Java      23           高# 1  Python      20           低# 2     C++      28           高

略微简单一些的过滤条件能够应用多条件的过滤形式找出符合要求的数据项进行分组标记。

# Creating a new column called `分组标记（高龄/低龄,是否是Java）` and setting the value to `高/是` if the value in the column `已诞生多少年` is# greater than or equal to 23 and the value in the column `编程语言` is equal to `Java`, otherwise it is setting the value to# `低/否`.dataframeE['分组标记（高龄/低龄,是否是Java）'] = np.where((dataframeE[u'已诞生多少年'] >= 23) & (dataframeE[u'编程语言'] == 'Java'), '高/是',                                             '低/否')# Printing the dataframeE.print(dataframeE)#      编程语言  已诞生多少年 分组标记（高龄/低龄） 分组标记（高龄/低龄,是否是Java）# 0    Java      23           高                 高/是# 1  Python      20           低                 低/否# 2     C++      28           高                 低/否

5、提取数据

数据提取即是对符合要求的数据实现提取操作，DataFrame对象提取数据次要是依照标签值、标签值和地位以及数据地位进行提取。

DataFrame对象依照地位或地位区域提取数据，这里所说的地位其实就是DataFrame对象的索引。

基本上所有的操作都可能应用DataFrame对象的loc函数、iloc函数这两个函数来实现操作。

提取索引为2的DataFrame对象对应的行数据。

# Selecting the row with the index of 2.res = dataframeE.loc[2]# Printing the result of the operation.print(res)# 编程语言                   C++# 已诞生多少年                  28# 分组标记（高龄/低龄）              高# 分组标记（高龄/低龄,是否是Java）    低/否# Name: 2, dtype: object

提取索引0到1地位的所有的行数据。

# Selecting the rows with the index of 0 and 1.res = dataframeE.loc[0:1]# Printing the result of the operation.print(res)#      编程语言  已诞生多少年 分组标记（高龄/低龄） 分组标记（高龄/低龄,是否是Java）# 0    Java      23           高                 高/是# 1  Python      20           低                 低/否

依照前两行前两列的数据区域提取数据。

# 留神这里带有冒号:的iloc函数用法成果是和后面不一样的。# Selecting the first two rows and the first two columns.res = dataframeE.iloc[:2, :2]# Printing the result of the operation.print(res)#      编程语言  已诞生多少年# 0    Java      23# 1  Python      20

提取符合条件的数据项，对某一列数据中指定的值实现提取。

# 提取出编程语言这个列中数据内容是Java、C++的数据行。# Selecting the rows where the value in the column `编程语言` is either `Java` or `C++`.res = dataframeE.loc[dataframeE[u'编程语言'].isin(['Java', 'C++'])]# Printing the result of the operation.print(res)#    编程语言  已诞生多少年 分组标记（高龄/低龄） 分组标记（高龄/低龄,是否是Java）# 0  Java      23           高                 高/是# 2   C++      28           高                 低/否

6、筛选数据

筛选数据是数据处理整个生命周期中的最初一个对原有数据的提取操作，通过各种逻辑判断条件的操作来实现数据筛选。

这里别离通过应用DataFrame对象的'与'、'或'、'非'三种罕用的逻辑判断来实现上面的数据筛选操作。

# Creating a dataframe with two columns, one called `编程语言` and the other called `已诞生多少年`.dataframeF = pd.DataFrame({"编程语言": ['Java', 'Python', 'C++'],                           "已诞生多少年": [23, 20, 28]}, columns=['编程语言', '已诞生多少年'])res = dataframeF.loc[(dataframeF[u'已诞生多少年'] > 25) & (dataframeF[u'编程语言'] == 'C++'), [u'编程语言', u'已诞生多少年']]# Printing the result of the operation.print(res)#   编程语言  已诞生多少年# 2  C++      28res = dataframeF.loc[(dataframeF[u'已诞生多少年'] > 23) | (dataframeF[u'编程语言'] == 'Java'), [u'编程语言', u'已诞生多少年']]# Printing the result of the operation.print(res)#    编程语言  已诞生多少年# 0  Java      23# 2   C++      28res = dataframeF.loc[(dataframeF[u'编程语言'] != 'Java'), [u'编程语言', u'已诞生多少年']]# Printing the result of the operation.print(res)#      编程语言  已诞生多少年# 1  Python      20# 2     C++      28

7、数据汇总

数据汇总通常是应用groupby函数对一个或多个列名称进行分组，再应用count函数统计分组后的数目。

res = dataframeF.groupby(u'编程语言').count()# Printing the result of the operation.print(res)#         已诞生多少年# 编程语言# C++          1# Java         1# Python       1res = dataframeF.groupby(u'编程语言')[u'已诞生多少年'].count()# Printing the result of the operation.print(res)# 编程语言# C++       1# Java      1# Python    1# Name: 已诞生多少年, dtype: int64res = dataframeF.groupby([u'编程语言',u'已诞生多少年'])[u'已诞生多少年'].count()# Printing the result of the operation.print(res)# 编程语言    已诞生多少年# C++     28        1# Java    23        1# Python  20        1# Name: 已诞生多少年, dtype: int64

8、数据统计

数据统计的概念基本上和数学上的思路是一样的，首先是对数据进行采样，采样实现计算相干的标准差、协方差等相干的数据指标。

'''依照采样不放回的形式，随机获取DataFrame对象中的两条数据'''res = dataframeF.sample(n=2, replace=False)# Printing the result of the operation.print(res)#      编程语言  已诞生多少年# 0    Java      23# 1  Python      20

能够发现每次执行之后都会随机的从DataFrame的数据表中取出两条数据。

若是采样放回的形式时则能够将replace的属性设置为True即可。

# 计算出DataFrame对象的所有列的协方差res = dataframeF.cov()# Printing the result of the operation.print(res)#            已诞生多少年# 已诞生多少年  16.333333# 计算出DataFrame对象相关性res = dataframeF.corr()# Printing the result of the operation.print(res)#         已诞生多少年# 已诞生多少年     1.0

以上就是Python数据处理中整个生命周期数据的处理过程以及常见的各个数据处理过程中的常见解决形式。

感激大家始终以来的陪伴，Python集中营将会持续致力创作出更好的内容，感激大家的浏览！

【往期举荐】

python中的精度计算应该用什么，相似Java中的Bigdecimal对象！

如何将Excel中全国各省份人口数据绘制成地区分布图？

周末自制了一个批量图片水印增加器！