关于python:python对100G以上的数据进行排序都有什么好的方法呢

学习 Pandas排序办法是开始或练习应用 Python进行根本数据分析的好办法。最常见的数据分析是应用电子表格、SQL或pandas 实现的。应用 Pandas 的一大长处是它能够解决大量数据并提供高性能的数据操作能力。

在本教程中，您将学习如何应用.sort_values()和.sort_index()，这将使您可能无效地对 DataFrame 中的数据进行排序。

在本教程完结时，您将晓得如何：

按一列或多列的值对Pandas DataFrame进行排序
应用ascending参数更改排序程序
通过index应用对 DataFrame 进行排序.sort_index()
在对值进行排序时组织缺失的数据
应用set to 对DataFrame进行就地排序inplaceTrue

要学习本教程，您须要对Pandas DataFrames有根本的理解，并对从文件中读取数据有肯定的理解。

Pandas 排序办法入门

疾速揭示一下，DataFrame是一种数据结构，行和列都带有标记的轴。您能够按行或列值以及行或列索引对 DataFrame 进行排序。

行和列都有索引，它是数据在 DataFrame 中地位的数字示意。您能够应用 DataFrame 的索引地位从特定行或列中检索数据。默认状况下，索引号从零开始。您也能够手动调配本人的索引。

筹备数据集

在本教程中，您将应用美国环境保护署 (EPA) 为 1984 年至 2021 年间制作的车辆编制的燃油经济性数据。EPA 燃油经济性数据集十分棒，因为它蕴含许多不同类型的信息，您能够对其进行排序上，从文本到数字数据类型。该数据集总共蕴含八十三列。

要持续，您须要装置pandas Python 库。本教程中的代码是应用 pandas 1.2.0 和Python 3.9.1 执行的。

留神：整个燃油经济性数据粗放为 18 MB。将整个数据集读入内存可能须要一两分钟。限度行数和列数有助于进步性能，但下载数据仍须要几秒钟的工夫。

出于剖析目标，您将按品牌、型号、年份和其余车辆属性查看车辆的 MPG（每加仑英里数）数据。您能够指定要读入 DataFrame 的列。对于本教程，您只须要可用列的子集。

以下是将燃油经济性数据集的相干列读入 DataFrame 并显示前五行的命令：

class=”highlight”>

>>>
>>> import pandas as pd

>>> column_subset = [
...     "id",
...     "make",
...     "model",
...     "year",
...     "cylinders",
...     "fuelType",
...     "trany",
...     "mpgData",
...     "city08",
...     "highway08"
... ]

>>> df = pd.read_csv(
...     "https://www.fueleconomy.gov/feg/epadata/vehicles.csv",
...     usecols=column_subset,
...     nrows=100
... )

>>> df.head()
   city08  cylinders fuelType  ...  mpgData            trany  year
0      19          4  Regular  ...        Y     Manual 5-spd  1985
1       9         12  Regular  ...        N     Manual 5-spd  1985
2      23          4  Regular  ...        Y     Manual 5-spd  1985
3      10          8  Regular  ...        N  Automatic 3-spd  1985
4      17          4  Premium  ...        N     Manual 5-spd  1993
[5 rows x 10 columns]

通过.read_csv()应用数据集 URL 进行调用，您能够将数据加载到 DataFrame 中。放大列会导致更快的加载工夫和更少的内存应用。为了进一步限度内存耗费并疾速理解数据，您能够应用指定要加载的行数nrows。

相熟 .sort_values()

您用于.sort_values()沿任一轴（列或行）对 DataFrame 中的值进行排序。通常，您心愿通过一列或多列的值对 DataFrame 中的行进行排序：

上图显示了应用.sort_values()依据highway08列中的值对 DataFrame 的行进行排序的后果。这相似于应用列对电子表格中的数据进行排序的形式。

相熟 .sort_index()

您用于.sort_index()按行索引或列标签对 DataFrame 进行排序。与 using 的不同之处.sort_values()在于您是依据其行索引或列名称对 DataFrame 进行排序，而不是依据这些行或列中的值：

DataFrame 的行索引在上图中以蓝色标出。索引不被视为一列，您通常只有一个行索引。行索引能够被认为是从零开始的行号。

在单列上对 DataFrame 进行排序

要依据单列中的值对 DataFrame 进行排序，您将应用.sort_values(). 默认状况下，这将返回一个按升序排序的新 DataFrame。它不会批改原始 DataFrame。

按升序按列排序

要应用.sort_values()，请将单个参数传递给蕴含要作为排序根据的列的名称的办法。在此示例中，您按city08列对 DataFrame 进行排序，该列示意纯燃料汽车的城市 MPG：

>>>
>>> df.sort_values("city08")
    city08  cylinders fuelType  ...  mpgData            trany  year
99       9          8  Premium  ...        N  Automatic 4-spd  1993
1        9         12  Regular  ...        N     Manual 5-spd  1985
80       9          8  Regular  ...        N  Automatic 3-spd  1985
47       9          8  Regular  ...        N  Automatic 3-spd  1985
3       10          8  Regular  ...        N  Automatic 3-spd  1985
..     ...        ...      ...  ...      ...              ...   ...
9       23          4  Regular  ...        Y  Automatic 4-spd  1993
8       23          4  Regular  ...        Y     Manual 5-spd  1993
7       23          4  Regular  ...        Y  Automatic 3-spd  1993
76      23          4  Regular  ...        Y     Manual 5-spd  1993
2       23          4  Regular  ...        Y     Manual 5-spd  1985
[100 rows x 10 columns]

这将应用中的列值对您的 DataFrame 进行排序city08，首先显示 MPG 最低的车辆。默认状况下，按升序.sort_values()对数据进行排序。只管您没有为传递给的参数指定名称，但.sort_values()您实际上应用了by参数，您将在下一个示例中看到该参数。

更改排序程序

的另一个参数.sort_values()是ascending。默认状况下.sort_values()曾经ascending设置True。如果您心愿 DataFrame 按降序排序，则能够传递False给此参数：

>>>
>>> df.sort_values(
...     by="city08",
...     ascending=False
... )
    city08  cylinders fuelType  ...  mpgData            trany  year
9       23          4  Regular  ...        Y  Automatic 4-spd  1993
2       23          4  Regular  ...        Y     Manual 5-spd  1985
7       23          4  Regular  ...        Y  Automatic 3-spd  1993
8       23          4  Regular  ...        Y     Manual 5-spd  1993
76      23          4  Regular  ...        Y     Manual 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
58      10          8  Regular  ...        N  Automatic 3-spd  1985
80       9          8  Regular  ...        N  Automatic 3-spd  1985
1        9         12  Regular  ...        N     Manual 5-spd  1985
47       9          8  Regular  ...        N  Automatic 3-spd  1985
99       9          8  Premium  ...        N  Automatic 4-spd  1993
[100 rows x 10 columns]

通过传递False到ascending，您能够颠倒排序程序。当初，您的 DataFrame 按城市条件下测量的均匀 MPG 降序排序。MPG 值最高的车辆在第一排。

抉择排序算法

值得注意的是，pandas 容许您抉择不同的排序算法来与.sort_values()和一起应用.sort_index()。可用的算法quicksort，mergesort和heapsort。无关这些不同排序算法的更多信息，请查看Python 中的排序算法。

对单列进行排序时默认应用的算法是quicksort。要将其更改为稳固的排序算法，请应用mergesort。您能够应用or 中的kind参数来执行此操作，如下所示：.sort_values().sort_index()

>>>
>>> df.sort_values(
...     by="city08",
...     ascending=False,
...     kind="mergesort"
... )
    city08  cylinders fuelType  ...  mpgData            trany  year
2       23          4  Regular  ...        Y     Manual 5-spd  1985
7       23          4  Regular  ...        Y  Automatic 3-spd  1993
8       23          4  Regular  ...        Y     Manual 5-spd  1993
9       23          4  Regular  ...        Y  Automatic 4-spd  1993
10      23          4  Regular  ...        Y     Manual 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
69      10          8  Regular  ...        N  Automatic 3-spd  1985
1        9         12  Regular  ...        N     Manual 5-spd  1985
47       9          8  Regular  ...        N  Automatic 3-spd  1985
80       9          8  Regular  ...        N  Automatic 3-spd  1985
99       9          8  Premium  ...        N  Automatic 4-spd  1993
[100 rows x 10 columns]

应用kind，您将排序算法设置为mergesort。之前的输入应用了默认quicksort算法。查看突出显示的索引，您能够看到行的程序不同。这是因为quicksort不是稳固的排序算法，而是mergesort。

留神：在 Pandas 中，kind当您对多个列或标签进行排序时会被疏忽。

当您对具备雷同键的多条记录进行排序时，稳固的排序算法将在排序后放弃这些记录的原始程序。因而，如果您打算执行多种排序，则必须应用稳固的排序算法。

在多列上对 DataFrame 进行排序

在数据分析中，通常心愿依据多列的值对数据进行排序。设想一下，您有一个蕴含人们名字和姓氏的数据集。先按姓而后按名字排序是有意义的，这样姓氏雷同的人会依据他们的名字按字母顺序排列。

在第一个示例中，您在名为的单个列上对 DataFrame 进行了排序city08。从剖析的角度来看，城市条件下的 MPG 是决定汽车受欢迎水平的重要因素。除了城市条件下的 MPG，您可能还想查看高速公路条件下的 MPG。要按两个键排序，您能够将列名列表传递给by：

>>>
>>> df.sort_values(
...     by=["city08", "highway08"]
... )[["city08", "highway08"]]
    city08  highway08
80       9         10
47       9         11
99       9         13
1        9         14
58      10         11
..     ...        ...
9       23         30
10      23         30
8       23         31
76      23         31
2       23         33
[100 rows x 2 columns]

通过指定列名称city08和的列表highway08，您能够应用对两列上的 DataFrame 进行排序.sort_values()。下一个示例将解释如何指定排序程序以及为什么留神您应用的列名列表很重要。

按升序按多列排序

要在多个列上对 DataFrame 进行排序，您必须提供一个列名称列表。例如，要按make和排序model，您应该创立以下列表，而后将其传递给.sort_values()：

>>>
>>> df.sort_values(
...     by=["make", "model"]
... )[["make", "model"]]
          make               model
0   Alfa Romeo  Spider Veloce 2000
18        Audi                 100
19        Audi                 100
20         BMW                740i
21         BMW               740il
..         ...                 ...
12  Volkswagen      Golf III / GTI
13  Volkswagen           Jetta III
15  Volkswagen           Jetta III
16       Volvo                 240
17       Volvo                 240
[100 rows x 2 columns]

当初您的 DataFrame 按升序排序make。如果有两个或更多雷同的品牌，则按排序model。在列表中指定列名的程序对应于 DataFrame 的排序形式。

更改列排序程序

因为您应用多列进行排序，因而您能够指定列的排序程序。如果要更改上一个示例中的逻辑排序程序，则能够更改传递给by参数的列表中列名的程序：

>>>
>>> df.sort_values(
...     by=["model", "make"]
... )[["make", "model"]]
             make        model
18           Audi          100
19           Audi          100
16          Volvo          240
17          Volvo          240
75          Mazda          626
..            ...          ...
62           Ford  Thunderbird
63           Ford  Thunderbird
88     Oldsmobile     Toronado
42  CX Automotive        XM v6
43  CX Automotive       XM v6a
[100 rows x 2 columns]

您的 DataFrame 当初按model升序按列排序，而后按make是否有两个或更多雷同模型进行排序。您能够看到更改列的程序也会更改值的排序程序。

按降序按多列排序

到目前为止，您仅对多列按升序排序。在下一个示例中，您将依据make和model列按降序排序。要按降序排序，请设置ascending为False：

>>>
>>> df.sort_values(
...     by=["make", "model"],
...     ascending=False
... )[["make", "model"]]
          make               model
16       Volvo                 240
17       Volvo                 240
13  Volkswagen           Jetta III
15  Volkswagen           Jetta III
11  Volkswagen      Golf III / GTI
..         ...                 ...
21         BMW               740il
20         BMW                740i
18        Audi                 100
19        Audi                 100
0   Alfa Romeo  Spider Veloce 2000
[100 rows x 2 columns]

该make列中的值按字母程序model倒序排列，对于具备雷同make. 对于文本数据，排序辨别大小写，这意味着大写文本将首先按升序呈现，最初按降序呈现。

按具备不同排序程序的多列排序

您可能想晓得是否能够应用多个列进行排序并让这些列应用不同的ascending参数。应用熊猫，您能够通过单个办法调用来实现此操作。如果要按升序对某些列进行排序，并按降序对某些列进行排序，则能够将布尔值列表传递给ascending.

在这个例子中，您排列数据帧由make，model和city08列，与前两列依照升序排序和city08按降序排列。为此，您将列名列表传递给by和布尔值列表传递给ascending：

>>>
>>> df.sort_values(
...     by=["make", "model", "city08"],
...     ascending=[True, True, False]
... )[["make", "model", "city08"]]
          make               model  city08
0   Alfa Romeo  Spider Veloce 2000      19
18        Audi                 100      17
19        Audi                 100      17
20         BMW                740i      14
21         BMW               740il      14
..         ...                 ...     ...
11  Volkswagen      Golf III / GTI      18
15  Volkswagen           Jetta III      20
13  Volkswagen           Jetta III      18
17       Volvo                 240      19
16       Volvo                 240      18
[100 rows x 3 columns]

当初你的数据帧进行排序make，并model在按升序排列，但与city08按降序排列列。这很有用，因为它按分类程序对汽车进行分组，并首先显示最高 MPG 的汽车。

依据索引对 DataFrame 进行排序

在对索引进行排序之前，最好先理解索引代表什么。DataFrame 有一个.index属性，默认状况下它是其行地位的数字示意。您能够将索引视为行号。它有助于疾速行查找和辨认。

按升序按索引排序

您能够依据行索引对 DataFrame 进行排序.sort_index()。像在后面的示例中一样按列值排序会从新排序 DataFrame 中的行，因而索引变得横七竖八。当您过滤 DataFrame 或删除或增加行时，也会产生这种状况。

为了阐明的应用.sort_index()，首先应用以下办法创立一个新的排序 DataFrame .sort_values()：

>>>
>>> sorted_df = df.sort_values(by=["make", "model"])
>>> sorted_df
    city08  cylinders fuelType  ...  mpgData            trany  year
0       19          4  Regular  ...        Y     Manual 5-spd  1985
18      17          6  Premium  ...        Y  Automatic 4-spd  1993
19      17          6  Premium  ...        N     Manual 5-spd  1993
20      14          8  Premium  ...        N  Automatic 5-spd  1993
21      14          8  Premium  ...        N  Automatic 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
12      21          4  Regular  ...        Y     Manual 5-spd  1993
13      18          4  Regular  ...        N  Automatic 4-spd  1993
15      20          4  Regular  ...        N     Manual 5-spd  1993
16      18          4  Regular  ...        Y  Automatic 4-spd  1993
17      19          4  Regular  ...        Y     Manual 5-spd  1993
[100 rows x 10 columns]

您曾经创立了一个应用多个值排序的 DataFrame。请留神行索引是如何没有特定程序的。要将新 DataFrame 复原到原始程序，您能够应用.sort_index()：

>>>
>>> sorted_df.sort_index()
    city08  cylinders fuelType  ...  mpgData            trany  year
0       19          4  Regular  ...        Y     Manual 5-spd  1985
1        9         12  Regular  ...        N     Manual 5-spd  1985
2       23          4  Regular  ...        Y     Manual 5-spd  1985
3       10          8  Regular  ...        N  Automatic 3-spd  1985
4       17          4  Premium  ...        N     Manual 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
95      17          6  Regular  ...        Y  Automatic 3-spd  1993
96      17          6  Regular  ...        N  Automatic 4-spd  1993
97      15          6  Regular  ...        N  Automatic 4-spd  1993
98      15          6  Regular  ...        N     Manual 5-spd  1993
99       9          8  Premium  ...        N  Automatic 4-spd  1993
[100 rows x 10 columns]

当初索引按升序排列。就像in.sort_values()的默认参数是，您能够通过传递更改为降序。对索引进行排序对数据自身没有影响，因为值不变。ascending.sort_index()TrueFalse

当您应用.set_index(). 如果要应用make和model列设置自定义索引，则能够将列表传递给.set_index()：

>>>
>>> assigned_index_df = df.set_index(
...     ["make", "model"]
... )
>>> assigned_index_df
                                  city08  cylinders  ...            trany  year
make        model                                    ...
Alfa Romeo  Spider Veloce 2000        19          4  ...     Manual 5-spd  1985
Ferrari     Testarossa                 9         12  ...     Manual 5-spd  1985
Dodge       Charger                   23          4  ...     Manual 5-spd  1985
            B150/B250 Wagon 2WD       10          8  ...  Automatic 3-spd  1985
Subaru      Legacy AWD Turbo          17          4  ...     Manual 5-spd  1993
                                  ...        ...  ...              ...   ...
Pontiac     Grand Prix                17          6  ...  Automatic 3-spd  1993
            Grand Prix                17          6  ...  Automatic 4-spd  1993
            Grand Prix                15          6  ...  Automatic 4-spd  1993
            Grand Prix                15          6  ...     Manual 5-spd  1993
Rolls-Royce Brooklands/Brklnds L       9          8  ...  Automatic 4-spd  1993
[100 rows x 8 columns]

应用此办法，您能够用两个轴标签替换默认的基于整数的行索引。这被认为是一个MultiIndex或一个档次索引。您的 DataFrame 当初由多个键索引，您能够应用.sort_index()以下键进行排序：

>>>
>>> assigned_index_df.sort_index()
                               city08  cylinders  ...            trany  year
make       model                                  ...
Alfa Romeo Spider Veloce 2000      19          4  ...     Manual 5-spd  1985
Audi       100                     17          6  ...  Automatic 4-spd  1993
           100                     17          6  ...     Manual 5-spd  1993
BMW        740i                    14          8  ...  Automatic 5-spd  1993
           740il                   14          8  ...  Automatic 5-spd  1993
                               ...        ...  ...              ...   ...
Volkswagen Golf III / GTI          21          4  ...     Manual 5-spd  1993
           Jetta III               18          4  ...  Automatic 4-spd  1993
           Jetta III               20          4  ...     Manual 5-spd  1993
Volvo      240                     18          4  ...  Automatic 4-spd  1993
           240                     19          4  ...     Manual 5-spd  1993
[100 rows x 8 columns]

首先应用make和列为 DataFrame 调配一个新索引model，而后应用对索引进行排序.sort_index()。您能够.set_index()在 pandas 文档中浏览无关应用的更多信息。

按索引降序排序

对于下一个示例，您将按索引按降序对 DataFrame 进行排序。请记住，通过对 DataFrame 进行排序.sort_values()，您能够通过设置ascending为来反转排序程序False。此参数也实用于.sort_index()，因而您能够按相同程序对 DataFrame 进行排序，如下所示：

>>>
>>> assigned_index_df.sort_index(ascending=False)
                               city08  cylinders  ...            trany  year
make       model                                  ...
Volvo      240                     18          4  ...  Automatic 4-spd  1993
           240                     19          4  ...     Manual 5-spd  1993
Volkswagen Jetta III               18          4  ...  Automatic 4-spd  1993
           Jetta III               20          4  ...     Manual 5-spd  1993
           Golf III / GTI          18          4  ...  Automatic 4-spd  1993
                               ...        ...  ...              ...   ...
BMW        740il                   14          8  ...  Automatic 5-spd  1993
           740i                    14          8  ...  Automatic 5-spd  1993
Audi       100                     17          6  ...  Automatic 4-spd  1993
           100                     17          6  ...     Manual 5-spd  1993
Alfa Romeo Spider Veloce 2000      19          4  ...     Manual 5-spd  1985
[100 rows x 8 columns]

当初您的 DataFrame 按其索引按降序排序。应用.sort_index()and之间的一个区别.sort_values()是它.sort_index()没有by参数，因为它默认在行索引上对 DataFrame 进行排序。

摸索高级索引排序概念

在数据分析中有很多状况您心愿对分层索引进行排序。你曾经看到了如何应用make和model在MultiIndex。对于此数据集，您还能够将该id列用作索引。

将id列设置为索引可能有助于链接相干数据集。例如，EPA 的排放数据集也用于id示意车辆记录 ID。这将排放数据与燃油经济性数据分割起来。在 DataFrame 中对两个数据集的索引进行排序能够应用其余办法（例如.merge(). 要理解无关在 Pandas 中组合数据的更多信息，请查看在 Pandas 中应用 merge()、.join() 和 concat() 组合数据。

对 DataFrame 的列进行排序

您还能够应用 DataFrame 的列标签对行值进行排序。应用设置为.sort_index()的可选参数将按列标签对 DataFrame 进行排序。排序算法利用于轴标签而不是理论数据。这有助于对 DataFrame 进行目视查看。axis1

应用数据框 axis

当您在.sort_index()不传递任何显式参数axis=0的状况下应用时，它将用作默认参数。DataFrame的轴指的是索引 ( axis=0) 或列 ( axis=1)。您能够应用这两个轴来索引和抉择DataFrame 中的数据以及对数据进行排序。

应用列标签进行排序

您还能够应用 DataFrame 的列标签作为.sort_index(). 设置依据列标签对 DataFrame 的列axis进行1排序：

>>>
>>> df.sort_index(axis=1)
    city08  cylinders fuelType  ...  mpgData            trany  year
0       19          4  Regular  ...        Y     Manual 5-spd  1985
1        9         12  Regular  ...        N     Manual 5-spd  1985
2       23          4  Regular  ...        Y     Manual 5-spd  1985
3       10          8  Regular  ...        N  Automatic 3-spd  1985
4       17          4  Premium  ...        N     Manual 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
95      17          6  Regular  ...        Y  Automatic 3-spd  1993
96      17          6  Regular  ...        N  Automatic 4-spd  1993
97      15          6  Regular  ...        N  Automatic 4-spd  1993
98      15          6  Regular  ...        N     Manual 5-spd  1993
99       9          8  Premium  ...        N  Automatic 4-spd  1993
[100 rows x 10 columns]

DataFrame 的列按字母升序从左到右排序。如果要按降序对列进行排序，则能够应用ascending=False：

>>>
>>> df.sort_index(axis=1, ascending=False)
    year            trany mpgData  ... fuelType cylinders  city08
0   1985     Manual 5-spd       Y  ...  Regular         4      19
1   1985     Manual 5-spd       N  ...  Regular        12       9
2   1985     Manual 5-spd       Y  ...  Regular         4      23
3   1985  Automatic 3-spd       N  ...  Regular         8      10
4   1993     Manual 5-spd       N  ...  Premium         4      17
..   ...              ...     ...  ...      ...       ...     ...
95  1993  Automatic 3-spd       Y  ...  Regular         6      17
96  1993  Automatic 4-spd       N  ...  Regular         6      17
97  1993  Automatic 4-spd       N  ...  Regular         6      15
98  1993     Manual 5-spd       N  ...  Regular         6      15
99  1993  Automatic 4-spd       N  ...  Premium         8       9
[100 rows x 10 columns]

应用axis=1in .sort_index()，您能够按升序和降序对 DataFrame 的列进行排序。这在其余数据集中可能更有用，例如列标签对应于一年中的几个月的数据集。在这种状况下，按月按升序或降序排列数据是有意义的。

在 Pandas 中排序时解决失落的数据

通常，事实世界的数据有很多缺点。尽管 Pandas 有多种办法可用于在排序前清理数据，但有时在排序时查看失落的数据还是不错的。你能够用na_position参数来做到这一点。

本教程应用的燃油经济性数据子集没有缺失值。为了阐明的应用na_position，首先您须要创立一些缺失的数据。以下代码基于现有mpgData列创立了一个新列，映射True了mpgData等于Y和NaN不等于的地位：

>>>
>>> df["mpgData_"] = df["mpgData"].map({"Y": True})
>>> df
    city08  cylinders fuelType  ...            trany  year mpgData_
0       19          4  Regular  ...     Manual 5-spd  1985     True
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
2       23          4  Regular  ...     Manual 5-spd  1985     True
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
4       17          4  Premium  ...     Manual 5-spd  1993      NaN
..     ...        ...      ...  ...              ...   ...      ...
95      17          6  Regular  ...  Automatic 3-spd  1993     True
96      17          6  Regular  ...  Automatic 4-spd  1993      NaN
97      15          6  Regular  ...  Automatic 4-spd  1993      NaN
98      15          6  Regular  ...     Manual 5-spd  1993      NaN
99       9          8  Premium  ...  Automatic 4-spd  1993      NaN
[100 rows x 11 columns]

当初你有一个名为新列mpgData_蕴含这两个True和NaN值。您将应用此列查看na_position应用这两种排序办法时的成果。要理解无关应用的更多信息.map()，您能够浏览Pandas 我的项目：应用 Python 和 Pandas 制作问题簿。

理解na_position参数.sort_values()

.sort_values()承受一个名为的参数na_position，它有助于在您排序的列中组织缺失的数据。如果您对缺失数据的列进行排序，那么具备缺失值的即将呈现在 DataFrame 的开端。无论您是按升序还是降序排序，都会产生这种状况。

当您对缺失数据的列进行排序时，您的 DataFrame 如下所示：

>>>
>>> df.sort_values(by="mpgData_")
    city08  cylinders fuelType  ...            trany  year mpgData_
0       19          4  Regular  ...     Manual 5-spd  1985     True
55      18          6  Regular  ...  Automatic 4-spd  1993     True
56      18          6  Regular  ...  Automatic 4-spd  1993     True
57      16          6  Premium  ...     Manual 5-spd  1993     True
59      17          6  Regular  ...  Automatic 4-spd  1993     True
..     ...        ...      ...  ...              ...   ...      ...
94      18          6  Regular  ...  Automatic 4-spd  1993      NaN
96      17          6  Regular  ...  Automatic 4-spd  1993      NaN
97      15          6  Regular  ...  Automatic 4-spd  1993      NaN
98      15          6  Regular  ...     Manual 5-spd  1993      NaN
99       9          8  Premium  ...  Automatic 4-spd  1993      NaN
[100 rows x 11 columns]

要扭转这种行为，并有失落的数据第一次呈现在你的数据帧，能够设置na_position到first。该na_position参数只承受值last，这是默认值，和first。以下是如何应用na_postion的.sort_values()：

>>>
>>> df.sort_values(
...     by="mpgData_",
...     na_position="first"
... )
    city08  cylinders fuelType  ...            trany  year mpgData_
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
4       17          4  Premium  ...     Manual 5-spd  1993      NaN
5       21          4  Regular  ...  Automatic 3-spd  1993      NaN
11      18          4  Regular  ...  Automatic 4-spd  1993      NaN
..     ...        ...      ...  ...              ...   ...      ...
32      15          8  Premium  ...  Automatic 4-spd  1993     True
33      15          8  Premium  ...  Automatic 4-spd  1993     True
37      17          6  Regular  ...  Automatic 3-spd  1993     True
85      17          6  Regular  ...  Automatic 4-spd  1993     True
95      17          6  Regular  ...  Automatic 3-spd  1993     True
[100 rows x 11 columns]

当初，您用于排序的列中的任何缺失数据都将显示在 DataFrame 的顶部。当您第一次开始剖析数据并且不确定是否存在缺失值时，这十分有用。

理解na_position参数.sort_index()

.sort_index()也承受na_position。您的 DataFrame 通常不会将NaN值作为其索引的一部分，因而此参数在.sort_index(). 然而，很快乐晓得，如果您的 DataFrame 的确NaN在行索引或列名中存在，那么您能够应用.sort_index()和疾速辨认这一点na_position。

默认状况下，此参数设置为last，将NaN值搁置在排序后果的开端。要扭转这种行为，并在你的数据帧先有失落的数据，设置na_position到first。

应用排序办法批改你的 DataFrame

在所有的例子你迄今所看到的，都.sort_values()和.sort_index()曾经返回数据帧对象时，你叫那些办法。这是因为在熊猫排序不工作到位默认。通常，这是应用 Pandas 剖析数据的最常见和首选办法，因为它会创立一个新的 DataFrame 而不是批改原始数据。这容许您保留从文件中读取数据时的数据状态。

然而，您能够通过指定inplace值为的可选参数来间接批改原始 DataFrame True。大多数 Pandas 办法都蕴含inplace参数。上面，您将看到一些inplace=True用于对 DataFrame 进行适当排序的示例。

.sort_values()就地应用

随着inplace设置为True，您批改原始数据帧，所以排序办法返回None。city08像第一个示例一样按列的值对 DataFrame 进行排序，但inplace设置为True：

>>>
>>> df.sort_values("city08", inplace=True)

请留神调用如何.sort_values()不返回 DataFrame。这是原件的df样子：

>>>
>>> df
    city08  cylinders fuelType  ...            trany  year mpgData_
99       9          8  Premium  ...  Automatic 4-spd  1993      NaN
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
80       9          8  Regular  ...  Automatic 3-spd  1985      NaN
47       9          8  Regular  ...  Automatic 3-spd  1985      NaN
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
..     ...        ...      ...  ...              ...   ...      ...
9       23          4  Regular  ...  Automatic 4-spd  1993     True
8       23          4  Regular  ...     Manual 5-spd  1993     True
7       23          4  Regular  ...  Automatic 3-spd  1993     True
76      23          4  Regular  ...     Manual 5-spd  1993     True
2       23          4  Regular  ...     Manual 5-spd  1985     True
[100 rows x 11 columns]

在df对象中，值当初基于city08列按升序排序。您的原始 DataFrame 已被批改，更改将继续存在。防止inplace=True用于剖析通常是个好主见，因为对 DataFrame 的更改无奈吊销。

.sort_index()就地应用

下一个示例阐明这inplace也实用于.sort_index().

因为索引是在您将文件读入 DataFrame 时按升序创立的，因而您能够df再次批改对象以使其复原到初始程序。应用.sort_index()与inplace设置为True批改数据框：

>>>
>>> df.sort_index(inplace=True)
>>> df
    city08  cylinders fuelType  ...            trany  year mpgData_
0       19          4  Regular  ...     Manual 5-spd  1985     True
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
2       23          4  Regular  ...     Manual 5-spd  1985     True
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
4       17          4  Premium  ...     Manual 5-spd  1993      NaN
..     ...        ...      ...  ...              ...   ...      ...
95      17          6  Regular  ...  Automatic 3-spd  1993     True
96      17          6  Regular  ...  Automatic 4-spd  1993      NaN
97      15          6  Regular  ...  Automatic 4-spd  1993      NaN
98      15          6  Regular  ...     Manual 5-spd  1993      NaN
99       9          8  Premium  ...  Automatic 4-spd  1993      NaN
[100 rows x 11 columns]

当初您的 DataFrame 已应用.sort_index(). 因为您的 DataFrame 依然具备其默认索引，因而按升序对其进行排序会将数据放回其原始程序。

如果您相熟 Python 的内置函数sort()and sorted()，那么inplacepandas 排序办法中可用的参数可能会感觉十分类似。无关更多信息，您能够查看如何在 Python 中应用 sorted() 和 sort()。

论断

您当初晓得如何应用 pandas 库的两个外围办法：.sort_values()和.sort_index(). 有了这些常识，您就能够应用 DataFrame 执行根本的数据分析。尽管这两种办法之间有很多相似之处，但通过查看它们之间的差别，能够分明地晓得应用哪一种办法来执行不同的剖析工作。

在本教程中，您学习了如何：

按一列或多列的值对Pandas DataFrame进行排序
应用ascending参数更改排序程序
通过index应用对 DataFrame 进行排序.sort_index()
在对值进行排序时组织缺失的数据
应用set to 对DataFrame进行就地排序inplaceTrue

这些办法是精通数据分析的重要组成部分。它们将帮忙您建设一个弱小的根底，您能够在此基础上执行更高级的 Pandas 操作。如果您想查看 Pandas 排序办法更高级用法的一些示例，那么 Pandas文档是一个很好的资源。

关于python:python对100G以上的数据进行排序都有什么好的方法呢

Pandas 排序办法入门

筹备数据集

相熟 .sort_values()

相熟 .sort_index()

在单列上对 DataFrame 进行排序

按升序按列排序

更改排序程序

抉择排序算法

在多列上对 DataFrame 进行排序

按升序按多列排序

更改列排序程序

按降序按多列排序

按具备不同排序程序的多列排序

依据索引对 DataFrame 进行排序

按升序按索引排序

按索引降序排序

摸索高级索引排序概念

对 DataFrame 的列进行排序

应用数据框 axis

应用列标签进行排序

在 Pandas 中排序时解决失落的数据

理解na_position参数.sort_values()

理解na_position参数.sort_index()

应用排序办法批改你的 DataFrame

.sort_values()就地应用

.sort_index()就地应用

论断

评论

发表回复取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

关于python:python对100G以上的数据进行排序都有什么好的方法呢

Pandas 排序办法入门

筹备数据集

相熟 .sort_values()

相熟 .sort_index()

在单列上对 DataFrame 进行排序

按升序按列排序

更改排序程序

抉择排序算法

在多列上对 DataFrame 进行排序

按升序按多列排序

更改列排序程序

按降序按多列排序

按具备不同排序程序的多列排序

依据索引对 DataFrame 进行排序

按升序按索引排序

按索引降序排序

摸索高级索引排序概念

对 DataFrame 的列进行排序

应用数据框 axis

应用列标签进行排序

在 Pandas 中排序时解决失落的数据

理解na_position参数.sort_values()

理解na_position参数.sort_index()

应用排序办法批改你的 DataFrame

.sort_values()就地应用

.sort_index()就地应用

论断

评论

发表回复 取消回复

更多文章

DDN HPC 存储硬件架构设计深度分析

探秘IO500：从Lustre并行文件系统出发，开启HPC存储性能新征程

苹果iOS打包的ipa应用无法安装？一篇文章带你了解可能的原因及排查方法

图解Golang：从零开始实现简易版过期LRU缓存

发表回复取消回复