The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
在前面的章节中,我们详细的了解了 NumPy 和它的 ndarray
如何为 Python 提供了高效的存储和数据访问。本章我们将介绍如何在 Pandas 去使用这些知识。Pandas 是建立在 NumPy 基础之上,它提供了 Dataframe
对象。 DataFrames
本质上是具有附加的行和列标签的多维数组,并能够处理异构数据以及数据缺失的情况。除了为标记数据提供方便的存储接口外,Pandas 还实现了数据库和电子表格程序所提供的,那些用户所熟知的数据操作。
In the previous chapter, we dove into detail on NumPy and its ndarray
object, which provides efficient storage and manipulation of dense typed arrays in Python.
Here we'll build on this knowledge by looking in detail at the data structures provided by the Pandas library.
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame
.
DataFrame
s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
正如我们看到的,NumPy 的 ndarray
为纯粹的数据提供了一些基本的数据操作方法。虽然它非常好地完成了这个任务,但是当我们需要更多的灵活性(例如,附加标签到数据,处理丢失数据等)或者使用一些非元素级操作(例如,分组(groupings)以及 pivots 等)时,NumPy 的局限性就更加明显了。以上提及的这些行为是我们在处理真实世界中处理或分析那些非结构化数据时非常需要的工具。Pandas,特别是其构建在 NumPy 的 Series
与 DataFrame
对象,提供了更高效的数据操纵(data munging)的方法,而这些操作往往花费了数据科学家们大量的时间和精力。
As we saw, NumPy's ndarray
data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.
While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.
Pandas, and in particular its Series
and DataFrame
objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.
在本章,我们重点介绍 Series
和 Dataframe
以及其他相关结构的一些机制。我们会使用一些真实的数据集作为示例,但是不必过于专注于示例本身。
In this chapter, we will focus on the mechanics of using Series
, DataFrame
, and related structures effectively.
We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.
在系统上安装 Pandas 需要安装 NumPy,如果从源代码构建库,需要使用适当的工具来编译构建 Pandas 的 C 和 Cython 代码。这些安装细节可以在Pandas 文档中找到。如果你按照序言的建议安装了 Anaconda,你已经安装了 Pandas。
Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built. Details on this installation can be found in the Pandas documentation. If you followed the advice outlined in the Preface and used the Anaconda stack, you already have Pandas installed.
安装之后,引入并查看其版本:
Once Pandas is installed, you can import it and check the version:
import pandas
pandas.__version__
'0.18.1'
和用 np
表示 NumPy 类似,我们用 pd
表示 Pandas:
Just as we generally import NumPy under the alias np
, we will import Pandas under the alias pd
:
import pandas as pd
后面的章节都会使用这样的方式引入 Pandas。
This import convention will be used throughout the remainder of this book.
当你阅读本章时,不要忘记 IPython 给了你快速了解一个库的能力(通过 tab 自动补全)并提供了各种类型的文档(?
更多内容见 IPython 的帮助和文档)。
As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ?
character). (Refer back to Help and Documentation in IPython if you need a refresher on this.)
比如想要查看 pandas 下的所有内容可以这么做:
For example, to display all the contents of the pandas namespace, you can type
In [3]: pd.<TAB>
想要展示 Pandas 的文档,可以这么做:
And to display Pandas's built-in documentation, you can use this:
In [4]: pd?
More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.