The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
本章以及第3章概述了在 Python 中高效加载、存储以及操纵数据的技术。其主题非常广泛:数据集可以有非常广泛的来源和多种形态的格式,包括文档,图像,声音片段,计量结果一起其他任何可能的形式。尽管这些数据形态各异,但这样我们才更易于理解为什么各种各样的数据都表示为数值数组的形式。
This chapter, along with chapter 3, outlines techniques for effectively loading, storing, and manipulating in-memory data in Python. The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else. Despite this apparent heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.
举个例子,图像,特别是数字图像,可以被认为是表示在整个区域上的像素亮度的二维数字阵列。声音剪辑可以被认为是强度在时间上的一维数组。文本可以以各种方式转换为数字表示,比如表示某些词或词对的频率的二进制数字。不管数据是什么,使其可分析的第一步就是将它们转换为数字数组。(我们将在特征工程中做更多的讨论)
For example, images–particularly digital images–can be thought of as simply two-dimensional arrays of numbers representing pixel brightness across the area. Sound clips can be thought of as one-dimensional arrays of intensity versus time. Text can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words. No matter what the data are, the first step in making it analyzable will be to transform them into arrays of numbers. (We will discuss some specific examples of this process later in Feature Engineering)
因此,有效存储和操作数值数组是进行数据科学过程的基础。现在我们来看看 Python 处理这种数组的专用工具:NumPy 和 Pandas(在第3章中讨论)。
For this reason, efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science. We'll now take a look at the specialized tools that Python has for handling such numerical arrays: the NumPy package, and the Pandas package (discussed in Chapter 3).
本章将详细介绍 NumPy。Numpy(Numerical Python 的简称)提供了一个有效存储和进行密集的数据缓操作的接口。在某些方面, Numpy 的阵列如Python 的内置 list
类型非常类似,但 Numpy 的阵列能够提供更加高效的存储和数据操作。NumPy 数组形成了 Python 中几乎整个数据科学工具体系的核心,所以无论数据科学的哪个方面对你感兴趣,花费时间来学习使用 NumPy 的是非常值得的。
This chapter will cover NumPy in detail. NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers.
In some ways, NumPy arrays are like Python's built-in list
type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size.
NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.
如果你遵循前言中概述的建议并安装了 Anaconda,你已经安装了 NumPy 并准备好了。否则你可以去 http://www.numpy.org/ 并按照在那里的安装说明安装。一旦安装完成,你可以导入 NumPy 并检查其版本:
If you followed the advice outlined in the Preface and installed the Anaconda stack, you already have NumPy installed and ready to go. If you're more the do-it-yourself type, you can go to http://www.numpy.org/ and follow the installation instructions found there. Once you do, you can import NumPy and double-check the version:
import numpy
numpy.__version__
'1.11.1'
我建议安装 NumPy 1.8 或更高版本。按照惯例,你会发现,大多数在 SciPy/PyData 世界的人在会在导入 numpy 时使用 np 作为别名:
For the pieces of the package discussed here, I'd recommend NumPy version 1.8 or later.
By convention, you'll find that most people in the SciPy/PyData world will import NumPy using np
as an alias:
import numpy as np
在本章以及本书的其余部分,你会发现这是我们导入和使用 NumPy 的默认方式。
Throughout this chapter, and indeed the rest of the book, you'll find that this is the way we will import and use NumPy.
当你阅读本章时,,不要忘记 IPython 给了你快速了解一个库的能力(通过 tab 自动补全)并提供了各种类型的(?
更多内容见 IPython 的帮助和文档)。
As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature), as well as the documentation of various functions (using the ?
character – Refer back to Help and Documentation in IPython).
比如想要查看 numpy 下的所有内容可以这么做:
For example, to display all the contents of the numpy namespace, you can type this:
In [3]: np.<TAB>
想要展示 NumPy 的文档,可以这么做:
And to display NumPy's built-in documentation, you can use this:
In [4]: np?
更多文档、入门样例和其他资源,可以在 http://www.numpy.org 找到。
More detailed documentation, along with tutorials and other resources, can be found at http://www.numpy.org.