The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
截至当前我们已经介绍了 NumPy 的一些基础知识;在接下来的几节中,我们将深入探讨 NumPy 在 Python 数据科学如此重要的原因。一句话概括,它提供了一个简单和灵活的接口来优化数组数据的计算。
Up until now, we have been discussing some of the basic nuts and bolts of NumPy; in the next few sections, we will dive into the reasons that NumPy is so important in the Python data science world. Namely, it provides an easy and flexible interface to optimized computation with arrays of data.
NumPy 数组的操作可以很快(在正确使用的情况下)也可以很慢(在使用错误的情况下)。使其变快的关键在于使用基于 NumPy 的 universal functions (ufuncs) 实现的向量化的操作。
Computation on NumPy arrays can be very fast, or it can be very slow. The key to making it fast is to use vectorized operations, generally implemented through NumPy's universal functions (ufuncs). This section motivates the need for NumPy's ufuncs, which can be used to make repeated calculations on array elements much more efficient. It then introduces many of the most common and useful arithmetic ufuncs available in the NumPy package.
默认的 Python 实现(CPython)的某些操作慢的要死。这是动态解释性语言所导致的:由于类型是动态的,序列操作就无法像 C Fortran 那样编译成为高效的机器语言来执行。近年来对于这个问题出现了很多的尝试:比较著名的方案有 PyPy,一个 Python 语言的动态编译器;Cython,一个将 Python 语言转换为 C 语言执行的项目;Numba,将 Python 转换为 LLVM 字节码执行的想。这些项目各有优劣,但可以肯定的是它们都无法达到目前标准 CPython 的普及程度。
Python's default implementation (known as CPython) does some operations very slowly. This is in part due to the dynamic, interpreted nature of the language: the fact that types are flexible, so that sequences of operations cannot be compiled down to efficient machine code as in languages like C and Fortran. Recently there have been various attempts to address this weakness: well-known examples are the PyPy project, a just-in-time compiled implementation of Python; the Cython project, which converts Python code to compilable C code; and the Numba project, which converts snippets of Python code to fast LLVM bytecode. Each of these has its strengths and weaknesses, but it is safe to say that none of the three approaches has yet surpassed the reach and popularity of the standard CPython engine.
Python 在有大量琐碎的操作被重复调用的时候变得更为迟缓,例如一个遍历数组所有元素的循环。假设我们有一个值的数组,我们想计算每个的倒数。一个简单的方法如下所示:
The relative sluggishness of Python generally manifests itself in situations where many small operations are being repeated – for instance looping over arrays to operate on each element. For example, imagine we have an array of values and we'd like to compute the reciprocal of each. A straightforward approach might look like this:
import numpy as np
np.random.seed(0)
def compute_reciprocals(values):
output = np.empty(len(values))
for i in range(len(values)):
output[i] = 1.0 / values[i]
return output
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)
array([ 0.16666667, 1. , 0.25 , 0.25 , 0.125 ])
对于一个背景是 C 或者 Java 的程序员来说,这个实现很常见。但是如果我们测试大输入的情况,我们会发现这个操作慢的要死。我们利用 IPython 的 %timeit
的魔法命令对其进行测试(详见性能与计时)。
This implementation probably feels fairly natural to someone from, say, a C or Java background.
But if we measure the execution time of this code for a large input, we see that this operation is very slow, perhaps surprisingly so!
We'll benchmark this with IPython's %timeit
magic (discussed in Profiling and Timing Code):
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)
1 loop, best of 3: 230 ms per loop
它需要几秒钟来计算这些百万次操作和存储结果!在连手机都有了Giga-FLOPS(即,每秒数十亿个数字操作)级别处理速度的今天,这慢的荒谬至极。事实证明,这里的瓶颈不是操作本身,而是 CPython 在循环的每个周期必须进行的类型检查和函数调度。每次计算倒数时,Python 首先检查对象的类型,并查找适合该类型的函数来执行计算。如果我们是在编译型的语言中,类型在代码执行之前就已知,那么计算速度就会快得多。
It takes several seconds to compute these million operations and to store the result! When even cell phones have processing speeds measured in Giga-FLOPS (i.e., billions of numerical operations per second), this seems almost absurdly slow. It turns out that the bottleneck here is not the operations themselves, but the type-checking and function dispatches that CPython must do at each cycle of the loop. Each time the reciprocal is computed, Python first examines the object's type and does a dynamic lookup of the correct function to use for that type. If we were working in compiled code instead, this type specification would be known before the code executes and the result could be computed much more efficiently.
对于许多类型的操作,NumPy 提供了执行这种针对静态类型的编译型的操作,这被称为向量化运算。这可以通过简单地对整个数组执行操作来实现,这相当于对每个元素应用操作。这种向量化方法旨在将循环推入编译层,这是 NumPy 高效运算的基础。
比较以下两个的结果:
For many types of operations, NumPy provides a convenient interface into just this kind of statically typed, compiled routine. This is known as a vectorized operation. This can be accomplished by simply performing an operation on the array, which will then be applied to each element. This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.
Compare the results of the following two:
print(compute_reciprocals(values))
print(1.0 / values)
[ 0.16666667 1. 0.25 0.25 0.125 ] [ 0.16666667 1. 0.25 0.25 0.125 ]
看一看对 big_array
的执行速度,与 Python 循环相比,我们看到了数量级的飞跃:
Looking at the execution time for our big array, we see that it completes orders of magnitude faster than the Python loop:
%timeit (1.0 / big_array)
100 loops, best of 3: 4.14 ms per loop
NumPy中的向量化操作通过 ufuncs 实现,其主要目的是加速 NumPy 数组中的重复操作。Ufuncs 非常灵活 - 除了能直接对标量和数组进行操作外,也可以用于两个数组:
Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute repeated operations on values in NumPy arrays. Ufuncs are extremely flexible – before we saw an operation between a scalar and an array, but we can also operate between two arrays:
np.arange(5) / np.arange(1, 6)
array([0, 0, 0, 0, 0])
当然 ufunc 不只适用于一维数组,它们同样支持多维数组:
And ufunc operations are not limited to one-dimensional arrays–they can also act on multi-dimensional arrays as well:
x = np.arange(9).reshape((3, 3))
2 ** x
array([[ 1, 2, 4], [ 8, 16, 32], [ 64, 128, 256]])
与等价的 Python 循环相比使用 ufuncs 的向量化运算基本上都是要快一些,在数组规模比较庞大的时情况下更是如此。当你在 Python 中看到了循环你就应当考虑它是否应当替换为一个向量化的操作。
Computations using vectorization through ufuncs are nearly always more efficient than their counterpart implemented using Python loops, especially as the arrays grow in size. Any time you see such a loop in a Python script, you should consider whether it can be replaced with a vectorized expression.
x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2) # floor division
('x =', array([0, 1, 2, 3])) ('x + 5 =', array([5, 6, 7, 8])) ('x - 5 =', array([-5, -4, -3, -2])) ('x * 2 =', array([0, 2, 4, 6])) ('x / 2 =', array([0, 0, 1, 1])) ('x // 2 =', array([0, 0, 1, 1]))
这里展示了取反运算,指数运算 **
以及取余 %
:
There is also a unary ufunc for negation, and a **
operator for exponentiation, and a %
operator for modulus:
print("-x = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2 = ", x % 2)
('-x = ', array([ 0, -1, -2, -3])) ('x ** 2 = ', array([0, 1, 4, 9])) ('x % 2 = ', array([0, 1, 0, 1]))
当然,这些运算你可以自由的组合,标准的运算优先级也都支持:
In addition, these can be strung together however you wish, and the standard order of operations is respected:
-(0.5*x + 1) ** 2
array([-1. , -2.25, -4. , -6.25])
其实这些运算符就是对 NumPy 内建的函数进行了操作符重载;例如 +
就是 add
函数而已:
Each of these arithmetic operations are simply convenient wrappers around specific functions built into NumPy; for example, the +
operator is a wrapper for the add
function:
np.add(x, 2)
array([2, 3, 4, 5])
该表格展示了 NumPy 支持的操作符:
The following table lists the arithmetic operators implemented in NumPy:
Operator | Equivalent ufunc | Description |
---|---|---|
+ |
np.add |
Addition (e.g., 1 + 1 = 2 ) |
- |
np.subtract |
Subtraction (e.g., 3 - 2 = 1 ) |
- |
np.negative |
Unary negation (e.g., -2 ) |
* |
np.multiply |
Multiplication (e.g., 2 * 3 = 6 ) |
/ |
np.divide |
Division (e.g., 3 / 2 = 1.5 ) |
// |
np.floor_divide |
Floor division (e.g., 3 // 2 = 1 ) |
** |
np.power |
Exponentiation (e.g., 2 ** 3 = 8 ) |
% |
np.mod |
Modulus/remainder (e.g., 9 % 4 = 1 ) |
当然还有布尔运算符以及位运算符;我们将在Comparisons, Masks, and Boolean Logic做进一步介绍。
Additionally there are Boolean/bitwise operators; we will explore these in Comparisons, Masks, and Boolean Logic.
x = np.array([-2, -1, 0, 1, 2])
abs(x)
array([2, 1, 0, 1, 2])
对应的 NumPy ufunc 是 np.absolute
,也可以通过 np.abs
调用:
The corresponding NumPy ufunc is np.absolute
, which is also available under the alias np.abs
:
np.absolute(x)
array([2, 1, 0, 1, 2])
np.abs(x)
array([2, 1, 0, 1, 2])
Ufunc 同样支持复数,其绝对值返回复数的标量:
This ufunc can also handle complex data, in which the absolute value returns the magnitude:
x = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0 + 1j])
np.abs(x)
array([ 5., 5., 2., 1.])
theta = np.linspace(0, np.pi, 3)
然后我们可以计算它们的一些三角函数:
Now we can compute some trigonometric functions on these values:
print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
('theta = ', array([ 0. , 1.57079633, 3.14159265])) ('sin(theta) = ', array([ 0.00000000e+00, 1.00000000e+00, 1.22464680e-16])) ('cos(theta) = ', array([ 1.00000000e+00, 6.12323400e-17, -1.00000000e+00])) ('tan(theta) = ', array([ 0.00000000e+00, 1.63312394e+16, -1.22464680e-16]))
数值按照机器的精度计算,因此有些 0 值可能出现误差(理解浮点数的同学都明白这种情况)。当然,反三角函数同样支持:
The values are computed to within machine precision, which is why values that should be zero do not always hit exactly zero. Inverse trigonometric functions are also available:
x = [-1, 0, 1]
print("x = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))
('x = ', [-1, 0, 1]) ('arcsin(x) = ', array([-1.57079633, 0. , 1.57079633])) ('arccos(x) = ', array([ 3.14159265, 1.57079633, 0. ])) ('arctan(x) = ', array([-0.78539816, 0. , 0.78539816]))
x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
('x =', [1, 2, 3]) ('e^x =', array([ 2.71828183, 7.3890561 , 20.08553692])) ('2^x =', array([ 2., 4., 8.])) ('3^x =', array([ 3, 9, 27]))
Numpy 也支持对数操作。其中 np.log
是自然对数,如果你想要计算 2 的对数或者 10 的对数,可以这么做:
The inverse of the exponentials, the logarithms, are also available.
The basic np.log
gives the natural logarithm; if you prefer to compute the base-2 logarithm or the base-10 logarithm, these are available as well:
x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
('x =', [1, 2, 4, 10]) ('ln(x) =', array([ 0. , 0.69314718, 1.38629436, 2.30258509])) ('log2(x) =', array([ 0. , 1. , 2. , 3.32192809])) ('log10(x) =', array([ 0. , 0.30103 , 0.60205999, 1. ]))
一些特别的方法可以给予小数值更高的精确度:
There are also some specialized versions that are useful for maintaining precision with very small input:
x = [0, 0.001, 0.01, 0.1]
print("exp(x) - 1 =", np.expm1(x))
print("log(1 + x) =", np.log1p(x))
('exp(x) - 1 =', array([ 0. , 0.0010005 , 0.01005017, 0.10517092])) ('log(1 + x) =', array([ 0. , 0.0009995 , 0.00995033, 0.09531018]))
如果 x
非常小,这些函数将会为数值提供相对于 np.log
和 np.exp
更高的精确度。
When x
is very small, these functions give more precise values than if the raw np.log
or np.exp
were to be used.
NumPy 还包含其他大量的 ufuncs,例如双曲三角函数,位操作运算,比较运算,弧度转换,舍入,取余等等。文档中对其进行了详细的介绍。
NumPy has many more ufuncs available, including hyperbolic trig functions, bitwise arithmetic, comparison operators, conversions from radians to degrees, rounding and remainders, and much more. A look through the NumPy documentation reveals a lot of interesting functionality.
另外,scipy.speical
包含了一些其他的特定领域使用的 ufuncs。如果你想要找到一些其他晦涩的数学运算函数,可能会在 scipy.special
中找得到。当然还有大量其他函数,下面展示几个用于统计的函数:
Another excellent source for more specialized and obscure ufuncs is the submodule scipy.special
.
If you want to compute some obscure mathematical function on your data, chances are it is implemented in scipy.special
.
There are far too many functions to list them all, but the following snippet shows a couple that might come up in a statistics context:
from scipy import special
# Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x) =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2) =", special.beta(x, 2))
('gamma(x) =', array([ 1.00000000e+00, 2.40000000e+01, 3.62880000e+05])) ('ln|gamma(x)| =', array([ 0. , 3.17805383, 12.80182748])) ('beta(x, 2) =', array([ 0.5 , 0.03333333, 0.00909091]))
# Error function (integral of Gaussian)
# its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x) =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))
('erf(x) =', array([ 0. , 0.32862676, 0.67780119, 0.84270079])) ('erfc(x) =', array([ 1. , 0.67137324, 0.32219881, 0.15729921])) ('erfinv(x) =', array([ 0. , 0.27246271, 0.73286908, inf]))
There are many, many more ufuncs available in both NumPy and scipy.special
.
Because the documentation of these packages is available online, a web search along the lines of "gamma function python" will generally find the relevant information.
对于大规模的运算,可能需要指定结果存储位置。与其将结果放到一个临时数组中,最好可以直接执行结果的最终存储位置。对于所有的 ufuncs 都可以通过指定参数 out
来完成这个任务:
For large calculations, it is sometimes useful to be able to specify the array where the result of the calculation will be stored.
Rather than creating a temporary array, this can be used to write computation results directly to the memory location where you'd like them to be.
For all ufuncs, this can be done using the out
argument of the function:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
[ 0. 10. 20. 30. 40.]
这甚至可以用于数组视图。例如:
This can even be used with array views. For example, we can write the results of a computation to every other element of a specified array:
y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)
[ 2. 0. 4. 0. 8. 0. 16. 0. 32. 0.]
如果我们写成 y[::2] = 2 ** x
,那么首先会创建一个临时数组存储 2 ** x
的运算结果,然后再将数据拷贝给 y
。当数组比较小的时候这没什么,但是对于一个超大数组,内存的使用会因为 out
的使用得到显著的优化。
If we had instead written y[::2] = 2 ** x
, this would have resulted in the creation of a temporary array to hold the results of 2 ** x
, followed by a second operation copying those values into the y
array.
This doesn't make much of a difference for such a small computation, but for very large arrays the memory savings from careful use of the out
argument can be significant.
对于二元 ufuncs,可以直接执行一个有意思的聚合操作。例如,reduce
函数,重复的对数组的元素执行一个操作,直到只剩下最后一个值。对 add
的结果执行 reduce
就是一个 sum
的操作:
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object.
For example, if we'd like to reduce an array with a particular operation, we can use the reduce
method of any ufunc.
A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce
on the add
ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x)
15
类似地,对 multiply
执行 reduce
:
Similarly, calling reduce
on the multiply
ufunc results in the product of all array elements:
np.multiply.reduce(x)
120
如果我们想要保留所以计算的中间过程,可以采用 accumulate
:
If we'd like to store all the intermediate results of the computation, we can instead use accumulate
:
np.add.accumulate(x)
array([ 1, 3, 6, 10, 15])
np.multiply.accumulate(x)
array([ 1, 2, 6, 24, 120])
当然在 NumPy 中也有特性的函数可以完成以上的动作 (np.sum
, np.prod
, np.cumsum
, np.cumprod
), 我们会在聚合: 最小值,最大值以及其他中讲述。
x = np.arange(1, 6)
np.multiply.outer(x, x)
array([[ 1, 2, 3, 4, 5], [ 2, 4, 6, 8, 10], [ 3, 6, 9, 12, 15], [ 4, 8, 12, 16, 20], [ 5, 10, 15, 20, 25]])
ufunc.at
与 ufunc.reduceat
方法(在高级索引详述), 也是非常有用的函数。
Ufuncs 一些其他的特性还包含长度,维度不同的数组的运算:有被称为 broadcasting。这部分非常重要,我们会花一个章节的内容进行讲述(见高级索引)。
The ufunc.at
and ufunc.reduceat
methods, which we'll explore in Fancy Indexing, are very helpful as well.
Another extremely useful feature of ufuncs is the ability to operate between arrays of different sizes and shapes, a set of operations known as broadcasting. This subject is important enough that we will devote a whole section to it (see Computation on Arrays: Broadcasting).
更多有关 ufuncs 的信息(包含所有的函数列表) 可以在 NumPy 与 SciPy 找到。
通过 IPython 的 tab 自动补全和帮助(?
)同样可以看到这些文档内容。使用方法在IPython 的帮助文档中有做介绍。
More information on universal functions (including the full list of available functions) can be found on the NumPy and SciPy documentation websites.
Recall that you can also access information directly from within IPython by importing the packages and using IPython's tab-completion and help (?
) functionality, as described in Help and Documentation in IPython.