The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!
虽然我们的数据可以通过一个的包含同构数值的数组来表示,但有时情况并非如此。本节演示使用 NumPy 的结构化数组和记录数组,其为异构数据提供了高效存储方式。虽然这里展示的例子对简单的操作是有用的,但像这样的场景经常由 Pandas 的 Dataframe
来处理,我们将在第三章介绍。
While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy's structured arrays and record arrays, which provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas Dataframe
s, which we'll explore in Chapter 3.
import numpy as np
想象一下,我们有一个人不同类别的数据(例如姓名,年龄和体重),我们希望将这些值用 Python 存储起来。可以将它们存储在三个独立的数组中:
Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we'd like to store these values for use in a Python program. It would be possible to store these in three separate arrays:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
但这看起来有点笨拙,这样存储的话无法告诉我们三个数组是相关的;如果我们可以使用单个结构来存储所有这些数据将是更自然的。NumPy 可以通过结构化数组来处理它,这是具有复合数据类型的数组。
But this is a bit clumsy. There's nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all of this data. NumPy can handle this through structured arrays, which are arrays with compound data types.
在之前我们介绍过可以像这样创建一个数组:
Recall that previously we created a simple array using an expression like this:
x = np.zeros(4, dtype=int)
我们可以用类似的方式创建一个包含复合数据类型的结构化数组:
We can similarly create a structured array using a compound data type specification:
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
'U10'
表示"最长为 10 的 Unicode 字符串",'i4'
表示"4 个字节(32 位)的整数",'f8'
则表示"8 个字节(64 位)的浮点类型"。我们后面会介绍创建这种数组的其他参数。这里我们创建了一个空的数组,我们可以为其填充数据:
Here 'U10'
translates to "Unicode string of maximum length 10," 'i4'
translates to "4-byte (i.e., 32 bit) integer," and 'f8'
translates to "8-byte (i.e., 64 bit) float."
We'll discuss other options for these type codes in the following section.
Now that we've created an empty container array, we can fill the array with our lists of values:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0) ('Doug', 19, 61.5)]
现在正如我们所愿,所有数据都被存储在一个地方了。使用结构化数组时,你可以用索引也可用名称去访问数据:
As we had hoped, the data is now arranged together in one convenient block of memory.
The handy thing with structured arrays is that you can now refer to values either by index or by name:
# Get all names
data['name']
array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')
# Get first row of data
data[0]
('Alice', 25, 55.0)
# Get the name from the last row
data[-1]['name']
'Doug'
还可以使用布尔掩码,借此你可以使用一些更高级的功能,比如按照年龄过滤数据:
Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:
# Get names where age is under 30
data[data['age'] < 30]['name']
array(['Alice', 'Doug'], dtype='<U10')
注意,如果你想做一些比这些更复杂的操作,你应该考虑下一章中讨论的 Pandas。之后我们会看到 Pandas 提供了一个 Dataframe
对象,它是建立在 NumPy 数组之上构建的,但是它提供了各种有用的数据操作方式。
Note that if you'd like to do any operations that are any more complicated than these, you should probably consider the Pandas package, covered in the next chapter.
As we'll see, Pandas provides a Dataframe
object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we've shown here, as well as much, much more.
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
数值类型也可以通过 Python 的 type
或者 NumPy 的 dtype
来指定:
For clarity, numerical types can be specified using Python types or NumPy dtype
s instead:
np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])
一个复合类型也可以用下面的 tuple 列表形式指定:
A compound type can also be specified as a list of tuples:
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])
如果你觉得给数据列命名没什么用,你可以直接用逗号分隔的类型来声明:
If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:
np.dtype('S10,i4,f8')
dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])
缩短的字符串格式代码看起来有点混乱,但它们就是为了简化操作而来。第一个(可选)字符是 <
或者 >
,这意味着“小端”或“大端”,用于指定字节的排列顺序。下一个字符指定数据的类型:字符,字节,整型,浮点型等(请参见下表)。最后一个或多个字符表示对象的大小(以字节为单位)。
The shortened string format codes may seem confusing, but they are built on simple principles.
The first (optional) character is <
or >
, which means "little endian" or "big endian," respectively, and specifies the ordering convention for significant bits.
The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below).
The last character or characters represents the size of the object in bytes.
Character | Description | Example |
---|---|---|
'b' |
Byte | np.dtype('b') |
'i' |
Signed integer | np.dtype('i4') == np.int32 |
'u' |
Unsigned integer | np.dtype('u1') == np.uint8 |
'f' |
Floating point | np.dtype('f8') == np.int64 |
'c' |
Complex floating point | np.dtype('c16') == np.complex128 |
'S' , 'a' |
String | np.dtype('S5') |
'U' |
Unicode string | np.dtype('U') == np.str_ |
'V' |
Raw data (void) | np.dtype('V') == np.void |
还可以定义更高级的复合类型。例如,你可以创建一个类型,其中每个元素包含一个数组或矩阵。在这里,我们将创建一个数据类型 mat
,它是一个 $3\times 3$ 的浮点型矩阵:
It is possible to define even more advanced compound types.
For example, you can create a type where each element contains an array or matrix of values.
Here, we'll create a data type with a mat
component consisting of a $3\times 3$ floating-point matrix:
tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
X = np.zeros(1, dtype=tp)
print(X[0])
print(X['mat'][0])
(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]) [[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]]
X
数组中的每一个元素包含一个 id
和一个 $3\times 3$ 的矩阵。那为什么不用一个简单的多维数组或者是一个 Python 字典呢?其原因资源 NumPy 的 dtype
直接对应一个 C 结构体,那么其结构可以直接被 C 代码访问。如果你在为 C 或者 Fortran 编写 Python 结构,你可能需要这样的结构。
Now each element in the X
array consists of an id
and a $3\times 3$ matrix.
Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary?
The reason is that this NumPy dtype
directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program.
If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful!
NumPy 提供了 np.recarray
它和结构化数组很像,但是其字段可用以属性的方式(而不是字典的方式)进行访问。来看我们之前记录年龄的那个例子:
NumPy also provides the np.recarray
class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys.
Recall that we previously accessed the ages by writing:
data['age']
array([25, 45, 37, 19], dtype=int32)
如果我们用记录数组来存储,我们就可以用更少的输入来访问它了:
If we view our data as a record array instead, we can access this with slightly fewer keystrokes:
data_rec = data.view(np.recarray)
data_rec.age
array([25, 45, 37, 19], dtype=int32)
记录数组的弊端在于其访问效率要慢一些,甚至是我们用字典方式访问也依然要慢一些:
The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:
%timeit data['age']
%timeit data_rec['age']
%timeit data_rec.age
1000000 loops, best of 3: 241 ns per loop 100000 loops, best of 3: 4.61 µs per loop 100000 loops, best of 3: 7.27 µs per loop
当然,是否用额外的开销来换取更便捷的访问方式取决于你自己。
Whether the more convenient notation is worth the additional overhead will depend on your own application.
结构化和记录数组的这一节有意被放在在本章结尾,因为它很好地引入我们将要介绍的下一个包:Pandas。像这里讨论的结构化数组在某些情况下是很好的知道,特别是如果你使用NumPy数组映射到二进制数据格式在C,Fortran或另一种语言。对于结构化数据的日常使用,Pandas包是一个更好的选择,我们将在下面的章节中详细讨论它。
This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas. Structured arrays like the ones discussed here are good to know about for certain situations, especially in case you're using NumPy arrays to map onto binary data formats in C, Fortran, or another language. For day-to-day use of structured data, the Pandas package is a much better choice, and we'll dive into a full discussion of it in the chapter that follows.