Performance between C-contiguous and Fortran-contiguous array operations

python numpy multidimensional-array scipy

1281 просмотра

2 ответа

Below, I compared performance when dealing with sum operations between C-contiguous and Fortran-contiguous arrays (C vs FORTRAN memory order). I set axis=0 to ensure numbers are added up column wise. I was surprised that Fortran-contiguous array is actually slower than its C counterpart. Isn't that Fortran-contiguous array has contiguous memory allocation in columns and hence is better at column-wise operation?

import numpy as np
a = np.random.standard_normal((10000, 10000))
c = np.array(a, order='C')
f = np.array(a, order='F')

In Jupyter notebook, run

%timeit c.sum(axis=0)
10 loops, best of 3: 84.6 ms per loop
%timeit f.sum(axis=0)
10 loops, best of 3: 137 ms per loop
Автор: Nicholas Источник Размещён: 08.11.2019 11:21

Ответы (2)


3 плюса

I think it's in the implementation of the np.sum(). For example:

import numpy as np

A = np.random.standard_normal((10000,10000))
C = np.array(A, order='C')
F = np.array(A, order='F')

Benchmarking with Ipython:

In [7]: %timeit C.sum(axis=0)
10 loops, best of 3: 101 ms per loop

In [8]: %timeit C.sum(axis=1)
10 loops, best of 3: 149 ms per loop

In [9]: %timeit F.sum(axis=0)
10 loops, best of 3: 149 ms per loop

In [10]: %timeit F.sum(axis=1)
10 loops, best of 3: 102 ms per loop

So it's behaving exactly the opposite as expected. But let's try out some other function:

In [17]: %timeit np.amax(C, axis=0)
1 loop, best of 3: 173 ms per loop

In [18]: %timeit np.amax(C, axis=1)
10 loops, best of 3: 70.4 ms per loop

In [13]: %timeit np.amax(F,axis=0)
10 loops, best of 3: 72 ms per loop

In [14]: %timeit np.amax(F,axis=1)
10 loops, best of 3: 168 ms per loop

Sure, it's apples to oranges. But np.amax() works along the axis as does sum and returns a vector with one element for each row/column. And behaves as one would expect.

In [25]: C.strides
Out[25]: (80000, 8)

In [26]: F.strides
Out[26]: (8, 80000)

Tells us that the arrays are in fact packed in row-order and column-order and looping in that direction should be a lot faster. Unless for example the sum sums each row by row as it travels along the columns for providing the column sum (axis=0). But without a mean of peeking inside the .pyd I'm just speculating.

EDIT:

From percusse 's link: http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduce.html

Reduces a‘s dimension by one, by applying ufunc along one axis.

Let a.shape = (N_0, ..., N_i, ..., N_{M-1}). Then ufunc.reduce(a, axis=i)[k_0, ..,k_{i-1}, k_{i+1}, .., k_{M-1}] = the result of iterating j over range(N_i), cumulatively applying ufunc to each a[k_0, ..,k_{i-1}, j, k_{i+1}, .., k_{M-1}]

So in pseudocode, when calling F.sum(axis=0):

for j=cols #axis=0
    for i=rows #axis=1
        sum(j,i)=F(j,i)+sum(j-1,i)

So it would actually iterate over the row when calculating the column sum, slowing down considerably when in column-major order. Behaviour such as this would explain the difference.

eric's link provides us with the implementation, for somebody curious enough to go through large amounts of code for the reason.

Автор: Tapio Friberg Размещён: 20.08.2016 09:40

1 плюс

That's expected. If you check the result of

%timeit f.sum(axis=1)

it also gives similar result with the timing of c. Similarly,

%timeit c.sum(axis=1)

is slower.


Some explanation: suppose you have the following structure

|1| |6|
|2| |7|
|3| |8|
|4| |9|
|5| |10|

As Eric mentioned, these operations work with reduce. Let's say we are asking for column sum. So intuitive mechanism is not at play such that each column is accessed once, summed, and recorded. In fact it is the opposite such that each row is accessed and the function (here summing) is performed in essence similar to having two arrays a,b and executing

a += b

That's a very informal way of repeating what is mentioned super-cryptically in the documentation of reduce. This requires rows to be accessed contiguously although we are performing a column sum [1,6] + [2,7] + [3,8]... Hence, the implementation direction matters depending on the operation but not the array.

Автор: percusse Размещён: 20.08.2016 02:53
Вопросы из категории :
32x32