Python multiprocess, fast computation of a big matrix
everyone, I am trying to make the following computation.
I have an 2D n*d numpy array A, each of its row is a d-dimensional
datapoint x. I want to compute a n*n matrix B, each of its element B[i, j
] = f( A[i], A[j]), where f(p,q) is a symmetric function applies to every
possible pair of datapints, e.g. sum(p*q)
Since n and d can be very large, n~200000, d~1000, this computation is
very slow, I am trying to speed it up using multiprocessing.Pool()
This is the code I am having so far, I am intentionally ignore the
symmetry property ( B[i,j]==B[j,i]) for simplicity.
p = Pool(cpu_count())
for i, x in enumerate(A):
B[i] = p.map(functools.partial(f , x) , A)
I am fixing one parameter of f() as one element of A, and apply map() = to
the iterable nparray A, so that I can get the answer for one row at a
time.
The problem with this is the speed, it is even slower than my original
elementwise computation when I tried it with (n,d) = (1000, 100).
After reading many other related posts, I guess this is because the
function f() has been pickled and unpickled back and forth, inducing huge
overhead.
Do I guess it right? Is there a better way of doing this ?
Thanks in advance.
No comments:
Post a Comment