Overview
Matrix-matrix multiplication is a fundamental kernel, one which can achieve high efficiency in both theory and practice. First, some caveats and assumptions:
- Here we deal with dense matrices, ones where there are few zeros and so they are justified to be stored as 2D arrays
- I distinguish between a matrix and an array. The first is a mathematical object, a rectangular arrangment of numbers usually indexed by an integer pair (i,j) that starts from 1. The second is a computer data structure, which can be used to hold a matrix, and it might be indexed starting from 0, 1, or anything convenient.
- A load-store analysis shows that the memory reference to flop ratio for matrix-matrix multiply is O(1/n), and hence it should be implementable with near peak performance on a cache-based serial computer.
- Few modern applications really need matrix-matrix multiplication with dense matrices. It is more of a toy, and the best that can be said for it is that if you cannot achieve 85% of the theoretical peak performance on a machine using it, then the machine is flawed in some way: OS, compiler, or hardware.
Now we deal with the practical implementation of the operation C = C + A*B and develop a performance model for its parallel versions. Temporarily assume that matrices are n x n, and that the number of processes p evenly divides n.
A brief note before beginning: for large matrices there are "reduced order" algorithms (Strassen, Winograd) which use extra memory but compute the product in fewer than 2n3 flops. We will use the standard algorithm, however, because the reduced order techniques can always be applied on a single process for our versions. We will consider 1D (where the primary partitioning is by block rows or block columns) and 2D (where the matrix is partitioned by both rows and columns) layouts.
Read full article from B673: Fall 2003, Parallel Matrix-Matrix Multiplication
No comments:
Post a Comment