Cluster and Grid Computing

Assignment 2:

Parallel Matrix Multiply

In the lecture I merely said write a parallel matrix multiplication program: C = A * B. Here I have added detail to make it easier for you.

The program should be able to accept any size square matrix and utilize any number of processors, although I will accept a fixed number of processors.

Some comments on implementation and submission:

1. Only one process (e.g., a process with rank = 0 acting as the master) should read the data/matrix (or assign values to matrix A and B) and then it should automatically share this among the worker processes.

2. You can divide the work among processes in a number of ways. Some suggested ways are: (a) treat computations of all elements of each row of the resulting matrix C as one work unit, (b) treat computations of all elements of each column of the resulting matrix C as one work unit, (c) computations of each element of the resulting matrix C as one work unit, (d) create work units based on the number of processes and the size of matrix and then assign them different worker process.

3. Try to do assign the work units to worker processes such that the load/work distribution among them is balanced. The master process should collect the results from worker processes and strores in matrix C.

4. Testing and Evaluation:
    (a) Take the size of square matrix as an input to the program by reading from the keyboard at runtime.
    (b) Test your program to make sure that it works in parallel!
    (c) When the size of matrix (N) is larger than 5, it is NOT be feasible for you to input elements of matrix A and B from the keyboard. In such cases, simply assign some random values to elements matrix A and B.
(d) Run your program for three different problem sizes: matrix size: N = 100, 200, 300, 400, 500 on different combinations of processes: 4, 8, 10, and 12 processes.
   (e) Note the processing time by making use of appropriate MPI calls and draw a graph for varying values of matrix size with "Number of CPUs/nodes on X column" and "the time taken " on Y column.

That is, use randomly generated matricies in a range of sizes and produce a report on the time taken for a range of different sized matricies in a parallel programmin environment using MPI.
.
4. Submission:

Write a report containing (a) details on your design and implementation and the way you did load balancing, (b) performance graphs and (c) program code
Please email your report to Associate Lecturer only (by deadline). Please mention your name and student number on the first page of your report.
Please bring a printed report and handover during the demo.

Old Description: You are welcome to implement as indicated below if you like.

A better way of expressing the problem is: Write a function "square_dgemm", similar to the LAPACK "dgemm" routine but limited to square matrices, to compute the product of two matrices using MPI to distribute the task onto a number of processors. Note, only one processor should read the data/matrix and then the program should automatically share this among the processors.

There is a harness routine matmul.c in http://www.hpc.unimelb.edu.au/cs/assignment2/ which provides checks on correctness and scalability. There are also serial basic and dlocked matrix multiiply sub-routines in the directory which may give assistance with methods of dealing with the odd bits left when the number of processors does not divide the matrix size.