You're given a test harness and an empty function. Implement single-core dense matrix multiplication
C = A * B
and then make it as fast as you can on one core. The harness builds the inputs, times your code, and checks the answer, so you can concentrate on the loops and the memory access pattern.
| File | What it is |
|---|---|
matmul.h |
The function you implement, and the array layout. Read it first. |
matmul_student.c |
The file you edit. Starts empty. |
harness.c |
Builds inputs, times matmul, checks the result. Don't edit. |
timer.h |
Wall-clock timer. |
tests/ |
Stored answers (a hash and a checksum per case). |
Makefile |
Build and run. |
You need a C compiler and make.
On Isambard 3, load a compiler module first (for example module load gcc, or
the Arm/Cray compiler of your choice). Locally, gcc or clang is fine; on
Windows use WSL.
make # build ./matmul from matmul_student.c
make run # build and run all the casesUntil you've written anything, every case reports FAIL (the stub does
nothing). When your code is correct you'll see something like:
size iters time (s) GFLOP/s result
---- ----- -------- ------- ------
256 20 0.0050 6.32 PASS
512 5 0.0360 7.40 PASS
1024 1 0.3350 6.41 PASS
2048 1 2.6800 6.41 PASS
All cases passed.
Open matmul.h. There's one function:
void matmul(const double *A, const double *B, double *C,
int M, int N, int K);The matrices are row-major flat arrays of double:
A is M x K A[(size_t)i * K + k]
B is K x N B[(size_t)k * N + j]
C is M x N C[(size_t)i * N + j]
C is zeroed before the call. The test cases are square (M = N = K), but
it's worth writing it for general M, N, K.
A rough order to work in:
- Get the plain triple loop correct so every case passes. Note the GFLOP/s.
- Change the loop order (try
i, k, j) and see what happens to the speed. The difference is all about howBis walked through memory. - Block the loops so the data you reuse stays in cache.
- Arrange the inner loop so the compiler can vectorise it.
- Try the compiler flags below.
The inputs are small integers, so the exact product fits in a double
whatever order you add the terms in. That means a reordered, blocked, or
vectorised version gives the same answer as the simple loop and still passes
the check, so you don't need to worry about floating-point rounding.
make run prints, per case, the time for one multiply and the achieved
GFLOP/s (2 * M * N * K / time). Higher is better; use it to compare attempts.
Small matrices run too quickly to time reliably, so the harness repeats them a few times by default. To set the repeat count yourself:
./matmul -i 50 # 50 repeats per caseThe matrices are allocated once and reused across repeats, so this gives you a longer, steadier measurement without using any more memory.
To time a single size without checking the answer:
./matmul 2000 # one 2000 x 2000 multiply
./matmul 2000 -i 5 # five of themThe built-in sizes run from 256 up to 2048. The smaller ones take a few seconds at most with the plain triple loop; 2048 is the heavy one (about 96 MB, and the naive loop on it is slow) and is there as a stretch case.
The default is -O2. Override CFLAGS to try others without editing the
Makefile:
make clean
make CFLAGS="-O3 -march=native -funroll-loops -std=c11 -Wall" # x86 (local)
make CFLAGS="-O3 -mcpu=native -funroll-loops -std=c11 -Wall" # Isambard 3
make run-O3 turns on more aggressive vectorisation; -march=native / -mcpu=native
lets the compiler use your CPU's full SIMD width. Compare the GFLOP/s.
The answers in tests/ are stored as an FNV hash plus a checksum per case, so
there's nothing large to commit and no worked implementation in the repo.
To regenerate them (for example after changing the sizes), put a working
matmul in matmul_student.c, then:
make gen # writes the hashes for every case in CASES[] to tests/To change the sizes, seeds, or default repeat counts, edit CASES[] near the
top of harness.c. Memory is 3 * n * n doubles, about 24 MB at n = 1024
and 96 MB at n = 2048.