Skip to content

JosephMoore25/SCC-matmul

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Matmul optimisation exercise

You're given a test harness and an empty function. Implement single-core dense matrix multiplication

C = A * B

and then make it as fast as you can on one core. The harness builds the inputs, times your code, and checks the answer, so you can concentrate on the loops and the memory access pattern.

Files

File What it is
matmul.h The function you implement, and the array layout. Read it first.
matmul_student.c The file you edit. Starts empty.
harness.c Builds inputs, times matmul, checks the result. Don't edit.
timer.h Wall-clock timer.
tests/ Stored answers (a hash and a checksum per case).
Makefile Build and run.

Building and running

You need a C compiler and make.

On Isambard 3, load a compiler module first (for example module load gcc, or the Arm/Cray compiler of your choice). Locally, gcc or clang is fine; on Windows use WSL.

make          # build ./matmul from matmul_student.c
make run      # build and run all the cases

Until you've written anything, every case reports FAIL (the stub does nothing). When your code is correct you'll see something like:

  size   iters      time (s)       GFLOP/s  result
  ----   -----      --------       -------  ------
   256      20        0.0050          6.32  PASS
   512       5        0.0360          7.40  PASS
  1024       1        0.3350          6.41  PASS
  2048       1        2.6800          6.41  PASS

All cases passed.

What to implement

Open matmul.h. There's one function:

void matmul(const double *A, const double *B, double *C,
            int M, int N, int K);

The matrices are row-major flat arrays of double:

A is M x K    A[(size_t)i * K + k]
B is K x N    B[(size_t)k * N + j]
C is M x N    C[(size_t)i * N + j]

C is zeroed before the call. The test cases are square (M = N = K), but it's worth writing it for general M, N, K.

A rough order to work in:

  1. Get the plain triple loop correct so every case passes. Note the GFLOP/s.
  2. Change the loop order (try i, k, j) and see what happens to the speed. The difference is all about how B is walked through memory.
  3. Block the loops so the data you reuse stays in cache.
  4. Arrange the inner loop so the compiler can vectorise it.
  5. Try the compiler flags below.

The inputs are small integers, so the exact product fits in a double whatever order you add the terms in. That means a reordered, blocked, or vectorised version gives the same answer as the simple loop and still passes the check, so you don't need to worry about floating-point rounding.

Timing

make run prints, per case, the time for one multiply and the achieved GFLOP/s (2 * M * N * K / time). Higher is better; use it to compare attempts.

Small matrices run too quickly to time reliably, so the harness repeats them a few times by default. To set the repeat count yourself:

./matmul -i 50        # 50 repeats per case

The matrices are allocated once and reused across repeats, so this gives you a longer, steadier measurement without using any more memory.

To time a single size without checking the answer:

./matmul 2000         # one 2000 x 2000 multiply
./matmul 2000 -i 5    # five of them

The built-in sizes run from 256 up to 2048. The smaller ones take a few seconds at most with the plain triple loop; 2048 is the heavy one (about 96 MB, and the naive loop on it is slow) and is there as a stretch case.

Compiler flags

The default is -O2. Override CFLAGS to try others without editing the Makefile:

make clean
make CFLAGS="-O3 -march=native -funroll-loops -std=c11 -Wall"   # x86 (local)
make CFLAGS="-O3 -mcpu=native -funroll-loops -std=c11 -Wall"    # Isambard 3
make run

-O3 turns on more aggressive vectorisation; -march=native / -mcpu=native lets the compiler use your CPU's full SIMD width. Compare the GFLOP/s.

Regenerating the reference files

The answers in tests/ are stored as an FNV hash plus a checksum per case, so there's nothing large to commit and no worked implementation in the repo.

To regenerate them (for example after changing the sizes), put a working matmul in matmul_student.c, then:

make gen        # writes the hashes for every case in CASES[] to tests/

To change the sizes, seeds, or default repeat counts, edit CASES[] near the top of harness.c. Memory is 3 * n * n doubles, about 24 MB at n = 1024 and 96 MB at n = 2048.

About

A small test rig for implementing matmul - used for SCC training in understanding standard optimisations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors