Skip to content

[WIP] implement all benchmarks on triton#43

Open
dshaaban01 wants to merge 3 commits into
spcl:mainfrom
dshaaban01:pr2
Open

[WIP] implement all benchmarks on triton#43
dshaaban01 wants to merge 3 commits into
spcl:mainfrom
dshaaban01:pr2

Conversation

@dshaaban01

Copy link
Copy Markdown

This PR implements all benchmarks on triton. It should be stacked on #42.

@dshaaban01 dshaaban01 changed the title implement all benchmarks on triton [WIP] implement all benchmarks on triton Dec 17, 2025
@dshaaban01 dshaaban01 marked this pull request as draft December 17, 2025 17:36
@dshaaban01 dshaaban01 marked this pull request as ready for review December 17, 2025 17:36
ThrudPrimrose added a commit to ThrudPrimrose/npbench that referenced this pull request May 28, 2026
54 triton/torch kernels with @triton.autotune across the npbench
suite, plus TritonFramework registration. Will be followed by an
autotune-shrink commit and numeric-failure triage.

Source: github.com/spcl/pull/43 (dshaaban01:pr2)

# Conflicts:
#	requirements.txt
ThrudPrimrose added a commit to ThrudPrimrose/npbench that referenced this pull request May 28, 2026
…sweep

PR spcl#43's per-kernel get_configs() helpers expand via itertools.product
into 32-60 triton.Config entries, e.g. gemm sweeps [32,64]^3 × [1,2,4,8]
num_warps. Running the full sweep on every S-preset call dwarfs the
per-call kernel work and produces noisy timings for early benchmarks.

Add a one-shot monkey-patch in TritonFramework.__init__ that wraps
triton.runtime.autotuner.Autotuner.__init__ to slice the `configs`
argument to the first N entries when NPBENCH_TRITON_AUTOTUNE_SIZE is
unset or 'small' (default cap N = 4, override via NPBENCH_TRITON_AUTOTUNE_N).
Set NPBENCH_TRITON_AUTOTUNE_SIZE=full to disable the cap and run the
upstream PR spcl#43 sweeps as-is.

Kernel files are not modified; the patch runs before any *_triton.py
import because TritonFramework is created before Test.run() imports the
kernel module.

Also: add `torch` to requirements.txt (triton needs torch as the array
backend; PR spcl#43 added triton to env.yml but not requirements.txt).
ThrudPrimrose added a commit to ThrudPrimrose/npbench that referenced this pull request May 28, 2026
…rnels

After merging PR spcl#43 (triton) into extended, ran an S-preset dynamic
sweep on 19 triton kernels (28 PASS / 9 FAIL / 1 CRASH) and launched two
static-review agents in parallel: one over all 54 *_triton.py files, one
over all 53 *_jax.py files against their *_numpy.py reference.

Findings consolidated into tests/TRIAGE_TRITON_JAX.md:

- 15 unique triton bugs in 11 kernels: precision-eroding casts (gemm
  unconditionally downcasts fp64 inputs to fp32 before tl.dot; doitgen
  hardcodes an fp64 accumulator); mask/bounds issues (jacobi_1d zeros
  boundary cells via other=0.0; softmax NaN on entirely-OOB rows;
  azimint_hist div-by-zero pre-flagged); atomic-ordering / fp-non-repro
  (nbody energy, covariance mean, atax/bicg/azimint_naive/azimint_hist);
  init/logic bugs (mandelbrot1 records last-active iteration not the
  escape iteration); autotune configs that may OOM (jacobi_1d 2048-elem
  blocks, azimint_naive 1024-elem blocks on consumer GPUs).

- 11 jax kernels flagged: dtype promotion without `jax_enable_x64`
  (mandelbrot1, mandelbrot2, nbody); algorithm divergence in reduction
  order or loop semantics (mandelbrot2 mask vs filter, seidel_2d
  per-element divide, correlation/covariance missing symmetric fill,
  durbin roll/flip vs slicing, azimint_naive masked-mean); cond/scalar
  bugs (contour_integral); twiddle indexing (stockham_fft); missing
  in-place rebind (nbody vel -= mean).

Planted brief TODO comments pointing at the triage doc in the kernels
that fail dynamic validation today: gemm, mandelbrot1, jacobi_1d,
azimint_naive, azimint_hist (triton); mandelbrot1, correlation (jax).
Fixes are deferred per user direction — this commit is diagnosis only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants