Skip to content

Add GPTQ quantized linear layer support for Qwen2 (#3650)#3660

Open
astorise wants to merge 1 commit into
huggingface:mainfrom
astorise:claude/quant-gptq
Open

Add GPTQ quantized linear layer support for Qwen2 (#3650)#3660
astorise wants to merge 1 commit into
huggingface:mainfrom
astorise:claude/quant-gptq

Conversation

@astorise

Copy link
Copy Markdown

Summary

Adds support for loading GPTQ-quantized checkpoints, as requested in #3650.

  • candle-gptq-kernels: a new crate with fused dequantize+GEMM CUDA and
    Metal kernels for GPTQ 4-bit linear layers (CUDA kernel includes a
    tensor-core path adapted from the Marlin project — see
    candle-gptq-kernels/kernels/marlin/ATTRIBUTION.md for license/provenance).
    Kept as its own crate rather than folded into candle-kernels so it's only
    built for checkpoints that actually use GPTQ; see the crate's README for the
    full rationale.
  • candle_transformers::quantized_gptq: portable CPU dequantize-at-load path
    (gptq_linear/GptqConfig) that works on any backend without requiring the
    gptq-cuda/gptq-metal feature.
  • candle_transformers::models::gptq_qwen2: a Qwen2 model that routes every
    attention/MLP projection through the GPTQ path above.
  • examples/quantized-gptq-qwen2: end-to-end example that downloads a GPTQ
    checkpoint from the Hub (default
    Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4)
    and runs text generation through it.

Why split into multiple PRs

This is the first of three PRs addressing #3650 (GPTQ, then AWQ, then
block-wise FP8). The original branch implemented all three formats together,
but several commits ended up mixing CUDA/Metal/CI work across formats, which
made the diff hard to review and hard for CI to validate per-format. This PR
is the GPTQ-only slice, reconstructed to be independently buildable and
testable so reviewers/CI can validate GPTQ in isolation before the AWQ and
FP8 PRs land on top of it. The two follow-up PRs:

  • AWQ support + a QuantizedLinear/QuantMethod abstraction unifying
    GPTQ/AWQ behind one Module impl, generalizing this PR's example into a
    single quantized-qwen2 example that auto-detects the format from
    config.json.
  • Block-wise FP8 (DeepSeek-V3-style) support, extending the same
    QuantizedLinear enum with an Fp8 variant.

Test plan

  • cargo test -p candle-transformers --lib quantized_gptq — dequantize
    roundtrip unit test
  • cargo test --manifest-path candle-gptq-kernels/Cargo.toml --no-default-features — kernel crate builds/tests standalone (CPU)
  • cargo clippy -p candle-transformers -p candle-examples --example quantized-gptq-qwen2 --all-targets -- -D warnings
  • cargo fmt --check
  • Manually ran cargo run --example quantized-gptq-qwen2 --release against
    Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4
  • CI: cargo test -p candle-transformers --features cuda,gptq-cuda and
    cargo test --manifest-path candle-gptq-kernels/Cargo.toml (tensor-core
    kernel correctness) on CUDA; --features metal GPTQ kernel test on
    Metal — added as new steps in ci_cuda.yaml/ci_metal.yaml.

Closes #3650 (GPTQ portion; AWQ and FP8 to follow in separate PRs).

CPU dequantization (quantized_gptq::gptq_linear) plus fused dequant+GEMM
CUDA/Metal kernels (candle-gptq-kernels), including a real tensor-core
(WMMA mma.sync) GEMM path bound via the vendored Marlin kernel for 4-bit
weights. Wired in behind opt-in gptq-cuda/gptq-metal features on
candle-transformers.

Adds candle_transformers::models::gptq_qwen2, a Qwen2 variant that routes
every attention/MLP projection through gptq_linear, and an end-to-end
quantized-gptq-qwen2 example that downloads a real AutoGPTQ/GPTQModel
checkpoint from the Hugging Face Hub and runs generation through it.

This is the GPTQ-only slice of the broader GPTQ/AWQ/FP8 quantization work
for issue huggingface#3650, split out to keep each format's PR independently
reviewable and end-to-end testable in CI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014pmFaDvgxprkJxEFq6XDFL
@astorise astorise force-pushed the claude/quant-gptq branch from 3846b08 to 41688f5 Compare June 27, 2026 04:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature] Weight quantization kernels: GPTQ/Marlin, AWQ, block-wise FP8

2 participants