Add GPTQ quantized linear layer support for Qwen2 (#3650) by astorise · Pull Request #3660 · huggingface/candle

astorise · 2026-06-27T03:40:03Z

Summary

Adds support for loading GPTQ-quantized checkpoints, as requested in #3650.

candle-gptq-kernels: a new crate with fused dequantize+GEMM CUDA and
Metal kernels for GPTQ 4-bit linear layers (CUDA kernel includes a
tensor-core path adapted from the Marlin project — see
candle-gptq-kernels/kernels/marlin/ATTRIBUTION.md for license/provenance).
Kept as its own crate rather than folded into candle-kernels so it's only
built for checkpoints that actually use GPTQ; see the crate's README for the
full rationale.
candle_transformers::quantized_gptq: portable CPU dequantize-at-load path
(gptq_linear/GptqConfig) that works on any backend without requiring the
gptq-cuda/gptq-metal feature.
candle_transformers::models::gptq_qwen2: a Qwen2 model that routes every
attention/MLP projection through the GPTQ path above.
examples/quantized-gptq-qwen2: end-to-end example that downloads a GPTQ
checkpoint from the Hub (default
Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4)
and runs text generation through it.

Why split into multiple PRs

This is the first of three PRs addressing #3650 (GPTQ, then AWQ, then
block-wise FP8). The original branch implemented all three formats together,
but several commits ended up mixing CUDA/Metal/CI work across formats, which
made the diff hard to review and hard for CI to validate per-format. This PR
is the GPTQ-only slice, reconstructed to be independently buildable and
testable so reviewers/CI can validate GPTQ in isolation before the AWQ and
FP8 PRs land on top of it. The two follow-up PRs:

AWQ support + a QuantizedLinear/QuantMethod abstraction unifying
GPTQ/AWQ behind one Module impl, generalizing this PR's example into a
single quantized-qwen2 example that auto-detects the format from
config.json.
Block-wise FP8 (DeepSeek-V3-style) support, extending the same
QuantizedLinear enum with an Fp8 variant.

Test plan

cargo test -p candle-transformers --lib quantized_gptq — dequantize
roundtrip unit test
cargo test --manifest-path candle-gptq-kernels/Cargo.toml --no-default-features — kernel crate builds/tests standalone (CPU)
cargo clippy -p candle-transformers -p candle-examples --example quantized-gptq-qwen2 --all-targets -- -D warnings
cargo fmt --check
Manually ran cargo run --example quantized-gptq-qwen2 --release against
Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4
CI: cargo test -p candle-transformers --features cuda,gptq-cuda and
cargo test --manifest-path candle-gptq-kernels/Cargo.toml (tensor-core
kernel correctness) on CUDA; --features metal GPTQ kernel test on
Metal — added as new steps in ci_cuda.yaml/ci_metal.yaml.

Closes #3650 (GPTQ portion; AWQ and FP8 to follow in separate PRs).

CPU dequantization (quantized_gptq::gptq_linear) plus fused dequant+GEMM CUDA/Metal kernels (candle-gptq-kernels), including a real tensor-core (WMMA mma.sync) GEMM path bound via the vendored Marlin kernel for 4-bit weights. Wired in behind opt-in gptq-cuda/gptq-metal features on candle-transformers. Adds candle_transformers::models::gptq_qwen2, a Qwen2 variant that routes every attention/MLP projection through gptq_linear, and an end-to-end quantized-gptq-qwen2 example that downloads a real AutoGPTQ/GPTQModel checkpoint from the Hugging Face Hub and runs generation through it. This is the GPTQ-only slice of the broader GPTQ/AWQ/FP8 quantization work for issue huggingface#3650, split out to keep each format's PR independently reviewable and end-to-end testable in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014pmFaDvgxprkJxEFq6XDFL

astorise force-pushed the claude/quant-gptq branch from 3846b08 to 41688f5 Compare June 27, 2026 04:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GPTQ quantized linear layer support for Qwen2 (#3650)#3660

Add GPTQ quantized linear layer support for Qwen2 (#3650)#3660
astorise wants to merge 1 commit into
huggingface:mainfrom
astorise:claude/quant-gptq

astorise commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

astorise commented Jun 27, 2026

Summary

Why split into multiple PRs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants