Add GPTQ quantized linear layer support for Qwen2 (#3650)#3660
Open
astorise wants to merge 1 commit into
Open
Conversation
CPU dequantization (quantized_gptq::gptq_linear) plus fused dequant+GEMM CUDA/Metal kernels (candle-gptq-kernels), including a real tensor-core (WMMA mma.sync) GEMM path bound via the vendored Marlin kernel for 4-bit weights. Wired in behind opt-in gptq-cuda/gptq-metal features on candle-transformers. Adds candle_transformers::models::gptq_qwen2, a Qwen2 variant that routes every attention/MLP projection through gptq_linear, and an end-to-end quantized-gptq-qwen2 example that downloads a real AutoGPTQ/GPTQModel checkpoint from the Hugging Face Hub and runs generation through it. This is the GPTQ-only slice of the broader GPTQ/AWQ/FP8 quantization work for issue huggingface#3650, split out to keep each format's PR independently reviewable and end-to-end testable in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014pmFaDvgxprkJxEFq6XDFL
3846b08 to
41688f5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for loading GPTQ-quantized checkpoints, as requested in #3650.
candle-gptq-kernels: a new crate with fused dequantize+GEMM CUDA andMetal kernels for GPTQ 4-bit linear layers (CUDA kernel includes a
tensor-core path adapted from the Marlin project — see
candle-gptq-kernels/kernels/marlin/ATTRIBUTION.mdfor license/provenance).Kept as its own crate rather than folded into
candle-kernelsso it's onlybuilt for checkpoints that actually use GPTQ; see the crate's README for the
full rationale.
candle_transformers::quantized_gptq: portable CPU dequantize-at-load path(
gptq_linear/GptqConfig) that works on any backend without requiring thegptq-cuda/gptq-metalfeature.candle_transformers::models::gptq_qwen2: a Qwen2 model that routes everyattention/MLP projection through the GPTQ path above.
examples/quantized-gptq-qwen2: end-to-end example that downloads a GPTQcheckpoint from the Hub (default
Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4)and runs text generation through it.
Why split into multiple PRs
This is the first of three PRs addressing #3650 (GPTQ, then AWQ, then
block-wise FP8). The original branch implemented all three formats together,
but several commits ended up mixing CUDA/Metal/CI work across formats, which
made the diff hard to review and hard for CI to validate per-format. This PR
is the GPTQ-only slice, reconstructed to be independently buildable and
testable so reviewers/CI can validate GPTQ in isolation before the AWQ and
FP8 PRs land on top of it. The two follow-up PRs:
QuantizedLinear/QuantMethodabstraction unifyingGPTQ/AWQ behind one
Moduleimpl, generalizing this PR's example into asingle
quantized-qwen2example that auto-detects the format fromconfig.json.QuantizedLinearenum with anFp8variant.Test plan
cargo test -p candle-transformers --lib quantized_gptq— dequantizeroundtrip unit test
cargo test --manifest-path candle-gptq-kernels/Cargo.toml --no-default-features— kernel crate builds/tests standalone (CPU)cargo clippy -p candle-transformers -p candle-examples --example quantized-gptq-qwen2 --all-targets -- -D warningscargo fmt --checkcargo run --example quantized-gptq-qwen2 --releaseagainstQwen/Qwen2-0.5B-Instruct-GPTQ-Int4cargo test -p candle-transformers --features cuda,gptq-cudaandcargo test --manifest-path candle-gptq-kernels/Cargo.toml(tensor-corekernel correctness) on CUDA;
--features metalGPTQ kernel test onMetal — added as new steps in
ci_cuda.yaml/ci_metal.yaml.Closes #3650 (GPTQ portion; AWQ and FP8 to follow in separate PRs).