Skip to content

Support for glm_moe_dsa architecture (GLM-5.2 DeepSeek Sparse Attention) #111

Description

@basedagent

Request

GLM-5.2 (z-ai/glm-5.2) uses the glm_moe_dsa architecture (GlmMoeDsaForCausalLM), which combines Mixture-of-Experts with DeepSeek Sparse Attention (DSA). This is distinct from the existing glm4_moe and glm4_moe_lite architectures currently supported in SwiftLM.

Why this matters

GLM-5.2 is a frontier MoE model (~308GB in 3.5bpw MLX format, ~384GB in mxfp4). On a 128GB M5 Max, the model exceeds RAM — making SwiftLM's --stream-experts SSD expert streaming the ideal solution. Only active experts (~40B params per token) need to be in memory, with the rest streamed from NVMe.

Current state

  • mlx-lm 0.31.3 now supports glm_moe_dsa (merged in commit d711c5f)
  • transformers v5.11.0 also supports GlmMoeDsa
  • The MLX 3.5bpw quantized model is available at avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw
  • SwiftLM's LLMModelFactory.swift supports glm4_moe and glm4_moe_lite but not glm_moe_dsa

What's needed

Add glm_moe_dsa as a supported architecture in LLMModelFactory.swift, mapping to the appropriate Swift MLX model class. The DSA attention pattern differs from standard MHA/GQA — it uses sparse attention with a sliding window + global tokens pattern.

Context

We're building an autonomous agent fleet (Based Agent Systems) that runs GLM-5.2 as the canonical reasoner via OpenRouter. Local MLX inference would eliminate API costs and reduce latency. SwiftLM's SSD expert streaming is the only viable path for running a 308GB MoE model on 128GB RAM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions