Request
GLM-5.2 (z-ai/glm-5.2) uses the glm_moe_dsa architecture (GlmMoeDsaForCausalLM), which combines Mixture-of-Experts with DeepSeek Sparse Attention (DSA). This is distinct from the existing glm4_moe and glm4_moe_lite architectures currently supported in SwiftLM.
Why this matters
GLM-5.2 is a frontier MoE model (~308GB in 3.5bpw MLX format, ~384GB in mxfp4). On a 128GB M5 Max, the model exceeds RAM — making SwiftLM's --stream-experts SSD expert streaming the ideal solution. Only active experts (~40B params per token) need to be in memory, with the rest streamed from NVMe.
Current state
mlx-lm 0.31.3 now supports glm_moe_dsa (merged in commit d711c5f)
transformers v5.11.0 also supports GlmMoeDsa
- The MLX 3.5bpw quantized model is available at
avlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpw
- SwiftLM's
LLMModelFactory.swift supports glm4_moe and glm4_moe_lite but not glm_moe_dsa
What's needed
Add glm_moe_dsa as a supported architecture in LLMModelFactory.swift, mapping to the appropriate Swift MLX model class. The DSA attention pattern differs from standard MHA/GQA — it uses sparse attention with a sliding window + global tokens pattern.
Context
We're building an autonomous agent fleet (Based Agent Systems) that runs GLM-5.2 as the canonical reasoner via OpenRouter. Local MLX inference would eliminate API costs and reduce latency. SwiftLM's SSD expert streaming is the only viable path for running a 308GB MoE model on 128GB RAM.
Request
GLM-5.2 (z-ai/glm-5.2) uses the
glm_moe_dsaarchitecture (GlmMoeDsaForCausalLM), which combines Mixture-of-Experts with DeepSeek Sparse Attention (DSA). This is distinct from the existingglm4_moeandglm4_moe_litearchitectures currently supported in SwiftLM.Why this matters
GLM-5.2 is a frontier MoE model (~308GB in 3.5bpw MLX format, ~384GB in mxfp4). On a 128GB M5 Max, the model exceeds RAM — making SwiftLM's
--stream-expertsSSD expert streaming the ideal solution. Only active experts (~40B params per token) need to be in memory, with the rest streamed from NVMe.Current state
mlx-lm0.31.3 now supportsglm_moe_dsa(merged in commit d711c5f)transformersv5.11.0 also supportsGlmMoeDsaavlp12/GLM-5.2-Alis-MLX-Dynamic-3.5bpwLLMModelFactory.swiftsupportsglm4_moeandglm4_moe_litebut notglm_moe_dsaWhat's needed
Add
glm_moe_dsaas a supported architecture inLLMModelFactory.swift, mapping to the appropriate Swift MLX model class. The DSA attention pattern differs from standard MHA/GQA — it uses sparse attention with a sliding window + global tokens pattern.Context
We're building an autonomous agent fleet (Based Agent Systems) that runs GLM-5.2 as the canonical reasoner via OpenRouter. Local MLX inference would eliminate API costs and reduce latency. SwiftLM's SSD expert streaming is the only viable path for running a 308GB MoE model on 128GB RAM.