Recommended max_kv_size for MLX deployment to prevent OOM on Apple Silicon
Issue
When running MiniMax-M2.5 (4-bit MLX variant) on Apple Silicon via mlx_lm.server in long-running server deployments, the unbounded KV cache can cause out-of-memory kernel panics.
Environment
- Mac Studio M3 Ultra, 512GB unified memory
- Model:
MiniMax-M2.5-MLX-4bit(~120GB on disk) - Runtime:
mlx_lm.serverv0.30.7 - Use case: Backend for Claude Code via Claude Code Router, serving coding assistance
What happened
During long coding sessions, M2.5's 80 MoE layers accumulate substantial KV cache per request. With mlx_lm.server's default LRU cache of 10 entries and no size limit, total memory usage grew past 400GB, exhausting the 512GB unified memory pool. This triggered a SoC Watchdog kernel panic (hardware hard reboot), not just a standard OOM kill.
Suggestion
Could the model card include recommended deployment parameters for MLX users? For example:
mlx_lm.server --model MiniMax-M2.5-MLX-4bit --max-kv-size 65536
Note: --max-kv-size is not yet supported in mlx_lm.server v0.30.7 (requires PR #884), but once merged, this would be critical for safe deployment.
Recommended values
| Variant | Recommended max_kv_size |
Rationale |
|---|---|---|
| 4-bit (512GB machine) | 65536 | ~15GB KV per request, 10 LRU ≈ 150GB headroom |
| 8-bit (512GB machine) | 32768 | Higher per-layer memory, needs tighter limit |
This would help Apple Silicon users avoid catastrophic OOM scenarios. Thank you for open-sourcing this incredible model!
Upvote !