Recommended max_kv_size for MLX deployment to prevent OOM on Apple Silicon

#14
by Neo2025new - opened

Issue

When running MiniMax-M2.5 (4-bit MLX variant) on Apple Silicon via mlx_lm.server in long-running server deployments, the unbounded KV cache can cause out-of-memory kernel panics.

Environment

  • Mac Studio M3 Ultra, 512GB unified memory
  • Model: MiniMax-M2.5-MLX-4bit (~120GB on disk)
  • Runtime: mlx_lm.server v0.30.7
  • Use case: Backend for Claude Code via Claude Code Router, serving coding assistance

What happened

During long coding sessions, M2.5's 80 MoE layers accumulate substantial KV cache per request. With mlx_lm.server's default LRU cache of 10 entries and no size limit, total memory usage grew past 400GB, exhausting the 512GB unified memory pool. This triggered a SoC Watchdog kernel panic (hardware hard reboot), not just a standard OOM kill.

Suggestion

Could the model card include recommended deployment parameters for MLX users? For example:

mlx_lm.server --model MiniMax-M2.5-MLX-4bit --max-kv-size 65536

Note: --max-kv-size is not yet supported in mlx_lm.server v0.30.7 (requires PR #884), but once merged, this would be critical for safe deployment.

Recommended values

Variant Recommended max_kv_size Rationale
4-bit (512GB machine) 65536 ~15GB KV per request, 10 LRU ≈ 150GB headroom
8-bit (512GB machine) 32768 Higher per-layer memory, needs tighter limit

This would help Apple Silicon users avoid catastrophic OOM scenarios. Thank you for open-sourcing this incredible model!

Sign up or log in to comment