Recommended max_kv_size for MLX deployment to prevent OOM on Apple Silicon

#14

by Neo2025new - opened about 11 hours ago

Neo2025new

about 11 hours ago

Issue

When running MiniMax-M2.5 (4-bit MLX variant) on Apple Silicon via mlx_lm.server in long-running server deployments, the unbounded KV cache can cause out-of-memory kernel panics.

Environment

Mac Studio M3 Ultra, 512GB unified memory
Model: MiniMax-M2.5-MLX-4bit (~120GB on disk)
Runtime: mlx_lm.server v0.30.7
Use case: Backend for Claude Code via Claude Code Router, serving coding assistance

What happened

During long coding sessions, M2.5's 80 MoE layers accumulate substantial KV cache per request. With mlx_lm.server's default LRU cache of 10 entries and no size limit, total memory usage grew past 400GB, exhausting the 512GB unified memory pool. This triggered a SoC Watchdog kernel panic (hardware hard reboot), not just a standard OOM kill.

Suggestion

Could the model card include recommended deployment parameters for MLX users? For example:

mlx_lm.server --model MiniMax-M2.5-MLX-4bit --max-kv-size 65536

Note: --max-kv-size is not yet supported in mlx_lm.server v0.30.7 (requires PR #884), but once merged, this would be critical for safe deployment.

Recommended values

Variant	Recommended `max_kv_size`	Rationale
4-bit (512GB machine)	65536	~15GB KV per request, 10 LRU ≈ 150GB headroom
8-bit (512GB machine)	32768	Higher per-layer memory, needs tighter limit

This would help Apple Silicon users avoid catastrophic OOM scenarios. Thank you for open-sourcing this incredible model!

Manuun1

about 5 hours ago

Upvote !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment