Instructions to use mlboydaisuke/VoxCPM2-CoreAI with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VoxCPM
How to use mlboydaisuke/VoxCPM2-CoreAI with VoxCPM:
import soundfile as sf from voxcpm import VoxCPM model = VoxCPM.from_pretrained("mlboydaisuke/VoxCPM2-CoreAI") wav = model.generate( text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.", prompt_wav_path=None, # optional: path to a prompt speech for voice cloning prompt_text=None, # optional: reference text cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed normalize=True, # enable external TN tool denoise=True, # enable external Denoise tool retry_badcase=True, # enable retrying mode for some bad cases (unstoppable) retry_badcase_max_times=3, # maximum retrying times retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech ) sf.write("output.wav", wav, 16000) print("saved: output.wav") - Notebooks
- Google Colab
- Kaggle
VoxCPM2 2B β Core AI (on-device, 48 kHz)
OpenBMB VoxCPM2 (2B) converted to Apple Core AI, running fully on-device on iPhone (A19 Pro / iPhone 17 Pro) and Mac β no network. The 2B, 48 kHz successor to VoxCPM-0.5B-CoreAI.
A tokenizer-free diffusion TTS: a MiniCPM4 28-layer text-semantic LM + an 8-layer residual acoustic LM drive a 12-layer LocDiT flow-matching diffusion head, decoded by a 48 kHz AudioVAE. Five Core AI bundles + a few host-side projections.
What's inside
| dir | contents |
|---|---|
macos/ |
JIT .aimodel bundles (Mac): int8 base/res decode + prefill, fp16 feat_decoder / feat_encoder / vocoder |
ios/ |
AOT .aimodelc bundles (iOS h18p, GPU): same five + the two int8 prefill bundles |
voxcpm2_host_glue/ |
embed table + projections / FSQ-512 / stop-head / fusion (.bin + manifest) |
tokenizer/ |
the VoxCPM2 tokenizer (Llama fast) |
The backbone LMs are weight-only int8 (the size driver); the diffusion + VAE stay fp16 (the continuous-feedback path is quant-sensitive β same split mlx-community uses).
On-device numbers (iPhone 17 Pro, int8 + prefill + streaming)
- RTF 1.19, first-audio 0.65 s, 48 kHz, ~4.9 GB resident (increased-memory entitlement).
- Streaming starts after the first ~0.65 s; the 2B is ~4Γ the 0.5B, so RTF sits just above realtime.
Use it
Runs through coreai-kit VoxCPM2TTS, wired into the
coreai-model-zoo coreai-audio app ("Voice 2B" tab).
Conversion + gates + export scripts: coreai-model-zoo/conversion/voxcpm/ (*_v2.py).
let tts = try await VoxCPM2TTS(paths: .standard(artifactsRoot: root, lm: .int8))
let wav = try await tts.synthesize("On device speech synthesis, running entirely on your iPhone.") // 48 kHz Float PCM
Verification
Reimplemented in exportable Core AI overlays and gated end-to-end against the official model: backbone / feat_decoder / feat_encoder cos 1.0, full chain magspec 0.996; every exported bundle engine-gated cos β₯ 0.9999.
License
Apache-2.0 (commercial OK), inherited from openbmb/VoxCPM2. Not affiliated with OpenBMB or Apple. Community port.
Model tree for mlboydaisuke/VoxCPM2-CoreAI
Base model
openbmb/VoxCPM2