PrimeTTS — on-device zh-TW + English TTS

Taiwan-Mandarin + English text-to-speech built for on-device use (contact-centre, GPS, transit): one frontend handles Chinese, English, and code-mix with no language routing, and reads entities correctly — phone numbers, emails, addresses, prices, dates, temperatures, %, serials.

Two models to know:

	PrimeTTS v2.1 — flagship	PrimeTTS v1 — leanest CPU
Folder	`v21_mbistft_16k/`	`v1b_16k/` · `v1b_8k/`
Architecture	MB-iSTFT-VITS (end-to-end, multi-speaker)	FastSpeech + Snake-HiFiGAN (+ pitch refiner)
Params	37.9M	~5.0M (16 kHz) / 4.09M (8 kHz)
Voices	3 selectable — Xinran ♀, Anchen ♂, Bowen ♂	1 — young ♀ zh-TW
Sample rate	16 kHz	16 kHz / 8 kHz
Held-out CER	0.059 (zh/mix/en, 3-voice avg)	0.11–0.15 (zh)
Best on	Jetson Nano GPU (also any CPU)	pure CPU — Nano RTF 0.35 (8 kHz, 1 thread)

Pick v2.1 for multiple voices; pick v1 when the budget is CPU-only and tight.

The full family (all MB-iSTFT-VITS except v1; all 16 kHz; single Xinran voice unless noted):

model	folder	params (deploy)	CER	use when
v2	`v2_mbistft_16k/`	34.7M (17.5M)	0.027	you want the cleanest single Xinran voice
v2.1	`v21_mbistft_16k/`	37.9M (~18M)	0.059	you want a choice of 3 voices
V2 Lite	`v2lite_mbistft_16k/`	24.8M (17.5M)	0.041	a lighter, still-good single voice for tighter GPU budgets
v1	`v1b_16k/`,`v1b_8k/`	~5M	0.11–0.15	pure-CPU, real-time on a Jetson Nano

(v3_4.6M/ and the top-level *.onnx are legacy 24 kHz variants, kept for provenance.) V2 Lite uses the exact same ONNX I/O + frontend as v2 — it's a drop-in, smaller replacement.

🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1 — pick a model, pick a voice, type text.

PrimeTTS v2.1 (`v21_mbistft_16k/`)

End-to-end MB-iSTFT-VITS (VAE + normalizing flow + adversarial multi-band iSTFT head; conv-only, no LSTM) with 3 selectable Taiwan-Mandarin voices, chosen by an integer sid input (0 = Xinran ♀, 1 = Anchen ♂, 2 = Bowen ♂). 37.9M generator params, 16 kHz, gin_channels=256 speaker conditioning.

Quality (36 held-out zh / code-mix / en sentences, X-ASR normalized CER):

voice (`sid`)	CER	note
Xinran ♀ (0)	0.059	flagship voice, cleanest teacher
Anchen ♂ (1)	0.069	slight accent
Bowen ♂ (2)	0.066	slight accent

On-device deployment (measured, Jetson Nano gen-1 / Tegra X1)

Same runtime profile as the single-voice v2 (identical architecture). RTF = compute-time ÷ audio-time (lower is faster; < 1.0 = real-time).

Tier	Runtime	Precision	RTF	Notes
GPU	RapidSpeech.cpp ggml-CUDA, 1 CPU thread	fp32	0.42 (2.4× RT)	launch-bound floor on Maxwell (sm_53, no CUDA-graph replay)
CPU (default)	onnxruntime, 4 threads	fp32	0.52 (1.9× RT)	full quality, 117 MB
CPU	onnxruntime, 2 threads	fp32	0.77 (1.3× RT)	fewer cores, leaves headroom

Both tiers are full-fidelity and need no GPU. On this ARMv8.0 Cortex-A57, fp32 is the fast format: int8 is not a speed lever (static-int8 shifts the voice; dynamic-int8 preserves it but runs slower than fp32 — no dot-product / no FP16 arithmetic on this core), fp16 casts to fp32 (no speedup), and XNNPACK ≈ MLAS. The only on-device speed lever is a smaller/faster architecture, not quantization.

Files

v21_mbistft_16k/primetts_v21_3voice.onnx        3-voice fp32 (117 MB) — full quality, all runtimes

Quickstart

pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS

import sys; sys.path.insert(0, "PrimeTTS/scripts")
import numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F                       # g2pw bopomofo + g2p_en, one pass

sess = ort.InferenceSession("PrimeTTS/v21_mbistft_16k/primetts_v21_3voice.onnx",
                            providers=["CPUExecutionProvider"])
o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")
blank = lambda s: np.array([[0] + [v for x in s for v in (x, 0)]], np.int64)   # add_blank=true
sid = 0                                             # 0 Xinran ♀ · 1 Anchen ♂ · 2 Bowen ♂
wav = sess.run(None, {
    "x": blank(o["phone_ids"]), "tone": blank(o["tone_ids"]), "lang": blank(o["lang_ids"]),
    "x_lengths": np.array([2*len(o["phone_ids"])+1], np.int64),
    "sid": np.array([sid], np.int64),
    "noise_scale": np.array([0.667], np.float32),
    "length_scale": np.array([1.0], np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, 16000)

PrimeTTS v1 (`v1b_16k/`, `v1b_8k/`) — tiny, CPU-only

FastSpeech-style acoustic (no attention: depthwise gated Conv-FFN + external durations + length regulator

BiGRU + postnet) with a 97K-param frame-pitch refiner that turns per-phoneme pitch into the per-frame F0 contour = Mandarin tones (ablating it costs +18% relative zh-CER), and a Snake-HiFiGAN vocoder. Torch-free ONNX; runs real-time on a Jetson Nano CPU. One young-female zh-TW voice across zh / en / code-mix.

	flagship `v1b_16k/`	leanest `v1b_8k/`
Params	~5.0M (3.56M acoustic + 1.43M vocoder)	4.09M (+ 0.53M vocoder)
Sample rate	16 kHz (0–8 kHz band)	8 kHz (telephone band)
Jetson Nano RTF	(heavier)	0.35 (1 thread)
`.gguf` (ggml)	—	`inflect_combined_v1b.gguf`

Pipeline is encoder → numpy length-regulator → decoder → vocoder:

import sys; sys.path.insert(0, "PrimeTTS/scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate

D = "PrimeTTS/v1b_16k"                               # or v1b_8k for the leanest Nano RTF
meta = json.load(open(f"{D}/meta.json"))
enc = ort.InferenceSession(f"{D}/acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession(f"{D}/acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession(f"{D}/vocoder.onnx",          providers=["CPUExecutionProvider"])
o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn, "lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in ["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])

scripts/synth_long.py adds punctuation auto-chunking for long text.

Shared frontend

g2pw (Taiwan bopomofo + polyphone disambiguation) + g2p_en (arpabet) merge into one phone sequence with per-phone language ids — zh / en / code-mix in a single pass, 88-symbol table. Entity normalization (scripts/text_norm.py) reads numbers / dates / prices / emails / addresses / serials and spells acronyms (VIP → V-I-P), applied identically in training and inference. Both model families consume the same frontend_bopomofo.text_to_ids() output (phone / tone / lang ids).

Reproduce from this repo

Everything needed to rebuild both models is here: the frontend, entity normalizer, aligner, corpus-gen and text-selection scripts, the eval sets + scorer, the export scripts, and the v1 trainer.

scripts/    frontend_bopomofo.py · text_norm.py · align_durations_v4.py · build_corpus_v3.py
            gen_codemix*.py · gen_entity_texts.py · select_diverse_text.py · asr_filter.py
            synth_from_text.py · synth_long.py · export_8k.py · export_onnx_primetts_v21.py
            xasr_offline.py · assess_big.py · rebuild_voice.sh · symbol_table.json
data/       codemix_v2.txt · entity_texts.jsonl · voxcpm_texts.jsonl   (corpus text sources)
eval_big.jsonl · eval_entity.jsonl                                     (held-out eval sets)
inflect_nano/  the v1 trainer (acoustic.py + vocoder.py), forked from Inflect-Nano-v1
configs/    zhtw_mb_istft_16k_v21b.json                                (v2.1 3-voice training config)

Common recipe (both models): teacher corpus → ASR/CER gate → phone-level align → train → export. The three levers that matter for a tiny model: phone-level alignment (espeak phoneme-CTC + torchaudio.forced_align — sub-syllable boundaries separate speech from fluent babble), broad coverage + diverse code-mix, and the teacher (a student's language is only as good as its teacher's).

v1 (inflect_nano/ trainer, all in-repo):

Generate corpus text — scripts/gen_codemix_v2.py, gen_entity_texts.py, select_diverse_text.py.
Synthesize with the teacher (VoxCPM2 cloning a CC0 zh-TW reference), gate with asr_filter.py.
Align — scripts/align_durations_v4.py. Train acoustic + vocoder (inflect_nano/). Export — scripts/export_8k.py.
One-shot: scripts/rebuild_voice.sh (swap in your own ~10 s reference clip).

v2.1 (MB-iSTFT-VITS; trainer is the upstream repo — see credits):

Synthesize the corpus with a VibeVoice-Large teacher across the 3 zh-capable voices.
CER-gate the teacher audio (X-ASR normalized CER < 0.05) — not voice-similarity — so only intelligible clips train the model. (This is the single most important QC step; ungated multi-voice teacher audio is the main failure mode.)
Train the 3-speaker MB-iSTFT-VITS (configs/zhtw_mb_istft_16k_v21b.json, n_speakers=3, gin_channels=256), warm-started from the single-voice v2 with fresh speaker-conditioning layers.
Export to ONNX — scripts/export_onnx_primetts_v21.py (opset 17, dynamo=False; the tiny gen-head iSTFT n_fft=16, hop=4 is replaced by an exact irFFT + overlap-add matrix, verified vs torch.istft).
Score — scripts/xasr_offline.py + assess_big.py on eval_big.jsonl.

Findings & lessons (what building tiny on-device zh/en TTS actually taught us)

Transferable lessons from taking this from a babbling 5M model to a shippable family. Full analysis in docs/zh-en-tts-arch-survey-2026.md and docs/streaming-arch-design.md.

A tiny model's quality is bounded by its inputs, not its parameter count. Held-out Mandarin CER fell 0.88 → 0.06 at a fixed ~5M purely from phone-level forced alignment + broad character coverage — no architecture change. Sub-syllable (not character) boundaries are the difference between intelligible speech and fluent babble. Gate on resynth CER, not on how balanced the duration histogram looks.
CER-gate the teacher audio, never voice-similarity alone. Our first multi-speaker attempt trained on teacher clips filtered only for the right voice; four of the "voices" were speakers that can't actually pronounce Mandarin (teacher CER 0.45–0.79), and the student faithfully learned garbled speech. Filtering on intelligibility (teacher X-ASR CER < 0.05) fixed it.
Deterministic (FastSpeech-class) models mean-regress prosody; distributional (VITS/flow) models don't. This is the wall that caps a tiny deterministic model at "intelligible but flat" — and why the flagship is a VITS, not a bigger FastSpeech.
On a launch-bound GPU (Maxwell sm_53, no CUDA-graph replay), RTF is set by kernel count, not FLOPs. A smaller VITS is a smaller download but not faster (~0.42 RTF floor regardless of params). The lever for speed is an architecture with fewer, larger kernels (flow-matching + Vocos measured ~0.18) — a different axis from size.
On an ARMv8.0 CPU (Cortex-A57): fp32 is the fast format. No int8 dot-product and no fp16 arithmetic, so int8 either breaks the voice (static) or runs slower than fp32 (dynamic), fp16 casts to fp32, and XNNPACK ≈ MLAS. The only CPU speed lever is a smaller/faster architecture — quantization is a download-size option.
"Lighter" and "faster" are different goals. VITS deploy size is dominated by flow + decoder + encoder, which don't shrink with hidden_channels; below ~17M deploy, quality craters. V2 Lite (17.5M) is the practical quality floor for this arch — there is no free "smaller and still good."

Credits & licenses

v2.1 architecture: MB-iSTFT-VITS (Kawamura et al., Apache-2.0) · Jetson-Nano ggml-CUDA runtime: RapidSpeech.cpp (mbistft-vits arch)
v2.1 teacher: VibeVoice-Large (Microsoft, MIT), 3 zh-capable presets (via the MIT community repo) — synthesized / AI-generated voices; mark as such in products
v1 base / trainer: owensong/Inflect-Nano-v1 (Apache-2.0) · v1 teacher: openbmb/VoxCPM2 · v1 reference voice: Mozilla Common Voice zh-TW (CC0 / public domain)
Gate ASR: Breeze-ASR-25 (MediaTek Research) · Whisper · Aligner: facebook/wav2vec2-lv-60-espeak-cv-ft + torchaudio.forced_align · Eval: sherpa-onnx X-ASR

This repository: Apache-2.0.

Downloads last month: 25

GGUF

Model size

4.09M params

Architecture

inflect-acoustic

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Luigi/PrimeTTS

Base model

owensong/Inflect-Nano-v1

Finetuned

(1)

this model

Luigi
/

PrimeTTS

PrimeTTS — on-device zh-TW + English TTS

PrimeTTS v2.1 (`v21_mbistft_16k/`)

On-device deployment (measured, Jetson Nano gen-1 / Tegra X1)

Files

Quickstart

PrimeTTS v1 (`v1b_16k/`, `v1b_8k/`) — tiny, CPU-only

Shared frontend

Reproduce from this repo

Findings & lessons (what building tiny on-device zh/en TTS actually taught us)

Credits & licenses

Model tree for Luigi/PrimeTTS

Spaces using Luigi/PrimeTTS 2

PrimeTTS — on-device zh-TW + English TTS

PrimeTTS v2.1 (v21_mbistft_16k/)

On-device deployment (measured, Jetson Nano gen-1 / Tegra X1)

Files

Quickstart

PrimeTTS v1 (v1b_16k/, v1b_8k/) — tiny, CPU-only

Shared frontend

Reproduce from this repo

Findings & lessons (what building tiny on-device zh/en TTS actually taught us)

Credits & licenses

Model tree for Luigi/PrimeTTS

Spaces using Luigi/PrimeTTS 2

PrimeTTS v2.1 (`v21_mbistft_16k/`)

PrimeTTS v1 (`v1b_16k/`, `v1b_8k/`) — tiny, CPU-only