BitCPM-8B β Apple Core AI (1.58-bit ternary, on-device)
The zoo's first 1.58-bit ternary LLM and first sub-int8 packed-GEMM Metal kernel, running
fully on-device on iPhone (and Mac) through Apple Core AI (.aimodel / .aimodelc, iOS 27 /
macOS 27).
An 8B model whose transformer weights are just {-1, 0, +1} β so it generates at a 3β4B-class footprint and speed while keeping 8B-class quality.
- Base: openbmb/BitCPM-CANN-8B (MiniCPM4-8B architecture, quantization-aware trained to ternary), Apache-2.0.
- Zoo + conversion code: https://github.com/john-rocky/coreai-model-zoo
On-device (iPhone 17 Pro, A19 Pro β CoreAIChat pipelined GPU engine, greedy)
| bundle | decode | prefill | resident | load |
|---|---|---|---|---|
gpu-pipelined/ (AOT h18p) |
17 tok/s | 13 tok/s | ~2.1 GB | 9 s cold |
Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5β6 GB resident β the ternary weight stream is the win. On M4 Max the same graph decodes 62.7 tok/s and is token-identical to the torch ternary reference (3/3 probe prompts, greedy).
What's ternary here
BitCPM-CANN-8B ships its ternary weights as TQ2_0 (2 bits/weight): per 256-element block along
the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The 224 transformer
linears (q/k/v/o + gate/up/down Γ 32 layers) run a custom Metal kernel that packs 16 ternary
codes into one uint32 and does a sign-add/subtract matvec with the per-block scale β no codebook.
The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision:
95.7β97.2% (OpenBMB).
Run
In the zoo's CoreAIChat app (Model β "BitCPM-8B 1.58bit"), or via Foundation Models:
import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: bundleURL) // gpu-pipelined/ AOT h18p
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "The capital of France is")) // -> "Paris."
Decode-only static bundle: set COREAI_CHUNK_THRESHOLD=1 (prefill runs as pipelined S=1 steps). Chat
is ChatML; eos <|im_end|> (73440). The gpu-pipelined/ bundle is AOT-compiled for the h18p GPU
(xcrun coreai-build compile β¦ --preferred-compute gpu --architecture h18p) β a custom-Metal-kernel
graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel).
License
Apache-2.0, inherited from openbmb/BitCPM-CANN-8B. This repository redistributes a converted Core AI artifact only.
Model tree for mlboydaisuke/BitCPM-8B-CoreAI
Base model
openbmb/BitCPM-CANN-8B