BitCPM-8B → Apple Core AI (1.58-bit ternary, on-device)

The zoo's first 1.58-bit ternary LLM and first sub-int8 packed-GEMM Metal kernel, running fully on-device on iPhone (and Mac) through Apple Core AI (.aimodel / .aimodelc, iOS 27 / macOS 27).

An 8B model whose transformer weights are just {-1, 0, +1} — so it generates at a 3–4B-class footprint and speed while keeping 8B-class quality.

Base: openbmb/BitCPM-CANN-8B (MiniCPM4-8B architecture, quantization-aware trained to ternary), Apache-2.0.
Zoo + conversion code: https://github.com/john-rocky/coreai-model-zoo

On-device (iPhone 17 Pro, A19 Pro — CoreAIChat pipelined GPU engine, greedy)

bundle	decode	prefill	resident	load
`gpu-pipelined/` (AOT h18p)	17 tok/s	13 tok/s	~2.1 GB	9 s cold

Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5–6 GB resident — the ternary weight stream is the win. On M4 Max the same graph decodes 62.7 tok/s and is token-identical to the torch ternary reference (3/3 probe prompts, greedy).

What's ternary here

BitCPM-CANN-8B ships its ternary weights as TQ2_0 (2 bits/weight): per 256-element block along the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The 224 transformer linears (q/k/v/o + gate/up/down × 32 layers) run a custom Metal kernel that packs 16 ternary codes into one uint32 and does a sign-add/subtract matvec with the per-block scale — no codebook. The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision: 95.7–97.2% (OpenBMB).

Run

In the zoo's CoreAIChat app (Model → "BitCPM-8B 1.58bit"), or via Foundation Models:

import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: bundleURL)   // gpu-pipelined/ AOT h18p
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "The capital of France is"))   // -> "Paris."

Decode-only static bundle: set COREAI_CHUNK_THRESHOLD=1 (prefill runs as pipelined S=1 steps). Chat is ChatML; eos <|im_end|> (73440). The gpu-pipelined/ bundle is AOT-compiled for the h18p GPU (xcrun coreai-build compile … --preferred-compute gpu --architecture h18p) — a custom-Metal-kernel graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel).

License

Apache-2.0, inherited from openbmb/BitCPM-CANN-8B. This repository redistributes a converted Core AI artifact only.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/BitCPM-8B-CoreAI

Base model

openbmb/BitCPM-CANN-8B

Finetuned

(2)

this model