BitCPM-8B β†’ Apple Core AI (1.58-bit ternary, on-device)

The zoo's first 1.58-bit ternary LLM and first sub-int8 packed-GEMM Metal kernel, running fully on-device on iPhone (and Mac) through Apple Core AI (.aimodel / .aimodelc, iOS 27 / macOS 27).

An 8B model whose transformer weights are just {-1, 0, +1} β€” so it generates at a 3–4B-class footprint and speed while keeping 8B-class quality.

On-device (iPhone 17 Pro, A19 Pro β€” CoreAIChat pipelined GPU engine, greedy)

bundle decode prefill resident load
gpu-pipelined/ (AOT h18p) 17 tok/s 13 tok/s ~2.1 GB 9 s cold

Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5–6 GB resident β€” the ternary weight stream is the win. On M4 Max the same graph decodes 62.7 tok/s and is token-identical to the torch ternary reference (3/3 probe prompts, greedy).

What's ternary here

BitCPM-CANN-8B ships its ternary weights as TQ2_0 (2 bits/weight): per 256-element block along the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The 224 transformer linears (q/k/v/o + gate/up/down Γ— 32 layers) run a custom Metal kernel that packs 16 ternary codes into one uint32 and does a sign-add/subtract matvec with the per-block scale β€” no codebook. The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision: 95.7–97.2% (OpenBMB).

Run

In the zoo's CoreAIChat app (Model β†’ "BitCPM-8B 1.58bit"), or via Foundation Models:

import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: bundleURL)   // gpu-pipelined/ AOT h18p
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "The capital of France is"))   // -> "Paris."

Decode-only static bundle: set COREAI_CHUNK_THRESHOLD=1 (prefill runs as pipelined S=1 steps). Chat is ChatML; eos <|im_end|> (73440). The gpu-pipelined/ bundle is AOT-compiled for the h18p GPU (xcrun coreai-build compile … --preferred-compute gpu --architecture h18p) β€” a custom-Metal-kernel graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel).

License

Apache-2.0, inherited from openbmb/BitCPM-CANN-8B. This repository redistributes a converted Core AI artifact only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/BitCPM-8B-CoreAI

Finetuned
(2)
this model