Pulsar 16B
Table of Contents
- Model Overview
- Key Characteristics
- Quick Start
- Reasoning Control
- Tool Calling
- Training & Fine-Tuning
- Evaluation & Benchmarks
- Languages
- Safety & Limitations
- Model Information
- Citation
Model Overview
Pulsar 16B is a model based on NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, developed by Multiverse Computing. The original model is a ~31.6B parameter, part of the Nemotron model family. It supports long-context inference up to 1M tokens and is designed for general-purpose language modeling tasks.
This version applies model compression techniques to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves 50% compression, reducing the parameter count to 16.15B parameters and lowering memory requirements.
Key Characteristics
| Characteristic | Description |
|---|---|
| Base model | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. 31.6B total parameters, 3.6B activated per forward pass (11.34% activation ratio). NVIDIA Open Model License. |
| Pulsar-16B-BF16 (this model) | 16.15B total parameters, 3.1B activated per forward pass (19.28% activation ratio) after CompactifAI compression. |
| 📐 Architecture | Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint). |
| 🛠️ Tool calling | Yes. Same tool-call structure and format as Nemotron-3-Nano-30B-A3B-BF16. See Tool Calling. |
| 🗜️ Compression | CompactifAI (proprietary compression technology) |
| Primary language | English |
Quick Start
This model can be loaded with the Transformers API. Use trust_remote_code=True. Recommended approach: AutoModelForCausalLM with apply_chat_template. This configuration has been tested with Transformers 4.57.6.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "MultiverseComputingCAI/Pulsar-16B-BF16"
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda" if torch.cuda.is_available() else "auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
messages = [
{"role": "user", "content": "Write a haiku about GPUs"},
]
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
tokenized_chat,
max_new_tokens=1024,
temperature=1.0,
top_p=1.0,
eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))
Alternatively you can use the pipeline API with trust_remote_code=True; the pipeline returns the full conversation structure, so extract the assistant message from outputs[0]["generated_text"] as needed.
vLLM Serving
Installation
pip install -U "vllm>=0.12.0"
Reasoning parser (NVIDIA)
Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as nano_v3_reasoning_parser.py on the base Hugging Face repo (not specific to Pulsar). Direct download:
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py
You can keep any local filename; the vllm serve flags below assume the file is in the current directory as nano_v3_reasoning_parser.py. If you mirror an identical copy under the Pulsar model repo, use that URL instead.
Serve
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
--served-model-name model \
--max-num-seqs 8 \
--tensor-parallel-size 1 \
--port 8000 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3
Note: The NeMo container
nvcr.io/nvidia/nemo:25.11.nemotron_3_nanocomes withmamba_ssmandcausal-conv1dpre-installed.
Thinking (Reasoning) Control
Pulsar 16B supports a hybrid reasoning mode: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the enable_thinking flag in the chat template.
This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard
Transformers API
Pass enable_thinking through apply_chat_template:
Thinking ON (default)
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=True, # default — can be omitted
)
Thinking OFF
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=False,
)
When thinking is ON the model opens a <think> block before the answer.
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Split on </think> to separate reasoning from the final answer
if "</think>" in output:
reasoning, answer = output.split("</think>", 1)
reasoning = reasoning.replace("<think>", "").strip()
answer = answer.strip()
else:
answer = output
vLLM
Server-level default
Set the default for all requests at startup with --default-chat-template-kwargs.
Requires recent versions of vLLM.
Thinking OFF for all requests
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
--served-model-name model \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--trust-request-chat-template \
--default-chat-template-kwargs '{"enable_thinking": false}' \
...
Thinking ON for all requests (default if flag is omitted)
vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
--served-model-name model \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3 \
--trust-request-chat-template \
--default-chat-template-kwargs '{"enable_thinking": true}' \
...
Per-request override
--trust-request-chat-templateis required to allow per-request overrides.
Individual requests can override the server default by passing chat_template_kwargs in the request body. This works regardless of the server-level default.
Thinking ON/OFF for one request
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "model",
"messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}],
"max_tokens": 1024,
"temperature": 1.0,
"chat_template_kwargs": {"enable_thinking": True},
})
Tool Calling
Pulsar 16B emits tool calls in the following format:
<tool_call>
<function=get_weather>
<parameter=city>Paris</parameter>
<parameter=unit>celsius</parameter>
</function>
</tool_call>
When serving (e.g with vLLM), you must use the qwen3_coder tool parser.
vllm serve <model_path> \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code
Training & Fine-Tuning
Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
The base model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the original model card for details.
CompactifAI Compression
CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities.
Evaluation & Benchmarks
| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B | gpt-oss-20b | Qwen3-14B | Ministral-3-14B-Instruct-2512 |
|---|---|---|---|---|---|
| AIME | 87.66 | 87.22 | 87.66 | 76.00 | 33.00 |
| GPQA | 74.04 | 71.41 | 68.99 | 63.63 | 56.45 |
| IFBench | 72.31 | 70.79 | 68.46 | 39.20 | 32.80 |
| MMLU-Pro | 78.90 | 74.78 | 76.65 | 85.01 | 70.09 |
| LiveCodeBench | 71.11 | 68.04 | 64.65 | 66.35 | 29.84 |
Quantizations
| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B (BF16) | Pulsar 16B (fp8) | Pulsar 16B (nvfp4) |
|---|---|---|---|---|
| AIME | 87.66 | 87.22 | 86.67 | 82.00 |
| GPQA | 74.04 | 71.41 | 70.61 | 71.11 |
| IFBench | 72.31 | 70.79 | 69.60 | 69.90 |
| MMLU-Pro | 78.90 | 74.78 | 74.76 | 74.19 |
| LiveCodeBench | 71.11 | 68.04 | 68.68 | 65.60 |
Performance
- Framework: guidellm
- Inference: vLLM 0.18.0
- GPU: NVIDIA L40s
- Decode:
temperature: 0.0,top_p: 1.0 - Measure Window: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
- Workload shape: 8k/16k workload as in the original model's card.
Long Context
Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context.
| Benchmark | Nemotron 3 Nano 30B A3B | Pulsar 16B |
|---|---|---|
| Longbench | 31.84 | 29.84 |
| AA-LCR | 33.67 | 29.33 |
| NIAH (@100K) | 100.00 | 100.00 |
| RULER (@128K) | 95.02 | 94.20 |
| RULER (@256K) | 92.02 | 87.74 |
Evaluation Methodology
Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.
Inference:
- Backend: VLLM 0.18.0
- Nemotron models:
temp 1.0,top_p 1.0 - GPT-OSS-20B:
temp: 1.0,top_p: 1.0,reasoning_effort: high - Qwen3-14B:
temp: 0.6,top_p: 0.95,top_k: 20,min_p: 0.0 - Ministral-3-14B-Instruct-2512:
temp: 0.15
| Benchmark | Framework | Repeats | Other |
|---|---|---|---|
| MMLU-Pro | NeMo-Skills | 1 | |
| AIME25 | NeMo-Skills | 10 | |
| GPQA:d | NeMo-Skills | 5 | |
| LiveCodeBench | NeMo-Skills | 3 | |
| IFBench | NeMo-Skills | 5 | |
| LongBench v1 | lm-evaluation-harness | 1 | |
| AA-LCR | EvalScope 1.4.1 | 3 | Judge: Qwen/Qwen3-235B-A22B-Instruct-2507. judge_score_type: pattern. judge_args → generation_config: top_p 0.8, top_k 20, min_p 0.0, temperature 0.7. |
| NIAH | EvalScope 1.4.1 | 1 | Judge: qwen/qwen3-235b-a22b-2507 . judge_model_args: {} (no extra judge settings in YAML). |
| RULER | NeMo-Skills (+ RULER) | 1 |
Languages
- Primary language: English
- Other languages: Spanish
Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish.
Safety & Limitations
Known Limitations
- English-centric training data (inherited from base model).
- Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed.
- Compression may affect some behaviors; evaluate for your use case.
Recommendations
- Validate tool outputs before running them
- Human oversight for critical use
- Task-specific eval before production
Model Information
| Field | Value |
|---|---|
| Model name | Pulsar 16B |
| Based on | NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
| Version | v1.5.0 |
| Release date | TBD |
| Developed by | Multiverse Computing |
| License | Apache 2.0 |
| Contact | business@multiversecomputing.com |
Citation
If you use this model, please cite the base model and Pulsar 16B:
@misc{nemotron3nanoTR,
title = {NVIDIA Nemotron 3 Nano Technical Report},
author = {{NVIDIA}},
year = {2025},
url = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}
@misc{nemotron3nanoslim16b,
title = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B},
author = {Multiverse Computing},
year = {2026},
url = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16},
note = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology}
}
Built by Multiverse Computing · Report an issue · Discord
- Downloads last month
- 15
Model tree for MultiverseComputingCAI/Pulsar-16B-FP8
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16


