You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Pulsar 16B

Optimized for Fast and Efficient Inference · Reduced Memory Footprint

Model Overview
Key Characteristics
Quick Start
Reasoning Control
Tool Calling
Training & Fine-Tuning
Evaluation & Benchmarks
Languages
Safety & Limitations
Model Information
Citation

Model Overview

Pulsar 16B is a model based on NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, developed by Multiverse Computing. The original model is a ~31.6B parameter, part of the Nemotron model family. It supports long-context inference up to 1M tokens and is designed for general-purpose language modeling tasks.

This version applies model compression techniques to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves 50% compression, reducing the parameter count to 16.15B parameters and lowering memory requirements.

Key Characteristics

Characteristic	Description
Base model	nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. 31.6B total parameters, 3.6B activated per forward pass (11.34% activation ratio). NVIDIA Open Model License.
Pulsar-16B-BF16 (this model)	16.15B total parameters, 3.1B activated per forward pass (19.28% activation ratio) after CompactifAI compression.
📐 Architecture	Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint).
🛠️ Tool calling	Yes. Same tool-call structure and format as Nemotron-3-Nano-30B-A3B-BF16. See Tool Calling.
🗜️ Compression	CompactifAI (proprietary compression technology)
Primary language	English

Quick Start

This model can be loaded with the Transformers API. Use trust_remote_code=True. Recommended approach: AutoModelForCausalLM with apply_chat_template. This configuration has been tested with Transformers 4.57.6.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MultiverseComputingCAI/Pulsar-16B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda" if torch.cuda.is_available() else "auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
messages = [
    {"role": "user", "content": "Write a haiku about GPUs"},
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=1024,
    temperature=1.0,
    top_p=1.0,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Alternatively you can use the pipeline API with trust_remote_code=True; the pipeline returns the full conversation structure, so extract the assistant message from outputs[0]["generated_text"] as needed.

vLLM Serving

Installation

pip install -U "vllm>=0.12.0"

Reasoning parser (NVIDIA)

Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as nano_v3_reasoning_parser.py on the base Hugging Face repo (not specific to Pulsar). Direct download:

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

You can keep any local filename; the vllm serve flags below assume the file is in the current directory as nano_v3_reasoning_parser.py. If you mirror an identical copy under the Pulsar model repo, use that URL instead.

Serve

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --port 8000 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3

Note: The NeMo container nvcr.io/nvidia/nemo:25.11.nemotron_3_nano comes with mamba_ssm and causal-conv1d pre-installed.

Thinking (Reasoning) Control

Pulsar 16B supports a hybrid reasoning mode: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the enable_thinking flag in the chat template.

This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard

Transformers API

Pass enable_thinking through apply_chat_template:

Thinking ON (default)

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,   # default — can be omitted
)

Thinking OFF

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=False,
)

When thinking is ON the model opens a <think> block before the answer.

output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Split on </think> to separate reasoning from the final answer
if "</think>" in output:
    reasoning, answer = output.split("</think>", 1)
    reasoning = reasoning.replace("<think>", "").strip()
    answer = answer.strip()
else:
    answer = output

vLLM

Server-level default

Set the default for all requests at startup with --default-chat-template-kwargs.

Requires recent versions of vLLM.

Thinking OFF for all requests

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  ...

Thinking ON for all requests (default if flag is omitted)

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  ...

Per-request override

--trust-request-chat-template is required to allow per-request overrides.

Individual requests can override the server default by passing chat_template_kwargs in the request body. This works regardless of the server-level default.

Thinking ON/OFF for one request

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "model",
    "messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}],
    "max_tokens": 1024,
    "temperature": 1.0,
    "chat_template_kwargs": {"enable_thinking": True},
})

Tool Calling

Pulsar 16B emits tool calls in the following format:

<tool_call>
<function=get_weather>
<parameter=city>Paris</parameter>
<parameter=unit>celsius</parameter>
</function>
</tool_call>

When serving (e.g with vLLM), you must use the qwen3_coder tool parser.

vllm serve <model_path> \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code

Training & Fine-Tuning

Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

The base model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the original model card for details.

CompactifAI Compression

CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities.

Evaluation & Benchmarks

Benchmark	Nemotron 3 Nano 30B A3B	Pulsar 16B	gpt-oss-20b	Qwen3-14B	Ministral-3-14B-Instruct-2512
AIME	87.66	87.22	87.66	76.00	33.00
GPQA	74.04	71.41	68.99	63.63	56.45
IFBench	72.31	70.79	68.46	39.20	32.80
MMLU-Pro	78.90	74.78	76.65	85.01	70.09
LiveCodeBench	71.11	68.04	64.65	66.35	29.84

Quantizations

Benchmark	Nemotron 3 Nano 30B A3B	Pulsar 16B (BF16)	Pulsar 16B (fp8)	Pulsar 16B (nvfp4)
AIME	87.66	87.22	86.67	82.00
GPQA	74.04	71.41	70.61	71.11
IFBench	72.31	70.79	69.60	69.90
MMLU-Pro	78.90	74.78	74.76	74.19
LiveCodeBench	71.11	68.04	68.68	65.60

Performance

Framework: guidellm
Inference: vLLM 0.18.0
GPU: NVIDIA L40s
Decode: temperature: 0.0, top_p: 1.0
Measure Window: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
Workload shape: 8k/16k workload as in the original model's card.

Long Context

Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context.

Benchmark	Nemotron 3 Nano 30B A3B	Pulsar 16B
Longbench	31.84	29.84
AA-LCR	33.67	29.33
NIAH (@100K)	100.00	100.00
RULER (@128K)	95.02	94.20
RULER (@256K)	92.02	87.74

Evaluation Methodology

Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.

Inference:

Backend: VLLM 0.18.0
Nemotron models: temp 1.0, top_p 1.0
GPT-OSS-20B: temp: 1.0, top_p: 1.0, reasoning_effort: high
Qwen3-14B: temp: 0.6, top_p: 0.95, top_k: 20, min_p: 0.0
Ministral-3-14B-Instruct-2512: temp: 0.15

Benchmark	Framework	Repeats	Other
MMLU-Pro	NeMo-Skills	1
AIME25	NeMo-Skills	10
GPQA:d	NeMo-Skills	5
LiveCodeBench	NeMo-Skills	3
IFBench	NeMo-Skills	5
LongBench v1	lm-evaluation-harness	1
AA-LCR	EvalScope 1.4.1	3	Judge: `Qwen/Qwen3-235B-A22B-Instruct-2507`. `judge_score_type`: `pattern`. `judge_args` → `generation_config`: `top_p` 0.8, `top_k` 20, `min_p` 0.0, `temperature` 0.7.
NIAH	EvalScope 1.4.1	1	Judge: `qwen/qwen3-235b-a22b-2507` . `judge_model_args`: `{}` (no extra judge settings in YAML).
RULER	NeMo-Skills (+ RULER)	1

Languages

Primary language: English
Other languages: Spanish

Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish.

Safety & Limitations

Known Limitations

English-centric training data (inherited from base model).
Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed.
Compression may affect some behaviors; evaluate for your use case.

Recommendations

Validate tool outputs before running them
Human oversight for critical use
Task-specific eval before production

Model Information

Field	Value
Model name	Pulsar 16B
Based on	NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Version	v1.5.0
Release date	TBD
Developed by	Multiverse Computing
License	Apache 2.0
Contact	business@multiversecomputing.com

Citation

If you use this model, please cite the base model and Pulsar 16B:

@misc{nemotron3nanoTR,
  title         = {NVIDIA Nemotron 3 Nano Technical Report},
  author        = {{NVIDIA}},
  year          = {2025},
  url           = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}
@misc{nemotron3nanoslim16b,
  title         = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B},
  author        = {Multiverse Computing},
  year          = {2026},
  url           = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16},
  note          = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology}
}

Built by Multiverse Computing · Report an issue · Discord

Downloads last month: 15

Safetensors

Model size

16B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MultiverseComputingCAI/Pulsar-16B-FP8

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Finetuned

(51)

this model

MultiverseComputingCAI
/

Pulsar-16B-FP8