You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Pulsar 16B

License HuggingFace Discord

Powered by CompactifAI

Optimized for Fast and Efficient Inference · Reduced Memory Footprint


Table of Contents


Model Overview

Pulsar 16B is a model based on NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, developed by Multiverse Computing. The original model is a ~31.6B parameter, part of the Nemotron model family. It supports long-context inference up to 1M tokens and is designed for general-purpose language modeling tasks.

This version applies model compression techniques to significantly reduce parameter count and deployment requirements while maintaining compatibility with the Nemotron Hybrid Mamba2-Transformer with MoE architecture. The resulting model achieves 50% compression, reducing the parameter count to 16.15B parameters and lowering memory requirements.


Key Characteristics

Characteristic Description
Base model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. 31.6B total parameters, 3.6B activated per forward pass (11.34% activation ratio). NVIDIA Open Model License.
Pulsar-16B-BF16 (this model) 16.15B total parameters, 3.1B activated per forward pass (19.28% activation ratio) after CompactifAI compression.
📐 Architecture Hybrid Mamba2-Transformer with MoE (same family as the base checkpoint).
🛠️ Tool calling Yes. Same tool-call structure and format as Nemotron-3-Nano-30B-A3B-BF16. See Tool Calling.
🗜️ Compression CompactifAI (proprietary compression technology)
Primary language English

Quick Start

This model can be loaded with the Transformers API. Use trust_remote_code=True. Recommended approach: AutoModelForCausalLM with apply_chat_template. This configuration has been tested with Transformers 4.57.6.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MultiverseComputingCAI/Pulsar-16B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda" if torch.cuda.is_available() else "auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
messages = [
    {"role": "user", "content": "Write a haiku about GPUs"},
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=1024,
    temperature=1.0,
    top_p=1.0,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Alternatively you can use the pipeline API with trust_remote_code=True; the pipeline returns the full conversation structure, so extract the assistant message from outputs[0]["generated_text"] as needed.

vLLM Serving

Installation

pip install -U "vllm>=0.12.0"

Reasoning parser (NVIDIA)

Pulsar 16B uses the same Nemotron v3 reasoning tags as the base model. NVIDIA provides the vLLM plugin as nano_v3_reasoning_parser.py on the base Hugging Face repo (not specific to Pulsar). Direct download:

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

You can keep any local filename; the vllm serve flags below assume the file is in the current directory as nano_v3_reasoning_parser.py. If you mirror an identical copy under the Pulsar model repo, use that URL instead.

Serve

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --port 8000 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3

Note: The NeMo container nvcr.io/nvidia/nemo:25.11.nemotron_3_nano comes with mamba_ssm and causal-conv1d pre-installed.


Thinking (Reasoning) Control

Pulsar 16B supports a hybrid reasoning mode: the model can either think step-by-step before answering (reasoning mode) or reply directly (non-reasoning mode). The behaviour is controlled via the enable_thinking flag in the chat template.

This section provides a brief overview of reasoning control in Pulsar 16B. For comprehensive details please see the official Nemotron-3 Nano-30B model card at: https://build.nvidia.com/nvidia/nemotron-3-nano-30b-a3b/modelcard


Transformers API

Pass enable_thinking through apply_chat_template:

Thinking ON (default)

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,   # default — can be omitted
)

Thinking OFF

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=False,
)

When thinking is ON the model opens a <think> block before the answer.

output = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Split on </think> to separate reasoning from the final answer
if "</think>" in output:
    reasoning, answer = output.split("</think>", 1)
    reasoning = reasoning.replace("<think>", "").strip()
    answer = answer.strip()
else:
    answer = output

vLLM

Server-level default

Set the default for all requests at startup with --default-chat-template-kwargs.

Requires recent versions of vLLM.

Thinking OFF for all requests

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  ...

Thinking ON for all requests (default if flag is omitted)

vllm serve MultiverseComputingCAI/Pulsar-16B-BF16 \
  --served-model-name model \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --trust-request-chat-template \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  ...

Per-request override

--trust-request-chat-template is required to allow per-request overrides.

Individual requests can override the server default by passing chat_template_kwargs in the request body. This works regardless of the server-level default.

Thinking ON/OFF for one request

import requests

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "model",
    "messages": [{"role": "user", "content": "Solve: x² - 5x + 6 = 0"}],
    "max_tokens": 1024,
    "temperature": 1.0,
    "chat_template_kwargs": {"enable_thinking": True},
})

Tool Calling

Pulsar 16B emits tool calls in the following format:

<tool_call>
<function=get_weather>
<parameter=city>Paris</parameter>
<parameter=unit>celsius</parameter>
</function>
</tool_call>

When serving (e.g with vLLM), you must use the qwen3_coder tool parser.

vllm serve <model_path> \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --trust-remote-code

Training & Fine-Tuning

Base Model: NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

The base model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. The model's reasoning capabilities can be configured through a flag in the chat template. See the original model card for details.

CompactifAI Compression

CompactifAI was applied to produce a smaller, efficient model (16B parameters) while aiming to preserve reasoning and tool-use capabilities. Supervised Fine Tuning was applied for improving cabapilities.


Evaluation & Benchmarks

Combined benchmark chart

Benchmark Nemotron 3 Nano 30B A3B Pulsar 16B gpt-oss-20b Qwen3-14B Ministral-3-14B-Instruct-2512
AIME 87.66 87.22 87.66 76.00 33.00
GPQA 74.04 71.41 68.99 63.63 56.45
IFBench 72.31 70.79 68.46 39.20 32.80
MMLU-Pro 78.90 74.78 76.65 85.01 70.09
LiveCodeBench 71.11 68.04 64.65 66.35 29.84

Quantizations

Quantization results

Benchmark Nemotron 3 Nano 30B A3B Pulsar 16B (BF16) Pulsar 16B (fp8) Pulsar 16B (nvfp4)
AIME 87.66 87.22 86.67 82.00
GPQA 74.04 71.41 70.61 71.11
IFBench 72.31 70.79 69.60 69.90
MMLU-Pro 78.90 74.78 74.76 74.19
LiveCodeBench 71.11 68.04 68.68 65.60

Performance

Performance results

  • Framework: guidellm
  • Inference: vLLM 0.18.0
  • GPU: NVIDIA L40s
  • Decode: temperature: 0.0, top_p: 1.0
  • Measure Window: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
  • Workload shape: 8k/16k workload as in the original model's card.

Long Context

Pulsar 16B preserves strong long-context behavior after compression, tracking the Nemotron-3-Nano-30B-A3B baseline closely across retrieval-heavy and full-suite long-context evaluations. Results are reported for LongBench v1, AA-LCR, NIAH, and RULER groupings up to 256k context.

Long-context benchmark results

Benchmark Nemotron 3 Nano 30B A3B Pulsar 16B
Longbench 31.84 29.84
AA-LCR 33.67 29.33
NIAH (@100K) 100.00 100.00
RULER (@128K) 95.02 94.20
RULER (@256K) 92.02 87.74

Evaluation Methodology

Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.

Inference:

  • Backend: VLLM 0.18.0
  • Nemotron models: temp 1.0, top_p 1.0
  • GPT-OSS-20B: temp: 1.0, top_p: 1.0, reasoning_effort: high
  • Qwen3-14B: temp: 0.6, top_p: 0.95, top_k: 20, min_p: 0.0
  • Ministral-3-14B-Instruct-2512: temp: 0.15
Benchmark Framework Repeats Other
MMLU-Pro NeMo-Skills 1
AIME25 NeMo-Skills 10
GPQA:d NeMo-Skills 5
LiveCodeBench NeMo-Skills 3
IFBench NeMo-Skills 5
LongBench v1 lm-evaluation-harness 1
AA-LCR EvalScope 1.4.1 3 Judge: Qwen/Qwen3-235B-A22B-Instruct-2507. judge_score_type: pattern. judge_argsgeneration_config: top_p 0.8, top_k 20, min_p 0.0, temperature 0.7.
NIAH EvalScope 1.4.1 1 Judge: qwen/qwen3-235b-a22b-2507 . judge_model_args: {} (no extra judge settings in YAML).
RULER NeMo-Skills (+ RULER) 1

Languages

  • Primary language: English
  • Other languages: Spanish

Trained mainly on English with added Spanish. No systematic evaluation for languages outside English and Spanish.

Safety & Limitations

Known Limitations

  • English-centric training data (inherited from base model).
  • Tool calling depends on correct schema and tool design; exact parity with the original model is not guaranteed.
  • Compression may affect some behaviors; evaluate for your use case.

Recommendations

  • Validate tool outputs before running them
  • Human oversight for critical use
  • Task-specific eval before production

Model Information

Field Value
Model name Pulsar 16B
Based on NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Version v1.5.0
Release date TBD
Developed by Multiverse Computing
License Apache 2.0
Contact business@multiversecomputing.com

Citation

If you use this model, please cite the base model and Pulsar 16B:

@misc{nemotron3nanoTR,
  title         = {NVIDIA Nemotron 3 Nano Technical Report},
  author        = {{NVIDIA}},
  year          = {2025},
  url           = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}
@misc{nemotron3nanoslim16b,
  title         = {Pulsar 16B: Model developed from NVIDIA Nemotron-3-Nano-30B-A3B},
  author        = {Multiverse Computing},
  year          = {2026},
  url           = {https://huggingface.co/MultiverseComputingCAI/Pulsar-16B-BF16},
  note          = {Model developed based on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 using CompactifAI technology}
}

Built by Multiverse Computing · Report an issue · Discord

Downloads last month
15
Safetensors
Model size
16B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MultiverseComputingCAI/Pulsar-16B-FP8

Finetuned
(51)
this model