voyage-4-nano-ONNX
ONNX exports of voyageai/voyage-4-nano for efficient CPU/GPU inference.
Model Details
| Property | Value |
|---|---|
| Original Model | voyageai/voyage-4-nano |
| Architecture | Qwen3-based embedding model |
| Embedding Dimension | 2048 |
| Context Length | 32000 tokens |
| ONNX Opset | 17 |
Model Variants
| File | Precision | Size | Description |
|---|---|---|---|
model_fp32.onnx |
FP32 | ~1.3 GB | Full precision, highest accuracy |
model_int8.onnx |
INT8 | ~331 MB | Dynamic quantization, fastest inference |
Cosine Similarity (FP32 vs INT8): ~0.98
Usage
Python with ONNX Runtime
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("voyageai/voyage-4-nano")
session = ort.InferenceSession("model_fp32.onnx")
# Prepare input (using recommended query prompt)
prompt = "Represent the query for retrieving supporting documents: "
text = prompt + "What is machine learning?"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
# Run inference
outputs = session.run(
["embeddings"],
{"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]}
)
embeddings = outputs[0] # Shape: (batch_size, 2048)
print(f"Embedding shape: {embeddings.shape}")
print(f"L2 norm: {np.linalg.norm(embeddings, axis=1)}") # Should be ~1.0
Input/Output Specification
Inputs:
input_ids:int64[batch_size, sequence_length]- Tokenized input IDsattention_mask:int64[batch_size, sequence_length]- Attention mask
Outputs:
embeddings:float32[batch_size, 2048]- L2-normalized embeddings
Note: Position IDs are generated internally - you only need to provide input_ids and attention_mask.
Processing Pipeline
The ONNX model includes the complete embedding pipeline:
- Transformer Backbone - Qwen3 model with bidirectional attention
- Projection Layer - Linear projection from 1024 to 2048 dimensions
- Mean Pooling - Weighted average using attention mask
- L2 Normalization - Unit-length embeddings for cosine similarity
Query vs Document Encoding
Following the original model's design:
- Queries: Prepend
"Represent the query for retrieving supporting documents: " - Documents: Encode directly without prompt
# Query encoding
query_text = "Represent the query for retrieving supporting documents: " + query
# Document encoding
doc_text = document # No prompt needed
Vespa Configuration
To use this model with Vespa, configure it as a hugging-face-embedder in services.xml:
<container id="default" version="1.0">
<component id="voyage-nano" type="hugging-face-embedder">
<transformer-model url="https://huggingface.co/thomasht86/voyage-4-nano-ONNX/resolve/main/model_int8.onnx"/>
<tokenizer-model url="https://huggingface.co/thomasht86/voyage-4-nano-ONNX/resolve/main/tokenizer.json"/>
<!-- Model outputs pooled & normalized embeddings directly -->
<transformer-output>embeddings</transformer-output>
<pooling-strategy>none</pooling-strategy>
<normalize>false</normalize>
<!-- Model doesn't use token_type_ids -->
<transformer-token-type-ids/>
<!-- Context length -->
<max-tokens>32000</max-tokens>
<!-- Query/document prepend instructions -->
<prepend>
<query>Represent the query for retrieving supporting documents: </query>
</prepend>
</component>
...
</container>
Use in your schema (.sd file):
schema doc {
document doc {
field myTextField type string {
indexing: summary | index
}
}
field embedding type tensor<float>(x[2048]) {
indexing: input myTextField | embed voyage-nano | attribute | index
attribute {
distance-metric: prenormalized-angular
}
index: hnsw
}
}
Note: Use model_fp32.onnx instead of model_int8.onnx for higher accuracy if resources permit.
Binary Quantization (Optional)
For maximum memory savings, you can binarize the embeddings using Vespa's built-in binarize and pack_bits converters. This reduces the 2048-dimensional float vector to 256 int8 values (32x compression).
Schema with binarized embeddings:
schema doc {
document doc {
field myTextField type string {
indexing: summary | index
}
}
field embedding type tensor<float>(x[2048]) {
indexing: input myTextField | embed voyage-nano | attribute
attribute: paged
}
field embedding_binary type tensor<int8>(x[256]) {
indexing: input myTextField | embed voyage-nano | binarize | pack_bits | attribute | index
attribute {
distance-metric: hamming
}
index {
hnsw {
max-links-per-node: 16
neighbors-to-explore-at-insert: 200
}
}
}
}
Rank profile for two-phase ranking (binary retrieval + full-precision reranking):
rank-profile binary_retrieval {
inputs {
query(q) tensor<float>(x[2048])
query(q_binary) tensor<int8>(x[256])
}
function unpack_to_float() {
expression: 2 * unpack_bits(attribute(embedding_binary), float) - 1
}
function dot_product() {
expression: sum(query(q) * unpack_to_float)
}
first-phase {
expression: closeness(field, embedding_binary)
}
second-phase {
expression: dot_product
}
}
Query example:
vespa query \
'yql=select * from doc where {targetHits:100}nearestNeighbor(embedding_binary, q_binary)' \
'input.query(q)=[0.1, -0.2, ...]' \
'input.query(q_binary)=[hex dump of binarized query]' \
'ranking=binary_retrieval'
Python helper for binarizing query vectors:
import numpy as np
def binarize_embedding(embedding: np.ndarray) -> str:
"""Binarize a float embedding to hex string for Vespa query."""
return np.packbits(np.where(embedding > 0, 1, 0)).astype(np.int8).tobytes().hex()
# Usage
query_embedding = outputs[0][0] # Shape: (2048,)
query_binary_hex = binarize_embedding(query_embedding)
See Vespa binarization guide for more details.
Performance Notes
- FP32 model provides highest accuracy
- INT8 model provides fastest inference with acceptable accuracy (~98% cosine similarity vs FP32)
License
This ONNX conversion follows the original model's license. See voyageai/voyage-4-nano for license details.
Credits
- Original Model: Voyage AI - voyageai/voyage-4-nano
- ONNX Conversion: thomasht86
Citation
If you use this model, please cite the original Voyage AI model:
@misc{voyage4nano,
author = {Voyage AI},
title = {voyage-4-nano},
year = {2024},
url = {https://huggingface.co/voyageai/voyage-4-nano}
}
- Downloads last month
- 31
Model tree for thomasht86/voyage-4-nano-ONNX
Base model
voyageai/voyage-4-nano