voyage-4-nano-ONNX

ONNX exports of voyageai/voyage-4-nano for efficient CPU/GPU inference.

Model Details

Property Value
Original Model voyageai/voyage-4-nano
Architecture Qwen3-based embedding model
Embedding Dimension 2048
Context Length 32000 tokens
ONNX Opset 17

Model Variants

File Precision Size Description
model_fp32.onnx FP32 ~1.3 GB Full precision, highest accuracy
model_int8.onnx INT8 ~331 MB Dynamic quantization, fastest inference

Cosine Similarity (FP32 vs INT8): ~0.98

Usage

Python with ONNX Runtime

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("voyageai/voyage-4-nano")
session = ort.InferenceSession("model_fp32.onnx")

# Prepare input (using recommended query prompt)
prompt = "Represent the query for retrieving supporting documents: "
text = prompt + "What is machine learning?"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)

# Run inference
outputs = session.run(
    ["embeddings"],
    {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]}
)

embeddings = outputs[0]  # Shape: (batch_size, 2048)
print(f"Embedding shape: {embeddings.shape}")
print(f"L2 norm: {np.linalg.norm(embeddings, axis=1)}")  # Should be ~1.0

Input/Output Specification

Inputs:

  • input_ids: int64[batch_size, sequence_length] - Tokenized input IDs
  • attention_mask: int64[batch_size, sequence_length] - Attention mask

Outputs:

  • embeddings: float32[batch_size, 2048] - L2-normalized embeddings

Note: Position IDs are generated internally - you only need to provide input_ids and attention_mask.

Processing Pipeline

The ONNX model includes the complete embedding pipeline:

  1. Transformer Backbone - Qwen3 model with bidirectional attention
  2. Projection Layer - Linear projection from 1024 to 2048 dimensions
  3. Mean Pooling - Weighted average using attention mask
  4. L2 Normalization - Unit-length embeddings for cosine similarity

Query vs Document Encoding

Following the original model's design:

  • Queries: Prepend "Represent the query for retrieving supporting documents: "
  • Documents: Encode directly without prompt
# Query encoding
query_text = "Represent the query for retrieving supporting documents: " + query

# Document encoding
doc_text = document  # No prompt needed

Vespa Configuration

To use this model with Vespa, configure it as a hugging-face-embedder in services.xml:

<container id="default" version="1.0">
    <component id="voyage-nano" type="hugging-face-embedder">
        <transformer-model url="https://huggingface.co/thomasht86/voyage-4-nano-ONNX/resolve/main/model_int8.onnx"/>
        <tokenizer-model url="https://huggingface.co/thomasht86/voyage-4-nano-ONNX/resolve/main/tokenizer.json"/>
        <!-- Model outputs pooled & normalized embeddings directly -->
        <transformer-output>embeddings</transformer-output>
        <pooling-strategy>none</pooling-strategy>
        <normalize>false</normalize>
        <!-- Model doesn't use token_type_ids -->
        <transformer-token-type-ids/>
        <!-- Context length -->
        <max-tokens>32000</max-tokens>
        <!-- Query/document prepend instructions -->
        <prepend>
            <query>Represent the query for retrieving supporting documents: </query>
        </prepend>
    </component>
    ...
</container>

Use in your schema (.sd file):

schema doc {

    document doc {

        field myTextField type string {
            indexing: summary | index
        }

    }

    field embedding type tensor<float>(x[2048]) {
        indexing: input myTextField | embed voyage-nano | attribute | index
        attribute {
            distance-metric: prenormalized-angular
        }
        index: hnsw
    }

}

Note: Use model_fp32.onnx instead of model_int8.onnx for higher accuracy if resources permit.

Binary Quantization (Optional)

For maximum memory savings, you can binarize the embeddings using Vespa's built-in binarize and pack_bits converters. This reduces the 2048-dimensional float vector to 256 int8 values (32x compression).

Schema with binarized embeddings:

schema doc {

    document doc {

        field myTextField type string {
            indexing: summary | index
        }

    }

    field embedding type tensor<float>(x[2048]) {
        indexing: input myTextField | embed voyage-nano | attribute
        attribute: paged
    }

    field embedding_binary type tensor<int8>(x[256]) {
        indexing: input myTextField | embed voyage-nano | binarize | pack_bits | attribute | index
        attribute {
            distance-metric: hamming
        }
        index {
            hnsw {
                max-links-per-node: 16
                neighbors-to-explore-at-insert: 200
            }
        }
    }

}

Rank profile for two-phase ranking (binary retrieval + full-precision reranking):

rank-profile binary_retrieval {
    inputs {
        query(q) tensor<float>(x[2048])
        query(q_binary) tensor<int8>(x[256])
    }
    function unpack_to_float() {
        expression: 2 * unpack_bits(attribute(embedding_binary), float) - 1
    }
    function dot_product() {
        expression: sum(query(q) * unpack_to_float)
    }
    first-phase {
        expression: closeness(field, embedding_binary)
    }
    second-phase {
        expression: dot_product
    }
}

Query example:

vespa query \
    'yql=select * from doc where {targetHits:100}nearestNeighbor(embedding_binary, q_binary)' \
    'input.query(q)=[0.1, -0.2, ...]' \
    'input.query(q_binary)=[hex dump of binarized query]' \
    'ranking=binary_retrieval'

Python helper for binarizing query vectors:

import numpy as np

def binarize_embedding(embedding: np.ndarray) -> str:
    """Binarize a float embedding to hex string for Vespa query."""
    return np.packbits(np.where(embedding > 0, 1, 0)).astype(np.int8).tobytes().hex()

# Usage
query_embedding = outputs[0][0]  # Shape: (2048,)
query_binary_hex = binarize_embedding(query_embedding)

See Vespa binarization guide for more details.

Performance Notes

  • FP32 model provides highest accuracy
  • INT8 model provides fastest inference with acceptable accuracy (~98% cosine similarity vs FP32)

License

This ONNX conversion follows the original model's license. See voyageai/voyage-4-nano for license details.

Credits

Citation

If you use this model, please cite the original Voyage AI model:

@misc{voyage4nano,
  author = {Voyage AI},
  title = {voyage-4-nano},
  year = {2024},
  url = {https://huggingface.co/voyageai/voyage-4-nano}
}
Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thomasht86/voyage-4-nano-ONNX

Quantized
(2)
this model