Transformers documentation
Tokenizers
Tokenizers
A tokenizer converts text into tensors, which are the inputs to a model. It normalizes and splits text, applies the tokenization algorithm, adds special tokens, and decodes output ids back into text.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer("Sphinx of black quartz, judge my vow.", return_tensors="pt")
{
'input_ids': tensor([[ 2, 235277, 82913, 576, 2656, 30407, 235269, 11490, 970, 29871, 235265]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}This guide covers loading, encoding, decoding, batch processing, and the available tokenizer backends.
Load a tokenizer
Load a tokenizer with the AutoTokenizer class or a model-specific tokenizer class.
AutoTokenizer.from_pretrained() reads the model config, resolves the correct tokenizer class, and returns an instance of it. You don’t need to know the tokenizer class beforehand. Most tokenizers resolve to a subclass of TokenizersBackend, a fast Rust-based tokenizer from the Tokenizers library.
Loading with AutoTokenizer is the recommended approach.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")Encode and decode
The TokenizersBackend.call() method encodes text or a batch of text into input_ids, attention_mask, and other model inputs. It also controls padding, truncation, and special token insertion.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer("Sphinx of black quartz, judge my vow.", return_tensors="pt")
{
'input_ids': tensor([[ 2, 235277, 82913, 576, 2656, 30407, 235269, 11490, 970, 29871, 235265]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
}TokenizersBackend.encode() is similar but only returns the input_ids.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer.encode("Sphinx of black quartz, judge my vow.")
[2, 235277, 82913, 576, 2656, 30407, 235269, 11490, 970, 29871, 235265]TokenizersBackend.decode() converts a single sequence or batch of tokenized input_ids back to text.
tokenizer.decode(outputs["input_ids"])
['<bos>Sphinx of black quartz, judge my vow.']TokenizersBackend.decode() preserves the exact tokenization spacing. Set clean_up_tokenization_spaces to remove spaces before punctuation, and skip_special_tokens to strip special tokens from the output.
tokenizer.decode(outputs["input_ids"], skip_special_tokens=True)
['Sphinx of black quartz, judge my vow.']Special tokens
Special tokens mark structural boundaries in a sequence, like the beginning-of-sequence or padding positions. Each model defines its own set of special tokens. The tokenizer adds them when you call it.
tokenizer.encode("Sphinx of black quartz, judge my vow.")
[2, 235277, 82913, 576, 2656, 30407, 235269, 11490, 970, 29871, 235265]
tokenizer.decode(outputs["input_ids"])
['<bos>Sphinx of black quartz, judge my vow.']Register additional named special tokens with the extra_special_tokens argument. Multimodal models use them as placeholders for images, video, or audio.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"google/gemma-3-4b-pt",
extra_special_tokens={"image_token": "<image>"}
)Batch processing
Batch processing tokenizes multiple sequences in a single call. TokenizersBackend handles large batches faster because its Rust-based backend parallelizes work across threads.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer(
[
"Sphinx of black quartz, judge my vow.",
"Pack my box with five dozen liquor jugs.",
"How vexingly quick daft zebras jump!"
],
return_tensors="pt"
)Batch processing requires all sequences to share the same length. Padding and truncation are strategies to handle varying-length sequences.
Padding
Padding appends special tokens so shorter sequences match the longest sequence in a batch. The attention mask marks padding positions as 0 so the model ignores them. Set padding=True to pad to the longest sequence or pass max_length to pad to a fixed size.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer(
[
"Sphinx of black quartz, judge my vow.",
"Pack my box with five dozen liquor jugs.",
"How vexingly quick daft zebras jump!"
],
return_tensors="pt",
padding=True,
)
{
'input_ids': tensor([
[ 2, 235277, 82913, 576, 2656, 30407, 235269, 11490, 970, 29871, 235265],
[ 0, 2, 6519, 970, 3741, 675, 4105, 25955, 42184, 225789, 235265],
[ 0, 2, 2299, 73378, 17844, 4320, 224463, 4949, 48977, 9902, 235341]
]),
'attention_mask': tensor([
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
])
}Large language models pad on the left side to avoid disrupting generation, which predicts the next token from the right side.
Truncation
Truncation clips tokens so a sequence fits within a maximum length. Set truncation=True and specify max_length to enable it.
Padding and truncation work together. Short sequences gain padding tokens while long sequences lose trailing tokens. Together, they produce a packed rectangular tensor.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer(
[
"Sphinx of black quartz, judge my vow.",
"Pack my box with five dozen liquor jugs.",
"How vexingly quick daft zebras jump!"
],
return_tensors="pt",
padding=True,
truncation=True,
max_length=5
)
{
'input_ids': tensor([
[ 2, 235277, 82913, 576, 2656],
[ 2, 6519, 970, 3741, 675],
[ 2, 2299, 73378, 17844, 4320]
]),
'attention_mask': tensor([
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]
])
}Backends
Each model tokenizer is defined in a single file and supports four tokenization backends.
| backend | implementation | description |
|---|---|---|
| TokenizersBackend | Tokenizers | default for most models |
| SentencePieceBackend | SentencePiece | models requiring SentencePiece |
| PythonBackend | Python | models requiring specialized custom tokenizers |
| MistralCommonBackend | mistral-common | Mistral and Pixtral models |
All backends inherit from PreTrainedTokenizerBase and share the same APIs for encoding, decoding, padding, truncation, saving, and loading. The difference is which tokenization pipeline runs underneath.
AutoTokenizer selects the best available backend when you call from_pretrained().
It reads the
tokenizer_config.jsonfile for thetokenizer_classfield.The registry matches
tokenizer_classto a class name. The resolved class inherits from one of the four backends. For example, GemmaTokenizer inherits from TokenizersBackend, and SiglipTokenizer inherits from SentencePieceBackend.Some models, like GLM, map directly to TokenizersBackend because the
tokenizer.jsonfile fully describes the pipeline. GemmaTokenizer exists as a subclass since it defines additional model-specific settings in Python thattokenizer.jsondoesn’t capture.TOKENIZER_MAPPING_NAMES = OrderedDict([ ("gemma2", "GemmaTokenizer" if is_tokenizers_available() else None), ("glm", "TokenizersBackend" if is_tokenizers_available() else None), ( "mistral", "MistralCommonBackend" if is_mistral_common_available() else ("TokenizersBackend" if is_tokenizers_available() else None), ), ("siglip", "SiglipTokenizer" if is_sentencepiece_available() else None), ... ]When a backend like mistral-common isn’t installed, AutoTokenizer falls back to TokenizersBackend.
Check which backend a tokenizer is using with the backend property.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
tokenizer.backend
'tokenizers'Inspect the tokenizer architecture
Inspect a tokenizer’s internal components (normalizer, pre-tokenizer, model, decoder) with the _tokenizer attribute.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
print(tokenizer._tokenizer.normalizer)
print(tokenizer._tokenizer.pre_tokenizer)
print(tokenizer._tokenizer.model)
print(tokenizer._tokenizer.decoder)Resources
- The Tokenization in Transformers v5 post discusses the motivation behind the new tokenization backends.
- Review the migration guide for an overview of the tokenization changes.