BoSentencePiece - Tibetan SentencePiece Tokenizer
A SentencePiece tokenizer trained on Tibetan text using the Unigram language model algorithm.
Model Details
| Parameter | Value |
|---|---|
| Model Type | Unigram |
| Vocabulary Size | 20,000 |
| Character Coverage | 100% |
| Alphabet Size | 232 characters |
| Max Token Length | 16 |
Training Data
- Total Sentences: 2,302,054
- Total Characters: 228,818,663
- Skipped Sentences (exceeded max length): 1,318
Special Tokens
| Token | ID | Description |
|---|---|---|
<unk> |
0 | Unknown token |
<s> |
1 | Beginning of sequence |
</s> |
2 | End of sequence |
<pad> |
-1 | Padding token |
Normalization
- Normalizer: NMT NFKC
- Add Dummy Prefix: Yes
- Remove Extra Whitespaces: Yes
- Escape Whitespaces: Yes
Tokenization Settings
- Split by Unicode script: ✅
- Split by number: ✅
- Split by whitespace: ✅
- Split digits: ❌
- Byte fallback: ❌
Usage
With Transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("OpenPecha/BoSentencePiece")
text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།"
tokens = tokenizer.tokenize(text)
print(tokens)
# Encode
encoded = tokenizer.encode(text)
print(encoded)
# Decode
decoded = tokenizer.decode(encoded)
print(decoded)
With SentencePiece Directly
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("sentencepiece.model")
text = "བོད་སྐད་ཀྱི་ཚིག་གྲུབ་འདི་ཡིན།"
tokens = sp.encode_as_pieces(text)
print(tokens)
Files
sentencepiece.model- SentencePiece binary modelsentencepiece.vocab- Vocabulary file with scorestokenizer.json- Hugging Face tokenizer formattokenizer_config.json- Tokenizer configuration
License
Apache 2.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support