| --- |
| language: en |
| license: mit |
| tags: |
| - pretrained |
| - causal-lm |
| - fineweb-edu |
| - custom-architecture |
| --- |
| |
| # tiny-edu-166m (ParchmentLM) |
|
|
| A 166M parameter transformer pretrained from scratch on 4B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). |
|
|
| ## Architecture (ParchmentLM) |
|
|
| Custom decoder-only transformer: |
| - **Parameters:** 166M |
| - **Layers:** 12 |
| - **Hidden size:** 768 |
| - **Attention heads:** 12 |
| - **FFN:** SwiGLU (hidden=2048) |
| - **Context length:** 1024 |
| - **Positional encoding:** RoPE (base=10000) |
| - **Normalization:** RMSNorm |
| - **Tokenizer:** cl100k_base (100277 tokens) — same as GPT-4 |
| |
| ## Training |
| |
| - **Dataset:** FineWeb-Edu 10BT sample |
| - **Tokens seen:** ~4B |
| - **Steps:** 30,000 |
| - **Optimizer:** AdamW (lr=3e-4, cosine decay to 3e-5) |
| - **Hardware:** Single A100 40GB |
| |
| ## Installation |
| |
| ```bash |
| pip install transformers tiktoken |
| ``` |
| |
| > **Note:** `tiktoken` is required because the tokenizer wraps OpenAI's cl100k_base encoding |
| > to guarantee byte-identical token IDs to the vocabulary the model was trained on. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True) |
| |
| inputs = tokenizer("The history of mathematics", return_tensors="pt") |
| out = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.8, top_k=50) |
| print(tokenizer.decode(out[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## License |
|
|
| Model weights: MIT. |
| Training data: This work uses the FineWeb-Edu dataset, available under the Open Data Commons Attribution License (ODC-By 1.0). |
|
|