tiny-edu-166m (ParchmentLM)

A 166M parameter transformer pretrained from scratch on 4B tokens of FineWeb-Edu.

Architecture (ParchmentLM)

Custom decoder-only transformer:

  • Parameters: 166M
  • Layers: 12
  • Hidden size: 768
  • Attention heads: 12
  • FFN: SwiGLU (hidden=2048)
  • Context length: 1024
  • Positional encoding: RoPE (base=10000)
  • Normalization: RMSNorm
  • Tokenizer: cl100k_base (100277 tokens) — same as GPT-4

Training

  • Dataset: FineWeb-Edu 10BT sample
  • Tokens seen: ~4B
  • Steps: 30,000
  • Optimizer: AdamW (lr=3e-4, cosine decay to 3e-5)
  • Hardware: Single A100 40GB

Installation

pip install transformers tiktoken

Note: tiktoken is required because the tokenizer wraps OpenAI's cl100k_base encoding to guarantee byte-identical token IDs to the vocabulary the model was trained on.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)

inputs = tokenizer("The history of mathematics", return_tensors="pt")
out    = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0], skip_special_tokens=True))

License

Model weights: MIT. Training data: This work uses the FineWeb-Edu dataset, available under the Open Data Commons Attribution License (ODC-By 1.0).

Downloads last month
370
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SlitherCode/tiny-edu-166m

Finetunes
2 models