How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SlitherCode/tiny-edu-166m-instruct-v3", trust_remote_code=True)
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m-instruct-v3", trust_remote_code=True, dtype="auto")
Quick Links

ParchmentLM 166M Instruct

A 166M parameter instruction-tuned language model trained entirely from scratch — custom architecture, real pretraining data, and full SFT pipeline — for under $55 in cloud compute.

This is a proof-of-concept demonstrating the full LLM development pipeline: architecture design, pretraining on real web data, supervised fine-tuning, and deployment. It is not intended for production use.

Model Details

  • Developed by: Pranay Narula (SlitherCode)
  • Model type: ParchmentLM — a custom decoder-only transformer architecture
  • Language: English
  • License: MIT
  • Base model: SlitherCode/tiny-edu-166m (pretrained from scratch)

Architecture

ParchmentLM is a custom LLaMA-style architecture with the following components:

Component Details
Parameters ~166M
Layers 12
Attention heads 12
Hidden size 768
FFN size 3072
Context length 1024 tokens
Positional encoding RoPE
Normalization RMSNorm (pre-norm)
Activation SwiGLU
Attention FlashAttention (via scaled_dot_product_attention)
Tokenizer tiktoken cl100k_base (vocab size 100,277)
Weight tying Yes (input embeddings = output projection)

Chat Template (ParchmentLM format)

system
You are a helpful assistant<|endoftext|>
user
{user message}<|endoftext|>
assistant
{assistant response}<|endoftext|>

<|endoftext|> (token ID 100257) serves as both the turn separator and stop token.

Training

Stage 1 — Pretraining

  • Dataset: FineWeb-Edu 10BT sample (HuggingFaceFW/fineweb-edu)
  • Tokens trained on: ~4B
  • Infrastructure: Modal, single A100-40GB
  • Throughput: ~75,000 tokens/sec
  • Duration: ~14.8 hours
  • Cost: ~$46
  • Optimizer: AdamW (β1=0.9, β2=0.95, weight decay=0.1)
  • Learning rate: 3e-4 with cosine decay to 3e-5, 2000 step warmup
  • Batch size: 16 × 8 grad accum × 1024 seq len ≈ 131k tokens/step
  • Precision: bfloat16

Stage 2 — Supervised Fine-Tuning

  • Datasets:
  • Total SFT examples: ~17k
  • Loss: Completion-only (prompt and padding tokens masked to -100)
  • Pad token: <|endofprompt|> (token ID 83285) to preserve EOT as a learnable stop signal
  • Epochs: 8
  • Learning rate: 1e-4 cosine decay
  • Batch size: 16 × 2 grad accum
  • Duration: ~38 minutes
  • Cost: ~$1.50
  • Infrastructure: Modal, single A100-40GB
  • Precision: bfloat16

Total training cost: ~$55 with many sft iterations

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
tokenizer.pad_token = "<|endofprompt|>"

model = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166M-instruct", trust_remote_code=True)
model.eval()

PAD_ID = tokenizer.convert_tokens_to_ids("<|endofprompt|>")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
input_len = inputs["input_ids"].shape[1]

import torch
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        repetition_penalty=1.1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=PAD_ID,
    )

raw = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=False)
response = raw.split("<|endoftext|>")[0].strip()
print(response)
# The capital of France is Paris.

Note: For arithmetic, use the format "47 + 83 =" rather than "What is 47 + 83?" to match the training distribution.

Evaluation

Informal evaluation on held-out questions:

Question Response Correct?
What is the capital of France? The capital of France is Paris. ✓
What is the capital of Germany? The capital of Germany is Berlin. ✓
Who wrote Romeo and Juliet? Romeo and Juliet was written by William Shakespeare. ✓
12 + 5 = 17 ✓
900 - 345 = 700 ✗ (off by ~145)
2790 + 6698 = 9648 ✗ (correct: 9488)

Limitations:

  • Reliable arithmetic only up to ~2-3 digit operands
  • Tends to hallucinate on out-of-distribution factual questions
  • No safety filtering or alignment
  • Will not stop gracefully on prompts with no clear answer (creative writing, open-ended tasks)
  • Undertrained relative to model capacity — 4B tokens vs. the ~300B tokens models this size typically see

Compute & Environmental Impact

  • Hardware: NVIDIA A100-40GB (via Modal)
  • Cloud provider: Modal (AWS us-east-1 region)
  • Total GPU hours: ~15.5 hours
  • Total cost: ~$55 USD

Citation

If you use this model or find this project useful, a link back to the repository is appreciated.

@misc{narula2025parchmentlm,
  author = {Pranay Narula},
  title = {ParchmentLM 166M Instruct: Full LLM Pipeline From Scratch},
  year = {2025},
  url = {https://huggingface.co/SlitherCode/tiny-edu-166M-instruct}
}
Downloads last month
28
Safetensors
Model size
0.2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SlitherCode/tiny-edu-166m-instruct-v3

Finetuned
(2)
this model

Datasets used to train SlitherCode/tiny-edu-166m-instruct-v3