TinyGPT2-IT

A 95M parameter instruction-tuned language model trained from scratch on a single consumer GPU

GitHub Demo License


Overview

TinyGPT2-IT is an instruction-tuned variant of TinyGPT2 β€” a modern GPT architecture built from scratch using PyTorch. The base model was pretrained on ~6.7B tokens from OpenWebText, then supervised fine-tuned (SFT) on Stanford Alpaca's 52K instruction-response pairs.

The entire pipeline β€” pretraining, fine-tuning, and inference β€” runs on a single NVIDIA RTX 3070 Ti (8 GB VRAM).

This model uses a custom architecture and requires trust_remote_code=True.


Architecture

Component Detail
Parameters ~95M
Layers 12 transformer blocks
Attention Grouped Query Attention (12 query heads, 4 KV groups)
Embedding dim 768
FFN hidden dim 2048
Position encoding Rotary Position Embeddings (RoPE)
Normalization RMSNorm
Context window 512 tokens
Vocabulary 50,304 (GPT-2 tiktoken + PAD token)
Weight tying Token embedding ↔ LM head
KV Cache Supported for efficient generation

Training

Stage 1 β€” Pretraining

Dataset OpenWebText (~6.7B tokens)
Optimizer AdamW (fused)
Effective batch 262K tokens/step
Precision bfloat16 + torch.compile
Hardware NVIDIA RTX 3070 Ti (8 GB)

Stage 2 β€” Supervised Fine-Tuning (SFT)

Dataset Stanford Alpaca (52K instructions)
Epochs 3
Loss masking Response-only (instruction tokens are masked)
Final train loss 1.91
Final val loss 1.98
Final val perplexity 7.26
Tokens processed ~72M
Prompt format ### Instruction: ... ### Response: ...

Usage

Quick Start

from transformers import AutoModelForCausalLM
import tiktoken
import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "NotShrirang/tinygpt2-it",
    trust_remote_code=True,
)
model.eval()

# Tokenize
enc = tiktoken.get_encoding("gpt2")
prompt = "### Instruction:\nWhat is the capital of France?\n\n### Response:\n"
input_ids = torch.tensor([enc.encode(prompt)])

# Generate
with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=128, do_sample=True, temperature=0.7, top_k=40)

print(enc.decode(output[0].tolist()))

Prompt Format

This model expects instructions in the following template:

### Instruction:
{your instruction here}

### Response:

For instructions with additional context:

### Instruction:
{your instruction here}

### Input:
{additional context}

### Response:

Example Outputs

Factual Q&A

>>> What is the capital of France?
The capital of France is Paris.

Explanation

>>> Explain what machine learning is in simple terms.
Machine learning is a branch of computer science that focuses on using algorithms to
identify patterns in data. These algorithms are used to analyze large amounts of data
and make predictions about future trends.

Creative

>>> Write a motivational quote.
"The only way to make a difference is to be bold and courageous."

Limitations

  • Small model β€” 95M parameters is far below production LLMs; expect factual errors, repetition, and limited reasoning.
  • Short context β€” 512 token window limits the length of conversations and documents.
  • Training data β€” pretrained on web text and fine-tuned on synthetic Alpaca data, which may contain biases or inaccuracies.
  • Not safety-aligned β€” no RLHF/DPO applied to this checkpoint; the model may produce harmful or inappropriate content.

Model Family

Model Params Description Link
TinyGPT 51M Standard GPT, TinyStories GitHub
TinyGPT-MoE 85M Mixture of Experts, TinyStories GitHub
Wikipedia-MoE 135M 8-expert MoE, Wikipedia/C4 GitHub
TinyGPT2 95M RoPE + GQA + RMSNorm, OpenWebText GitHub
TinyGPT2.1 183M Scaled TinyGPT2, FineWeb-Edu GitHub
TinyGPT2-IT 95M Instruction-tuned (this model) You are here
TinyGPT2-DPO 95M DPO-aligned with Anthropic HH-RLHF GitHub

Citation

@misc{tinygpt2-it,
  author       = {Shrirang Mahajan},
  title        = {TinyGPT2-IT: Instruction-Tuned 95M Parameter Language Model},
  year         = {2025},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/NotShrirang/tinygpt2-it}
}

License

This model is released under the GPL-3.0 License.

Downloads last month
-
Safetensors
Model size
95.3M params
Tensor type
C64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train NotShrirang/tinygpt2-it