| --- |
| tags: |
| - reinforcement-learning |
| - game-theory |
| - codenames |
| - neurips-2025 |
| - graph-neural-networks |
| - preference-learning |
| - llm-distillation |
| license: mit |
| --- |
| |
| # Codenames: Graph-Based RL with LLM-Guided Preference Distillation |
|
|
|  |
|  |
|  |
|
|
| This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**. |
| The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness. |
|
|
| --- |
|
|
| ## Overview |
|
|
| The approach integrates: |
|
|
| - **Graph Neural Networks** for structured board and history representation |
| - **Proximal Policy Optimization (PPO)** for policy learning |
| - **Role-conditioned decoding** for spymaster and operative behaviors |
| - **Rollout-grounded preference learning** using large language models |
| - **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment |
| - **Knowledge distillation** from the aligned teacher back into a compact policy |
|
|
| The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play. |
|
|
| --- |
|
|
| ## Game Configuration |
|
|
| - **Game**: Codenames |
| - **Board size**: 25 words |
| - **Roles**: Spymaster and Operative |
| - **Evaluation games**: 600 full episodes |
| - **Opponents**: Scripted baseline agents |
|
|
| --- |
|
|
| ## Policy Architecture |
|
|
| ### Graph-Based State Encoder |
| - Heterogeneous graph with **30–40 nodes** |
| - Node types include: |
| - Word nodes with semantic and state features |
| - Historical clue nodes |
| - Global summary node |
| - Node feature dimension: **35** |
| - Encoder: |
| - 3 Graph Attention layers |
| - 6 attention heads |
| - Hidden size 192 |
|
|
| ### Role Conditioning |
| - Shared policy trunk |
| - Role-conditioned action decoding: |
| - Clue generation and constraint handling for spymaster |
| - Guess selection and stopping decisions for operative |
|
|
| ### Model Size |
| - Total parameters: **~6.8M** |
| - Enables fast inference under competitive constraints |
|
|
| --- |
|
|
| ## Training Pipeline |
|
|
| Training follows a multi-stage curriculum: |
|
|
| 1. **Graph PPO Pretraining** |
| - PPO with clip ratio 0.2 |
| - Discount factor γ = 0.99 |
| - GAE λ = 0.95 |
| - Trained against scripted Codenames agents |
|
|
| 2. **Preference Generation via Rollouts** |
| - ~800 intermediate states sampled |
| - Candidate actions proposed by: |
| - Llama 3.1 Instruct |
| - Qwen 2.5 Instruct |
| - Each proposal evaluated using multiple stochastic rollouts |
| - Higher-return actions labeled preferred |
|
|
| 3. **Teacher Alignment** |
| - Supervised Fine Tuning on chosen actions |
| - Direct Preference Optimization using frozen reference model |
|
|
| 4. **Policy Distillation** |
| - Aligned teacher generates state-and-role to action labels |
| - Graph policy trained via cross-entropy imitation |
|
|
| 5. **PPO Refinement** |
| - PPO resumes using environment rewards |
| - Stabilizes policy after distillation |
|
|
| --- |
|
|
| ## Evaluation Results |
|
|
| Evaluation uses **600 full games** against scripted opponents. |
|
|
| | Agent | Win Rate | Assassin Rate | |
| |------|---------|---------------| |
| | Graph PPO | 44.8% | 12.6% | |
| | PPO + Distillation | 52.9% | 6.9% | |
|
|
| - Distillation yields an **8.1 point** absolute win-rate improvement |
| - Assassin-triggered losses are reduced by **45%** |
| - Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness |
|
|
| --- |
|
|
| ## Repository Contents |
|
|
| ### Policy Checkpoints |
| - `policy_models/policy_after_ppo.pt` |
| - `policy_models/policy_after_distill.pt` |
|
|
| ### Teacher Models |
| - `sft_model/` – supervised fine-tuned teacher |
| - `dpo_model/` – preference-aligned teacher |
|
|
| ### Configuration and Logs |
| - `master_config.json` |
| - `evaluation_results.json` |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Load Policy |
|
|
| ```python |
| import torch |
| from policy import GraphPolicy |
| |
| policy = GraphPolicy(...) |
| policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt")) |
| policy.eval() |
| ``` |
|
|
| ### Loading Fine-tuned LLM |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| # Load SFT or DPO model |
| tokenizer = AutoTokenizer.from_pretrained("./sft_model") |
| model = AutoModelForCausalLM.from_pretrained("./sft_model") |
| |
| # Use for inference |
| inputs = tokenizer(prompt, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=32) |
| ``` |
|
|
| ## 🎓 Research Context |
|
|
| This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: |
|
|
| - Language models provide useful strategic priors when grounded by rollouts |
| - Graph-based representations enable structured reasoning in semantic games |
| - Distillation transfers high-level reasoning into efficient, deployable agents |
|
|
| ### Key Innovations |
|
|
| 1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states |
| 2. **Ground-truth Counterfactual Learning**: Exploiting game determinism |
| 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings |
| 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies |
|
|
|
|
| ## 📄 License |
|
|
| MIT License - See LICENSE file for details |
|
|
| ## 🙏 Acknowledgments |
|
|
| - Built for **NeurIPS 2025 MindGames Workshop** |
| - Uses PyTorch, HuggingFace Transformers, and PEFT |
| - Training infrastructure: NVIDIA H200 GPU |
|
|
| --- |
|
|
| **Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} |
| **Uploaded from**: Notebook Environment |
|
|