SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? Paper • 2410.03859 • Published Oct 4, 2024 • 1
OpenThoughts: Data Recipes for Reasoning Models Paper • 2506.04178 • Published Jun 4, 2025 • 55
LongCodeBench: Evaluating Coding LLMs at 1M Context Windows Paper • 2505.07897 • Published May 12, 2025
EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities Paper • 2409.16165 • Published Sep 24, 2024
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration Paper • 2412.15701 • Published Dec 20, 2024
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 36
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents Paper • 2602.22124 • Published Feb 25 • 2
SWE-chat: Coding Agent Interactions From Real Users in the Wild Paper • 2604.20779 • Published 23 days ago • 14
ProgramBench: Can Language Models Rebuild Programs From Scratch? Paper • 2605.03546 • Published 10 days ago • 3
CodeClash: Benchmarking Goal-Oriented Software Engineering Paper • 2511.00839 • Published Nov 2, 2025 • 10
SWE-smith: Scaling Data for Software Engineering Agents Paper • 2504.21798 • Published Apr 30, 2025 • 15
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Paper • 2310.06770 • Published Oct 10, 2023 • 10
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback Paper • 2306.14898 • Published Jun 26, 2023
DevBench: A Comprehensive Benchmark for Software Development Paper • 2403.08604 • Published Mar 13, 2024 • 2
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents Paper • 2207.01206 • Published Jul 4, 2022 • 3