ProgramBench

university

https://programbench.com

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

john-b-yang authored a paper about 20 hours ago

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

john-b-yang authored a paper about 20 hours ago

OpenThoughts: Data Recipes for Reasoning Models

john-b-yang authored a paper about 20 hours ago

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

View all activity

john-b-yang

authored 4 papers about 20 hours ago

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Paper • 2409.16165 • Published Sep 24, 2024

john-b-yang

authored 5 papers about 21 hours ago

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Paper • 2412.15701 • Published Dec 20, 2024

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Paper • 2601.11868 • Published Jan 17 • 36

SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents

Paper • 2602.22124 • Published Feb 25 • 2

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Paper • 2604.20779 • Published 23 days ago • 14

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Paper • 2605.03546 • Published 10 days ago • 3

klieret

updated a dataset 8 days ago

programbench/ProgramBench-Tests

Updated 8 days ago • 5.4k • 6

klieret

published a dataset 10 days ago

programbench/ProgramBench-Tests

Updated 8 days ago • 5.4k • 6

john-b-yang

updated a Space 11 days ago

README

🦊

john-b-yang

updated a dataset 11 days ago

programbench/ProgramBench-Tests

Updated 8 days ago • 5.4k • 6

klieret

published a Space 11 days ago

README

🦊

john-b-yang

authored a paper 6 months ago

CodeClash: Benchmarking Goal-Oriented Software Engineering

Paper • 2511.00839 • Published Nov 2, 2025 • 10

john-b-yang

authored 5 papers about 1 year ago

SWE-smith: Scaling Data for Software Engineering Agents

Paper • 2504.21798 • Published Apr 30, 2025 • 15

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Paper • 2310.06770 • Published Oct 10, 2023 • 10

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback

Paper • 2306.14898 • Published Jun 26, 2023

DevBench: A Comprehensive Benchmark for Software Development

Paper • 2403.08604 • Published Mar 13, 2024 • 2

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Paper • 2207.01206 • Published Jul 4, 2022 • 3

AI & ML interests

Recent Activity

Team members 5

programbench's activity

README

README