Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency Paper • 2506.08343 • Published Jun 10, 2025 • 54
TESTEVAL: Benchmarking Large Language Models for Test Case Generation Paper • 2406.04531 • Published Jun 6, 2024 • 1
Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning Paper • 2509.13755 • Published Sep 17, 2025 • 19
ContextBench: A Benchmark for Context Retrieval in Coding Agents Paper • 2602.05892 • Published Feb 5 • 4
ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning Paper • 2603.11226 • Published Mar 11
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks Paper • 2605.22535 • Published 2 days ago • 3
ContextBench: A Benchmark for Context Retrieval in Coding Agents Paper • 2602.05892 • Published Feb 5 • 4
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale Paper • 2502.16645 • Published Feb 23, 2025 • 21