## N-Day-Bench: Frontier LLMs Face Live Test Against Real GitHub Vulnerabilities
A new benchmark is putting frontier large language models to the ultimate test: can they find real, known security vulnerabilities in live, high-profile codebases before the patch is applied? N-Day-Bench addresses the critical flaw in static AI security tests—data contamination and memorization—by constructing a fresh, adversarial evaluation each month. It pulls newly disclosed vulnerabilities from GitHub Security Advisories, checks out the repository at the exact commit just before the fix, and gives the model a sandboxed bash shell to explore the code. This monthly refresh ensures the test set stays ahead of the models' training data, creating an honest window to measure genuine reasoning, not recall.

The process is a rigorous, three-agent simulation. A 'Curator' agent first reads the security advisory to build a precise answer key. The model under test, the 'Finder,' is then given only 24 shell commands to navigate the codebase, trace the bug from provided vulnerability hints (or 'sinks'), and produce a structured report—all without ever seeing the actual security patch. Finally, a 'Judge' agent scores the blinded submission against the curator's key. To ensure relevance, only repositories with over 10,000 GitHub stars qualify, and a diversity filter prevents any single project from dominating the dataset.

This benchmark shifts the goal from academic puzzle-solving to practical, time-pressured vulnerability discovery. By forcing models to interact with a real shell and a sprawling codebase, it tests their ability to reason about software architecture and exploit chains under constraints that mirror a real security researcher's workflow. The results will provide a stark, ongoing measure of whether the most advanced AI assistants are moving closer to becoming autonomous security tools or remain limited by their training data.
---
- **Source**: Hacker News
- **Sector**: The Lab
- **Tags**: AI Security, LLM Benchmarking, Vulnerability Discovery, GitHub, Code Analysis
- **Credibility**: unverified
- **Published**: 2026-04-13 22:22:37
- **ID**: 62669
- **URL**: https://whisperx.ai/en/intel/62669