Yuanbo Pang
UC Berkeley EECS · AI Agent Research & Infrastructure
I study how AI agents can deliver real economic value, and build the infrastructure to evaluate and route them. Co-first author on Agent's Last Exam (ALE), a NeurIPS 2026 benchmark advised by Prof. Dawn Song at UC Berkeley BAIR. Full-stack engineer building Ludus, an outcome-first marketplace where AI agents compete on real tasks.
About
Research and building at the intersection of AI agents, evaluation, and infrastructure.
Education
UC Berkeley EECS (transferred 2025) · Stanford CS 229 · 4.0 GPA at De Anza & Foothill
Research
Agent evaluation at UC Berkeley BAIR (Prof. Dawn Song); federated learning at Stevens Institute (Prof. Hao Wang)
Building
Ludus — AI agent marketplace (Full-Stack Engineer); Context8 — agent knowledge network
I'm a junior at UC Berkeley EECS, studying how AI agents can deliver production-grade economic value and building the infrastructure that evaluates and routes them. I transferred into Berkeley in 2025 from De Anza & Foothill with a 4.0 GPA, and previously took CS 229 (Machine Learning) at Stanford in Summer 2025.
My current research is at UC Berkeley BAIR, where I serve as Engineering Co-Lead and co-first author on Agent's Last Exam (ALE) — a NeurIPS 2026 benchmark advised by Prof. Dawn Song that evaluates AI agents across 90%+ of non-physical industries. Earlier, I worked on personalized federated learning and differential privacy with Prof. Hao Wang at Stevens Institute of Technology.
Outside of research, I'm building Ludus, an outcome-first marketplace where multiple AI agents execute the same task in parallel and users pay only for the result they keep. I also independently built Context8 (Oct 2025 – Jan 2026), a self-evolving knowledge network for AI agents.
Selected Publication
Agent's Last Exam (ALE): A Benchmark for Production-Grade AI Agent Evaluation
Yuanbo Pang (co-first author), et al. Submitted to NeurIPS 2026, Datasets and Benchmarks Track. Advised by Prof. Dawn Song, UC Berkeley BAIR.
ALE is a benchmark of 1,000+ real-world tasks across 55 sub-fields and 13 industry clusters, contributed by 300+ industry experts. As the primary builder, I filtered the noisy expert submissions down to ~300 tasks using parallel agents, then engineered each into a standardized, execution-ready benchmark under a unified evaluation architecture covering 90%+ of non-physical industries.
Selected Coursework
Data Structures (UC Berkeley)
Discrete Mathematics & Probability Theory (UC Berkeley)
Designing Information Devices and Systems I (UC Berkeley)
Machine Learning (Stanford, Summer 2025)
Featured Projects
Recent work across AI agent infrastructure, production full-stack systems, and robotics perception.
NeurIPS 2026 benchmark submission
Benchmark infrastructure for long-horizon AI agent workflows across 55 sub-fields and 13 industry clusters, built with executable task harnesses and deliverable-based scoring.
UC Berkeley small-object detection
Benchmarked UAV detection methods against 1,600 manually annotated DJI frames and evaluated optical-flow/background-compensation pipelines for low-false-positive tracking.
Experience
Research, building, and selected awarded projects.
Full-Stack Engineer
Ludus
- •Building an outcome-first marketplace for AI agent work: multiple agents execute the same task in parallel, and users pay only for the result they keep.
- •Integrated 9 task agents and growing — Claude Code, Codex CLI, GitHub Copilot, Gemini CLI, Hermes, Factory Droid, Forge (Claude/Codex modes), and Perplexity Agent.
- •Designed full stack: React/TypeScript frontend on Cloudflare, Node.js/Hono backend on Railway, Neon Postgres, Docker-based agent execution on VM nodes with remote dispatch.
Engineering Co-Lead — Agent's Last Exam (ALE)
UC Berkeley BAIR · Advised by Prof. Dawn Song
- •Co-first author on a NeurIPS 2026 benchmark evaluating whether AI agents can deliver production-grade economic value across 90%+ of non-physical industries.
- •As the primary builder, filtered ~1,000 noisy expert-submitted tasks to ~300 by orchestrating parallel agents alongside my own review.
- •Engineered each surviving task into a standardized, execution-ready benchmark under a unified evaluation architecture covering 55 sub-fields across 13 industry clusters.
Software Engineering Intern
Geopogo
- •Sole engineer on an AI architectural rendering platform shipped to production.
- •Built end-to-end: Google Gemini API integration, RunwayML text-to-video pipeline, React/TypeScript frontend, FastAPI backend, Firebase auth, Vercel CI/CD.
- •Designed responsive chat interface with drag-and-drop image upload, real-time previews, and credit-based subscription system.
Independent Project — Context8
context8.org
- •Built solo: a self-evolving Stack Overflow for AI agents. A community vote/feedback loop turns one agent's solved bug into reusable knowledge for all agents.
- •Two months after shipping, Andrew Ng's Context Hub and Evomap launched in adjacent directions, validating the underlying thesis that agents need persistent, shared experience.
Research Intern
Prof. Hao Wang Lab · Stevens Institute of Technology
- •Investigated personalized federated learning systems with differential privacy guarantees.
- •Implemented training pipelines in PyTorch and TensorFlow, focusing on protecting local data while accelerating on-device model personalization.
Student Ambassador
Fetch.ai Innovation Lab
- •Mentored 20+ teams at CalHack 11.0 (UC Berkeley) and SF Hacks on full-stack and agent development.
- •Scouted early-stage startups at Bay Area Founders Club demo summits (5,000+ startups, 1,000+ VCs) on behalf of Fetch.ai; reported shortlists to the Innovation Lab lead for investment follow-up.
AI Mario Level Generator
Cambridge, MA
- •Built a sketch-to-playable-level pipeline: LLaVA 1.5 for sketch interpretation, OpenCV for layout extraction, H100 GPU inference on Modal.
- •FastAPI backend + React frontend; live demo available.
Stud.ai
Cambridge, MA
- •Chrome extension + AI agent that turns assignment rubrics into step-by-step timelines and auto-schedules work blocks on students' calendars.
- •FastAPI backend with uAgents framework for autonomous task handling.
GetResearch
Davis, CA
- •Platform connecting students with research opportunities and professors, with real-time project listings and a streamlined application flow.
- •PropelAuth + PostgreSQL, FastAPI backend, React frontend, deployed on AWS.
Skills & Tools
The toolkit I use to research, evaluate, and ship AI agent systems.
Programming Languages
Machine Learning & AI
AI Coding Agents
Backend & Data
Frontend & Infrastructure
Areas of Focus
Get In Touch
I'm always open to discussing new opportunities, research collaborations, or interesting projects. Feel free to reach out!
Let's Build Something Amazing
Whether you're working on AI agent evaluation, infrastructure for the agent ecosystem, or a research collaboration in privacy-preserving ML, I'd love to hear from you.
Send Me an Email