Yuanbo Pang

UC Berkeley EECS · AI Agent Research & Infrastructure

I study how AI agents can deliver real economic value, and build the infrastructure to evaluate and route them. Co-first author on Agent's Last Exam (ALE), a NeurIPS 2026 benchmark advised by Prof. Dawn Song at UC Berkeley BAIR. Full-stack engineer building Ludus, an outcome-first marketplace where AI agents compete on real tasks.

About

Research and building at the intersection of AI agents, evaluation, and infrastructure.

Education

UC Berkeley EECS (transferred 2025) · Stanford CS 229 · 4.0 GPA at De Anza & Foothill

Research

Agent evaluation at UC Berkeley BAIR (Prof. Dawn Song); federated learning at Stevens Institute (Prof. Hao Wang)

Building

Ludus — AI agent marketplace (Full-Stack Engineer); Context8 — agent knowledge network

I'm a junior at UC Berkeley EECS, studying how AI agents can deliver production-grade economic value and building the infrastructure that evaluates and routes them. I transferred into Berkeley in 2025 from De Anza & Foothill with a 4.0 GPA, and previously took CS 229 (Machine Learning) at Stanford in Summer 2025.

My current research is at UC Berkeley BAIR, where I serve as Engineering Co-Lead and co-first author on Agent's Last Exam (ALE) — a NeurIPS 2026 benchmark advised by Prof. Dawn Song that evaluates AI agents across 90%+ of non-physical industries. Earlier, I worked on personalized federated learning and differential privacy with Prof. Hao Wang at Stevens Institute of Technology.

Outside of research, I'm building Ludus, an outcome-first marketplace where multiple AI agents execute the same task in parallel and users pay only for the result they keep. I also independently built Context8 (Oct 2025 – Jan 2026), a self-evolving knowledge network for AI agents.

Selected Publication

Agent's Last Exam (ALE): A Benchmark for Production-Grade AI Agent Evaluation

Yuanbo Pang (co-first author), et al. Submitted to NeurIPS 2026, Datasets and Benchmarks Track. Advised by Prof. Dawn Song, UC Berkeley BAIR.

ALE is a benchmark of 1,000+ real-world tasks across 55 sub-fields and 13 industry clusters, contributed by 300+ industry experts. As the primary builder, I filtered the noisy expert submissions down to ~300 tasks using parallel agents, then engineered each into a standardized, execution-ready benchmark under a unified evaluation architecture covering 90%+ of non-physical industries.

Selected Coursework

CS 61B

Data Structures (UC Berkeley)

CS 70

Discrete Mathematics & Probability Theory (UC Berkeley)

EECS 16A

Designing Information Devices and Systems I (UC Berkeley)

CS 229

Machine Learning (Stanford, Summer 2025)

Featured Projects

Recent work across AI agent infrastructure, production full-stack systems, and robotics perception.

Ludus

AI agent marketplace

Outcome-first marketplace where multiple coding and research agents execute the same task in parallel, with users paying only for accepted results.

AI AgentsReactHonoPostgresDockerVM Dispatch
Context8

Agent knowledge system

Self-evolving Stack Overflow for AI agents where community votes and feedback turn one agent's solved bug into reusable knowledge for future agents.

AI AgentsKnowledge NetworkFull Stack
Agent's Last Exam

NeurIPS 2026 benchmark submission

Benchmark infrastructure for long-horizon AI agent workflows across 55 sub-fields and 13 industry clusters, built with executable task harnesses and deliverable-based scoring.

BenchmarkEvaluationAgent HarnessesVM Workflows
AI Mario Level Generator

HackMIT 2025 Modal Sponsor Prize

Sketch-to-playable-level pipeline using LLaVA 1.5, H100 GPU inference on Modal, OpenCV layout extraction, FastAPI, and React.

Computer VisionLLaVAModalFastAPIReact
UAV Perception Research

UC Berkeley small-object detection

Benchmarked UAV detection methods against 1,600 manually annotated DJI frames and evaluated optical-flow/background-compensation pipelines for low-false-positive tracking.

RoboticsUAV DetectionOptical FlowOpenCV
Geopogo AI Rendering

Production AI rendering platform

Production architectural rendering platform integrating Gemini image reasoning, RunwayML text-to-video workflows, Firebase auth, and Vercel CI/CD.

ReactFastAPIGemini APIRunwayMLCI/CD

Experience

Research, building, and selected awarded projects.

Full-Stack Engineer

Ludus

Apr 2026 – Present
  • Building an outcome-first marketplace for AI agent work: multiple agents execute the same task in parallel, and users pay only for the result they keep.
  • Integrated 9 task agents and growing — Claude Code, Codex CLI, GitHub Copilot, Gemini CLI, Hermes, Factory Droid, Forge (Claude/Codex modes), and Perplexity Agent.
  • Designed full stack: React/TypeScript frontend on Cloudflare, Node.js/Hono backend on Railway, Neon Postgres, Docker-based agent execution on VM nodes with remote dispatch.
AI AgentsMarketplaceTypeScriptHonoPostgresCloudflare

Engineering Co-Lead — Agent's Last Exam (ALE)

UC Berkeley BAIR · Advised by Prof. Dawn Song

NeurIPS 2026 submission · co-first author
Jan 2026 – Present
  • Co-first author on a NeurIPS 2026 benchmark evaluating whether AI agents can deliver production-grade economic value across 90%+ of non-physical industries.
  • As the primary builder, filtered ~1,000 noisy expert-submitted tasks to ~300 by orchestrating parallel agents alongside my own review.
  • Engineered each surviving task into a standardized, execution-ready benchmark under a unified evaluation architecture covering 55 sub-fields across 13 industry clusters.
BenchmarkAI Agent EvaluationResearchNeurIPS 2026

Software Engineering Intern

Geopogo

Sep 2025 – Dec 2025
  • Sole engineer on an AI architectural rendering platform shipped to production.
  • Built end-to-end: Google Gemini API integration, RunwayML text-to-video pipeline, React/TypeScript frontend, FastAPI backend, Firebase auth, Vercel CI/CD.
  • Designed responsive chat interface with drag-and-drop image upload, real-time previews, and credit-based subscription system.
ReactTypeScriptGemini APIRunwayMLFirebaseFastAPIVercel

Independent Project — Context8

context8.org

Oct 2025 – Jan 2026
  • Built solo: a self-evolving Stack Overflow for AI agents. A community vote/feedback loop turns one agent's solved bug into reusable knowledge for all agents.
  • Two months after shipping, Andrew Ng's Context Hub and Evomap launched in adjacent directions, validating the underlying thesis that agents need persistent, shared experience.
AI AgentsKnowledge NetworkSolo Project

Research Intern

Prof. Hao Wang Lab · Stevens Institute of Technology

Jun 2024 – Feb 2026
  • Investigated personalized federated learning systems with differential privacy guarantees.
  • Implemented training pipelines in PyTorch and TensorFlow, focusing on protecting local data while accelerating on-device model personalization.
Federated LearningDifferential PrivacyPyTorchTensorFlowResearch

Student Ambassador

Fetch.ai Innovation Lab

Sep 2024 – Oct 2025
  • Mentored 20+ teams at CalHack 11.0 (UC Berkeley) and SF Hacks on full-stack and agent development.
  • Scouted early-stage startups at Bay Area Founders Club demo summits (5,000+ startups, 1,000+ VCs) on behalf of Fetch.ai; reported shortlists to the Innovation Lab lead for investment follow-up.
AI AgentsMentorshipDeal Sourcing

AI Mario Level Generator

Cambridge, MA

Modal Sponsor Prize — HackMIT 2025
Sep 2025
  • Built a sketch-to-playable-level pipeline: LLaVA 1.5 for sketch interpretation, OpenCV for layout extraction, H100 GPU inference on Modal.
  • FastAPI backend + React frontend; live demo available.
Computer VisionLLaVAModalFastAPIGame Dev

Stud.ai

Cambridge, MA

"Smartest AI Agent" Award — HackMIT 2024
Sep 2024
  • Chrome extension + AI agent that turns assignment rubrics into step-by-step timelines and auto-schedules work blocks on students' calendars.
  • FastAPI backend with uAgents framework for autonomous task handling.
AI AgentsChrome ExtensionFastAPIuAgents

GetResearch

Davis, CA

Best Use of .Tech Domain — HackDavis 2024
Apr 2024
  • Platform connecting students with research opportunities and professors, with real-time project listings and a streamlined application flow.
  • PropelAuth + PostgreSQL, FastAPI backend, React frontend, deployed on AWS.
ReactFastAPIPostgreSQLAWS

Skills & Tools

The toolkit I use to research, evaluate, and ship AI agent systems.

Programming Languages

PythonTypeScriptJavaScriptC++JavaRust

Machine Learning & AI

PyTorchTensorFlowscikit-learnLLaVAGemini APIOpenAI APIAnthropic APIEvaluation HarnessesComputer Vision

AI Coding Agents

Claude CodeCodex CLICursorGemini CLIGitHub Copilot

Backend & Data

FastAPINode.jsHonoNeon PostgresPostgreSQLSQLREST APIsFirebase

Frontend & Infrastructure

ReactNext.jsTailwind CSSCloudflare WorkersVercelRailwayDockerLinuxGitCI/CDTestingModal

Areas of Focus

AI Agent Evaluation
Agent Infrastructure
LLM Applications
Federated Learning
Differential Privacy
Full-Stack Development
Production AI Systems
Robotics & Vision
System Design

Get In Touch

I'm always open to discussing new opportunities, research collaborations, or interesting projects. Feel free to reach out!

Let's Build Something Amazing

Whether you're working on AI agent evaluation, infrastructure for the agent ecosystem, or a research collaboration in privacy-preserving ML, I'd love to hear from you.

Send Me an Email