Back to projects
LLM Eval Dashboard
·2 min read
Overview
Built an eval-driven development workflow with a web dashboard for tracking LLM output quality. The tool runs test suites against prompt versions, scores outputs using LLM-as-judge and heuristic checks, and visualizes regressions over time.
The Problem
Every time we tweaked a prompt, we had no systematic way to know if it got better or worse. We'd eyeball a few examples, ship it, and find out days later that edge cases had regressed. We needed the LLM equivalent of a test suite with CI.
Architecture
- Test suites — YAML files defining input-output pairs with metadata (category, difficulty, expected behavior).
- Runner — Python CLI that executes test suites against any OpenAI-compatible API. Supports parallel execution and caching.
- Scorers — Pluggable scoring functions: exact match, regex, LLM-as-judge (with customizable rubrics), and custom Python functions.
- Dashboard — Next.js app that reads eval results from a SQLite database. Shows pass rates, score distributions, and diff views between prompt versions.
Key Learnings
- LLM-as-judge is noisy but directionally useful. Individual scores vary run-to-run, but aggregate trends across 50+ test cases are reliable.
- The hardest part is writing good test cases. We spent more time curating the eval set than building the tooling. Good evals require real edge cases from production.
- Version everything. Prompts, model configs, and eval sets all need versioning. We use git hashes as prompt version IDs.
- Fast feedback loops change behavior. Once evals run in under 2 minutes, engineers actually use them before merging prompt changes.
Results
- Caught 3 major regressions before they shipped in the first month of use
- Eval suite grew to 200+ test cases across 8 product areas
- Reduced prompt iteration cycle from days to hours by giving engineers immediate quality feedback