LLM Eval Dashboard

Overview

Built an eval-driven development workflow with a web dashboard for tracking LLM output quality. The tool runs test suites against prompt versions, scores outputs using LLM-as-judge and heuristic checks, and visualizes regressions over time.

The Problem

Every time we tweaked a prompt, we had no systematic way to know if it got better or worse. We'd eyeball a few examples, ship it, and find out days later that edge cases had regressed. We needed the LLM equivalent of a test suite with CI.

Architecture

Test suites — YAML files defining input-output pairs with metadata (category, difficulty, expected behavior).
Runner — Python CLI that executes test suites against any OpenAI-compatible API. Supports parallel execution and caching.
Scorers — Pluggable scoring functions: exact match, regex, LLM-as-judge (with customizable rubrics), and custom Python functions.
Dashboard — Next.js app that reads eval results from a SQLite database. Shows pass rates, score distributions, and diff views between prompt versions.

Key Learnings

LLM-as-judge is noisy but directionally useful. Individual scores vary run-to-run, but aggregate trends across 50+ test cases are reliable.
The hardest part is writing good test cases. We spent more time curating the eval set than building the tooling. Good evals require real edge cases from production.
Version everything. Prompts, model configs, and eval sets all need versioning. We use git hashes as prompt version IDs.
Fast feedback loops change behavior. Once evals run in under 2 minutes, engineers actually use them before merging prompt changes.

Results

Caught 3 major regressions before they shipped in the first month of use
Eval suite grew to 200+ test cases across 8 product areas
Reduced prompt iteration cycle from days to hours by giving engineers immediate quality feedback