Skip to content
Back to projects

LLM Eval Dashboard

·2 min read
EvalsNext.jsPythonOpenAISource

Overview

Built an eval-driven development workflow with a web dashboard for tracking LLM output quality. The tool runs test suites against prompt versions, scores outputs using LLM-as-judge and heuristic checks, and visualizes regressions over time.

The Problem

Every time we tweaked a prompt, we had no systematic way to know if it got better or worse. We'd eyeball a few examples, ship it, and find out days later that edge cases had regressed. We needed the LLM equivalent of a test suite with CI.

Architecture

  • Test suites — YAML files defining input-output pairs with metadata (category, difficulty, expected behavior).
  • Runner — Python CLI that executes test suites against any OpenAI-compatible API. Supports parallel execution and caching.
  • Scorers — Pluggable scoring functions: exact match, regex, LLM-as-judge (with customizable rubrics), and custom Python functions.
  • Dashboard — Next.js app that reads eval results from a SQLite database. Shows pass rates, score distributions, and diff views between prompt versions.

Key Learnings

  • LLM-as-judge is noisy but directionally useful. Individual scores vary run-to-run, but aggregate trends across 50+ test cases are reliable.
  • The hardest part is writing good test cases. We spent more time curating the eval set than building the tooling. Good evals require real edge cases from production.
  • Version everything. Prompts, model configs, and eval sets all need versioning. We use git hashes as prompt version IDs.
  • Fast feedback loops change behavior. Once evals run in under 2 minutes, engineers actually use them before merging prompt changes.

Results

  • Caught 3 major regressions before they shipped in the first month of use
  • Eval suite grew to 200+ test cases across 8 product areas
  • Reduced prompt iteration cycle from days to hours by giving engineers immediate quality feedback