A monitoring system and orchestrator for LLM-as-a-Judge and LLMs-as-a-Jury.
The idea: instead of having a human reviewer check every output from your own or third-party LLMs, another LLM takes over that job — according to previously defined metrics and rules. It scales, it's reproducible, and it makes evaluation criteria explicit rather than implicit.
Instead of writing a new script for every use case that maps judgement against some LLM and generates reports, JudgeForge puts all of that into a flexible web app.
JudgeForge is a self-hosted platform that comes with everything needed to test the quality of bot responses regularly and continuously.
So:
Four containers, clear separation of concerns:
| Container | Role |
|---|---|
| postgres | Single source of truth. All tables, all data. |
| n8n | Judge engine. Reads pending items, calls the LLM, writes scores back. |
| api | Fastify + Drizzle ORM. Ingest, CRUD, analytics, import. |
| web | React + Vite via Nginx. Dashboard, explorer, management. |
The data model is generic (inspired by Langfuse): metrics (typed: numeric, boolean, categorical, text) → polymorphic scores per item and metric. Plus runs, run_items, rubrics, datasets, human_reviews.
n8n runs as a sub-workflow pattern: a main workflow (schedule + claim) fans out into N sub-calls — one per judge. This allows multi-judge parallelization without touching the orchestration workflow.
What's the point of using n8n here? Why not hardcode everything directly in the software?
→ JudgeForge doesn't want to own the implementation of things like LangChain integrations, LLM credentials, endpoints, etc. — n8n keeps all of that up to date anyway. It also means you can observe, debug, and adjust the workflows there if you want to.
Future versions of JudgeForge will keep revisiting this architectural decision and adjust if needed.
Three ingest paths:
POST /api/ingest with API key header — automated, e.g. directly from the application being evaluatedEach item lands as pending in the database. The judge scheduler claims items every minute (FOR UPDATE SKIP LOCKED, so multiple workers can run in parallel) and evaluates them.
Metrics are the core of the platform — this is where you define what gets evaluated. Each metric has:
numeric (0..1 scale), boolean (pass/fail), categorical, textrequires_reference: does the judge need a gold answer to compare against?requires_context: does it need additional context?judge_instructions: evaluation guide injected directly into the prompt — the judge references it in its reasoningexplanation: plain-language description, automatically shows up in the app's FAQNew metrics are picked up by the judge immediately — they attach to the default rubric automatically.
The most interesting feature: multiple judges in parallel per question. Each judge is individually configurable (name, model, provider). Scores are aggregated per metric (strategies: mean, majority, min/max — whichever makes sense for the metric type).
Why this is useful: if a judge consistently breaks on a metric, either the metric definition has a problem or the provider has bias. The panel makes that visible.
Three ways to activate the panel for an item:
Multi-provider in the sub-workflow: a switch node routes by judge_provider to four branches (Azure / Mistral / Anthropic / Gemini), each branch has its own LLM node. Model hot-swap via judge_config — no workflow edit needed.
Every re-judge writes a new score row — no overwriting. Each score entry carries:
judge_prompt_hash (djb2) — detects when the system prompt has changedmetric_schema_version — bumped on structural metric editspending_trigger — why was it re-evaluated? (gold_edit, context_edit, manual_requeue)The explorer has a history view per item: a line chart across all re-judges, with vertical markers for schema bumps (yellow) and prompt changes (amber). Makes it immediately obvious whether a metric change shifted the scores.
/docsSelf-contained — no external DB service, no manual seeding.
# Create volumes
docker volume create judgeforge_pgdata
docker volume create judgeforge_n8ndata
# Clone repo, set secrets
cp .env.example .env # APP_PASSWORD, INGEST_API_KEY, SESSION_SECRET, DATABASE_URL
docker compose up -d --build
On first start: Postgres creates the DB, the API runs migrations and seed (idempotent). Then set up n8n once manually: owner account, Postgres credential (reachable internally via service name postgres), LLM credential, import the workflow.
Update:
git pull && docker compose up -d --build
Ports (all configurable via .env): web 8080, API 3000, n8n 5678, Postgres 5432.
For private repos on company servers: use an SSH deploy key with port 443 instead of 22 — typical corporate firewalls block outbound port 22.
The DATABASE_URL in .env determines which database the API runs against. Switching = change the URL, restart the stack.
H@ppy H@cking 🤖⚖️