The Permanent Record

Originated as spec work for a client relationship in Assessment and Development.

Completed Scantron test sheets pinned to crossing clotheslines, like records hung on the line
vinsonconsulting/ permanent-record
Private

Two AI models graded the same year of a user's work, then worked through their results together to determine the sources of variance, ending with a joint interview of the user to fill knowledge gaps and produce a final grade. This is the vector-search layer that fed them the passages and let every grade cite its source; LobeChat was the venue for model-to-model discussion and the user interview.

Claude 3.5 Sonnet Gemini 1.5 Pro Cloudflare Vectorize Qdrant Gemini Embeddings LobeChat Neon Bun TypeScript

Context

The client need was multi-evaluator grading at scale. Anyone who works with LLMs already cross-references them: a second model to check the first, catch a hallucination, take over when one stalls. The Permanent Record pushes that habit to its end. Independent evaluators disagree, and the disagreement compounds when several grade the same work; averaging reports a false agreement and discards the reason for it. So the system surfaces the disagreement first, reconciles it through structured dialogue against a shared rubric, and keeps a human in the loop to interrogate the evaluators and answer back. The output is a consensus with a reasoning trail you can audit, not a number you can only assert.

Approach

A three-round multi-agent system, adapted from a flat-structure peer-review model for AI evaluators:

  1. Blind Independence. Each evaluator grades against a shared rubric with no access to peer outputs. Variance is captured, not suppressed.
  2. Cross-Validation. Each evaluator sees its peers’ grades and rationales and can defend or revise its own. Differences of evidence get corrected; differences of judgment remain.
  3. Collaborative Synthesis. The evaluators reconcile through structured dialogue under a supervisor agent that manages turn-taking, producing a final assessment with a reasoning trail attached to every contested point.

The rubric does the load-bearing work: it fixes what each grade means and requires cited evidence, so the models disagree against a standard instead of drifting. Built as a serverless RAG system on Cloudflare’s edge stack, with vector search over indexed conversation archives and structured metadata feeding the multi-agent process.

Outcome

In pilot, grade variance across evaluators narrowed from 0.26 to 0.04, a 6x reduction, with the remaining gaps resolved on the record rather than averaged away. The deliverable is that record: the reasoning trail behind the consensus, every contested point and how it closed. The system is now in client testing as a containerized LobeHub distribution.

The pattern generalizes past grading. Any decision where independent LLM evaluators need a defensible, auditable consensus benefits from the same three rounds: surface the disagreement, reconcile it through structured dialogue, output a result with a reasoning trail. That applies to candidate screening, vendor selection, document review, and high-stakes prompt validation, anywhere averaging would hide the disagreement instead of resolving it.

The methodology, and the deliberately hard case it was proven on first, are written up in Forcing Consensus.