{
  "title": "Recovering LLM-Persona Accuracies from Unlabeled Votes",
  "version": "1.0.0",
  "doi": "10.5281/zenodo.20498700",
  "doi_url": "https://doi.org/10.5281/zenodo.20498700",
  "zenodo_record": "https://zenodo.org/records/20498700",
  "record_id": "20498700",
  "publication_date": "2026-06-01",
  "resource_type": {
    "title": "Journal article",
    "type": "publication",
    "subtype": "article"
  },
  "creators": [
    {
      "name": "Daniel Ari Friedman",
      "affiliation": null,
      "orcid": "0000-0001-6232-9096"
    }
  ],
  "description": "<p>Algebraic (NTQR) evaluation infers how accurate a group of noisy classifiers was on a finite test using only their responses &mdash; no answer key. We test this end to end on real large language models. Three trader \"personas\" (optimistic, neutral, pessimistic), instantiated as system prompts, each make a binary bullish/bearish call on the same 64 market scenarios; we run the identical trio through six locally-hosted models via Ollama. For each model we recover per-persona, per-label accuracy with ErrorIndependentEvaluation (unsupervised) and score it against the authored ground truth (supervised), which is used only as a check. On the five models whose three judges all varied (mistral:latest, gemma4:latest, gemma3:4b, gemma2:2b, granite4.1:3b), the unsupervised algebra recovered persona accuracies to a mean absolute error of 0.012, within the 0.102 sampling-noise floor across all six per-label accuracy terms, with no labels -- including a persona's genuinely poor bullish accuracy of 0.57, recovered as 0.59. The other model collapsed at least one persona into a constant classifier (a judge that voted one way on all 64 scenarios), which makes the error-independent algebra unsolvable. The central, non-obvious result: inter-judge disagreement does not imply evaluability. Aggregate disagreement separated this run only because the unevaluable model(s) collapsed to 0.00; the five evaluable models spanned 0.03&ndash;0.23. What gates evaluation is a per-judge condition &mdash; every judge must vary (and answer) &mdash; not an ensemble one. We formalize this as a label-free evaluability diagnostic (a judge whose modal-vote fraction reaches 1.0 is a constant classifier; an unparseable vote is an abstention) that predicted exactly which models would be evaluable, before any solve and without ground truth. This is a concrete instance of the safety property the NTQR logic promises: it warns you when an ensemble is not good enough to be evaluated. A scenario bootstrap puts a 95% CI of [0.000, 0.038] on the recovery MAE (well inside the 0.102 noise floor), and a deterministic synthetic study generalizes the recovery beyond the finite set of real evaluable models &mdash; error falls like 1/&radic;Q (slope -0.58, stable across ensembles) &mdash; while mapping two honest limits: the built-in failure alarm catches anti-correlated judges with no false positives yet can miss positively-correlated (shared-training) errors, and the two-solution tie-break inverts once judges are no longer clearly better than random &mdash; exactly where simple majority-voting evaluation, though biased, is the more robust fallback. --- Associated artifacts GitHub release: v1.0.0 (https://github.com/docxology/ntqr_llm/releases/tag/v1.0.0) DOI: https://doi.org/10.5281/zenodo.20498699 Zenodo: https://zenodo.org/records/20498699 PDF SHA-256: e1196698427f9fe04d1f3071705adb6e5459983649c78d7f5d074756e989148b</p>",
  "keywords": [
    "algebraic evaluation",
    "NTQR",
    "unsupervised evaluation",
    "evaluation on unlabeled data",
    "LLM-as-judge",
    "error-independent evaluation",
    "ensemble evaluability",
    "constant classifier",
    "AI safety warning light",
    "reproducible research",
    "answer-key-free recovery",
    "local large language models"
  ],
  "files": [
    {
      "name": "Friedman_2026_Recovering_e1196698.pdf",
      "size_bytes": 2270020,
      "checksum": "md5:585821d541c33c086dcb3ab988d19b5b",
      "download_url": "https://zenodo.org/api/records/20498700/files/Friedman_2026_Recovering_e1196698.pdf/content"
    }
  ],
  "related_resources": [
    {
      "type": "repository",
      "url": "https://github.com/docxology/ntqr_llm"
    }
  ],
  "github_repo": "docxology/ntqr_llm",
  "github_release_url": "https://github.com/docxology/ntqr_llm/releases/tag/v1.0.0",
  "release_tag": "v1.0.0",
  "release_name": "Recovering LLM-Persona Accuracies from Unlabeled Votes (v1.0.0)",
  "pdf_sha256": "e1196698427f9fe04d1f3071705adb6e5459983649c78d7f5d074756e989148b",
  "pairing_confidence": "strong",
  "pairing_evidence": [
    "github_release_mentions_doi",
    "github_release_mentions_zenodo_record",
    "zenodo_related_identifier_mentions_release",
    "github_repo_self_linked",
    "title_overlap"
  ],
  "checked_at": "2026-06-04T20:45:04Z"
}