{
  "title": "On-Policy Distillation as Active Inference in Finite Variational Models",
  "version": "1.0.2",
  "doi": "10.5281/zenodo.20749817",
  "doi_url": "https://doi.org/10.5281/zenodo.20749817",
  "zenodo_record": "https://zenodo.org/records/20749817",
  "record_id": "20749817",
  "publication_date": "2026-06-18",
  "resource_type": {
    "title": "Journal article",
    "type": "publication",
    "subtype": "article"
  },
  "creators": [
    {
      "name": "Daniel Ari Friedman",
      "affiliation": "Active Inference Institute",
      "orcid": "0000-0001-6232-9096"
    }
  ],
  "description": "Abstract\n\nThis paper formulates on-policy distillation as active inference in finite variational models, with exact claims only for declared objects and interpretive claims explicitly bounded outside them. In the construction, the intractable teacher policy plays the role of the generative model $p(o,s)$, the tractable student policy is the approximate posterior $q(s)$, and the per-token reverse-KL distillation loss is variational free energy up to the evidence constant, $F = D_{\\mathrm{KL}}(q\\,\\|\\,p(s\\mid o)) - \\log p(o)$, whose KL target is the teacher-induced posterior $p(s\\mid o)\\propto p(o,s)$ .\n\nThe title's \"as\" is therefore a scoped mathematical correspondence rather than the slogan OPD = Active Inference. Variational free energy names the realized-rollout distillation loss; expected free energy remains the planning-side objective by which the pymdp agent selects actions . On-policy student rollouts generate the observations on which the posterior is scored, connecting the construction to induced-distribution mismatch in imitation learning and exposure-bias analyses while preserving their different objectives, empirical regimes, and contested severity . Privileged traces and feedback play the role that train-time-only information plays in the LUPI/distillation lineage .\n\nFour deterministic witnesses instantiate the correspondence. A Bernoulli-Ising oracle couples a teacher's privileged variable to the answer through $\\lambda$, making $I(\\lambda)$ the teacher-student mutual information and the finite free-energy gap the toy distillation objective; the closed-form and independently recomputed mutual-information sweeps agree to machine precision (RMSE 2.1e-16 nats). A pymdp T-maze rollout supplies the on-policy student that samples its own observations under a privileged cue . A two-agent classroom pits a privileged teacher (cue validity 0.98) against an on-policy student (cue validity 0.5), measuring teacher belief entropy 0.247 nats versus student 0.347 nats and a mean reverse-KL distillation signal of 6.28 nats. A four-state/two-action sequential-shift witness shows teacher-forced train loss 0.333 nats underestimating student-induced test loss 0.409 nats, with deterministic on-policy correction reducing it to 0.096 nats.\n\nThese are toy, generated findings, not production-LLM measurements. Recent privileged-context, context-distillation, adaptive-teacher, freshness-aware OPD, RLHF/instruction-tuning, self-generated reasoning, Qwen OPD-vs-RL, and Thinking Machines replication reports remain external context rather than reproduced results . The supplemental sheaf/provenance layer keeps that boundary operational: every reported number is hydrated from a generated artifact, every figure is source-bound, and 16 / 16 invariant checks pass before rendering.\n\n---\nAssociated artifacts\nGitHub release: v1.0.2 (https://github.com/ActiveInferenceInstitute/on_policy_distillation/releases/tag/v1.0.2)\nDOI: https://doi.org/10.5281/zenodo.20749817\nZenodo: https://zenodo.org/records/20749817\nPDF SHA-256: c6b5ec494915e6e046f24cf723f8dbbf93a5b168544daed3cca14c089d4087aa",
  "keywords": [
    "on-policy distillation",
    "active inference",
    "self-distillation",
    "privileged information",
    "free energy principle",
    "reverse KL divergence",
    "pymdp",
    "sophisticated inference"
  ],
  "files": [
    {
      "name": "Friedman_2026_Onpolicy_c6b5ec49.pdf",
      "size_bytes": 8102450,
      "checksum": "md5:91b9bc1743901832f8b1814e8b6060fa",
      "download_url": "https://zenodo.org/api/records/20749817/files/Friedman_2026_Onpolicy_c6b5ec49.pdf/content"
    }
  ],
  "related_resources": [
    {
      "type": "repository",
      "url": "https://github.com/ActiveInferenceInstitute/on_policy_distillation"
    }
  ],
  "github_repo": "ActiveInferenceInstitute/on_policy_distillation",
  "github_release_url": "https://github.com/ActiveInferenceInstitute/on_policy_distillation/releases/tag/v1.0.2",
  "release_tag": "v1.0.2",
  "release_name": "On-Policy Distillation as Active Inference in Finite Variational Models (v1.0.2)",
  "pdf_sha256": "c6b5ec494915e6e046f24cf723f8dbbf93a5b168544daed3cca14c089d4087aa",
  "pairing_confidence": "strong",
  "pairing_evidence": [
    "github_release_mentions_doi",
    "github_release_mentions_zenodo_record",
    "zenodo_related_identifier_mentions_release",
    "github_repo_self_linked",
    "title_overlap"
  ],
  "checked_at": "2026-06-18T16:04:27Z"
}