← Research Library
BODYFACTmethodology

How We Validate Research at The Encoded Human Project

Pearl (AI Research Engine) · Eric Whitney DO·March 25, 2026·3,608 words

How We Validate Research at The Encoded Human Project

Generated by Pearl — 3/26/2026


How We Validate Research at The Encoded Human Project

Epistemic ceiling: FACT / INTERPRETATION Published by Eric Whitney, DO — Primary Investigator, The Encoded Human Project AI Disclosed: This article was written with Pearl, the project's AI research partner. The validation architecture described here governs Pearl's own outputs.


The Problem We Are Solving

Artificial intelligence can fabricate with complete confidence.

Not occasionally. Not rarely. Structurally. A large language model generating a research claim has no internal mechanism that distinguishes "I retrieved this from a confirmed source" from "I assembled this from pattern — it sounds right." Both arrive with identical fluency. Both are delivered without hesitation. One is grounded. One is not.

This is not a criticism of AI. It is a description of the architecture. Understanding it is the first requirement for using AI responsibly in research.

The scale of the problem is now quantified. A 2025 PMC preprint study (PMC12583397) tested three leading AI models on retracted stem cell literature. DeepSeek fabricated citations in 88% of cases where it failed to identify retracted content. Grok 3 fabricated in 63% of unrecognized cases. Even ChatGPT-4o — the strongest performer — failed to recognize the retraction status of approximately 38% of papers it retrieved. A separate analysis of eight AI search platforms found over 60% of responses contained incorrect or misleading citations. These are not edge cases. They are the baseline behavior of the systems most researchers rely on.

Meanwhile, the source material itself is degrading. A January 2026 study in the BMJ (BMJ 2026;392:e087581) trained a machine learning model on known paper mill products and screened 2.6 million cancer research papers. It flagged approximately 261,000 — nearly 10% — as bearing textual similarities to paper mill output. The Retraction Watch database now contains tens of thousands of retracted papers. Wiley retracted more than 11,300 Hindawi articles in 2024 alone. The STM Integrity Hub intercepts approximately 1,000 suspected paper mill submissions every month — and that covers only the 40 publishers currently connected.

The problem, therefore, is not merely that AI hallucinates. It is that AI hallucinates into a literature that is itself increasingly contaminated. A validation system must account for both failure modes simultaneously: the fabrication risk of the AI doing the synthesis, and the integrity risk of the sources it retrieves.

The Encoded Human Project uses AI extensively — for knowledge synthesis, hypothesis generation, literature review, and cross-domain pattern recognition. Pearl, our AI research partner, has indexed 519,000+ research chunks across eight physiological operations and three densities of analysis. The scale of synthesis that becomes possible through AI is genuinely unprecedented.

But scale amplifies both signal and error. A system capable of connecting findings across 500,000 entries is equally capable of generating a citation that doesn't exist, citing a paper that has been retracted, attributing a study to the wrong author, or stating a probabilistic finding as if it were causal. Without a rigorous validation architecture, the output is unreliable regardless of how sophisticated the synthesis.

This document describes exactly how we catch those errors — before they reach you.


The Five Error Types We Are Hunting

Not all errors are equal. Our validation system is designed around five distinct error tiers, each requiring different detection methods:

Tier 1 — Hallucination The most dangerous error. Fabricated information: an author name that doesn't exist, a paper that was never published, a statistic that was invented. T1 errors are structurally indistinguishable from accurate claims in the text. They require external verification — not internal review.

Example from our own work: A letter to the editor generated by Pearl cited "Frosst et al., Circulation, 2009." Frosst is a real researcher. The 1995 Nature Genetics paper is real. The 2009 Circulation citation does not exist. The citation was doing real argumentative work. The paper did not.

Tier 2 — Source Corruption The citation is real. The paper exists. But the paper itself has been retracted, is a known paper mill product, or has been substantially contradicted by subsequent research. T2 errors are invisible to any system that only checks whether a reference exists — they require checking whether the reference is still trustworthy.

Example: A paper with 500 citations that has been contradicted by 50 subsequent studies is fundamentally different from one with 500 supporting citations. Traditional citation counts are blind to this distinction.

Tier 3 — Staleness Internal inconsistency introduced by editing. A number stated in section 2 doesn't match section 4. A claim revised in one location wasn't updated in another. T3 errors are invisible to any single-pass review.

Tier 4 — Framing Technically true but misleading. A real paper cited to support a claim that goes beyond what the evidence actually demonstrates. The mechanism is confirmed. The inference from the mechanism to the conclusion is overstated.

Example from our own work: An early draft of our microbiome ancestral encoding document stated that C-section delivery produces specific mental health outcomes. The citations were real. The mechanism was confirmed. The causal inference was not — the correct framing is that C-section delivery raises the biological vulnerability load that other systems must compensate for. Probabilistic, not deterministic. T4 errors are the most common failure mode in research synthesis and the hardest to catch algorithmically. They require judgment.

Tier 5 — Omission Something important is missing. The argument is internally coherent but incomplete — a contradicting finding not acknowledged, a known limitation not stated, a methodological constraint not disclosed.


The Seven-Agent Validation Pipeline

Every research document produced by The Encoded Human Project passes through a seven-agent validation pipeline before publication. Each agent is a separate AI instance with a discrete task. No single agent sees the whole problem. The architecture is deliberately decomposed because the errors that slip through monolithic review are precisely the errors that emerge from review fatigue — the same system that generated the claim cannot reliably catch its own blind spots.

Gate 1 — Reference Verifier

Catches: T1 (Hallucination)

Mechanically verifies every reference in the document. Web searches each PMID, DOI, author name, journal, and publication year independently against CrossRef, PubMed, arXiv, and Google Scholar. Returns a confirmed/unconfirmed status for every citation with specific failure notes.

This is the most straightforward check. It answers one question: does this paper actually exist in the form it is cited?

Gate 2 — Source Integrity Auditor

Catches: T2 (Source Corruption)

This is where we depart from most AI-assisted research systems. Confirming a citation exists is necessary but insufficient. Gate 2 asks: should this paper still be trusted?

Gate 2 runs as a separate agent — not a second pass in Gate 1. The distinction matters architecturally: Gate 1 asks whether a paper exists; Gate 2 asks whether it should be trusted. These are epistemically distinct questions requiring different databases, different tool calls, and different failure modes. Collapsing them into a single agent would allow a real but retracted paper to pass Gate 1's existence check without triggering Gate 2's integrity audit.

This agent performs three checks per citation:

  • Retraction scan. Cross-references every cited paper against the Retraction Watch database and PubPeer (post-publication peer review comments). A paper that has been retracted, issued an expression of concern, or flagged for data integrity issues is caught here.

  • Citation-context analysis. Following the methodology pioneered by Scite's Smart Citations system (which has analyzed 1.4+ billion citation statements across 200+ million sources), this agent evaluates how a key paper has been cited by subsequent research — not just how many times. Has the finding been predominantly supported, contradicted, or merely mentioned? A paper with a high contradiction ratio triggers a flag regardless of its citation count.

  • Source quality signal. Checks whether the cited journal has been flagged for paper mill infiltration, delisted from major indexes, or identified by the Problematic Paper Screener's "tortured phrase" detection system.

When Gate 2 flags a source as compromised, the citation is quarantined. It cannot proceed through the pipeline until the investigator reviews the flag, confirms whether the concern is material, and either replaces the source, adds an appropriate caveat, or clears the flag with documented justification.

Gate 3 — Claim Verifier

Catches: T1 (Hallucination), T3 (Staleness)

Checks every specific statistic, percentage, effect size, and quantitative claim against its stated source. The question is not whether the mechanism is plausible — it is whether this specific number appears in this specific paper.

Additionally, cross-checks internal consistency: does the same statistic appear identically everywhere it is cited within the document? If a figure was revised during editing, has it been updated in every location?

Gate 4 — Citation-Claim Mapper (The Judgment Layer)

Catches: T4 (Framing)

This is the most important agent. It reads every claim alongside its cited evidence and evaluates proportionality: is the inference the document is drawing consistent with what the cited paper actually demonstrates?

Gate 4 runs only on claims tagged FACT or HYPOTHESIS — not INTERPRETATION or SPECULATION. INTERPRETATION claims are already flagged as framework-dependent readings; adversarial review of an interpretation produces a counter-interpretation, not a verification result. SPECULATION is already disclosed as provisional. Restricting Gate 4 to FACT and HYPOTHESIS ensures the framing check applies where it matters most: where the document is making its strongest truth claims.

Agent 4 evaluates three dimensions:

  • Causal vs. correlational. Does the source demonstrate causation, or only correlation? If correlation, does the document's language imply causation?
  • Population scope. Does the source study the same population the document is making claims about? A finding in mouse models being cited to support a human clinical claim is a T4 flag.
  • Effect magnitude. Does the document's framing accurately represent the effect size? A statistically significant but clinically marginal finding cited as if it were transformative is a T4 flag.

T4 pause gate: When Agent 4 flags a framing error, the pipeline pauses. The specific overclaims are sent to the human investigator for review. Nothing proceeds until the framing is corrected or explicitly accepted with appropriate qualification. This gate is non-negotiable.

Gate 5 — Adversarial Reviewer (The Devil's Advocate)

Catches: T4 (Framing), T5 (Omission) Scope: FACT and HYPOTHESIS claims only

This agent's explicit mission is to disprove the document's claims. Gate 5 is not looking for what confirms the argument — it is looking for what contradicts it.

The Adversarial Reviewer produces a structured counter-report with a maximum of three objections, ranked by severity. This ceiling is deliberate: unlimited flagging creates review fatigue and erodes trust in the pipeline. Three objections force prioritization — the agent must identify what matters most. The investigator sees the three strongest objections, not a list of marginal concerns.

The agent:

  • Searches for published findings that directly contradict the document's central claims
  • Identifies methodological limitations in cited studies that the document does not acknowledge
  • Flags alternative explanations for the phenomena the document attributes to specific mechanisms
  • Checks whether important replication failures or null results exist in the cited research area

This agent does not have the authority to block publication. It produces a structured counter-report: "Here is what the strongest objection to this document would look like." The investigator then decides whether the document adequately addresses those objections, or whether revisions are needed.

The Adversarial Reviewer exists because confirmation bias is the deepest failure mode of synthesis work. A system designed to build coherent arguments will naturally select evidence that supports coherence and deprioritize evidence that disrupts it. The antidote is an agent whose sole purpose is disruption.

Gate 6 — Structural Reviewer

Catches: T3 (Staleness), T5 (Omission)

Reviews the document for logical consistency, internal contradictions, significant omissions, and gaps between premise and conclusion. This agent is not evaluating citations — it is evaluating reasoning.

Does the argument follow from its premises? Are there unstated assumptions? Does the conclusion require evidence that has not been presented? Is the scope of the conclusion proportional to the scope of the evidence?

Gate 7 — Synthesis Judge

Catches: All tiers — final integration

Receives synthesized outputs from Gates 1–6 — not the full document. Produces a formal verification report with:

  • Error tier breakdown (T1 through T5)
  • Specific corrections required
  • Adversarial findings summary with investigator response notes
  • Source integrity audit results
  • Confidence score per major claim (0–100, with methodology — see below)
  • Publication verdict: clear, conditional (specific revisions required), or hold (material concerns unresolved)

The Confidence Scoring System

Every major claim carries a numeric confidence score (0–100) generated by the Synthesis Judge at Gate 7. The score is computed from five weighted components:

ComponentWeightRationale
Source verification status30%Binary anchor — a retracted or compromised source collapses the score regardless of other factors
Citation-context ratio25%How subsequent literature treated the finding is the best proxy for its durability
Replication count20%Single studies do not earn high confidence regardless of quality
Cross-domain inference distance15%The further from the source domain, the lower the confidence ceiling
Sample size adequacy10%Relevant but least determinative — large poor-quality studies still fail

Two rules govern the score beyond the weighted components:

The ceiling rule: Cross-domain inference distance acts as a ceiling, not just a weight. A claim crossing from molecular biology to psychological implication cannot score above 65 regardless of other components. A claim crossing from biological mechanism to philosophical or spiritual inference cannot score above 50. These ceilings reflect the epistemic mode drop that applies at every density boundary: a biological FACT becomes a psychological INTERPRETATION at best, which carries a lower confidence ceiling by definition.

The floor rule: If source verification status flags a retraction or high contradiction ratio, the score floors at 20 regardless of other components. The other factors do not compensate for a compromised source.

Investigator review zone: Claims scoring between 45–65 where cross-domain inference distance is the primary limiting factor require investigator review before publication. This is the gray zone where the number is technically computable but the judgment behind the inference needs human eyes. Gate 7 flags these automatically.

Scores above 80 indicate strong convergent evidence. Scores between 50–80 indicate active research areas with mixed support. Scores below 50 indicate early-stage or contested findings — published with appropriate qualification.


External Validation Integrations

The seven-agent pipeline is our internal architecture. We also integrate with the external validation ecosystem:

OpenEvidence — For clinical claims that touch patient care, we cross-reference against OpenEvidence's evidence base, which holds exclusive full-text licensing agreements with the New England Journal of Medicine, JAMA, NCCN, and Cochrane Systematic Reviews.

Semantic Scholar + Consensus — For literature scope checks. Semantic Scholar's 200M+ paper corpus and citation network mapping help us verify that our literature coverage is not anomalously narrow.

Scite Smart Citations — For critical citations doing load-bearing argumentative work, we run an independent Scite check to verify the supporting/contrasting/mentioning ratio.

Retraction Watch + PubPeer — Direct database cross-reference at Gate 2 for every citation. Non-negotiable.


What This System Catches — And What It Doesn't

We are rigorous about naming the limits of our own validation.

What the pipeline catches reliably:

  • Completely fabricated authors, journals, and papers (T1)
  • Wrong years, wrong journals, wrong PMIDs on real papers (T1)
  • Retracted papers, papers with expressions of concern, papers flagged for integrity issues (T2)
  • Papers with high contradiction ratios in subsequent literature (T2)
  • Statistics that don't appear in their cited sources (T3)
  • Internal inconsistencies introduced by editing (T3)
  • Claims that contradict their own cited evidence (T4)
  • Claims that exceed the inferential scope of their cited evidence — on FACT and HYPOTHESIS claims (T4)
  • Logical gaps between evidence base and conclusions (T4, T5)
  • Missing contradictory evidence in the published literature — on FACT and HYPOTHESIS claims (T5)

What the pipeline does not fully solve:

  • Papers that exist but are behind paywalls the verification agents cannot access
  • Very recent publications not yet indexed in external databases
  • Source corruption in papers that have not yet been flagged — paper mill products that remain undetected in the literature (the BMJ study suggests this could be ~10% of certain fields)
  • The deepest T4 errors — where the inference is subtle, the overclaim is modest, and the judgment call is genuinely difficult
  • Cross-domain inferences where no direct replication pathway exists (e.g., from molecular biology to psychological experience)
  • T4 and T5 errors in INTERPRETATION and SPECULATION claims — these are outside Gate 4 and Gate 5 scope by design, because the human investigator is the appropriate judge of interpretive proportionality

For these categories, human investigator review remains the final gate.


The Epistemic Badge System

Every claim published at The Encoded Human Project carries an explicit epistemic ceiling badge:

FACT — Supported by peer-reviewed evidence, independently replicated, mechanistically explained, and confirmed through Gate 2 source integrity audit. We cite the source. The source has been verified as non-retracted, non-contradicted, and non-compromised. If we cannot meet all conditions, it is not FACT.

HYPOTHESIS — A testable proposition consistent with existing evidence but not yet confirmed. Stated as: "one possibility is..." or "if X were true, we would expect..."

INTERPRETATION — A grounded reading of data through the Encoded Human framework. Stated as: "through this framework, we read X as Y." The framework is the lens. The data is the ground.

SPECULATION — A creative or intuitive leap without direct evidence. Always flagged explicitly. Speculation is legitimate in science — it is where new questions come from. It is not legitimate when presented as established finding.

When a claim crosses density boundaries — from biological mechanism to psychological implication, for example — the epistemic mode drops at least one level. A biological FACT becomes a psychological INTERPRETATION at best. We mark these transitions explicitly rather than allowing the authority of the biological finding to silently transfer to the cross-domain inference.


Positioning in the Validation Landscape

We document where our system sits relative to the broader ecosystem — not to claim superiority, but to name what we have borrowed, what we have built, and where the gaps remain.

What we share with industry best practice:

  • Multi-agent decomposition (increasingly standard in autonomous AI research systems)
  • Retraction and PubPeer cross-referencing
  • Citation-context analysis (pioneered by Scite's Smart Citations across 1.4B+ citation statements)
  • Adversarial review agents

What we add:

  • The five-tier error taxonomy (T1–T5) that maps error types to specific detection methods
  • Explicit separation of Gate 1 (existence verification) and Gate 2 (integrity auditing) as architecturally distinct agents
  • The T4 pause gate — mandatory human-in-the-loop checkpoint for framing errors, scoped to FACT and HYPOTHESIS claims
  • Gate 5 (Adversarial Reviewer) capped at three ranked objections to prevent review fatigue
  • The confidence scoring system with defined weights, a cross-domain ceiling rule, a source-corruption floor rule, and an investigator review zone for the 45–65 gray band
  • The epistemic badge system with explicit density-boundary transitions
  • Open architecture documentation — this document itself — as a published commitment to methodological transparency

What the field does that we cannot yet replicate at our scale:

  • The STM Integrity Hub's cross-publisher duplicate submission detection
  • Scite's full 1.4B citation statement corpus (we query their system rather than replicating it)
  • The Problematic Paper Screener's 130-million-paper scan for tortured phrases

Why We Publish This

Because the research community deserves to know exactly how AI-assisted research is being validated — not in general terms, but in operational detail.

The Encoded Human Project is committed to the principle that AI disclosure is not enough. Disclosure says: AI was used here. What research integrity requires is transparency about how AI was used, what errors it is capable of producing, and what architecture is in place to catch those errors.

The current landscape makes this urgency clear. AI detection tools — the systems designed to flag whether text was AI-generated — misclassified 61% of essays by non-native English speakers as AI-generated in a Stanford study (Liang et al., Patterns, 2023), while native speaker false positive rates varied widely across detectors. The approach of detecting "AI-ness" in text is proving fundamentally less reliable than the approach we take: verifying the integrity of claims, citations, and reasoning regardless of who or what produced them.

The question is not whether AI wrote this. The question is whether the evidence supports it, whether the sources are trustworthy, whether the inferences are proportional, and whether the limitations are disclosed. That is what our system evaluates.

This document is that transparency.

We do not claim our validation system is perfect. We claim it is honest, documented, and continuously improved. When errors slip through — and they will — we correct them publicly, name what the validation system missed, and improve the pipeline.

The science is only as trustworthy as the process behind it. This is our process.


The validation pipeline described in this article is open architecture. If you are building AI-assisted research systems and want to discuss implementation, contact the project at theencodedhumanproject.com.

Pearl is the AI research partner of The Encoded Human Project. Her validation outputs, hypothesis logs, and epistemic ledger are maintained as living documents and publicly accessible at theencodedhumanproject.com/validation-log. The pipeline described here governs her own outputs — including this article, which passed through the seven-agent system before publication.