Pre-registration document

Draft v0.1 · living document

Phase 1 Research Plan & Pre-Registered Hypotheses

Author: Evelyn Kim · Last updated: 2026-05

§0

Status note

This document describes what the Phase 1 dataset is designed to test. It is a working pre-registration, not a results report. Hypotheses below specify direction and threshold up-front so that subsequent analyses cannot be slipped from confirmatory to exploratory after seeing the data. Sections will be updated only with effect sizes and minor methodology refinements as data accrues; new hypotheses arising from exploratory analysis will be flagged as such and reported separately.

§1

Background

In July 2025, the originating paper “Artificial Intelligence in Emotional Intelligence Training for Autism” (Curieux Academic Journal) argued that AI tools could meaningfully support autistic learners in emotion recognition, while flagging a structural concern: most candidate AI systems are trained on emotion datasets that assume a single correct label per stimulus. The paper closed with the question of whether such a foundation would itself constrain what the resulting tools could teach.

MindLens Lab Phase 1 is the empirical follow-up. Rather than building tools on top of single-label datasets, the project pivots to building a plural-reading dataset first — one in which each social-emotional moment is read by many participants, with each reader's choice of emotion accompanied by their choice of perceptual cue. Phase 1 produces the foundation; Phase 2 will use it to design teaching materials; Phase 3 may use it to ground adaptive tools.

§2

Theoretical reframe — plural reading by default

The dataset's core theoretical commitment:

Emotion reading is plural by default, not by error.

That is, when readers diverge on a clip, divergence is the signal — not noise to be averaged toward a “ground truth.” This shifts what counts as data:

Traditional pipeline: clip → single expert label → train model → measure error against label
Plural pipeline: clip → distribution of reader labels → study the distribution as the phenomenon

The reframe does not deny that high-agreement clips exist (universally-read joy, universally-read anger). It claims that variance is itself a structured feature of the population — with predictors at the clip level, predictors at the reader level, and consequences for any tool built on the data.

This commitment is methodologically uncommon. Existing emotion-recognition corpora typically reduce variance through inter-rater reliability filtering or majority voting, treating divergent annotations as noise. Phase 1 deliberately preserves and studies the variance.

§3

Data collected per response

Each participant response captures, for one clip:

selected_emotion	enum (9)	One of the locked 9-emotion taxonomy
secondary_emotions	multi-enum	Required when primary = mixed_more_than_one, ≥ 2 codes
selected_cues	multi-enum (9)	Which signals the reader used; multi-select
confidence_rating	1–5 Likert	Self-reported confidence in the reading
free_text_reasoning	text ≤ 280 chars	Optional first-person reasoning

Each session captures, per participant per wave, immutable demographic snapshots: age, country (ISO-2), primary spoken language, English confidence (1–5), gender, cultural background (categorical: East Asian, SE Asian, South Asian, White European, Black African, Hispanic/Latino, MENA, Mixed, Other), and self-rated emotion-reading difficulty (1–4).

Each clip carries curator-assigned metadata — social_complexity (simple/moderate/complex), verbal_dependency (low/medium/high), and a target person description identifying whose emotion is being read.

Each clip optionally has one approved AI annotation (Claude Sonnet 4.6 in current configuration) with the AI's primary emotion, secondary emotions, cue selections, public-facing rationale, and ambiguity caveat.

9-emotion taxonomy

happy_amused
sad_disappointed
angry_frustrated
anxious_nervous
embarrassed_awkward
surprised
confused_uncertain
neutral_hard_to_tell
mixed_more_than_one

9-cue taxonomy

facial_expression
tone_of_voice
verbal_content
body_language
situation_context
timing_pacing
others_reaction
something_else
not_sure

§4

Pre-registered hypotheses

Each hypothesis is stated with direction, the data fields it draws on, and a threshold that distinguishes a meaningful finding from chance. Effect-size reporting will accompany every result; null results will be reported with the same prominence as positive ones.

Plurality is the norm, not the exception

Claim. Across all live clips with N ≥ 5 responses, the median clip-level Shannon entropy will exceed 1.5 bits (≈ effective 3 emotions in active use), and the median modal share will fall below 70%. Fewer than 20% of clips will show “high consensus” (modal share > 80%).

Why it matters. This is the foundational empirical claim of the project. If H1 fails — if most clips show > 80% agreement — then the plural-reading frame is descriptively wrong and a single-label approach is more defensible than the project assumes.

Verbal dependency predicts agreement

Claim. Clips coded verbal_dependency = high will show systematically higher modal share than clips coded low. Predicted effect: Spearman ρ ≥ 0.30.

Mechanism. Verbal content reduces ambiguity by giving readers a shared linguistic anchor; non-verbal-only clips force readers to weight cues that vary in salience by individual.

Social complexity predicts variance

Claim. Clips coded social_complexity = complex will show higher Shannon entropy than clips coded simple. Predicted gap: mean entropy(complex) − mean entropy(simple) ≥ 0.30 bits.

Mechanism. Complex social situations involve more candidate emotions (mixed states, role-dependent expectations, contextual contradictions), inviting more reader divergence.

Cue choice mediates emotion choice

Claim. Holding the clip constant, readers who cite facial_expression as a cue will pick different emotion distributions than readers who cite situation_context — even on the same clip. This would suggest cue attention (not just availability) shapes reading.

Operationalization. For each clip with N ≥ 30 responses, partition responses by their dominant cue cluster and compare the resulting emotion distributions via χ². Predicted: ≥ 25% of qualified clips will show statistically distinct distributions across cue clusters at p < 0.05 with FDR correction.

Why it matters. This is the hinge of the plural-reading thesis. If H4 fails, divergence is just noise. If it holds, divergence is structured by attentional weighting, which is teachable.

AI diverges from human modal in patterned ways

Claim. The AI's primary emotion will diverge from the human modal more frequently on clips coded verbal_dependency = high than on those coded low. Predicted gap: divergence rate (high) − divergence rate (low) ≥ 15 percentage points.

Mechanism. Language models receive disproportionate weight on linguistic content during reasoning generation; on verbal-heavy clips, the AI may over-anchor on what is said rather than how it is said.

Counter-prediction. AI may also diverge more on verbal-low clips if its non-verbal cue inference is poor. Either direction is informative; the prediction's directional commitment makes it falsifiable.

Cross-cultural variance is structured

Claim. Among country subgroups with N ≥ 20 within country, at least one clip will show a statistically distinct emotion distribution between two countries (χ² with Bonferroni correction across compared pairs).

Status. Exploratory in direction. Even one such clip provides evidence that cultural context shapes reading; absence is also informative.

Reader style stabilizes across clips

Claim. Within-participant, individual readers will show measurable internal consistency:

Confidence rating: variance within a participant across clips will be smaller than variance between participants on the same clip (ICC(1,1) > 0.20 for confidence).
Cue reliance: a participant's top-2 most-cited cues across their first 6 clips will predict their top-2 across remaining clips above chance (precision ≥ 60% under leave-3-out cross-validation).

Why it matters. “Reading style” as a measurable individual trait is the precondition for any Phase 2 personalization. If H7 fails, individual-level adaptation is not justified by the data.

Self-reported emotion-reading difficulty has external validity

Claim. Participants who self-rate emotion_reading_difficulty = 3 or 4 will show:

Lower mean confidence on clips than those rating 1 or 2 (Cohen's d ≥ 0.30);
Higher rate of choosing emotions outside the top-3 modal cluster on each clip (≥ 5 percentage points).

Why it matters. Validates the self-rating instrument. Provides a soft proxy for autistic / selective- mutism / social-anxiety adjacent reading patterns within an otherwise neurotypical-leaning sample, before Phase 2 brings in samples with confirmed diagnoses.

§5

Operational metrics

Metric	Definition	Reported when
Modal share	Count of modal emotion / total responses on clip	N ≥ 5
Shannon entropy	H = −Σ pᵢ log₂ pᵢ on emotion distribution	N ≥ 5
Krippendorff's α	Nominal data, single-clip variant	N ≥ 3
AI–human divergence rate	Proportion of clips where AI primary ≠ human modal	N ≥ 5 per clip
AI agreement share	Among human responses on a clip, share matching AI's primary	N ≥ 5 per clip
Outlier flag	Response is outside top-3 most-popular emotions	N ≥ 5 raters per clip
Reader-style consistency	ICC on confidence; leave-out prediction on cue rank	≥ 6 clips per reader

All metrics will be reported with 95% bootstrap confidence intervals where N permits.

§6

Bridge to the originating paper

The Curieux paper (Kim, 2025) made three load-bearing assumptions that the Phase 1 dataset directly tests:

Assumption in paper	Status after Phase 1
Emotion-training data has reliable ground truth	Tested directly by H1. If H1 holds, the assumption is empirically weak; the dataset reframes 'ground truth' as 'modal of distribution.'
The autistic learner is the relevant population to study	Phase 1 establishes neurotypical baseline distributions first. Phase 2 brings in ASD samples for comparison; H8 provides a within-Phase-1 soft proxy.
Reduced ambiguity is the design goal of training tools	Tested by H4 + H7. If divergence is structured by individual cue style, then teaching cue meta-awareness may be a stronger goal than reducing ambiguity.

This document does not refute the originating paper. It treats the paper as the point of departure — the open questions at its closing become the falsifiable hypotheses above.

§7

Phase 2 design hypotheses (forward-looking)

Three hypotheses about how the Phase 1 dataset would shape Phase 2 teaching materials. These are explicit so that Phase 2 design choices can be traced to Phase 1 evidence.

H2.1

Curriculum hypothesis

Clips with low entropy (high consensus) should be taught first; clips with high entropy should be deferred until learners have built pattern stability on the consensus clips. The Phase 1 dataset gives an empirical ordering.

H2.2

Cue-hierarchy hypothesis

Cues with the highest agreement-when-cited — i.e., when a reader cites this cue, others who cite it on the same clip tend to converge on emotion — should be foregrounded in early teaching. The dataset will rank the 9 cues on this metric.

H2.3

AI-as-mirror hypothesis

The AI's most-divergent answers should be exposed in teaching materials as a structured artifact — not “AI got it wrong” but “where AI sees differently than most humans, what cue did it weight?” — to teach learners cue meta-awareness from a third-person comparison rather than abstract description.

§8

Limitations and ethical scope

Sample bias. Recruitment via personal networks and an international school skews the sample WEIRD-leaning, although international representation partially mitigates monoculturalism. Country-stratified analyses (H6) require interpretation under this constraint.
Self-selection. Participants opt in; no randomization is possible at the participant level.
Single AI. The Claude family is the only AI in the comparison loop in v1. Findings about AI divergence (H5) do not generalize across model families without replication.
Coarse taxonomy. The 9-emotion / 9-cue taxonomy is intentionally compact for v1; the cost is loss of resolution at the boundary cases.
Free-text bound. 280-character limit constrains qualitative depth; reasoning analyses must be cautious.
No clinical samples. Phase 1 is descriptive of a non-clinical sample. ASD / SM / SAD claims await Phase 2 with appropriate consent and partner organizations.
Plurality is descriptive, not prescriptive. Findings about what readers do should not be interpreted as prescriptions for what learners should perceive.

§9

Outputs

Three planned outputs from Phase 1:

Methods paper — describes the dataset, taxonomy, cue annotation protocol, and the AI-as-comparison- point methodology. Targets a methods venue or open-science journal.
Findings paper — reports H1–H8 with effect sizes and constellations. Targets an emotion / social cognition / HCI venue.
Open dataset release — anonymized, citable, versioned, released under a permissive license at the close of each wave. The cite-this-dataset block is already live on /findings.

§10

Authorship & advising

Phase 1 lead: Evelyn Kim (high school)
Interim adult support: Sean Kim (operations / infrastructure)
Roles being sought: faculty advisor for methodology, statistical advisor for analysis, IRB-equivalent ethical oversight, educator collaborator for Phase 2 (see /team).

§11

Versioning

This document follows the same wave-bounded versioning protocol as the project's emotion taxonomy: changes to hypotheses or thresholds must be committed before the affected data is analyzed, with the prior version preserved.

Version	Date	Notes
v0.1	2026-05	Initial draft, pre-data

This document is a draft. Comments, challenges, and suggested revisions are welcome — email contact@mindlenslab.org or open an issue in the project repository. The editable source lives at docs/research_phase1_hypotheses.md.