Find, Fix, Reason: Context Repair for Video Reasoning

Huang, Haojian; Qin, Chuanyu; Li, Yinchuan; Chen, Yingcong

Find, Fix, Reason: Context Repair for Video Reasoning

Preprint

A teacher-guided repair framework that turns failed video-reasoning rollouts into evidence-aligned training signals.

Haojian Huang^1,2 Chuanyu Qin³ Yinchuan Li² Yingcong Chen^1,2

¹HKUST(GZ) ²Knowin AI ³Institute of Information Engineering, Chinese Academy of Sciences

arXiv PDF Appendix Code

Comparison of on-policy training, hybrid replay, tool-use loops, and the FFR context repair regime. — The teaser contrasts four ways to train video reasoners. FFR keeps the rollout on the student's policy, but repairs the missing observation with a minimal teacher-provided evidence patch.

(a)

On-Policy RL

The student only learns from its own sampled rollouts, so progress depends on what it can already discover by itself.

(b)

Hybrid Replay

Past or off-policy traces are mixed back into training, improving coverage but introducing distribution-shift concerns.

(c)

Tool-Use Reasoning

The model repeatedly calls external tools at inference time to retrieve more context before answering.

(d)

FFR Context Repair

A frozen teacher intervenes only on failed rollouts, points to missing video evidence, and lets the same policy reason again.

TL;DR: FFR keeps video RL on-policy while using a stronger frozen teacher to expose the missing spatiotemporal evidence behind failed rollouts.

Abstract

Reinforcement learning has improved video reasoning, but existing pipelines either depend on self-exploration that saturates at the model's own capability ceiling, or mix off-policy guidance in ways that complicate optimization. FFR introduces an observation-level intervention: when a rollout fails, a frozen tool-integrated teacher identifies the missing spatiotemporal dependency and returns a minimal evidence patch from the original video. The student then answers again with the added context, and GRPO updates are applied to the repaired trajectory through a robust improvement reward. Across multiple video reasoning and general video benchmarks, this produces consistent accuracy gains while preserving the benefits of on-policy exploration.

Method Overview

Repair the observation, then train on the chosen rollout.

Overview diagram of the FFR pipeline with first-pass rollouts, teacher diagnosis, evidence patching, repaired rollouts, and GRPO updates. — The teacher intervenes only on failed rollouts, extracts evidence from the same video, and constrains the patch so the answer is not leaked.

Find the missing dependency

A frozen teacher inspects incorrect rollouts and classifies what the student missed, such as temporal order, spatial relation, or task misconception.

Fix with minimal evidence

The patch points to key frames, temporal markers, or regions from the original video while avoiding direct answer clues.

Reason again and update

The same policy re-answers with the patch. GRPO then updates on the original correct rollout or the repaired rollout, with a patch tax controlling reliance on guidance.

Key Findings

Small, targeted repairs shift the training signal.

FFR improves both reasoning and general video benchmarks

Compared with Video-R1-SFT, FFR with GLM-4.5V lifts the eight-benchmark average from 47.8 to 56.5.

Context repair beats direct teacher imitation

FFR with a 32B teacher reaches a 51.2 reasoning average, surpassing SFT with a 235B teacher at 50.7.

Visual evidence matters

Removing visual context drops Video-Holmes from 52.3 to 42.3, showing that key-frame and spatial cues are central to the repair.

The final prompt design prevents answer leakage

Structured output constraints plus negative prompting reduce manually verified direct and partial leakage to 0%.

Main Results

FFR consistently strengthens video reasoners.

The main table compares video reasoning benchmarks and general video understanding benchmarks. FFR improves the corresponding Video-R1 and VideoRFT training baselines across every reported metric.

Model	MMVU	VSI-Bench	VideoMMMU	Video-Holmes	LongVideoBench	LVBench	MVBench	TempCompass
GPT-4o	75.4	34.0	61.2	42.0	58.5	48.9	64.6	73.75
Video-R1-SFT	61.3	31.8	47.4	34.6	47.6	30.7	59.4	69.2
Video-R1	63.8	35.8	52.3	36.5	52.7	35.3	63.9	73.2
FFR + Video-R1	68.5	38.9	54.6	52.3	55.3	38.1	68.8	75.6
VideoRFT-SFT	60.5	31.7	48.5	27.1	47.3	26.9	57.0	68.4
VideoRFT	68.5	36.8	51.1	40.0	52.5	33.9	62.1	73.7
FFR + VideoRFT	70.1	38.6	54.9	48.0	54.9	37.8	68.2	75.4

+51.16% Video-Holmes relative gain over Video-R1-SFT.

56.5 Best average score with GLM-4.5V as teacher.

kappa = 0.3 Best patch-tax setting in the sensitivity sweep.

Visual Evidence

Case studies and mechanism checks.

STAR case study showing sampled frames, the student's wrong answer, the teacher evidence patch, and the corrected answer. — The patch redirects the student to frames 13-15 and the hand-release event without revealing that the answer is the book.

Citation

Please cite the paper if you use FFR or the released code.

@misc{huang2026findfixreason,
  title={Find, Fix, Reason: Context Repair for Video Reasoning},
  author={Huang, Haojian and Qin, Chuanyu and Li, Yinchuan and Chen, Yingcong},
  year={2026},
  eprint={2604.16243},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  doi={10.48550/arXiv.2604.16243},
  url={https://arxiv.org/abs/2604.16243}
}