VidPair-Halluc

ECCV 2026

VidPair-Halluc

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

A paired-video benchmark for probing how video-language models use visible evidence under controlled scene context.

Haojian Huang1,2 Harold Haodong Chen1,2 Meng Luo3 Junjia Du4 Shanqing Xu5 Ziheng Chen6 Yanxiang Huang7 Yinchuan Li2 Yingcong Chen1,2,*

1HKUST(GZ) 2Knowin AI 3NUS 4NTU 5HUST 6UIUC 7PolyU *Corresponding author

Video A: black alarm clock
Video B: red alarm clock

Cover pair story. In both clips, the scene is built around a comparable bedroom wake-up routine with the same person, neutral wall, bedside setting, and alarm-clock interaction. Video A shows a black alarm clock near the bed, while Video B keeps the alarm-clock scenario but shows a red alarm clock held in front of the person. This pair illustrates color grounding under a controlled background: the answer depends on the visible clock color rather than the surrounding bedroom context.

Abstract

Same context, different truth conditions.

VidPair-Halluc follows the PairFlow construction described in the paper: 15-second videos are organized into paired contrasts where the surrounding context is intended to stay comparable while the answer-relevant visual evidence differs. The benchmark spans object, action, color, number, person, location, static relation, dynamic relation, dynamic attribute, and temporal sequence reasoning.

2,000videos
1,000paired contrasts
11,523QA rows
10semantic aspects

Motivation

Evidence and scene priors in controlled pairs.

The paper studies cases where model answers may follow likely scene priors instead of the visible content. VidPair-Halluc evaluates this behavior by asking comparable questions over paired videos whose answers depend on controlled visual differences.

Paper motivation figure contrasting textual and visual hallucination with paired video examples.
Paper Fig. 1: from text-only perturbations to background-controlled visual contrasts.

Scene priors can affect answers.

When the surrounding scene is familiar, a model may rely on likely events. Paired clips test whether the answer changes with the visible evidence.

Background control matters.

If the whole video changes, unrelated cues can affect the answer. The paired setup aims to keep context stable while changing the foreground truth condition.

QA formats cover different answer styles.

The benchmark includes binary, multiple-choice, and open-ended QA, allowing the same type of visual contrast to be evaluated under different response formats.

PairFlow

Background-Controlled Pair Generation.

PairFlow composes a scene story, generates matched clip segments, and assembles paired videos. The construction is designed to keep the surrounding situation plausible while changing the visual evidence needed to answer.

PairFlow pipeline figure from the paper.
Paper Fig. 3: semantic aspect selection, story composition, clip generation, and paired video assembly.
Dataset examples from VidPair-Halluc showing paired videos and question types.
Paper Fig. 4: examples across binary, multiple-choice, and open-ended QA.

Pair construction

Controlled scenes with changed evidence.

Each pair keeps a familiar visual context and changes the foreground evidence, spatial relation, or temporal order that determines the expected answer.

Video A: grilling sausages
Video B: turning sausages

Action contrast

The same backyard grill context supports different answers only because the visible hand action changes.

Video A: reading a book
Video B: shelving a book

Library contrast

The bookshelf, subject, and framing stay aligned while the evidence changes from reading to returning the book.

Paper results

Results across current VLMs.

The paper reports dataset-level background-control measurements, QA coverage across response formats, and model results under paired video evaluation.

Performance comparison between text-pair and video-pair hallucinations.
Paper Fig. 6: reported model performance on text-pair and video-pair variants.
t-SNE, human verification, and model radar plots from the paper.
Paper Fig. 7: t-SNE visualization, human verification, and model-level comparison from the paper.

Table 1

Full QA coverage

Binary, multiple-choice, and open-ended QA follow the paired-video structure with visual similarity control and paired contrasts.

QA / videos
11,523 / 2,000
Video pairs
1,000

Table 2

Background consistency

MetricRandomControlled
DINOv2 ↑0.210.94 / 0.81
LPIPS ↓0.680.15 / 0.15
SSIM ↑0.240.76 / 0.90

Table 3

Reported human/model comparison

Human74.32
Gemini-2.5-Pro49.15
Qwen2.5-VL41.66
GPT-4o26.97
Method Binary wAcc ↑ MCQ F1 ↑ Video Acc ↑ Open Desc. ↑
Human74.3289.2179.66-
Gemini-2.5-Pro49.1567.3243.3654.68
GPT-5-mini29.3364.8245.2849.33
Qwen2.5-VL-Instruct41.6661.5042.8645.07
Video-LLaMA221.4862.9142.8640.85

Taxonomy showcase

Representative Reasoning-Axis Pairs.

These examples were selected from the clean preview pool after frame-level inspection. The page prioritizes visually legible pairs across different reasoning axes, while avoiding repeated examples of the same contrast type.

Object

Calculator vs. stress ball

Both clips keep the same desk work setup with a seated person, laptop, notebook, and window light. Video A places a calculator on the table, while Video B keeps the scene structure but replaces that object with a red stress ball.

Action

Grilling vs. turning sausages

The backyard grill, open lid, food placement, and cook remain consistent across the pair. Video A emphasizes grilling sausages on the grate, while Video B changes the queried action to turning the sausages with a utensil.

Color

Pink vs. green bottle

Both clips use the same soccer-practice background with a child drinking near the field. The highlighted visual difference is the bottle color: pink in Video A and green in Video B.

Number

One vs. two onions

The kitchen counter, bowl, knife, and food-prep action stay matched. Video A shows a single onion being handled, while Video B changes the visual count to two onions in the same preparation scenario.

Location

Classroom vs. library study

The person continues the same quiet writing task across both clips. The background evidence changes the location: a classroom-like board and desks in Video A, and library shelves around the same study behavior in Video B.

Static relation

Front vs. low position

The vending-machine aisle and product shelves remain the shared context. Video A shows the person standing in front of the machine, while Video B changes the spatial relation by placing the person lower, crouched beside the machine.

Temporal sequence

Forward vs. reversed drawing order

The artist, easel, brush, and studio framing stay matched. Video A progresses from a blank canvas toward a visible blue stroke, while Video B reverses the order so the same visual states imply the opposite temporal sequence.

BibTeX
@inproceedings{vidpairhalluc2026,
  title = {No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs},
  author = {Huang, Haojian and Chen, Harold Haodong and Luo, Meng and Du, Junjia and Xu, Shanqing and Chen, Ziheng and Huang, Yanxiang and Li, Yinchuan and Chen, Yingcong},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year = {2026},
  url = {https://github.com/JethroJames/VidPair-Halluc}
}