Do you hear it? Meet AVIP-Bench

A controlled benchmark for evaluating intuitive physics from video & sound.

Objects crash, bounce, and shatter - our benchmark of audiovisual object drops probes whether models benefit from adding sound when reasoning about physics.

What is AVIP?

A tiny, controlled benchmark with triplet videos per clip: A audio-only, V video-only, and AV audio+video. Tasks: object, material, outcome. We check top‑1 predictions vs. ground truth and look for cross‑modal gains.

Method (short)
  1. For each clip, run models on A, V, and AV variants with the same instruction-style prompt.
  2. Decode model outputs into {object, material, outcome} and compare against labels.
  3. Compute per-task Top-1 and Top-5 accuracy and cross-modal gain per clip and in aggregate; additionally report calibration/confidence metrics (ECE, Brier, margin, entropy, Top-1 probability) and probing-based audio reliance via fixed cue selection and A/V/AV consistency; all metrics computed on the paired clip set (A∩V∩AV) with 95% confidence intervals.

Leaderboard

Per‑Modality (A / V / AV)

Model Modality N Top‑1 Acc (%) Updated

Example clips and Plots

Task labels (demo): object=paperbox, material=cardboard, outcome=bounce

Cross-Modal Gain (CMG)

Cross-Modal Gain heatmap
CMG in percentage points per engine; horizontal bars are 95\% paired-bootstrap CIs on the paired clip set.
Look for positive values: these mean AV was better than either audio or video alone. Gains usually appear for outcome prediction, but rarely for object or material recognition.

Average modality attribution (AV)

Average audio weight across models
Audio weight by model.
What to look for: Red = model relies more on audio, Blue = model relies less. Engines that “listen” more may gain on outcome prediction, but not always.
Average video weight across models
Video weight by model.
What to look for: Red = model relies more on video, Blue = model relies less. Engines that “look” more often ignore sound, which can explain weak cross-modal gains.

Top-1 accuracy by task

Top-1 accuracy per model across object, material, and outcome for A, V, AV
Top-1 accuracy with 95% CIs (A, V, AV) across tasks and models.
What to look for: V is usually highest; AV improves over A and sometimes nudges past V on outcome. Big gaps A→AV mean sound is helpful; AV≈V means little extra benefit.

Contact

Questions? bramo.g@protonmail.com

huggingface.co/Grets/AVIP