Do you hear it? Meet AVIP-Bench

A controlled benchmark for evaluating intuitive physics from video & sound.

Objects crash, bounce, and shatter - our benchmark of audiovisual object drops probes whether models benefit from adding sound when reasoning about physics.

See example Videos and Results 📄 PDF coming soon

What is AVIP?

A tiny, controlled benchmark with triplet videos per clip: A audio-only, V video-only, and AV audio+video. Tasks: object, material, outcome. We check top‑1 predictions vs. ground truth and look for cross‑modal gains.

📦 Minimal, reproducible clips
Short single‑impact scenes recorded in a controlled setup.
🔊 Modality toggles
Each clip exists as A, V, and AV to test true audio usage.
📈 Metrics
Top‑1 accuracy per task and an AV − max(A,V) cross‑modal gain.
🧪 Probe‑style prompts
Strict label sets & JSON outputs to avoid prompt drift.

Method (short)

For each clip, run models on A, V, and AV variants with the same instruction-style prompt.
Decode model outputs into {object, material, outcome} and compare against labels.
Compute per-task Top-1 and Top-5 accuracy and cross-modal gain per clip and in aggregate; additionally report calibration/confidence metrics (ECE, Brier, margin, entropy, Top-1 probability) and probing-based audio reliance via fixed cue selection and A/V/AV consistency; all metrics computed on the paired clip set (A∩V∩AV) with 95% confidence intervals.

Leaderboard

Per‑Modality (A / V / AV)

Model	Modality	N	Top‑1 Acc (%)	Updated

Example clips and Plots

Task labels (demo): object=paperbox, material=cardboard, outcome=bounce

Cross-Modal Gain (CMG)

Cross-Modal Gain heatmap — CMG in percentage points per engine; horizontal bars are 95\% paired-bootstrap CIs on the paired clip set.

Average modality attribution (AV)

Average audio weight across models — Audio weight by model.

Average video weight across models — Video weight by model.

Top-1 accuracy by task

Top-1 accuracy per model across object, material, and outcome for A, V, AV — Top-1 accuracy with 95% CIs (A, V, AV) across tasks and models.

Contact

Questions? bramo.g@protonmail.com

huggingface.co/Grets/AVIP