Diffusion-forcing world model rollouts. Each sample is ~2.7 s (16 raw frames @ 6 fps). Side-by-side videos show ground truth on the left, model prediction on the right. First half of each clip = history (GT), second half = autoregressively sampled future. Click a sample header to expand its videos (default-collapsed to keep the page light).
Click a run name to jump to its rollout videos. ✓ = feature on, ✗ = off. Lower val_loss is better.
| run | fusion | views | tactile | shift16 | delta-ref | cam-pose | gate | val_loss |
|---|---|---|---|---|---|---|---|---|
| vo_v0 legacy 1v0t | — | — | — | — | — | — | — | — |
| mm_v0 legacy 1v2t | — | — | — | — | — | — | — | — |
| vo_left | — | — | — | — | — | — | — | — |
| vo_middle | — | — | — | — | — | — | — | — |
| vo_right BEST single | — | — | — | — | — | — | — | — |
| p2_mv 3v shared | — | — | — | — | — | — | — | — |
| p3_mv2t 3v+2t shared | — | — | — | — | — | — | — | — |
| p4_mv2t_extr post-explosion | — | — | — | — | — | — | — | — |
| p4_mv2t_extr pre-explosion | — | — | — | — | — | — | — | — |
| p4_gate cam-pose + gate | — | — | — | — | — | — | — | — |
| p5_gate vision-only + cam-pose | — | — | — | — | — | — | — | — |
| MV1 3v2t + delta-ref + shift16 NEW | — | — | — | — | — | — | — | — |
| MV2 MV1 + cam-pose NEW | — | — | — | — | — | — | — | — |
Latent-space MSE of predicted future vs ground truth. Contact/no-contact split uses tactile latent-to-reference energy (threshold 0.05).
| run | view MSE | TL MSE | TR MSE | tactile contact MSE | tactile no-contact MSE |
|---|---|---|---|---|---|
| vo_v0 legacy 1v0t | 0.049632 | 2.971416 | 2.766976 | 2.869196 | n/a |
| mm_v0 legacy 1v2t | 0.050667 | 0.017137 | 0.024456 | 0.020797 | n/a |
| vo_left | 0.035911 | 2.956703 | 2.771010 | 2.863857 | n/a |
| vo_middle | 0.036385 | 2.956703 | 2.771010 | 2.863857 | n/a |
| vo_right BEST single | 0.021557 | 2.956703 | 2.771010 | 2.863857 | n/a |
| p2_mv 3v shared | 0.028225 | 0.000000 | 0.000000 | n/a | 0.000000 |
| p3_mv2t 3v+2t shared | 0.030864 | 0.009988 | 0.010151 | 0.010070 | n/a |
| p4_mv2t_extr post-explosion | 0.111799 | 0.011736 | 0.011720 | 0.011728 | n/a |
| p4_mv2t_extr pre-explosion | 0.111799 | 0.011736 | 0.011720 | 0.011728 | n/a |
| p4_gate cam-pose + gate | 0.084355 | 0.010387 | 0.009887 | 0.010137 | n/a |
| p5_gate vision-only + cam-pose | 0.050636 | 0.010832 | 0.011602 | 0.011217 | n/a |
| MV1 3v2t + delta-ref + shift16 NEW | 0.019570 | 0.004056 | 0.017507 | 0.010782 | n/a |
| MV2 MV1 + cam-pose NEW | 0.044765 | 0.003904 | 0.019195 | 0.011550 | n/a |

# Loss curve analysis — multi-view + cam-pose ablations | Run | Type | Best val_loss_visual | Best val_loss_tactile | Final val_loss | |----------------------|-------------------------------------------------|----------------------|-----------------------|----------------| | vo_left | single view (left) | 0.0096 @ ep84 | — | 0.0097 | | vo_middle | single view (middle/top) | 0.0128 @ ep97 | — | 0.0128 | | **vo_right** | single view (right) | **0.0093 @ ep97** | — | 0.0093 | | p2_mv | 3-view shared weights, no tactile | 0.0099 @ ep96 | — | 0.0099 | | p3_mv2t | 3-view + 2-tactile, shared weights | 0.0100 @ ep97 | 0.0040 | 0.0139 | | p4_mv2t_extr (no gate) | p3 + cam-pose extrinsics | 0.0102 @ ep98 | **0.3244 (blow-up)** | 0.3346 | | **p4_gate** | p4 + scalar gate (this fix) | **0.0096 @ ep134** | 0.2621 (still bad) | 0.2751 | | **p5_gate** | vision-only p2 + cam-pose + gate | **0.0095 @ ep149** | — | 0.0095 | | vo_v0 (legacy) | single view (old 1v0t pipeline) | 0.0167 @ ep53 | — | 0.0171 | | mm_v0 (legacy) | 1-view + 2-tactile channel-stack (old mm_v0) | 0.0162 @ ep56 | 0.0032 | 0.0198 | ## Conclusions 1. **The gate fix worked for the VISUAL branch.** Both `p4_gate` (0.0096) and `p5_gate` (0.0095) now **match the best single-view baseline `vo_right` (0.0093)** — for the first time, multi-view + cam-pose conditioning beats the best single-view model on visual fidelity. Without the gate (`p4_mv2t_extr`) the visual stream was already OK (0.0102) but tactile blew up. 2. **The gate did NOT save the tactile branch.** `p4_gate` val_loss_tactile = 0.262 (vs `p3_mv2t` baseline 0.0040, ~65× worse). The single scalar gate delays the cam-pose perturbation by one step but does not prevent it from injecting noise into the tactile token stream once it opens. Likely fix for a future run: route cam-pose ONLY through view tokens (gate visual cam-pose embedding, but never add it to tactile-stream AdaLN). 3. **Plateau.** `p4_gate` and `p5_gate` reach their best around ep122–150 and have plateaued for >1h; safe to inference now even though target was 200 epochs. 4. **Multi-view rank** (visual val_loss, lower=better): `p5_gate (0.0095) ≈ p4_gate (0.0096) ≈ vo_right (0.0093) < vo_left (0.0096) < p2_mv (0.0099) ≈ p3_mv2t (0.0100) < vo_middle (0.0128)` 5. **`vo_middle` is the worst single view** by a margin — top-down camera has the least useful geometry for predicting future frames.