3DPoV

Improving 3D Understanding via Patch Ordering on Videos

Ioana Simion1,2,*   Mohammadreza Salehi3,*   Shashanka Venkataramanan4   Cees G. M. Snoek3   Yuki M. Asano5

1Quantitative Healthcare Analysis (qurAI) group, University of Amsterdam   2Amsterdam UMC, Dept. of Biomedical Engineering and Physics
3Video & Image Sense Lab, University of Amsterdam   4Valeo.ai   5Fundamental AI Lab, University of Technology Nuremberg
* Equal contribution

ICML 2026 — Seoul, South Korea · PMLR 306

3DPoV ICML 2026 poster

Poster presented at ICML 2026, Seoul · view full resolution

Abstract

Visual foundation models have achieved remarkable progress in scale and versatility, yet understanding the 3D world remains a fundamental challenge. While 2D images contain cues about 3D structure that humans readily interpret, deep models often fail to exploit them, underperforming on tasks such as multiview semantic consistency – crucial for applications including robotics and autonomous driving. We propose a self-supervised approach to enhance the 3D understanding of vision foundation models by (i) introducing a temporal nearest-neighbor consistency loss that finds corresponding points across video frames and enforces consistency between their nearest neighbors, (ii) incorporating reference-guided ordering that requires patch-level features to be not only expressive but also consistently aligned, and (iii) constructing a mixture of video datasets tailored to these objectives, thereby leveraging rich 3D information. Our method, 3DPoV, achieves state-of-the-art performance in keypoint matching under viewpoint variation, as well as in depth and surface normal estimation, and consistently improves a diverse set of backbones, including DINOv3.

Method

3DPoV method diagram
Learning 3D-aware representations via patch ordering in videos. Motion trajectories are extracted from raw clips with CoTrackerV3; student and teacher networks encode each frame into feature maps, which are resampled along the tracks into patch sequences Fs and Ft. A differentiable sorting module compares these against a reference feature bank Fr, producing permutation matrices that enforce consistent patch ordering across time.

3DPoV uses a teacher–student architecture where the student processes video frames independently and the teacher provides a stable, EMA-updated anchor frame for supervision. Rather than matching features directly, we align the relative similarity structure of tracked patches over time: for each frame we compute similarity rankings against a shared bank of reference patches and use differentiable sorting to obtain soft permutation matrices, training the student to match the teacher's anchor-frame permutations. This yields viewpoint-invariant descriptors that stay consistent under motion, occlusion, and lighting changes — without relying on crops or masks.

Fine-tuning blends three complementary video sources: CO3D (long object-centric sequences with large viewpoint shifts), DL3DV (diverse dynamic scenes and spatial layouts), and YouTube-VOS (unconstrained motion, occlusions, and real-world camera trajectories) — together spanning object-, scene-, and natural-video variability, with only ~20 hours of training on A100 GPUs.

Results

Keypoint Matching — SPair-71k

Recall across viewpoint difficulty bins (S: same / M: moderate / L: large). 3DPoV improves consistently, including under large viewpoint shifts.

ModelBackboneS / 0M / 1L / 2All
DINOViT-S/1628.3423.3824.4425.63
TimeTuningDINOv1-S/1626.7622.4823.4523.96
MoSiCDINOv1-S/1626.7321.9722.9823.76
DINOViT-B/1630.1924.2224.3526.39
NeCoDINOv1-B/1630.2424.4523.1026.32
3DPoVDINOv1-B/1631.7725.7425.8028.16
DINOv2RViT-B/1458.2051.5653.4153.47
NeCoDINOv2R-B/1459.5749.0652.3554.42
MoSiCDINOv2R-B/1456.3750.7051.7551.72
3DPoVDINOv2R-B/1460.1652.7954.5055.40
DINOv3ViT-B/1661.9948.6746.7755.76
3DPoVDINOv3-B/1662.2448.5646.8155.84

Keypoint Matching — Navi

Recall binned by relative camera rotation. 3DPoV improves uniformly across all viewpoint regimes.

ModelBackboneθ015θ1530θ3060θ60120
DINOViT-S/1684.3655.1734.5820.48
TimeTuningDINOv1-S/1680.8152.6133.9319.96
MoSiCDINOv1-S/1680.2152.0733.3719.59
DINOViT-B/1686.1356.9233.3719.74
NeCoDINOv1-B/1684.9453.5231.8018.47
3DPoVDINOv1-B/1686.4257.1833.7720.42
DINOv2RViT-B/1487.8767.4547.1531.58
NeCoDINOv2R-B/1488.7665.1043.8828.96
MoSiCDINOv2R-B/1487.1366.4646.9531.78
3DPoVDINOv2R-B/1489.1868.9847.6431.63
DINOv3ViT-B/1694.4074.7348.6431.45
3DPoVDINOv3-B/1694.4774.7448.6531.36

Depth Estimation — Navi

Depth estimation qualitative comparison

Surface Normal Estimation — Navi

Surface normal estimation qualitative comparison

Robustness to Occlusion

SPair-71k recall with synthetic occlusions applied. 3DPoV improves consistently across all difficulty levels and both occlusion settings.

ModelS / 0M / 1L / 2All
One view occluded
DINOv2R59.7853.8456.0155.54
3DPoV62.0454.9257.6557.54
Both views occluded
DINOv2R58.8453.4156.9054.88
3DPoV61.3054.1158.0956.98

Robustness to Lighting

Correspondence recall under different lighting conditions
Correspondence recall under default, less-variation, and intense lighting conditions — 3DPoV shows larger gains under challenging light.

Navi keypoint recall binned by viewpoint across three lighting conditions.

ModelBackboneθ015θ1530θ3060θ60120
Default NAVI dataset
DINOv2RViT-B/1487.8767.4547.1531.58
3DPoVDINOv2R-B/1489.1868.9847.6431.63
Less lighting variation subset
DINOv2RViT-B/1492.2867.7249.2433.69
3DPoVDINOv2R-B/1493.2768.8549.2733.52
Intense lighting conditions subset
DINOv2RViT-B/1489.6366.9848.1333.96
3DPoVDINOv2R-B/1490.9168.4548.8234.11

BibTeX

@inproceedings{simion2026_3dpov,
  title     = {3DPoV: Improving 3D Understanding via Patch Ordering on Videos},
  author    = {Simion, Ioana and Salehi, Mohammadreza and Venkataramanan, Shashanka and Snoek, Cees G. M. and Asano, Yuki M.},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  series    = {PMLR},
  volume    = {306},
  year      = {2026}
}