Abstract
Visual foundation models have achieved remarkable progress in scale and versatility, yet understanding the 3D world remains a fundamental challenge. While 2D images contain cues about 3D structure that humans readily interpret, deep models often fail to exploit them, underperforming on tasks such as multiview semantic consistency – crucial for applications including robotics and autonomous driving. We propose a self-supervised approach to enhance the 3D understanding of vision foundation models by (i) introducing a temporal nearest-neighbor consistency loss that finds corresponding points across video frames and enforces consistency between their nearest neighbors, (ii) incorporating reference-guided ordering that requires patch-level features to be not only expressive but also consistently aligned, and (iii) constructing a mixture of video datasets tailored to these objectives, thereby leveraging rich 3D information. Our method, 3DPoV, achieves state-of-the-art performance in keypoint matching under viewpoint variation, as well as in depth and surface normal estimation, and consistently improves a diverse set of backbones, including DINOv3.
Method
3DPoV uses a teacher–student architecture where the student processes video frames independently and the teacher provides a stable, EMA-updated anchor frame for supervision. Rather than matching features directly, we align the relative similarity structure of tracked patches over time: for each frame we compute similarity rankings against a shared bank of reference patches and use differentiable sorting to obtain soft permutation matrices, training the student to match the teacher's anchor-frame permutations. This yields viewpoint-invariant descriptors that stay consistent under motion, occlusion, and lighting changes — without relying on crops or masks.
Fine-tuning blends three complementary video sources: CO3D (long object-centric sequences with large viewpoint shifts), DL3DV (diverse dynamic scenes and spatial layouts), and YouTube-VOS (unconstrained motion, occlusions, and real-world camera trajectories) — together spanning object-, scene-, and natural-video variability, with only ~20 hours of training on A100 GPUs.
Results
Keypoint Matching — SPair-71k
Recall across viewpoint difficulty bins (S: same / M: moderate / L: large). 3DPoV improves consistently, including under large viewpoint shifts.
| Model | Backbone | S / 0 | M / 1 | L / 2 | All |
|---|---|---|---|---|---|
| DINO | ViT-S/16 | 28.34 | 23.38 | 24.44 | 25.63 |
| TimeTuning | DINOv1-S/16 | 26.76 | 22.48 | 23.45 | 23.96 |
| MoSiC | DINOv1-S/16 | 26.73 | 21.97 | 22.98 | 23.76 |
| DINO | ViT-B/16 | 30.19 | 24.22 | 24.35 | 26.39 |
| NeCo | DINOv1-B/16 | 30.24 | 24.45 | 23.10 | 26.32 |
| 3DPoV | DINOv1-B/16 | 31.77 | 25.74 | 25.80 | 28.16 |
| DINOv2R | ViT-B/14 | 58.20 | 51.56 | 53.41 | 53.47 |
| NeCo | DINOv2R-B/14 | 59.57 | 49.06 | 52.35 | 54.42 |
| MoSiC | DINOv2R-B/14 | 56.37 | 50.70 | 51.75 | 51.72 |
| 3DPoV | DINOv2R-B/14 | 60.16 | 52.79 | 54.50 | 55.40 |
| DINOv3 | ViT-B/16 | 61.99 | 48.67 | 46.77 | 55.76 |
| 3DPoV | DINOv3-B/16 | 62.24 | 48.56 | 46.81 | 55.84 |
Keypoint Matching — Navi
Recall binned by relative camera rotation. 3DPoV improves uniformly across all viewpoint regimes.
| Model | Backbone | θ015 | θ1530 | θ3060 | θ60120 |
|---|---|---|---|---|---|
| DINO | ViT-S/16 | 84.36 | 55.17 | 34.58 | 20.48 |
| TimeTuning | DINOv1-S/16 | 80.81 | 52.61 | 33.93 | 19.96 |
| MoSiC | DINOv1-S/16 | 80.21 | 52.07 | 33.37 | 19.59 |
| DINO | ViT-B/16 | 86.13 | 56.92 | 33.37 | 19.74 |
| NeCo | DINOv1-B/16 | 84.94 | 53.52 | 31.80 | 18.47 |
| 3DPoV | DINOv1-B/16 | 86.42 | 57.18 | 33.77 | 20.42 |
| DINOv2R | ViT-B/14 | 87.87 | 67.45 | 47.15 | 31.58 |
| NeCo | DINOv2R-B/14 | 88.76 | 65.10 | 43.88 | 28.96 |
| MoSiC | DINOv2R-B/14 | 87.13 | 66.46 | 46.95 | 31.78 |
| 3DPoV | DINOv2R-B/14 | 89.18 | 68.98 | 47.64 | 31.63 |
| DINOv3 | ViT-B/16 | 94.40 | 74.73 | 48.64 | 31.45 |
| 3DPoV | DINOv3-B/16 | 94.47 | 74.74 | 48.65 | 31.36 |
Depth Estimation — Navi
Surface Normal Estimation — Navi
Robustness to Occlusion
SPair-71k recall with synthetic occlusions applied. 3DPoV improves consistently across all difficulty levels and both occlusion settings.
| Model | S / 0 | M / 1 | L / 2 | All |
|---|---|---|---|---|
| One view occluded | ||||
| DINOv2R | 59.78 | 53.84 | 56.01 | 55.54 |
| 3DPoV | 62.04 | 54.92 | 57.65 | 57.54 |
| Both views occluded | ||||
| DINOv2R | 58.84 | 53.41 | 56.90 | 54.88 |
| 3DPoV | 61.30 | 54.11 | 58.09 | 56.98 |
Robustness to Lighting
Navi keypoint recall binned by viewpoint across three lighting conditions.
| Model | Backbone | θ015 | θ1530 | θ3060 | θ60120 |
|---|---|---|---|---|---|
| Default NAVI dataset | |||||
| DINOv2R | ViT-B/14 | 87.87 | 67.45 | 47.15 | 31.58 |
| 3DPoV | DINOv2R-B/14 | 89.18 | 68.98 | 47.64 | 31.63 |
| Less lighting variation subset | |||||
| DINOv2R | ViT-B/14 | 92.28 | 67.72 | 49.24 | 33.69 |
| 3DPoV | DINOv2R-B/14 | 93.27 | 68.85 | 49.27 | 33.52 |
| Intense lighting conditions subset | |||||
| DINOv2R | ViT-B/14 | 89.63 | 66.98 | 48.13 | 33.96 |
| 3DPoV | DINOv2R-B/14 | 90.91 | 68.45 | 48.82 | 34.11 |
BibTeX
@inproceedings{simion2026_3dpov,
title = {3DPoV: Improving 3D Understanding via Patch Ordering on Videos},
author = {Simion, Ioana and Salehi, Mohammadreza and Venkataramanan, Shashanka and Snoek, Cees G. M. and Asano, Yuki M.},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
series = {PMLR},
volume = {306},
year = {2026}
}