3DPoV: Improving 3D Understanding via Patch Ordering on Videos

Poster presented at ICML 2026, Seoul · view full resolution

Abstract

Visual foundation models have achieved remarkable progress in scale and versatility, yet understanding the 3D world remains a fundamental challenge. While 2D images contain cues about 3D structure that humans readily interpret, deep models often fail to exploit them, underperforming on tasks such as multiview semantic consistency – crucial for applications including robotics and autonomous driving. We propose a self-supervised approach to enhance the 3D understanding of vision foundation models by (i) introducing a temporal nearest-neighbor consistency loss that finds corresponding points across video frames and enforces consistency between their nearest neighbors, (ii) incorporating reference-guided ordering that requires patch-level features to be not only expressive but also consistently aligned, and (iii) constructing a mixture of video datasets tailored to these objectives, thereby leveraging rich 3D information. Our method, 3DPoV, achieves state-of-the-art performance in keypoint matching under viewpoint variation, as well as in depth and surface normal estimation, and consistently improves a diverse set of backbones, including DINOv3.

Method

3DPoV uses a teacher–student architecture where the student processes video frames independently and the teacher provides a stable, EMA-updated anchor frame for supervision. Rather than matching features directly, we align the relative similarity structure of tracked patches over time: for each frame we compute similarity rankings against a shared bank of reference patches and use differentiable sorting to obtain soft permutation matrices, training the student to match the teacher's anchor-frame permutations. This yields viewpoint-invariant descriptors that stay consistent under motion, occlusion, and lighting changes — without relying on crops or masks.

Fine-tuning blends three complementary video sources: CO3D (long object-centric sequences with large viewpoint shifts), DL3DV (diverse dynamic scenes and spatial layouts), and YouTube-VOS (unconstrained motion, occlusions, and real-world camera trajectories) — together spanning object-, scene-, and natural-video variability, with only ~20 hours of training on A100 GPUs.

Results

Keypoint Matching — SPair-71k

Recall across viewpoint difficulty bins (S: same / M: moderate / L: large). 3DPoV improves consistently, including under large viewpoint shifts.

Model	Backbone	S / 0	M / 1	L / 2	All
DINO	ViT-S/16	28.34	23.38	24.44	25.63
TimeTuning	DINOv1-S/16	26.76	22.48	23.45	23.96
MoSiC	DINOv1-S/16	26.73	21.97	22.98	23.76
DINO	ViT-B/16	30.19	24.22	24.35	26.39
NeCo	DINOv1-B/16	30.24	24.45	23.10	26.32
3DPoV	DINOv1-B/16	31.77	25.74	25.80	28.16
DINOv2R	ViT-B/14	58.20	51.56	53.41	53.47
NeCo	DINOv2R-B/14	59.57	49.06	52.35	54.42
MoSiC	DINOv2R-B/14	56.37	50.70	51.75	51.72
3DPoV	DINOv2R-B/14	60.16	52.79	54.50	55.40
DINOv3	ViT-B/16	61.99	48.67	46.77	55.76
3DPoV	DINOv3-B/16	62.24	48.56	46.81	55.84

Keypoint Matching — Navi

Recall binned by relative camera rotation. 3DPoV improves uniformly across all viewpoint regimes.

Model	Backbone	θ₀¹⁵	θ₁₅³⁰	θ₃₀⁶⁰	θ₆₀¹²⁰
DINO	ViT-S/16	84.36	55.17	34.58	20.48
TimeTuning	DINOv1-S/16	80.81	52.61	33.93	19.96
MoSiC	DINOv1-S/16	80.21	52.07	33.37	19.59
DINO	ViT-B/16	86.13	56.92	33.37	19.74
NeCo	DINOv1-B/16	84.94	53.52	31.80	18.47
3DPoV	DINOv1-B/16	86.42	57.18	33.77	20.42
DINOv2R	ViT-B/14	87.87	67.45	47.15	31.58
NeCo	DINOv2R-B/14	88.76	65.10	43.88	28.96
MoSiC	DINOv2R-B/14	87.13	66.46	46.95	31.78
3DPoV	DINOv2R-B/14	89.18	68.98	47.64	31.63
DINOv3	ViT-B/16	94.40	74.73	48.64	31.45
3DPoV	DINOv3-B/16	94.47	74.74	48.65	31.36

Depth Estimation — Navi

Surface Normal Estimation — Navi

Surface normal estimation qualitative comparison

Robustness to Occlusion

SPair-71k recall with synthetic occlusions applied. 3DPoV improves consistently across all difficulty levels and both occlusion settings.

Model	S / 0	M / 1	L / 2	All
One view occluded
DINOv2R	59.78	53.84	56.01	55.54
3DPoV	62.04	54.92	57.65	57.54
Both views occluded
DINOv2R	58.84	53.41	56.90	54.88
3DPoV	61.30	54.11	58.09	56.98

Robustness to Lighting

Correspondence recall under different lighting conditions — Correspondence recall under default, less-variation, and intense lighting conditions — 3DPoV shows larger gains under challenging light.

Navi keypoint recall binned by viewpoint across three lighting conditions.

Model	Backbone	θ₀¹⁵	θ₁₅³⁰	θ₃₀⁶⁰	θ₆₀¹²⁰
Default NAVI dataset
DINOv2R	ViT-B/14	87.87	67.45	47.15	31.58
3DPoV	DINOv2R-B/14	89.18	68.98	47.64	31.63
Less lighting variation subset
DINOv2R	ViT-B/14	92.28	67.72	49.24	33.69
3DPoV	DINOv2R-B/14	93.27	68.85	49.27	33.52
Intense lighting conditions subset
DINOv2R	ViT-B/14	89.63	66.98	48.13	33.96
3DPoV	DINOv2R-B/14	90.91	68.45	48.82	34.11

BibTeX

@inproceedings{simion2026_3dpov,
  title     = {3DPoV: Improving 3D Understanding via Patch Ordering on Videos},
  author    = {Simion, Ioana and Salehi, Mohammadreza and Venkataramanan, Shashanka and Snoek, Cees G. M. and Asano, Yuki M.},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  series    = {PMLR},
  volume    = {306},
  year      = {2026}
}