BenchDepth

BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

¹KAUST ²ByteDance Seed ³Zhejiang University

Overview

Depth estimation is a fundamental task in computer vision with diverse applications. Recent advancements in deep learning have led to powerful depth foundation models (DFMs), yet their evaluation remains challenging due to inconsistencies in existing protocols. Traditional benchmarks rely on alignment-based metrics that introduce biases, favor certain depth representations, and complicate fair comparisons. In this work, we propose BenchDepth, a new benchmark that evaluates DFMs through five carefully selected downstream proxy tasks: depth completion, stereo matching, monocular feed-forward 3D scene reconstruction, SLAM, and vision-language spatial understanding. Unlike conventional evaluation protocols, our approach assesses DFMs based on their practical utility in real-world applications, bypassing problematic alignment procedures. We benchmark eight state-of-the-art DFMs and provide an in-depth analysis of key findings and observations. We hope our work sparks further discussion in the community on best practices for depth model evaluation and paves the way for future research and advancements in depth estimation.

Benchmark Tables

Click on tabs to view different benchmark results and comparisons.

Method	Depth Completion		Stereo Matching		Mono 3DGS		SLAM		VLM Spatial Understanding		Average rank
Method	imp ↑	rank ↓	imp ↑	rank ↓	imp ↑	rank ↓	imp ↑	rank ↓	imp ↑	rank ↓	rank ↓
Midas	+3.09	4	-3.07	7	+5.24	1	+2.32	5	-	-	4.25
DAV2-Rel	+9.26	1	+5.77	1	+4.21	3	+10.00	1	-	-	1.50
DAV2-Met	+6.48	2	+1.46	5	+4.81	2	+1.95	6	-	-	3.75
Metric3DV2	-0.38	8	-1.74	6	-0.05	5	-4.19	-	-	-	6.33
UniDepth	+2.97	5	-3.68	8	-0.10	6	+7.08	2	-	-	5.25
Marigold	+1.76	6	+1.87	4	-4.19	8	+4.67	4	-	-	5.50
GenPercept	+6.16	3	+1.99	3	-0.14	4	+6.16	3	-	-	3.25
MoGe	+1.53	7	+2.70	2	-1.60	7	-4.04	7	-	-	5.75

Method	100		32		8		4		1		Overall
Method	rmse ↓	mse ↓	rmse ↓	mse ↓	rmse ↓	mse ↓	rmse ↓	mse ↓	rmse ↓	mse ↓	imp. ↑	rank ↓
w/o DFM	0.206	0.102	0.334	0.199	0.486	0.340	0.514	0.370	0.550	0.406	-	-
Midas	0.204	0.114	0.294	0.182	0.449	0.311	0.493	0.355	0.556	0.414	+3.09	4
DAV2-Rel	0.191	0.099	0.279	0.166	0.427	0.292	0.471	0.336	0.533	0.396	+9.26	1
DAV2-Met	0.202	0.112	0.287	0.178	0.431	0.297	0.472	0.338	0.529	0.392	+6.48	2
Metric3DV2	0.216	0.128	0.306	0.195	0.454	0.317	0.497	0.359	0.557	0.415	-0.38	8
UniDepth	0.210	0.122	0.296	0.187	0.438	0.308	0.480	0.349	0.540	0.404	+2.97	5
Marigold	0.210	0.121	0.296	0.187	0.448	0.314	0.491	0.356	0.555	0.414	+1.76	6
GenPercept	0.199	0.110	0.284	0.174	0.436	0.301	0.479	0.342	0.542	0.402	+6.16	3
MoGe	0.210	0.124	0.295	0.188	0.444	0.312	0.489	0.355	0.558	0.417	+1.53	7

Method	SceneFlow		Middlebury		ETH3D		Overall
Method	epe ↓	>3pt↓	epe ↓	>2pt↓	epe ↓	>1pt↓	imp. ↑	rank ↓
w/o DFM	0.496	2.599	0.857	6.655	0.283	3.575	-	-
Midas	0.483	2.502	1.061	7.316	0.273	3.383	-3.07	7
DAV2-Rel	0.456	2.432	0.834	6.399	0.275	3.189	+5.77	1
DAV2-Met	0.471	2.473	0.938	6.177	0.270	3.698	+1.46	5
Metric3DV2	0.482	2.521	0.949	7.309	0.275	3.523	-1.74	6
UniDepth	0.477	2.521	0.964	7.242	0.285	3.822	-3.68	8
Marigold	0.475	2.499	0.899	6.519	0.273	3.485	+1.87	4
GenPercept	0.473	2.485	0.935	6.649	0.265	3.374	+1.99	3
MoGe	0.473	2.481	0.907	5.951	0.279	3.544	+2.70	2

Method	5 frames			10 frames			[-30, 30] frames			Overall
Method	PSNR↑	>SSIM↑	>LPIP↓	PSNR ↑	>SSIM↑	>LPIP↓	PSNR ↑	>SSIM↑	>LPIP↓	imp. ↑	rank ↓
w/o DFM	24.285	0.803	0.151	21.767	0.729	0.203	21.241	0.705	0.230	-	-
Midas	24.964	0.812	0.125	22.290	0.735	0.179	21.769	0.710	0.212	+5.24	1
DAV2-Rel	24.965	0.812	0.129	22.305	0.733	0.185	21.703	0.706	0.218	+4.21	3
DAV2-Met	25.000	0.812	0.128	22.341	0.735	0.182	21.842	0.711	0.215	+4.81	2
Metric3DV2	24.468	0.787	0.150	21.994	0.713	0.204	21.396	0.690	0.233	-0.05	5
UniDepth	23.983	0.786	0.145	21.530	0.708	0.202	21.036	0.687	0.235	-0.10	6
Marigold	23.974	0.779	0.162	21.515	0.701	0.219	20.952	0.676	0.248	-4.19	8
GenPercept	24.119	0.787	0.140	21.489	0.705	0.197	21.029	0.682	0.230	-0.14	4
MoGe	23.930	0.780	0.144	21.309	0.696	0.202	20.851	0.673	0.235	-1.60	7

Method	rm-0		rm-1		rm-2		off-0		off-1		off-2		off-3		off-4		Overall
Method	acc↓	comp↓	acc↓	comp↓	acc↓	comp↓	acc↓	comp↓	acc↓	comp↓	acc↓	comp↓	acc↓	comp↓	acc↓	comp↓	imp.	rank
w/o DFM	3.37	3.93	4.01	4.61	3.58	3.97	7.26	8.25	5.82	6.52	6.98	7.72	6.98	6.92	4.26	6.09	-	-
Midas	3.25	3.63	3.59	4.12	3.49	3.78	8.09	9.04	6.02	7.08	4.63	6.19	4.93	5.40	3.95	5.71	+2.32	5
DAV2-Rel	3.30	3.92	3.52	3.85	3.28	3.59	6.16	6.94	5.78	6.62	6.55	7.09	7.00	6.43	4.26	6.09	+10.00	1
DAV2-Met	3.22	3.39	3.48	3.98	3.47	3.87	8.58	9.64	4.59	5.40	6.38	7.43	6.13	5.59	3.98	6.29	+1.95	6
Metric3DV2	3.48	3.64	3.45	3.93	3.73	4.09	9.55	10.53	5.82	6.41	5.20	6.67	6.73	6.78	4.51	6.65	-4.19	-
UniDepth	3.11	3.49	3.73	4.38	3.80	4.06	5.96	6.91	5.05	6.05	6.48	7.41	5.83	5.95	4.60	6.76	+7.08	2
Marigold	3.01	3.67	3.77	4.07	3.70	4.00	7.07	7.93	6.23	7.01	4.83	6.43	6.32	6.26	4.52	6.79	+4.67	4
GenPercept	3.28	3.47	3.77	4.34	3.33	3.73	7.06	7.65	4.14	5.06	4.38	6.35	5.30	5.05	4.40	6.20	+6.16	3
MoGe	3.26	3.67	3.67	4.23	3.89	4.33	8.86	9.83	4.55	5.58	5.68	6.73	6.40	6.32	3.92	5.98	-4.04	7

Observations and Findings

1. Are DFMs good for downstream tasks?

Most depth foundation models improve the performance of downstream tasks, highlighting their potential for broader applications in the future.

2. Which DFMs are the best for the downstream tasks?

DAV2 achieves the best results across proxy tasks, demonstrating the benefits of scaling up training data and incorporating synthetic data.

3. Which kind of DFMs are the best for the downstream tasks?

Affine-invariant disparity methods consistently outperform other depth estimation approaches, even with MiDaS being the oldest method among them

4. Which metric DFM is the best for the downstream tasks?

Despite being fine-tuned on a single dataset (Hypersim, synthetic), DAV2-Met significantly outperforms other metric depth models (Metric3DV2, UniDepth) trained on multiple datasets. This somehow aligns with the conclusion of ZoeDepth that fine-tuning a well-pretrained affine-invariant disparity model enhances metric depth estimation.
Moreover, the performance gap also suggests that incorporating synthetic data for metric depth training might be crucial, as it allows models to learn high-frequency details that are often lost in real-world datasets.

5. Which diffusion-based DFM is the best for the downstream tasks?

The performance improvement from Marigold to GenPercept underscores the importance of effective fine-tuning strategies for Stable Diffusion, a powerful foundation model. Expanding the training data could further unlock their potential, following the success of DAV2, as the current fine-tuning process is limited to VKITTI and Hypersim.

6. How about the latest kind of DFM?

MoGe, as a novel approach for geometry estimation, also demonstrates its potential on BenchDepth.

7. How about the higher-level tasks?

For the highest-level task, VLM spatial understanding, all methods yield comparable results. This suggests that at this higher level, different depth estimation approaches can be equally effective.

@article{li2025benchdepth, title={BenchDepth: A Benchmark for Evaluating Depth Foundation Models}, author={Li, Zhenyu and Lin, Haotong and Feng, Jiashi and Wonka, Peter and Kang, Bingyi}, year={2025}, journal={arXiv preprint arXiv:tbd}, primaryClass={cs.CV}}

BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

Overview

Benchmark Tables

Observations and Findings

🔥 Citation