BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

1KAUST   2ByteDance Seed   3Zhejiang University
left

Overview

Depth estimation is a fundamental task in computer vision with diverse applications. Recent advancements in deep learning have led to powerful depth foundation models (DFMs), yet their evaluation remains challenging due to inconsistencies in existing protocols. Traditional benchmarks rely on alignment-based metrics that introduce biases, favor certain depth representations, and complicate fair comparisons. In this work, we propose BenchDepth, a new benchmark that evaluates DFMs through five carefully selected downstream proxy tasks: depth completion, stereo matching, monocular feed-forward 3D scene reconstruction, SLAM, and vision-language spatial understanding. Unlike conventional evaluation protocols, our approach assesses DFMs based on their practical utility in real-world applications, bypassing problematic alignment procedures. We benchmark eight state-of-the-art DFMs and provide an in-depth analysis of key findings and observations. We hope our work sparks further discussion in the community on best practices for depth model evaluation and paves the way for future research and advancements in depth estimation.

Benchmark Tables

Click on tabs to view different benchmark results and comparisons.


Method Depth Completion Stereo Matching Mono 3DGS SLAM VLM Spatial Understanding Average rank
imp rank imp rank imp rank imp rank imp rank rank
Midas +3.094 -3.077 +5.241 +2.325 -- 4.25
DAV2-Rel +9.261 +5.771 +4.213 +10.001 -- 1.50
DAV2-Met +6.482 +1.465 +4.812 +1.956 -- 3.75
Metric3DV2 -0.388 -1.746 -0.055 -4.19- -- 6.33
UniDepth +2.975 -3.688 -0.106 +7.082 -- 5.25
Marigold +1.766 +1.874 -4.198 +4.674 -- 5.50
GenPercept +6.163 +1.993 -0.144 +6.163 -- 3.25
MoGe +1.537 +2.702 -1.607 -4.047 -- 5.75
Method 100 32 8 4 1 Overall
rmse mse rmse mse rmse mse rmse mse rmse mse imp. rank
w/o DFM 0.2060.102 0.3340.199 0.4860.340 0.5140.370 0.5500.406 --
Midas 0.2040.114 0.2940.182 0.4490.311 0.4930.355 0.5560.414 +3.094
DAV2-Rel 0.1910.099 0.2790.166 0.4270.292 0.4710.336 0.5330.396 +9.261
DAV2-Met 0.2020.112 0.2870.178 0.4310.297 0.4720.338 0.5290.392 +6.482
Metric3DV2 0.2160.128 0.3060.195 0.4540.317 0.4970.359 0.5570.415 -0.388
UniDepth 0.2100.122 0.2960.187 0.4380.308 0.4800.349 0.5400.404 +2.975
Marigold 0.2100.121 0.2960.187 0.4480.314 0.4910.356 0.5550.414 +1.766
GenPercept 0.1990.110 0.2840.174 0.4360.301 0.4790.342 0.5420.402 +6.163
MoGe 0.2100.124 0.2950.188 0.4440.312 0.4890.355 0.5580.417 +1.537
Method SceneFlow Middlebury ETH3D Overall
epe >3pt epe >2pt epe >1pt imp. rank
w/o DFM 0.4962.599 0.8576.655 0.2833.575 --
Midas 0.4832.502 1.0617.316 0.2733.383 -3.077
DAV2-Rel 0.4562.432 0.8346.399 0.2753.189 +5.771
DAV2-Met 0.4712.473 0.9386.177 0.2703.698 +1.465
Metric3DV2 0.4822.521 0.9497.309 0.2753.523 -1.746
UniDepth 0.4772.521 0.9647.242 0.2853.822 -3.688
Marigold 0.4752.499 0.8996.519 0.2733.485 +1.874
GenPercept 0.4732.485 0.9356.649 0.2653.374 +1.993
MoGe 0.4732.481 0.9075.951 0.2793.544 +2.702
Method 5 frames 10 frames [-30, 30] frames Overall
PSNR >SSIM >LPIP PSNR >SSIM >LPIP PSNR >SSIM >LPIP imp. rank
w/o DFM 24.2850.8030.151 21.7670.7290.203 21.2410.7050.230 --
Midas 24.9640.8120.125 22.2900.7350.179 21.7690.7100.212 +5.241
DAV2-Rel 24.9650.8120.129 22.3050.7330.185 21.7030.7060.218 +4.213
DAV2-Met 25.0000.8120.128 22.3410.7350.182 21.8420.7110.215 +4.812
Metric3DV2 24.4680.7870.150 21.9940.7130.204 21.3960.6900.233 -0.055
UniDepth 23.9830.7860.145 21.5300.7080.202 21.0360.6870.235 -0.106
Marigold 23.9740.7790.162 21.5150.7010.219 20.9520.6760.248 -4.198
GenPercept 24.1190.7870.140 21.4890.7050.197 21.0290.6820.230 -0.144
MoGe 23.9300.7800.144 21.3090.6960.202 20.8510.6730.235 -1.607
Method rm-0 rm-1 rm-2 off-0 off-1 off-2 off-3 off-4 Overall
acc comp acc comp acc comp acc comp acc comp acc comp acc comp acc comp imp. rank
w/o DFM 3.373.93 4.014.61 3.583.97 7.268.25 5.826.52 6.987.72 6.986.92 4.266.09 --
Midas 3.253.63 3.594.12 3.493.78 8.099.04 6.027.08 4.636.19 4.935.40 3.955.71 +2.325
DAV2-Rel 3.303.92 3.523.85 3.283.59 6.166.94 5.786.62 6.557.09 7.006.43 4.266.09 +10.001
DAV2-Met 3.223.39 3.483.98 3.473.87 8.589.64 4.595.40 6.387.43 6.135.59 3.986.29 +1.956
Metric3DV2 3.483.64 3.453.93 3.734.09 9.5510.53 5.826.41 5.206.67 6.736.78 4.516.65 -4.19-
UniDepth 3.113.49 3.734.38 3.804.06 5.966.91 5.056.05 6.487.41 5.835.95 4.606.76 +7.082
Marigold 3.013.67 3.774.07 3.704.00 7.077.93 6.237.01 4.836.43 6.326.26 4.526.79 +4.674
GenPercept 3.283.47 3.774.34 3.333.73 7.067.65 4.145.06 4.386.35 5.305.05 4.406.20 +6.163
MoGe 3.263.67 3.674.23 3.894.33 8.869.83 4.555.58 5.686.73 6.406.32 3.925.98 -4.047

Observations and Findings

1. Are DFMs good for downstream tasks?
Most depth foundation models improve the performance of downstream tasks, highlighting their potential for broader applications in the future.
2. Which DFMs are the best for the downstream tasks?
DAV2 achieves the best results across proxy tasks, demonstrating the benefits of scaling up training data and incorporating synthetic data.
3. Which kind of DFMs are the best for the downstream tasks?
Affine-invariant disparity methods consistently outperform other depth estimation approaches, even with MiDaS being the oldest method among them
4. Which metric DFM is the best for the downstream tasks?
  • Despite being fine-tuned on a single dataset (Hypersim, synthetic), DAV2-Met significantly outperforms other metric depth models (Metric3DV2, UniDepth) trained on multiple datasets. This somehow aligns with the conclusion of ZoeDepth that fine-tuning a well-pretrained affine-invariant disparity model enhances metric depth estimation.
  • Moreover, the performance gap also suggests that incorporating synthetic data for metric depth training might be crucial, as it allows models to learn high-frequency details that are often lost in real-world datasets.
5. Which diffusion-based DFM is the best for the downstream tasks?
The performance improvement from Marigold to GenPercept underscores the importance of effective fine-tuning strategies for Stable Diffusion, a powerful foundation model. Expanding the training data could further unlock their potential, following the success of DAV2, as the current fine-tuning process is limited to VKITTI and Hypersim.
6. How about the latest kind of DFM?
MoGe, as a novel approach for geometry estimation, also demonstrates its potential on BenchDepth.
7. How about the higher-level tasks?
For the highest-level task, VLM spatial understanding, all methods yield comparable results. This suggests that at this higher level, different depth estimation approaches can be equally effective.

🔥 Citation

@article{li2025benchdepth,
  title={BenchDepth: A Benchmark for Evaluating Depth Foundation Models}, 
  author={Li, Zhenyu and Lin, Haotong and Feng, Jiashi and Wonka, Peter and Kang, Bingyi},
  year={2025},
  journal={arXiv preprint arXiv:tbd},
  primaryClass={cs.CV}}