PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation

KAUST

Abstract

Single image depth estimation is a foundational task in computer vision and generative modeling. However, prevailing depth estimation models grapple with accommodating the increasing resolutions commonplace in today's consumer cameras and devices. Existing high-resolution strategies show promise, but they often face limitations, ranging from error propagation to the loss of high-frequency details. We present PatchFusion, a novel tile-based framework with three key components to improve the current state of the art: (1) A patch-wise fusion network that fuses a globally-consistent coarse prediction with finer, inconsistent tiled predictions via high-level feature guidance, (2) A Global-to-Local (G2L) module that adds vital context to the fusion network, discarding the need for patch selection heuristics, and (3) A Consistency-Aware Training (CAT) and Inference (CAI) approach, emphasizing patch overlap consistency and thereby eradicating the necessity for post-processing. Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details. PatchFusion is independent of the base model for depth estimation. Notably, our framework built on top of SOTA ZoeDepth brings improvements for a total of 17.3% and 29.4% in terms of the root mean squared error (RMSE) on UnrealStereo4K and MVS-Synth, respectively.

👀 Interactive Comparison

Drag for interactive comparison

In-Domain Samples from UnrealStereo4K

Zero-Shot Transfer (Out-of-Domain from Middleburry 2014)

Depth-Guided Text-to-Image Generation (Out-of-Domain from Internet)

Input Image (from Quang Nguyen Vinh):

Text Prompt: An oil painting of a rice field, with warm lighting and rich texture

Result

Input Image (from Jo Kassis):

Text Prompt: An evening scene with the Eiffel Tower, the bridge under the glow of street lamps and a twilight sky

Result

🚀 Framework

Most SOTA depth estimation architectures are bottlenecked by the resolution capabilities of their backbone, leading to blur depth predictions. For instance, ZoeDepth processes an input resolution of 384x512, VPD manages 480x480, and AiT is designed for 384x512. These figures pale in comparison to the resolutions offered by modern consumer cameras, such as the 45 Megapixel Canon EOS R5, the widely available 8K televisions, and even mobile devices like the iPhone 15, which boasts a 12MP Ultra Wide lens.

In this work, we make effort to addresses the challenge of metric single image depth estimation for high-resolution inputs. We propose an end-to-end tile-based framework to achieve our goal. It includes a (1) Coarse Network to provide global scale-aware estimation based on whole-image inpus, whereas high-frequency details are lost at the cost of global consistency, a (2) Fine Network to achieve patch-wise fine depth prediction with rich details, particularly at boundaries and intricate structure, but scale potentially inconsistent with the actual scene due to the segment property and lack of global information, and a (3) Guided Fusion Network with Global-to-Local (G2L) module to combine the best of two worlds. We also propose the consistency-aware training and inference strategy to ensure patch-wise prediction consistency.

🔥 Method

We propose to adopt a neural network to fuse the coarse and fine depth maps, which outperforms the traditional post-optimization strategies. While it could be implemented by simply learning a pix2pix U-Net, our key idea is to exploit the multi-scale features from the Coarse Network and Fine Network. We use two main components to achieve this transfer - the Global-to-Local Module (G2L) and the Guided Fusion Network.

Global-to-Local Module and Guided Fusion Network

While the key insight of G2L is to apply the global-wise self attention for each-level feature from Coarse Network to ensemble crucial information for patch-wise scale-consistent prediction, we adopt the Swin Transformer Layer (STL) to preserve the global context while simultaneously alleviating GPU memory concerns.

The Guided Fusion Network follows the U-Net design. The input comprises a concatenated ensemble of the cropped original image, the corresponding cropped coarse depth estimations from Coarse Network, and fine depth estimations from Fine Network. After a lightweight encoder, we inject the guidance features to the skip connections and decoder layers.

Consistency-Aware Training and Inference

While our Guided Fusion Network with G2L makes scale-aware predictions, boundary inconsistencies still exist. Recognizing this gap, we introduce Consistency-Aware Training (CAT) and Inference (CAI) to ensure patch-wise depth prediction consistency. Our methodology is based on the intuitive idea that overlapping regions between cropped patches from the same image should ideally produce consistent feature representations and depth predictions. We impose an \( L_2 \) loss on the overlapping regions of both the extracted image features and the depth predictions. While the idea of constraining the depth values is quite intuitive, the good results mainly stem from constraining the features. During the inference processing of patches, the updated depth is concatenated with the cropped image and coarse depth map, as the input to our guided fusion network. This dynamic updating, coupled with a running mean, engenders a local ensemble approach, incrementally refining the depth estimations on the fly. This strategy alleviates the inconsistency and further boost the prediction accuracy.

🚀 Qualitative Results

Qualitative results on UnrealStereo4K (first two rows) and MVS-Synth (last two rows). Left to Right: Input, BoostingDepth[27], Graph-DGSR[8], PatchFusion (Ours), GT.

🔥 Citation


@article{li2023patchfusion,
  title={PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation}, 
  author={Zhenyu Li and Shariq Farooq Bhat and Peter Wonka},
  year={2023},
  eprint={2312.02284},
  archivePrefix={arXiv},
  primaryClass={cs.CV}}