Deep Residual Stereo Reconstruction for Urban City Modeling

Given two (or more) images with overlapping fields of view and known relative pose, dense stereo methods estimate depth maps for each input view and subsequently fuse them into a 3D reconstruction that is photo-consistent with all views. This concept purely based on image-level correspondences and triangulation forms the basis for a wide range of commercially available systems, including large-scale topographic mapping, industrial machine vision, and robotics. For many practical use cases, solely maximizing photo-consistency between all pixels is insufficient to infer a geometrically accurate, complete, and highly detailed 3D model of the observed scene. Therefore, dense stereo methods typically impose a suitable prior on the 3D scene. In the most general sense, this prior encodes a preference for piece-wise smooth surfaces. However, such a prior is still rather unspecific and lacks detailed knowledge about the structure of the observed scene. In practice, the resulting surface models must thus often be cleaned up in a semi-automatic and time-consuming post-processing process.

The need to capture complex, soft prior expectations about the world, which are hard to formulate explicitly, naturally calls for a machine learning approach. Indeed, recent developments show a trend toward substituting classical dense stereo methods with a learning-based approach, i.e., to train a deep convolutional neural network that predicts a depth map from a set of input views. We argue that such a purely learning-based approach might be inefficient and needlessly complicated. First, the model capacity is wasted on learning basic 3D reconstruction from data, for which conventional stereo matching algorithms provide efficient solutions. Secondly, conventional stereo methods are very robust in the sense that their outputs are usually correct as a coarse, global estimate of the scene surface, but may be impaired by local biases and errors. Consequently, classical stereo methods suffer from a lack of prior knowledge about the observed world that goes beyond the naïve and simplistic assumption of (piece-wise) surface smoothness.

In this project, we thus aim at developing deep neural networks for complementing conventional stereo matcher rather than replacing them. We use a classical stereo matcher to reconstruct an approximate, coarse depth map (respectively, height map) and train a deep network to upgrade that initial surface estimate. Our primary target domain is urban city modeling from satellite imagery. For our initial experiments, we use panchromatic images acquired over the area of Zurich/Switzerland between 2014 and 2018 with DigitalGlobe’s WorldView2 satellite. As supervision for training, we use the publicly available 2.5D CAD model of Zurich generated via semi-automatic, aerial photogrammetry. Preliminary results shown below indicate that the deep network indeed learns higher-level priors of local surface shape.
 

ResDepth_results_compressed
DEM refinement with a deep regression network. (i) learned DEM filter (ResDepth-0), (ii) learned filter with monocular image guidance (ResDepth-mono), (iii) learned filter with stereo guidance (ResDepth-stereo).

Contacts:
Corinne Stucker,
Prof. Konrad Schindler,
 

JavaScript has been disabled in your browser