Struct2Depth – Predicting object depth in dynamic environments7 min read

While recent advances in computer vision are helping robots and autonomous vehicles navigate complex environments effectively, some challenges still remain. One major challenge is depth prediction, i.e. the ability of a moving robot to recognize the depth of objects around it, a requirement for it to navigate a real-life environment safely. Historically, the most effective way to predict depth was by using stereo cameras which include at least two cameras. The current ambition is to avoid stereo cameras and predict depth with standard video data coming from a single cell phone camera.

A new paper from Google Brain, struct2depth, presents a novel method to estimate depth, achieving state-of-the-art results with a single monocular camera, comparable to results achieved with stereo cameras. Struct2depth uses unsupervised learning, comparing consecutive video frames and calculating loss based on the difference between the expected next frame and the actual frame observed. Unsupervised learning for depth prediction has been used in the past by Zhou et al and others, achieving top results in several metrics. Struct2depth builds on several concepts first presented in the Zhou paper, while adding several of its own.

The key contributions of struct2depth are its ability to analyze the 3D motion of moving objects, a task on which previous papers achieved relatively poor results, and its ability to deduce ego-motion, i.e. the speed of the camera capturing the scene. In addition, the struct2depth model benefits from being effectively transferable to many different environments, for instance, analyzing depth in a busy street and later using the trained model to navigate a room inside a home. Interestingly, the model supports online learning, allowing it to adapt to new environments as it encounters them.

Example of the model in action (Source: struct2depth’s blog)


When predicting depth and ego-motion via unsupervised learning, the algorithm receives as input the current and previous frames and attempts to estimate the next frame. In the 2017 paper Unsupervised Learning of Depth and Ego-Motion from Video (Zhou et. al), this process is achieved using three convolutional neural networks (CNNs) – a depth network, a pose (position and orientation) network, and an explainability network:

  • The depth network generates a visual depth prediction for each object in the scene.
  • The pose network produces an estimate of the camera’s position relative to the observed objects in six degrees of freedom (forward/backward, up/down, left/right, pitch, yaw, roll), allowing for the calculation of the camera’s ego-motion. Both CNNs use an encoder-decoder style design, with the depth CNN returning a depth map image and the pose CNN returning the pose of the camera based on the data in the middle layer of the CNN (see image).
  • The explainability network aims to recognize objects whose characteristics will not be properly explained by the depth and pose networks due to object motion or occlusion/visibility issues, allowing to disregard their depth analysis. Note: Subsequent studies found the explainability network to be relatively non-effective.
Image: Zhou et al.

In training, the loss function is the linear combination of three components:

  1. Lvs – The difference between the predicted next image and the actual observed image, while penalizing the object the higher its explainability is (low explainability -> low loss).
  2. Lsmooth – A smoothness loss which punishes stark shifts in the depth analysis.
  3. Lreg – A regularization loss which punishes the algorithm if it overuses the explainability feature to reduce the Lvs loss.

In some algorithms published after Zhou, another loss function is added – Lssim, the structured similarity (SSIM) of two images (predicted and observed), a common computer vision metric for evaluating the quality of image predictions..

The Zhou et. al paper achieved state-of-the-art results and since its publication has been widely cited in the computer vision literature. Despite its effectiveness, the Zhou technique, like many subsequent techniques, assumes that the observed objects are static and is therefore sensitive to fast-moving objects in the camera’s field of view. Struct2depth aims to tackle this exact challenge.

How Struct2depth Works

Struct2depth uses the Zhou depth estimator architecture (see background) while presenting three key additions:

  • A specialized 3D object motion estimator.
  • An object size regularization module.
  • An online refinement method that fine-tunes the model in real-time.

3D Object Motion Estimator
The object motion estimation process consists of three stages:

  1. Using off-the-shelf computer vision techniques, struct2depth recognizes the outline of individual moving objects and masks the area that they cover in the recorded image sequence.
  2. It then computes ego-motion without the distraction of moving objects (which are masked), allowing for accurate ego-motion data.
  3. Finally, using the computed ego-motion, it looks at the masked areas and models the motion and pose of the moving object, producing an object motion estimation for each object in the scene. The object motion estimation is computed using a CNN.

Object Size Regularization Module
A common problem with depth analysis is predicting the depth of adjacent objects moving at the same speed as the camera. In such cases, for instance, when a car films a neighboring car, the captured car may look like it isn’t moving at all. Many algorithms fail to handle these cases and mistakenly treat neighboring objects as very far objects on the horizon. Humans can easily understand that an adjacent car is unlikely to be a huge car on the horizon but is more likely to be a regular-sized car which is moving at the same speed, because humans have a prior understanding of regular car sizes. Inspired by human intuition, struct2depth adjusts the size of the object in the image based on similar objects it has seen in the training phase, therefore allowing it to estimate object sizes. Technically, this works by adding a dedicated loss function LSC to the depth estimation network, thereby punishing predicted object depths that are unlikely when considering the depth of similar-sized objects.

Online Refinement Method
The researchers found that the struct2depth depth estimation model can be fine-tuned for a specific environment and achieve improved results by simply collecting a group of 20 three-frame sequences during test time and feeds them to the neural network in a process of fine-tuning. The same approach can be used to fine-tune the 3D Object Motion Estimators.


To summarize the struct2depth structure, the results of the ego-motion estimator and the 3D object motion estimators are fed into the depth estimator, which produces the depth analysis, regularized by object sizes. The depth analysis can be constantly updated via the online refinement model.

The struct2depth flow (Image: Casser et al)


The paper compares struct2depth to state-of-the-art depth recognition technologies using two common datasets – KITTI and Cityscapes.

In KITTI, a video footage database whose ground truth is created via LIDAR sensors, struct2depth achieves an absolute relative error of 0.1087, drastically improving on the previous state-of-the-art of 0.131, achieved by Yang et al in 2018.

Cityscapes is an autonomous driving dataset which includes many dynamic scenes with multiple moving objects, without a depth ground truth. After training struct2depth on Cityscapes, the researchers tested the model on KITTI and surpassed the previous best result on the Cityscapes->KITTI transfer learning, achieved by Godard et al. It’s worth mentioning that the significance of the Cityscapes transfer learning comparison is limited since only Godard et al attempted the same Cityscapes->KITTI transfer learning.

Finally, struct2depth was able to achieve useful results when applying transfer learning from Cityscapes to a dataset of indoor data (the Fetch robot Indoor Navigation dataset), a notable achievement considering the inherent difference between indoor and outdoor environments. In the paper, the Fetch results were not compared quantitatively to other algorithms.

Compute considerations

The model can train in real-time on a GeForce 1080Ti, a standard GPU with 11 GB RAM, making it practical for use in expensive robots and autonomous vehicles but not yet in cheap hardware. It’s also possible to run the model in static inference mode on weaker hardware, while still benefiting from a depth model trained by the struct2depth object motion network.

Implementation details

Struct2depth is implemented in TensorFlow and is open source.


Through the use of deep neural networks, the capabilities of computer vision are constantly advancing and state-of-the-art techniques are becoming increasingly more accessible with consumer hardware. Struct2depth represents a leap forward in the field of depth recognition, an important technology for the development of smart robots, drones, and autonomous vehicles.

Currently, the algorithm still requires relatively sophisticated hardware but we can expect that, in the near future, improvements in algorithm efficiency and in hardware will allow advanced depth recognition in standard commercial devices, opening a wide range of possible applications.

Thanks to Vincent Casser and Anelia Angelova, two of the paper’s writers, for insights on the workings of struct2depth.

Sign up to our monthly newsletter
Stay updated with the latest research in Deep Learning

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *