Monocular Depth Estimation

Introduction

The estimation and extraction of depth information from images and videos are one of the primary and essential tasks in computer vision. Depth information can also be successfully integrated with RGB data to obtain notable improvements in other challenging tasks, like face and object detection, semantic segmentation, visual SLAM, and many others. Moreover, several applications, such as autonomous systems, robotics manipulators, and augmented reality algorithms, usually rely on stereo or depth cameras to achieve their tasks. Depth estimation techniques can be divided into two major groups: active depth sensors, which exploit laser-based and RGBD cameras, and passive depth sensors, such as geometry-based methods and deep learning-based techniques. The latter methodologies have remarkable abilities to estimate accurate dense depth maps from a single image in an end-to-end fashion and potentially overcome all the active depth sensing limitations, such as unfilled depth maps and limited depth ranges.

How Neural Networks learns the depth?

Here, we can observe a graphical image representation of the training progress, i.e. how the model learns the depth data distribution.

The color-bar reported on the left side of each depth map displaying the distance in centimeters for each pixel with color values highlights how the depth distribution fluctuates during training. 

ezgif.com-gif-maker.mp4

Lightweight DL models for mobile devices

The reported video shows an example of a real-world application in which our designed model predicts depth in real-time on an edge (smartphone) device. The model, named SPEED, has been specifically designed to optimize inference frequency while maximizing the estimation accuracy. SPEED is able to infer at stable (+30) frame-per-seconds on the reference device. 

The video, captured in our lab, displays the collected (RGB) images at the top and the corresponding depth maps (grayscale) at the bottom, where the intensity of pixels increases with the estimated distance. Moreover we can observe SPEED real-time generalization capabilities over common scenarios.

In this video, the limited hardware resources of a mobile device, model conversion/quantization, and domain adaptation capabilities are shown

L. Papa, E. Alati, P. Russo and I. Amerini, “SPEED: Separable Pyramidal pooling EncodEr-Decoder for real-time Monocular Depth Estimation on low-resource settings”, IEEE Access 2022

L. Papa, P. Russo, I. Amerini, “Real-time monocular depth estimation on embedded devices: challenges and performances in terrestrial and underwater scenarios”, 2022 IEEE International Workshop on Metrology for the Sea.

L. Maiano, L. Papa, K. Vocaj and I. Amerini, “DepthFake: a depth-based strategy for detecting Deepfake videos, AI4MFDD workshop ICPR 2022.