Beating State of the Art Optical Flow with a Stereo Depth Model

Training a dense feature matcher

7 min readJul 11, 2024

In the ever-evolving field of computer vision, the quest for achieving state-of-the-art metrics is relentless. One of the most promising advancements in recent times has been depth models that can actually generalize. Models like CREStereo¹, a state-of-the-art stereo matching model, are now demonstrating strong performance on datasets with vastly different distributions from the ones they were trained on. This progress is largely enabled by the fact that depth (stereo) is a low-level perception task. Essentially, depth estimation relies more on physical principles than on semantics. As long as the environment adheres to the laws of physics (e.g., left and right cameras exhibiting disparity), the same fundamental rules apply across all depth datasets.

Given this progress, an intriguing question arises: can we harness the generalizability of depth estimation to improve optical flow, a task traditionally known for its challenges in generalization? By examining both tasks, we observe a common underlying theme: feature matching. Depth estimation involves matching features predominantly in the horizontal direction, while optical flow requires feature matching in both horizontal and vertical directions.

Consider a highly generalizable depth model like CREStereo¹. If we extend its capabilities to match features in the vertical axis as well as the horizontal, we can develop a versatile solution that estimates either depth or optical flow using the same set of weights. This dual-purpose model would adapt its functionality based on the input images: providing depth from stereo pairs and optical flow from temporal pairs. This innovative approach not only simplifies the model architecture but also leverages the inherent strengths of depth models to improve optical flow estimation.

Moreover, this approach opens up new possibilities for training data. By using depth data to train for optical flow and vice versa, we can now access a broader range of public datasets, greatly diversifying our training data. This diversification enhances the model’s robustness and generalization capabilities, ensuring better performance across various scenarios and environments. This union between depth and optical flow not only boosts the performance of the model but also demonstrates the power of leveraging cross-task data for improved machine learning outcomes.

Extending Stereo Depth Models for Optical Flow Estimation

Traditionally, a stereo depth model outputs a single-channel tensor of dimensions (input_image_height, input_image_width), where each element represents the disparity in the horizontal (u) axis of the corresponding pixel. To extend this model to handle both horizontal and vertical (v) axes, it needs to output a tensor of dimensions (input_image_height, input_image_width, 2), introducing an additional channel.

In training, we can adapt the existing loss function to apply to both channels instead of just the horizontal one. Stereo depth estimation involves pixel matching to find the horizontal disparity between a given image pair, essentially performing 1D feature matching. Optical flow, however, requires finding the disparity in both dimensions, turning the task into 2D feature matching. Thus, the loss function can be applied similarly, but with the extended 2D output.

Testing CREStereo’s Generalizability to Optical Flow Data

In this article, we will explore how well CREStereo’s generalizability can transfer to optical flow estimation. After training the model, renamed “Fuji,” we will compare its performance against the state-of-the-art optical flow model, FlowFormer⁷.

For training Fuji, we utilized a combination of datasets including FlyingThings3D², FlyingChairs³, KITTI⁴, Sintel⁵, and HD1K⁶, for approximately 1.7 million iterations at a batch size of 7 on a single Nvidia H100 GPU. To evaluate its performance, we tested both models on the Monkaa² dataset, which neither model was trained on, making it an ideal dataset for comparison.

Results

The results are super promising. Fuji surpasses FlowFormer⁷ on all metrics. It achieves a lower end-point error (EPE) and has a higher percentage of pixels under each accuracy range across the board. This demonstrates Fuji’s superior performance and highlights the potential of leveraging generalizable depth models for optical flow estimation.

Fuji | Ground Truth (GT) / FlowFormer | Ground Truth (GT)

Testing Against Real-World Data

When we assess real-world data visually alongside past state-of-the-art models, a consistent trend emerges: our model, Fuji, outperforms others in the consistency and crispness of object edges, which is crucial for our optical flow applications.

For the sake of testing against a state-of-the-art optical flow model, Fuji was trained using datasets tailored for optical flow, identical to those used for FlowFormer. However, this model can seamlessly incorporate both depth and optical flow data during training, allowing it to adapt its functionality, as previously discussed. This transformation effectively positions the model as a versatile Dense Feature Matcher (DFM). We have already demonstrated the feasibility of training a depth network using optical flow data. While further validation for DFM’s is necessary, surpassing a state-of-the-art optical flow model with a depth-based architecture already provides compelling evidence of its potential and merits further exploration.

To see some downstream applications of DFM’s, checkout my other blog post on how we can apply a DFM to solve scene flow — Hacking Scene Flow 🪓

A Note on Metrics

Metrics can sometimes be misleading. To illustrate this, let’s examine a specific frame from the Monkaa² dataset. In this frame, there is very little motion in the background, but the foreground contains numerous small, fast-moving objects (in this case, rocks falling from the sky).

When comparing our model, Fuji, to FlowFormer⁷ on this frame, FlowFormer⁷ fails to accurately capture the fast-moving objects, resulting in significant errors. Despite this, FlowFormer⁷ might still show better metrics for 1-pixel and 3-pixel accuracy. This is because the flow image is predominantly composed of low-flow values from the background, which FlowFormer⁷ may capture more precisely. The high-flow objects, which are more critical for accurate optical flow, occupy a smaller proportion of the image. Thus, while FlowFormer⁷ may excel in metrics where low-flow accuracy dominates, Fuji demonstrates higher accuracy for larger movements, evidenced by its superior performance in capturing the falling rocks. Also, note that Fuji still has a better (lower) overall EPE.

This example highlights the importance of a comprehensive evaluation that goes beyond standard metrics. While FlowFormer⁷ might perform well on metrics like 1-pixel accuracy, Fuji’s overall better end-point error (EPE) and its ability to accurately capture significant motion make it a more reliable model for real-world applications where capturing the details of fast-moving objects is critical.

Connect with me on LinkedIn!

https://www.linkedin.com/in/mostafa-mohsen/

[1] Li, J., Wang, P., Xiong, P., Cai, T., Yan, Z., Yang, L., Liu, J., Fan, H., & Liu, S. (2022). Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation. In Proc. of CVPR.

[2] Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) arXiv:1512.02134, http://lmb.informatik.uni-freiburg.de/Publications/2016/MIFDB16

[3] Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., & Brox, T. (2015). FlowNet: Learning optical flow with convolutional networks. In Proc. of IEEE International Conference on Computer Vision (ICCV) http://lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15

[4] Menze, M., Heipke, C., & Geiger, A. (2015). Joint 3D estimation of vehicles and scene flow. In ISPRS Workshop on Image Sequence Analysis (ISA) https://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=flow

[5] Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J. (2012). A Naturalistic Open Source Movie for Optical Flow Evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds) Computer Vision — ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, vol 7577. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33783-3_44

[6] Kondermann, D., Nair, R., Honauer, K., Krispin, K., Andrulis, J., Brock, A., Güssefeld, B., Rahimimoghaddam, M., Hofmann, S., Brenner, C., & Jahne, B. (2016). The HCI benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 19–28)

[7] Huang, Z., Shi, X., Zhang, C., Wang, Q., Cheung, K. C., Qin, H., Dai, J., & Li, H. (2022). FlowFormer: A Transformer Architecture for Optical Flow. ECCV https://drinkingcoder.github.io/publication/flowformer/