Hacking Scene Flow 🪓

A dense feature matcher is all you need

Mostafa Mohsen
7 min readJul 9, 2024

The goal is to distill the motion problem down to a simpler one. Many motion estimation efforts have been heavily semantic which creates a big dependancy on data diversity (not always easy to get in computer vision), and many of the low level motion estimation methods like the SOTA scene-flow models take a long time to train while not even being generalizable. Can we leverage some of the recent cutting edge computer vision models to come up with a faster way to compute scene flow? In this post, we’ll try to “hack” scene flow. This will mainly reduce the total training time needed and the number of models needed to run a self driving perception stack. Spoiler alert! — it only takes one model 😎

Computing Scene-Flow from Depth and Optical Flow

The idea is to compute scene flow from depth + OF , rather than directly inferring the 3D motion “end-to-end style”. To do this, we compare two depth maps from frames of time stamps t-1 and t, and their respective optical flow map (from t-1 to t).

Each depth map can be thought of as a 3D point cloud, since it encodes the 3D position of each pixel. Therefore, we simply take the difference in 3D position between each pixel in depth map t from its corresponding position in depth map t-1, where the correspondence is found using the optical flow.

### compute optical flow 
OF = optical_flow_model(rgb_t-1, rgb_t)

### compute the correlation between the two frames using OF
index_correlation = original_index + OF

### ccompute sceneflow
SF = xyz_t - xyz_t-1[index_correlation]

Since the optical flow is now being used to index the depth maps, it becomes critical that these two are aligned with each other, if the outlines of objects like cars are slightly misaligned due to inference errors, this will cause significant false scene flow predictions.

Here, “measured_OF” refers to the initial optical flow estimate of everything relative to the camera

To achieve consistency between the optical flow and depth maps, It is preferable if the DFM (dense feature matcher) used to compute optical flow and depth is the same model (with the same weights). By giving the DFM left/right image pairs of the same timestamp, we get disparity between left/right frames which represents depth (this should naturally be disparity in only the horizontal axis if the cameras are horizontally aligned). If we pass in the same-camera image pairs from consecutive time stamps we get the 2D disparity over time, which is optcial flow.

Check out my other article on how you can train a DFM from existing architectures!

So the full code ends up looking like this

### compute depth using the Dense Feature Matcher
depth = DFM(left_rgb_t, right_rgb_t)

### compute optical flow using the Dense Feature Matcher
OF = DFM(left_rgb_t-1, left_rgb_t)

### compute 3D pointclouds
xyz_t = depth_t * intrinsics

### compute the correlation between the two frames using OF
index_correlation = original_index + OF

### compute sceneflow
SF = xyz_t - xyz_t-1[index_correlation]

### save the current point cloud for the next loop
xyz_t-1 = xyz_t

Here is an output example of everything put together

Computing Scene Flow W.R.T. the World

This “measured scene flow” is relative to the ego vehicle. Meaning it assumes the ego vehicle is stationary and that everything else is moving with respect to it. To get scene flow with respect to the world, we need to subract the motion of the ego car from the measured_SF. To do this we compute the induced scene flow by taking as input the current depth map and IMU transformation (between current and prev frames), then transforming the point cloud according the IMU motion. This gives us a induced scene flow, which we can then subtract from the measured scene flow to get “final scene flow”.

### compute pointcloud
xyz_t-1 = depth_t * intrinsics

### transform the pointcloud with 3D IMU transformation matrix
xyz_t = xyz_1 * imu_transformation

### compute induced sceneflow by taking the difference in xyz
induced_SF = xyz_2 - xyz_1

### we can also compute induced optical flow
2d_pos_t-1 = current_position
2d_pos_t = back_proj(xyz_t)
induced_OF = 2d_pos_t - 2d_pos_t-1

### then correct the sceneflow
final_SF = measured_SF - inducedSF

The result looks like the following

Quick note on Runtime

Since we are using the same model to estimate depth and optical flow and then computing scene flow after the fact, we can have two instances of the same model running in parallel for depth and optical flow. The rest of the computation is relatively negligible so your overall runtime will be largely determined by the latency of a single DFM instance and your ability to run it parallely.

Post Processing Scene-Flow

You will notice there is a trailing edge behind vehicles that pass by, this is due to occlusions. These are areas/pixels that only appear in one of the two optical flow images that are being used to compute scene flow, here is a more detailed illustration of that problem

In the above illustration, the point (70, 100) in frame t has an optical flow of (-10, 0) so it correlates back to (80, 100) in frame t-1 . So far this is ok. But the point (120, 100) in frame t (representing part of the building), would have an optical flow of (0, 0) (as it is part of the non-moving building) which corresponds to point (120, 100) in frame t-1 which in that frame represents part of the car. This results in an error since the computed correlation does not match the physical representation. This is like saying the pixel would be moving from the distance of the building in the background to the distance of the car in the foreground which is an absurdly high and incorrect scene flow value.

To eliminate this, we can compute a mask of only dynamically moving objects and only consider the scene flow that lies on that mask. We’ll call this the dynamic optical flow mask. It would look like this

### compute dynamic OF
dynamic_OF = measured_OF - induced_OF

### compute the mask
thresh = 1.5
mask = zeros()
mask[dynamic_OF > thresh] = 1

### correct the final sceneflow vectors
final_SF[mask] = (0, 0, 0)

Now, we have a clean scene flow measurement, W.R.T the world (“final SF”,) where everything that is moving with respect to the ground is labeled as such in the image with very minimal occlusion errors. This is reflected by the fact that the whole background goes grey, representing zero 3D motion, except for the moving vehicles in the scene.

Here is a more clear before/after correcting occlusions

Before
After

As mentioned before, it is preferable if the same DFM is used for both the optical flow and depth predictions. Since we are computing induced optical flow from the depth map essentially, and subtracting that from the measured optical flow to get dynamic optical flow, it is important that the depth and optical flow maps are consistent with each other. Misaligned edges results in the blending of motion vectors from different objects and corrupts the dynamic flow map with false flows. This then affects the masking out of occlusions in the final scene flow.

Here are some clips of the hacked scene-flow algorithm in action

Conclusions

In this post, we’ve simplified the complex problem of scene flow estimation by leveraging depth and optical flow, rather than directly inferring 3D motion end-to-end. This approach reduces computational complexity and training time typical of state-of-the-art (SOTA) scene-flow models by using a Dense Feature Matcher (DFM) for both depth and optical flow, ensuring consistency and accuracy. Most end-to-end scene flow models lack generalizability, performing well only on specific data distributions, which limits their effectiveness. By contrast, our method takes advantage of the generalizability of depth and optical flow models to compute scene flow. our method also streamlines the training process and reduces dependency on extensive training data. Critical alignment between optical flow and depth maps at object edges is emphasized to avoid significant errors. By running two DFM instances in parallel, we achieve efficient processing, and by addressing occlusion errors with a dynamic optical flow mask, we clean the final scene flow for accurate motion representation of dynamic objects relative to the world.

Connect with me on LinkedIn

https://www.linkedin.com/in/mostafa-mohsen/

--

--