Dynamic EventNeRF: Reconstructing General Dynamic Scenes from Multi-view RGB and Event Streams

Rudnev, Viktor; Fox, Gereon; Elgharib, Mohamed; Theobalt, Christian; Golyanik, Vladislav

Dynamic EventNeRF: Reconstructing General Dynamic Scenes from Multi-view RGB and Event Streams

Viktor Rudnev^1,2 Gereon Fox^1,2 Mohamed Elgharib¹ Christian Theobalt¹ Vladislav Golyanik¹
¹Max Planck Institute for Informatics, Saarland Informatics Campus ²Saarland University
CVPR 2025 Workshop on Event-based Vision
paper video code dataset

Abstract: Volumetric 3D reconstruction of dynamic scenes with visual sensors is an important problem in computer vision. It is especially challenging in poor lighting and with fast motions. One difficulty is due to the fundamental limitations of RGB cameras: To capture fast motion, the framerate must be increased, which in turn requires more intense lighting. In contrast, event cameras, which record changes in logarithmic pixel brightness asynchronously, are much less dependent on lighting, making them more suitable for recording fast motion. We, hence, propose the first method to reconstruct a scene in 4D from sparse multi-view event streams and sparse RGB frames. We learn a sequence of cross-faded time-conditioned NeRF models, one per short recording segment. The individual segments are supervised with a combination of event- and RGB-based losses and sparse-view regularisation. We assemble a real-world multi-view camera rig with six static event cameras mounted around the object, and record a benchmark multi-view event stream dataset of challenging motions. Our work outperforms RGB-based baselines, producing state-of-the-art results, and opens up the topic of multi-view event-based reconstruction as a new path for fast scene capture beyond RGB cameras.

Overview

Overview of the proposed Dynamic EventNeRF method. We split the entire sequence into short overlapping segments. For each segment, we learn a time-conditioned MLP-based NeRF model. To supervise it, we first sample a random window \([t_0, t_1]\) within the segment and apply a combination of the following losses: 1) Event loss, supervising predicted view differences; 2) Accumulation loss, supervising differences between a reference RGB frame and one of the predicted views; 3) RGB loss, supervising the model with the reference RGB frame, and 4) Sparsity loss, which minimises the number of opaque pixels in each predicted view. For increased stability and fast computation of these losses, we propose Fast Event Accumulation with Decay

NeRF-based methods usually require dozens of training views. However, it is possible to significantly reduce this number. In our work, we use the following combination of sparse-view adaptations, which allows us to produce novel views using as few as only two training views:

Cylinder clipping — only use the area observed by multiple cameras
Sparsity loss — minimise the number of opaque pixels in the view
Coarse-to-fine through PE frequency annealing [1]

[1] Park et al. "Nerfies: Deformable neural radiance fields", ICCV 2021

We use accumulated events on each training iteration. Naive accumulation is too slow and is unstable with longer event streams. To solve both problems, we propose Fast Accumulation with Decay:

Accumulates seconds of events in milliseconds and without unstability
Adds decay factor to prevent accumulation unstability
Precomputes and stores cumulative values, inspired by Fast EDI [1]
Then uses binary search to quickly answer the queries

[1] Lin et al. "Fast Event-based Double Integral for Real-time Robotics", ICRA 2023

Figure: In a small patch of pixels (marked red on the right), we accumulate events for each pixel individually and show the resulting signals. A naive method (red) becomes unstable, as all pixels have completely different values at the end. In contrast, our proposed method with decay (blue) is stable: all pixels keep similar values at any time. Visually, the image with decay (right) has much fewer artefacts too.

Datasets

5 new large and fast motion synthetic scenes
(shown in 3x slow motion for clarity)

5 new scenes with 5 training and 3 validation views
Complex, fast, non-rigid motion with large deformations
Rendered at 1000 FPS in RGB and then converted to events [1]

Over 18 minutes of real multi-view event data

Multi-view event and simultaneous RGB streams (5 FPS)
Total duration of 18 minutes
Selected 16 sequences of 10 participants (5-10s each)
Fast motions in the dark with and without object interaction
Recorded in a very dim environment (150ms exposure time required for RGB streams, despite fully open lens aperture)
To deblur RGB frames, we use events and our new reimplementation of Fast EDI [2]

[1] Rudnev et al. "EventNeRF: Neural Radiance Fields from a Single Colour Event Camera", CVPR 2023
[2] Lin et al. "Fast Event-based Double Integral for Real-time Robotics", ICRA 2023

Completely new and unique rig
6 colour event cameras (iniVation DAVIS 346C)
Room-scale and portable
Hardware-synchronised and calibrated
New custom tooling for recording and processing multi-view data

All baselines based on blurry RGBs fail to recover sharp details as they were designed to work with sharp training data. In contrast, our proposed Dyn-EventNeRF method uses events instead, which allows us to recover sharp details. Methods trained on RGB frames generated from events using E2VID also recover these details. However, there are many artefacts present in the reconstructions, as all input views were reconstructed independently, without regard for multi-view consistency. Our method does integrate multi-view events into one shared geometry, allowing for view-consistent 3D reconstruction and reduced artefacts.