Abstract: Volumetric 3D reconstruction of dynamic scenes with visual sensors is an important problem in computer vision. It is especially challenging in poor lighting and with fast motions. One difficulty is due to the fundamental limitations of RGB cameras: To capture fast motion, the framerate must be increased, which in turn requires more intense lighting. In contrast, event cameras, which record changes in logarithmic pixel brightness asynchronously, are much less dependent on lighting, making them more suitable for recording fast motion. We, hence, propose the first method to reconstruct a scene in 4D from sparse multi-view event streams and sparse RGB frames. We learn a sequence of cross-faded time-conditioned NeRF models, one per short recording segment. The individual segments are supervised with a combination of event- and RGB-based losses and sparse-view regularisation. We assemble a real-world multi-view camera rig with six static event cameras mounted around the object, and record a benchmark multi-view event stream dataset of challenging motions. Our work outperforms RGB-based baselines, producing state-of-the-art results, and opens up the topic of multi-view event-based reconstruction as a new path for fast scene capture beyond RGB cameras.

Overview


Overview of the proposed Dynamic EventNeRF method. We split the entire sequence into short overlapping segments. For each segment, we learn a time-conditioned MLP-based NeRF model. To supervise it, we first sample a random window \([t_0, t_1]\) within the segment and apply a combination of the following losses: 1) Event loss, supervising predicted view differences; 2) Accumulation loss, supervising differences between a reference RGB frame and one of the predicted views; 3) RGB loss, supervising the model with the reference RGB frame, and 4) Sparsity loss, which minimises the number of opaque pixels in each predicted view. For increased stability and fast computation of these losses, we propose Fast Event Accumulation with Damping
NeRF-based methods usually require dozens of training views. However, it is possible to significantly reduce this number. In our work, we use the following combination of sparse-view adaptations, which allows us to produce novel views using as few as only two training views:
  • Cylinder clipping — only use the area observed by multiple cameras
  • Sparsity loss — minimise the number of opaque pixels in the view
  • Coarse-to-fine through PE frequency annealing [1]

[1] Park et al. "Nerfies: Deformable neural radiance fields", ICCV 2021
We use accumulated events on each training iteration. Naive accumulation is too slow and is unstable with longer event streams. To solve both problems, we propose Fast Accumulation with Damping:
  • Accumulates seconds of events in milliseconds and without unstability
  • Adds damping factor to prevent accumulation unstability
  • Precomputes and stores cumulative values, inspired by Fast EDI [1]
  • Then uses binary search to quickly answer the queries

[1] Lin et al. "Fast Event-based Double Integral for Real-time Robotics", ICRA 2023
Figure: In a small neighbourhood of some pixel, we individually accumulate events for each pixel and show the results as traces. Naive method (red) becomes unstable as all pixels have completely different values at the end. In contrast, our method (blue) is stable: all pixels keep similar values at any time.

Datasets

5 new large and fast motion synthetic scenes
(shown in 3x slow motion for clarity)

  • 5 new scenes with 5 training and 3 validation views
  • Complex, fast, non-rigid motion with large deformations
  • Rendered at 1000 FPS in RGB and then converted to events [1]
Over 18 minutes of real multi-view event data
  • Multi-view event and simultaneous RGB streams (5 FPS)
  • Total duration of 18 minutes
  • Selected 16 sequences of 10 participants (5-10s each)
  • Fast motions in the dark with and without object interaction
  • Recorded in a very dim environment (150ms exposure time required for RGB streams, despite fully open lens aperture)
  • To deblur RGB frames, we use events and our new reimplementation of Fast EDI [2]

[1] Rudnev et al. "EventNeRF: Neural Radiance Fields from a Single Colour Event Camera", CVPR 2023
[2] Lin et al. "Fast Event-based Double Integral for Real-time Robotics", ICRA 2023
  • Completely new and unique rig
  • 6 colour event cameras (iniVation DAVIS 346C)
  • Room-scale and portable
  • Hardware-synchronised and calibrated
  • New custom tooling for recording and processing multi-view data
All baselines based on blurry RGBs fail to recover sharp details as they were designed to work with sharp training data. In contrast, our proposed Dyn-EventNeRF method uses events instead, which allows us to recover sharp details. Methods trained on RGB frames generated from events using E2VID also recover these details. However, there are many artefacts present in the reconstructions, as all input views were reconstructed independently, without regard for multi-view consistency. Our method does integrate multi-view events into one shared geometry, allowing for view-consistent 3D reconstruction and reduced artefacts.

Blender

Ground Truth Ours E2VID [1]+Dyn-NeRF Blurry RGB+Dyn-NeRF E2VID [1]+FreeNeRF [2] Blurry RGB+FreeNeRF [2]
Novel View 1
Novel View 2
Novel View 3

Dress

Ground Truth Ours E2VID [1]+Dyn-NeRF Blurry RGB+Dyn-NeRF E2VID [1]+FreeNeRF [2] Blurry RGB+FreeNeRF [2]
Novel View 1
Novel View 2
Novel View 3

Spheres

Ground Truth Ours E2VID [1]+Dyn-NeRF Blurry RGB+Dyn-NeRF E2VID [1]+FreeNeRF [2] Blurry RGB+FreeNeRF [2]
Novel View 1
Novel View 2
Novel View 3

Lego

Ground Truth Ours E2VID [1]+Dyn-NeRF Blurry RGB+Dyn-NeRF E2VID [1]+FreeNeRF [2] Blurry RGB+FreeNeRF [2]
Novel View 1
Novel View 2
Novel View 3

Static Lego

Ground Truth Ours E2VID [1]+Dyn-NeRF Blurry RGB+Dyn-NeRF E2VID [1]+FreeNeRF [2] Blurry RGB+FreeNeRF [2]
Novel View 1
Novel View 2
Novel View 3
Training data: Events and only one RGB frame at the start
*Dyn-NeRF — our method with only RGB inputs, no events
[1] Rebecq et al. "High Speed and High Dynamic Range Video with an Event Camera", PAMI 2019
[2] Yang et al. "FreeNeRF: Improving few-shot neural rendering with free frequency regularization", CVPR 2023
Same as with synthetic data, our method significantly outperforms compared methods. However with real data, the performance of the baselines is even more degraded. Due to presence of noise, view inconsistencies in E2VID are even more pronounced, leading to severe artefacts in related methods. With real data, we use 5 FPS blurry RGB as that was the maximal frame rate possible during the recordings due to the low lighting. As it is much lower than 20 FPS used for synthetic data blurry RGB baseline, the reconstructed motion is only barely recognisable.

Dancing (6x slow motion)

Ground Truth Ours E2VID [1]+Dyn-NeRF Blurry RGB+Dyn-NeRF E2VID [1]+FreeNeRF [2]
Novel View

Bucket (6x slow motion)

Ground Truth Ours E2VID [1]+Dyn-NeRF Blurry RGB+Dyn-NeRF E2VID [1]+FreeNeRF [2]
Novel View

Towel Tricks (6x slow motion)

Ground Truth Ours E2VID [1]+Dyn-NeRF Blurry RGB+Dyn-NeRF E2VID [1]+FreeNeRF [2]
Novel View
Training data: Events and 5 FPS RGB frames
*Dyn-NeRF — our method with only RGB inputs, no events
[1] Rebecq et al. "High Speed and High Dynamic Range Video with an Event Camera", PAMI 2019
[2] Yang et al. "FreeNeRF: Improving few-shot neural rendering with free frequency regularization", CVPR 2023
We ablate different parts of our method and report the results. In particular, we compare different choices for the core model: MLP ("Full Model"), NGP [1], TensoRF-CP [2], and HexPlane [3]. NGP fails to handle the sparse-view setting despite applying the same regularisation as MLP. TensoRF-CP and HexPlane are both grid-based: Individual timestamps are modelled separately, which results in information from one timestamp not being propagated to other timestamps, i.e. with 512 temporal grid cells, each of the 512 volumes has to be supervised individually, leading to blurriness and artefacts. On the other hand, our full model with temporally-conditioned MLP explicitly controls how much information is shared between timestamps through Positional Encoding. There, supervising one timestamp also affects others in its vicinity, resulting in better use of the training data. Disabling various parts of the method reduces the prediction quality and sharpness. It even leads to the model diverging when trained on the real data when disabling Accumulation Loss and Cylindrical Clipping.

Synthetic data ablations

Ground Truth Full Model NGP TensoRF-CP HexPlane W/o Event Loss
Novel View 1
Novel View 2
Novel View 3
W/o RGB Loss W/o Acc. Loss W/o Sparsity W/o Clipping
Training data: Events and only one RGB frame at the start. Bold value is the default

Real data ablations

Full Model TensoRF-CP W/o Event Loss W/o Acc. Loss W/o Clipping W/o Damping NGP HexPlane W/o RGB Loss W/o Sparsity Loss W/o Multi-Segment  
Training data: Events and 5 FPS RGB frames. Bold value is the default
[1] Müller et al. "Instant neural graphics primitives with a multiresolution hash encoding", SIGGRAPH 2022 [2] Chen et al. "TensoRF: Tensorial radiance fields", ECCV 2022
[3] Cao et al. "HexPlane: A fast representation for dynamic scenes", CVPR 2023
We ablate the number of supporting RGB views and the number of views used for training. There is only minimal difference in the results if we use only one RGB frame for reconstruction (0.5 FPS), compared to using 100 FPS RGB inputs. This indicates that our method does not depend on the RGB inputs much, mostly using the event information.

Ablations on FPS of supporting RGB frames

1 RGB 1 FPS 5 FPS 10 FPS 50 FPS 100 FPS Ground Truth
Training data: Events and 0.5–100 FPS RGB frames. Bold value is the default
We also assess the importance of the number of input views. We find that increasing the number of views does improve the quality. This is a strong indication that multiview event camera setups can be worth the investment.

Ablations on view count

2 Views 3 Views 4 Views 5 Views Ground Truth
Training data: Events and 5 FPS RGB frames. Bold value is the default
We show results of the multi-segment model trained on the long sequences (5-10s). Each sub-model is independently trained only on a short 1s segment. We overlap and smoothly transition the segments. Hence the combined reconstruction is consistent and has no sudden gaps between the parts. "Sword" shows reconstruction of a thin texture-less poster roll handled by an actor. And "Towel Tricks" shows that our method can handle large and fast motion of a texture-less towel, despite only using 5 views for training.
Sword: Novel view and 6x slow motion
Towel Tricks: Novel view and 3x slow motion
Training data: Events and 5 FPS RGB frames
We show reconstruction results of our single-segment method on various selected short sequences (0.5-1s). Note that our method can also reconstruct shadows casted on the ground by the subject, as our method does background segmentation implicitly in a self-supervised manner. However, some shadows lie beyond the clipped scene volume, which is likely the reason behind the dark floating artefacts. "Ball", "Bucket", "Guitar", and "Box" show that our method handles small and thin objects well. "Bucket" shows that we can reconstruct black objects even when recorded in the dark. "Towel-A" and "Towel Tricks" show performance with complex fast deformations of texture-less towel. "Jump", "Guitar", "Dancing" show how our method handles large motions of human body.
Ball: Novel view and 6x slow motion
Bucket: Novel view and 6x slow motion
Jump: Novel view and 6x slow motion
Towel-A: Novel view and 6x slow motion
Guitar: Novel view and 6x slow motion
Towel tricks: Novel view and 6x slow motion
Dancing: Novel view and 6x slow motion
Box: Novel view and 6x slow motion
Training data: Events and 5 FPS RGB frames
@article{rudnev2024dynamiceventnerf,
  title={Dynamic EventNeRF: Reconstructing General Dynamic Scenes using Multi-View Event Streams},
  author={Rudnev, Viktor and Fox, Gereon and Elgharib, Mohamed and Theobalt, Christian and Golyanik, Vladislav},
  journal={arXiv preprint arXiv:2412.06770},
  year={2024}
}
For questions, clarifications, please get in touch with:
Viktor Rudnev
vrudnev@mpi-inf.mpg.de
Vladislav Golyanik
golyanik@mpi-inf.mpg.de