E-3DPSM: A State Machine for Event-based Egocentric3D Human Pose Estimation





Rethinking event-based egocentric 3D human pose estimation. E-3DPSM models motion as a continuous event-driven state evolution, fusing delta and direct 3D human pose updates, thereby achieving real-time and temporally stable 3D reconstruction and significantly outperforming prior approaches.

Abstract

Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state-of-the-art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x.


Supplementary Video

Method

Method figure

Qualitative Results


EE3D-R (Real Dataset)

Qualitative comparison with state-of-the-art methods on EE3D-R dataset. Actions: Crawl and Kick

Qualitative comparison with state-of-the-art methods on EE3D-R dataset. Actions: Walk and Object Interaction

EE3D-W (In-The-Wild Dataset)

Qualitative comparison with state-of-the-art methods on EE3D-W dataset. Actions: Walk and Pushup

Qualitative comparison with state-of-the-art methods on EE3D-W dataset. Actions: Crouch and Kick

Hardware Setup

Our head-mounted device setup. The device uses a single fisheye egocentric event camera for input, NVIDIA Jetson Orin Nano for onboard processing, and a portable power bank for standalone operation.

Real-time Demo

Scenario: Low light, lower body occlusion, and fast motion.

Scenario: Indoor walking and kicking.

Improvements in Temporal Stability

Method figure

Our method predicts smoother trajectories for occlusion-prone end-effector joints compared to existing approaches.

Quantitative Results

Quantitative results

Comparison with state-of-the-art methods. E-3DPSM achieves significant improvements in accuracy (MPJPE) and temporal stability.


Quantitative results
Per-Action Evaluation. Comparison with state-of-the-art methods on EE3D-R and EE3D-W datasets.


Quantitative results

Per-Joint Evaluation. Comparison with state-of-the-art methods on EE3D-R and EE3D-W datasets.

Model Efficiency

Quantitative results

Model efficiency comparison in terms of parameters, FLOPs, GPU memory, and 3D pose update rate in Hz (measured on a single NVIDIA A6000 GPU).

Citation


    @inproceedings{deshmukh2026e3dpsm,
      title = {E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation},
      author = {Deshmukh, Mayur and Akada, Hiroyasu and Rhodin, Helge and Theobalt, Christian and Golyanik, Vladislav},
      booktitle = {Computer Vision and Pattern Recognition (CVPR)},
      year = {2026}
    }
    

Acknowledgments

This work was partially supported by the Nakajima Foundation scholarship.