Main Video

Abstract

Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward---a common motion in human activities. A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras in the HMD design for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Moreover, we introduce two new large-scale datasets, Ego4View-Syn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the new camera configurations with back views provide superior support for 3D pose tracking compared to only frontal placements. The proposed method achieves significant improvement over the current state of the art (>10% on MPJPE). We will release the source code, trained models, and new datasets.

Method

Overview of our framework. Given front and rear views, we first use a 2D joint estimator to obtain 2D joint heatmaps (Sec.4.1). We also extract 2D joint positions from the heatmaps as anchors on the corresponding heatmap feature. Next, our refinement module refines the heatmap estimation for each view (Sec.4.2). Here, we show an example process to refine the heatmap for the front-left view. The features around the anchors interact with view-specific joint queries in our refinement module to generate multi-view-aware offset features. Here, to better capture the initial heatmap estimation state, we enhance the joint queries with embeddings of the initial heatmap and the RGB input (Sec.4.3). Furthermore, we utilize heatmap uncertainty to explicitly guide the refinement module to prioritize the heatmap features with higher confidence (Sec.4.4). The offset features are then added to initial heatmap features to obtain refined features and heatmaps. We iterate this refinement process for all views and resulting refined features and heatmaps can be used with existing 2D-to-3D lifting modules to estimate a 3D pose (Sec.4.5).

Setup and Datasets

Visualization of our setup and datasets, Ego4View-Syn and Ego4View-RW.


Table 1. Comparison of existing datasets for egocentric 3D human pose estimation using body-facing RGB cameras. V: the number of views. Id: the number of human identities. Img: the number of images. GT: the number of ground-truth 3D poses. Def: realistic cloth deformation. LC: loose clothes. Ha: hand annotations. SMPL: SMPL parameters. RC: rear cameras. Note that xR-EP-R and EgoGlass are not publicly available, and EPW contains only pseudo ground truths.


Table 2. Visibility analysis of end-effector joints (hands and feet) with various rear-camera settings. D-FR: distance from front to rear cameras. D-FH: expected distance from front cameras to the front of the human head. D-RH: expected distance from rear cameras to the back of the human head. -L: left. -R: right.

Qualitative Results

Visualization of 3D human pose estimation results.

Qualitative Results

Figure 5. Visualizations of 2D heatmap refinement with our method. Left: Ego4View-Syn. Right: Ego4View-RW. We highlight notably refined regions with red bounding boxes and provide zoomed-in views of these areas for closer examination.

Future Applications

Visualization of future applications.

Citation

@article{hakada2025arxiv,
title = {Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation},
author = {Akada, Hiroyasu and Wang, Jian and Golyanik, Vladislav and Theobalt, Christian},
year = {2025},
booktitle = {arXiv}
}

Acknowledgement

The work was supported by the ERC Consolidator Grant 4DReply (770784) and the Nakajima Foundation.