Visual Computing and Artificial Intelligence Department
Max Planck Institute for Informatics
Publications       Assets       GVVPerfCapEva      

GVVPerfCapEva, is a hub to access a wide range of human shape and performance capture datasets from the Graphics, Vision, and Video and partner research groups at MPI for Informatics and elsewhere. These datasets provide an opportunity to enable further research in different fields such as full body performance capture, facial performance capture, or hand and finger performance capture.

The datasets span different sensor modalities such as depth cameras, multi-view video, optical markers, and inertial sensors. For some datasets, external results can be uploaded and compared to previous approaches.

License: Please see the individual dataset pages for details on license/restrictions. In general, you are required to cite the respective publication(s) if you use that dataset.

Supported by ERC Starting Grant CapReal

Affiliated to

DenseHands: Two-Hand Interactions with Dense Correspondence Annotation from Depth

Collaboration with Universidad Rey Juan Carlos


The DenseHands dataset is a synthetic dataset that provides depth data of two interacting hands. The data was recorded in a live physical simulation setup ensuring natural hand motions. The annotations include 3D keypoint locations for the 21 keypoints of each hand (42 total) as well as dense surface correspondences to the MANO model mesh. In total, DenseHands consists of roughly 86,400 depth frames.

Real Painted Hands: Hand Segmentation for Two Interacting Hands from Depth

Collaboration with Universidad Rey Juan Carlos


A real dataset for hand segmentation of two interacting hands from depth data. The annotation was automatically obtained using body paint on the hands and subsequent HSV color segmentation in the depth-aligned color image. The dataset provides depth images, original color images, depth-aligned color images, and hand segmentation labels (right/left/background). 3 different subjects (1 female, 2 male) were recorded from an egocentric (shoulder-mounted) and third-person camera, resulting in approximately 20,000 depth frames in total.

Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera

Collaboration with Stanford University and EPFL


The Mo2Cap2 dataset is for training and evaluation of the egocentric 3D human body pose estimation method. The associated capture hardware is based on a novel lightweight setup that converts a standard baseball cap to a device for high-quality pose estimation based on a single cap-mounted fisheye camera. The training set contains 530,000 rendered images of human body with ground truth 2D and 3D annotation, which encompass around 3000 different actions and more than 700 different body textures. The test data contains more than 5000 real images captured with our cap-mounted hardware.

MuCo-3DHP: Multi-Person Monocular RGB Composited 3D Human Pose Dataset

Collaboration with Stanford University


A training dataset to facilitate multi-person 3D pose estimation from monocular RGB. The dataset is composited from MPI-INF-3DHP, which has 3D pose annotations available from a markerless motion capture system. Further, the segmentation masks allow appearance augmentation of the clothing of the individuals and the background. The compositing and augmentation approach allows at scale creation of plausible multi-person scenarios, making the dataset suitable for use with current data-hungry neural network based approaches.

MuPoTS-3D: Multi-Person Monocular RGB 3D Human Pose Evaluation Benchmark

Collaboration with Stanford University


A real multi-person 3D pose dataset with ground truth 3D pose captured with a markerless motion capture system. The set comprises of 20 sequences covering a variety of outdoor and indoor scenarios, as well as a variety of multi-person scenes including inter-personal interaction, to serve as a benchmark for in-the-wild 3D human body pose estimation.

HandSeg: An Automatically Labeled Dataset for Hand Segmentation from Depth Images

Collaboration with University of Victoria, and TU Graz


The dataset comprises of 158,000 depth images captured with a RealSense SR300 camera and automatically annotated labels. The dataset contains 7 male and 3 female subjects. After the automatic labeling is done, the labels were carefully inspected to have no labeling errors.

MonoPerfCap Dataset: Evaluation Benchmark for Monocular RGB Based Performance Capture

Collaboration with EPFL


MonoPerfCap dataset is meant for evaluating monocular performance capture approaches in a variety of scenarios. The dataset consists of 13 sequences (around 40k frames in total), which are split into the following subsets: 1) 8 video sequences at 30Hz covering a variety of different scenarios including indoor and outdoor settings, handheld and static cameras, natural and man-made environments, male and female subjects, as well as body-tight and loose garments. 2) To further increase the diversity of human motions of the benchmark dataset, an additional 40 actions, including daily actions such as walking, jumping as well as highly challenging ones such as rolling, kicking and falling are included. Each action is repeated multiple times by 3 subjects. In total, this leads to 120 video clips in 3 long video sequences, 7 minutes each. 3) Additionally, two sequences from prior works [Robertini et al. 2016] and [Wu et al. 2013] are included in the benchmark dataset. These two sequences provide accurate surface reconstruction from multiview images, which can be used as ground truth for quantitative evaluation.

GANerated Hands Dataset: An Enhanced Dataset for RGB Hand Pose Estimation With Full 3D Annotation

Collaboration with Universidad Rey Juan Carlos and Stanford University


GANeratedHands dataset contains more than 330,000 color images of hands with full 3D annotation for 21 keypoints of the hand. The images were initially synthetically generated and afterwards fed to a GAN for image-to-image translation to make the features more similar to real hands. A geometric consistency constraint during translation ensures that the perfect and inexpensive annotations of the synthetic hands can be transferred to the enhanced GANerated images.

IntrinsicMoCap Dataset: Human Body MoCap Under Varying Illumination Conditions

Collaboration with the Intel Visual Computing Institute


IntrinsicMoCap dataset contains 5 sequences (2 indoor and 3 outdoor) capturing human motion under varying illumination conditions, such as shadows or global illumination changes. The recording setup consists of 8 static calibrated cameras placed around the scene action. Calibration and pre-synchronized multi-view images are provided.

EgoDexter: A Benchmark Dataset for Hand Tracking in the Presence of Occlusions and Clutter

Collaboration with Universidad Rey Juan Carlos


EgoDexter is an RGB-D dataset for evaluating algorithms for hand tracking in the presence of occlusions and clutter. It consists of 4 sequences with 4 actors (2 female), and varying interactions with various objects and cluttered background. Fingertip positions were manually annotated for 1485 out of 3190 frames.

SynthHands: A Dataset for Hand Pose Estimation from Depth and Color

Collaboration with Universidad Rey Juan Carlos


SynthHands is a dataset for training and evaluating algorithms for hand pose estimation from depth and color data. The dataset contains data for male and female hands, both with and without interaction with objects. While the hand and foreground object are synthtically generated using Unity, the motion was obtained from real performances as described in the accompanying paper. In addition, real object textures and background images (depth and color) were used. Ground truth 3D positions are provided for 21 keypoints of the hand.

MPI-INF-3DHP (Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision)

Collaboration with Universidad Rey Juan Carlos and EPFL


MPI-INF-3DHP is a new single person 3D body pose dataset and evaluation benchmark, with pose annotations from a markerless motion capture system. The training set has extensive pose coverage, with 7 broad activities, covered with 14 cameras, for 8 subjects. The subjects are captured with a green screen, wearing street clothes, as well as uniform color clothing to allow extensive background and foreground appearance augmentation. The test set covers similarly broad pose classes, with 6 sequences in a variety of scene settings to serve as a benchmark for in-the-wild performance.

MARCOnI (MARker-less Motion Capture in Outdoor and Indoor Scenes)

Collaboration with New York University, Stanford University, and Computer Vision and Multimodal Computing Group


MARCOnI (MARker-less Motion Capture in Outdoor and Indoor Scenes) is a new test data set for marker-less motion capture methods that reflects real world scene conditions more realistically, yet features comprehensive referecene/ground truth data. The dataset features multi-view video datasets of twelve real world sequences with varying sensor modalities, different subjects in the scene, and different scene and motion complexities. Our new multi-model dataset contains sequences in a variety of surroundings: uncontrolled outdoor scenarios and indoor scenarios. The sequences vary according to different data modalities captured (multiple videos, video + marker positions), in the numbers and identities of actors to track, the complexity of the motions, the number of cameras used, the existence and number of moving objects in the background, and the lighting conditions (i.e. some body parts lit and some in shadow). Cameras differ in the types (from cell phones to vision cameras), the frame resolutions, and the frame rates.

BundleFusion: Real-time Globally Consistent 3D Reconstruction using Online Surface Re-integration

Collaboration with Stanford University and Microsoft Research


We provide a dataset containing RGB-D data of 7 large scenes (60m average trajectory length, 5833 average number of frames). The RGB-D data was captured using a Structure.io depth sensor coupled with an iPad color camera. In addition, we provide the corresponding globally consistent 3D reconstructions obtained by our real-time BundleFusion approch.

Dexter+Object: A Dataset for Evaluating Joint Hand-Object Tracking

Collaboration with Aalto University


Dexter+Object is a dataset for evaluating algorithms for joint hand and object tracking. It consists of 6 sequences with 2 actors (1 female), and varying interactions with a simple object shape. Fingertip positions and cuboid corners were manually annotated for all sequences. This dataset accompanies the ECCV 2016 paper, Real-time Joint Tracking of a Hand Manipulating an Object from RGB-D Input.

VolumeDeform: Real-time Volumetric Non-rigid Reconstruction

Collaboration with the University of Erlangen-Nuremberg and Stanford University


We present a novel approach for the reconstruction of dynamic geometric shapes using a single hand-held consumer-grade RGB-D sensor at real-time rates. Our method does not require a pre-defined shape template to start with and builds up the scene model from scratch during the scanning process. Geometry and motion are parameterized in a unified manner by a volumetric representation that encodes a distance field of the surface geometry as well as the non-rigid space deformation. Motion tracking is based on a set of extracted sparse color features in combination with a dense depth-based constraint formulation. This enables accurate tracking and drastically reduces drift inherent to standard model-to-depth alignment. We cast finding the optimal deformation of space as a non-linear regularized variational optimization problem by enforcing local smoothness and proximity to the input constraints. The problem is tackled in real-time at the camera's capture rate using a data-parallel flip-flop optimization strategy. Our results demonstrate robust tracking even for fast motion and scenes that lack geometric features.

EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras

Collaboration with the Computer Vision and Multimodal Computing Group, and the Intel Visual Computing Institute


The EgoCap Pose dataset provides 100,000 egocentric fisheye images with detailed pose annotation of eight subjects performing various activities and wearing different apparel. Performances are recorded on greenscreen background to ease augmentation. The dataset contains the raw recordings as well as the augmentation proposed in EgoCap [Rhodin et al. SIGGRAPH Asia 2016], which is ready to be used for DNN training. Pose annotations are inferred from a marker-less multi-view motion capture algorithms and provide 18 body part labels per image. Unique to the dataset is the detailed annotation, greenscreen background for background augmentation, egocentric perspective, and the fisheye distortion due to 180 degree field of view lenses.

Shading-based Refinement on Volumetric Signed Distance Functions

Collaboration with the University of Erlangen-Nuremberg, Stanford University, and ETH Zurich


We present a novel method to obtain fine-scale detail in 3D reconstructions generated with RGB-D cameras or other commodity scanning devices. As the depth data of these sensors is noisy, truncated signed distance fields are typically used to regularize out this noise in the reconstructions, which unfortunately over-smooths results. In our approach, we leverage RGB data to refine these reconstructions through inverse shading cues, as color input is typically of much higher resolution than the depth data. As a result, we obtain reconstructions with high geometric detail - far beyond the depth resolution of the camera itself - as well as highly-accurate surface albedo, at high computational efficiency. Our core contribution is shading-based refinement directly on the implicit surface representation, which is generated from globally-aligned RGB-D images. We formulate the inverse shading problem on the volumetric distance field, and present a novel objective function which jointly optimizes for fine-scale surface geometry and spatially-varying surface reflectance. In addition, we solve for incident illumination, allowing application in general and unconstrained environments. In order to enable the efficient reconstruction of sub-millimeter detail, we store and process our surface using a sparse voxel hashing scheme that we augmented by introducing a grid hierarchy. A tailored GPU-based Gauss-Newton solver enables us to refine large shape models to previously unseen resolution within only a few seconds. Non-linear shape optimization directly on the implicit shape model allows for a highly-efficient parallelization, and enables much higher reconstruction detail. Our method is versatile and can be combined with a sea of scanning approaches based on implicit surfaces.

3D Human Body Shape Models and Tools [MPII Human Shape]

Collaboration with the Computer Vision and Multimodal Computing Group, and Saarland University


MPII Human Shape is a family of expressive 3D human body shape models and tools for human shape space building, manipulation and evaluation. Human shape spaces are based on the widely used statistical body representation and learned from the CAESAR dataset, the largest commercially available scan database to date. As preprocessing several thousand scans for learning the models is a challenge in itself, we contribute by developing robust best practice solutions for scan alignment that quantitatively lead to the best learned models. We make the models as well as the tools publicly available. Extensive evaluation shows improved accuracy and generality of our new models, as well as superior performance for human body reconstruction from sparse input data.

Performance Capture of Multiple Actors Using a Stereo Camera [GVVPerfCapEva: BinoCap 2013]


We describe a new algorithm which is able to track skeletal motion and detailed surface geometry of one or more actors from footage recorded with a stereo rig that is allowed to move. It succeeds in general sets with uncontrolled background and uncontrolled illumination, and scenes in which actors strike non-frontal poses. It is one of the first performance capture methods to exploit detailed BRDF information and scene illumination for accurate pose tracking and surface refinement in general scenes. It also relies on a new foreground segmentation approach that combines appearance, stereo, and pose tracking results to segment out actors from the background. Appearance, segmentation, and motion cues are combined in a new pose optimization framework that is robust under uncontrolled lighting, uncontrolled background and very sparse camera views.

High-quality Facial Performance Capture [GVVPerfCapEva: MonFaceCap 2013]


Detailed facial performance geometry can be reconstructed using dense camera and light setups in controlled studios. However, a wide range of important applications cannot employ these approaches, including all movie productions shot from a single principal camera. For post-production, these require dynamic monocular face capture for appearance modification. We therefore present a new method for capturing detailed, dynamic, spatio-temporally coherent 3D face geometry from monocular video. Our approach works under uncontrolled lighting and needs no markers, yet it successfully reconstructs expressive motion including high-frequency face detail such as winkles and dimples. This database demonstrate the reconstruction quality attained by our proposed monocular approach on three long and complex sequences showing challenging head motion and facial expressions.

Inertial Depth Tracker Dataset [GVVPerfCapEva: IDT 2013]


In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the research in monocular full body skeletal pose tracking. Unfortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part occlusions. With this dataset, we provide the means to evaluate a monocular depth tracker based on ground-truth joint positions that have been obtained with a optical marker-based system and which are calibrated with respect to the depth images. Furthermore, this dataset contains the readings of six inertial sensors worn by the person. This enables the development and testing of trackers that fuse information from these two complementary sensor types.

Evaluation of 3D Hand Tracking [GVVPerfCapEva: Dexter 1]


Dexter 1 is a dataset for evaluating algorithms for markerless, 3D articulated hand motion tracking. This dataset consists of 7 sequences of general hand motion that covers the abduction-adduction and flexion-extension ranges of the hand. All sequences are with a single actor's right hand. Dexter 1 consists of data from the following sensors.
  1. 5 Sony DFW-V500 RGB cameras
  2. 1 Creative Gesture Camera (Close-range ToF depth sensor)
  3. 1 Kinect structured light camera
While the RGB and ToF data is nearly synchronized, the structured light data is not synchronized. In addition, this dataset also contains manual annotations on the ToF data for all fingertip positions and approximate palm center.

Personalized Depth Tracker Dataset [GVVPerfCapEva: PDT 2013]

Collaboration with Harvard University and University of Erlangen-Nuremberg


Reconstructing a three-dimensional representation of human motion in real-time constitutes an important research topic with applications in sports sciences, human-computer-interaction, and the movie industry. This dataset was created to evaluate two different kinds of algorithms. Firstly one can assess the accuracy of methods that estimate the shape of a human from two sequentially taken depth images from a Microsoft Kinect sensor. Secondly one can use the dataset to evaluate the accuracy of depth-based full-body trackers.

Lightweight Binocular Facial Performance Capture under Uncontrolled Lighting [GVVPerfCapEva: FaceCap]

Collaboration with University of Stuttgart


Recent progress in passive facial performance capture has shown impressively detailed results on highly articulated motion. However, most methods rely on complex multi-camera set-ups, controlled lighting or fiducial markers. This prevents them from being used in general environments, outdoor scenes, during live action on a film set, or by freelance animators and everyday users who want to capture their digital selves. In this paper, we therefore propose a lightweight passive facial performance capture approach that is able to reconstruct high-quality dynamic facial geometry from only a single pair of stereo cameras. Our method succeeds under uncontrolled and time-varying lighting, and also in outdoor scenes. Our approach builds upon and extends recent image-based scene flow computation, lighting estimation and shading-based refinement algorithms. It integrates them into a pipeline that is specifically tailored towards facial performance reconstruction from challenging binocular footage under uncontrolled lighting. In an experimental evaluation, the strong capabilities of our method become explicit: We achieve detailed and spatio-temporally coherent results for expressive facial motion in both indoor and outdoor scenes -- even from low quality input images recorded with a hand-held consumer stereo camera. We believe that our approach is the first to capture facial performances of such high quality from a single stereo rig and we demonstrate that it brings facial performance capture out of the studio, into the wild, and within the reach of everybody.

SkelSurf: Motion Capture Using Joint Skeleton Tracking and Surface Estimation


This paper proposes a method for capturing the performance of a human or an animal from a multi-view video sequence. Given an articulated template model and silhouettes from a multi-view image sequence, our approach recovers not only the movement of the skeleton, but also the possibly non-rigid temporal deformation of the 3D surface. While large scale deformations or fast movements are captured by the skeleton pose and approximate surface skinning, true small scale deformations or non-rigid garment motion are captured by fitting the surface to the silhouette. We further propose a novel optimization scheme for skeleton-based pose estimation that exploits the skeleton’s tree structure to split the optimization problem into a local one and a lower dimensional global one. We show on various sequences that our approach can capture the 3D motion of animals and humans accurately even in the case of rapid movements and wide apparel like skirts.

Performance Capture from Sparse Multi-view Video [GVVPerfCapEva: PCSM]

Collaboration with Stanford University


This paper proposes a new marker-less approach to capturing human performances from multi-view video. Our algorithm can jointly reconstruct spatio-temporally coherent geometry, motion and textural surface appearance of actors that perform complex and rapid moves. Furthermore, since our algorithm is purely mesh-based and makes as few as possible prior assumptions about the type of subject being tracked, it can even capture performances of people wearing wide apparel, such as a dancer wearing a skirt. To serve this purpose our method efficiently and effectively combines the power of surface- and volume-based shape deformation techniques with a new mesh-based analysis-through-synthesis framework. This framework extracts motion constraints from video and makes the laser-scan of the tracked subject mimic the recorded performance. Also small-scale time-varying shape detail is recovered by applying model-guided multi-view stereo to refine the model surface. Our method delivers captured performance data at higher level of detail, is highly versatile, and is applicable to many complex types of scenes that could not be handled by alternative marker-based or marker-free recording techniques.

Markerless Motion Capture of Multiple Characters Using Multi-view Image Segmentation [GVVPerfCapEva: MVIC]

Collaboration with Tsinghua University and ETH Zurich


We present a markerless motion capture approach that reconstructs the skeletal motion and detailed time-varying surface geometry of multi-person from multi-view video. Due to ambiguities in feature-to-person assignments and frequent occlusions, it is not feasible to directly apply single-person capture approaches to the multiperson case. We therefore propose a combined image segmentation and tracking approach to overcome these difficulties. A new probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Thereafter, a single-person markerless motion and surface capture approach can be applied to each individual, either one-by-one or in parallel, even under strong occlusions. We demonstrate the performance of our approach on several challenging multi-person motions, including dance and martial arts, and also provide a reference dataset for multi-person motion capture with ground truth.

Performance Capture of Interacting Characters with Handheld Kinects [GVVPerfCapEva: HKIC]

Collaboration with Tsinghua University


We present an algorithm for marker-less performance capture of interacting humans using only three hand-held Kinect cameras. Our method reconstructs human skeletal poses, deforming surface geometry and camera poses for every time step of the depth video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Only the combination of geometric and photometric correspondences and the integration of human pose and camera pose estimation enables reliable performance capture with only three sensors. As opposed to previous performance capture methods, our algorithm succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.

A Statistical Model of Human Pose and Body Shape [GVVPerfCapEva: ScanDB]


The aim of this research is to develop a detailed statistical model of human body shapes by leveraging an encoding for mesh surfaces that is linear with respect to rotations and scalings and describes human pose and body shape in a unified framework. We have captured 114 subjects in a subset of 35 poses using a 3D laser scanner. Additionally, we measured body weight and several other biometric measures of the subject. Nonrigid registration is performed to bring the scans into correspondence.