Home       News        Team        Publications       Teaching       GVVPerfCapEva       Offers       Facilities       Contact      

GVVPerfCapEva, is a hub to access a wide range of human shape and performance capture datasets from the Graphics, Vision, and Video and partner research groups at MPI for Informatics and elsewhere. These datasets provide an opportunity to enable further research in different fields such as full body performance capture, facial performance capture, or hand and finger performance capture.

The datasets span different sensor modalities such as depth cameras, multi-view video, optical markers, and inertial sensors. For some datasets, external results can be uploaded and compared to previous approaches.

License: Please see the individual dataset pages for details on license/restrictions. In general, you are required to cite the respective publication(s) if you use that dataset.

Supported by ERC Starting Grant CapReal

Affiliated to

GANerated Hands Dataset: An Enhanced Dataset for RGB Hand Pose Estimation With Full 3D Annotation

The dataset contains more than 330,000 color images of hands with full 3D annotation for 21 keypoints of the hand. The images were initially synthetically generated and afterwards fed to a GAN for image-to-image translation to make the features more similar to real hands. A geometric consistency constraint during translation ensures that the perfect and inexpensive annotations of the synthetic hands can be transferred to the enhanced GANerated images.

IntrinsicMoCap: Human Body MoCap Under Varying Illumination Conditions

This dataset contains 5 sequences (2 indoor and 3 outdoor) capturing human motion under varying illumination conditions, such as shadows or global illumination changes. The recording setup consists of 8 static calibrated cameras placed around the scene action. Calibration and pre-synchronized multi-view images are provided.

EgoDexter: A Benchmark Dataset for Hand Tracking in the Presence of Occlusions and Clutter

EgoDexter is an RGB-D dataset for evaluating algorithms for hand tracking in the presence of occlusions and clutter. It consists of 4 sequences with 4 actors (2 female), and varying interactions with various objects and cluttered background. Fingertip positions were manually annotated for 1485 out of 3190 frames.

SynthHands: A Dataset for Hand Pose Estimation from Depth and Color

SynthHands is a dataset for training and evaluating algorithms for hand pose estimation from depth and color data. The dataset contains data for male and female hands, both with and without interaction with objects. While the hand and foreground object are synthtically generated using Unity, the motion was obtained from real performances as described in the accompanying paper. In addition, real object textures and background images (depth and color) were used. Ground truth 3D positions are provided for 21 keypoints of the hand.

MPI-INF-3DHP (Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision)

MPI-INF-3DHP is a new single person 3D body pose dataset and evaluation benchmark, with pose annotations from a markerless motion capture system. The training set has extensive pose coverage, with 7 broad activities, covered with 14 cameras, for 8 subjects. The subjects are captured with a green screen, wearing street clothes, as well as uniform color clothing to allow extensive background and foreground appearance augmentation. The test set covers similarly broad pose classes, with 6 sequences in a variety of scene settings to serve as a benchmark for in-the-wild performance.

MARCOnI (MARker-less Motion Capture in Outdoor and Indoor Scenes)

Collaboration with New York University, Stanford University, and Computer Vision and Multimodal Computing Group, MPI for Informatics

MARCOnI (MARker-less Motion Capture in Outdoor and Indoor Scenes) is a new test data set for marker-less motion capture methods that reflects real world scene conditions more realistically, yet features comprehensive referecene/ground truth data. The dataset features multi-view video datasets of twelve real world sequences with varying sensor modalities, different subjects in the scene, and different scene and motion complexities. Our new multi-model dataset contains sequences in a variety of surroundings: uncontrolled outdoor scenarios and indoor scenarios. The sequences vary according to different data modalities captured (multiple videos, video + marker positions), in the numbers and identities of actors to track, the complexity of the motions, the number of cameras used, the existence and number of moving objects in the background, and the lighting conditions (i.e. some body parts lit and some in shadow). Cameras differ in the types (from cell phones to vision cameras), the frame resolutions, and the frame rates.

BundleFusion: Real-time Globally Consistent 3D Reconstruction using Online Surface Re-integration

Collaboration with Stanford University and Microsoft Research

We provide a dataset containing RGB-D data of 7 large scenes (60m average trajectory length, 5833 average number of frames). The RGB-D data was captured using a Structure.io depth sensor coupled with an iPad color camera. In addition, we provide the corresponding globally consistent 3D reconstructions obtained by our real-time BundleFusion approch.

Dexter+Object: A Dataset for Evaluating Joint Hand-Object Tracking

Collaboration with Aalto University

Dexter+Object is a dataset for evaluating algorithms for joint hand and object tracking. It consists of 6 sequences with 2 actors (1 female), and varying interactions with a simple object shape. Fingertip positions and cuboid corners were manually annotated for all sequences. This dataset accompanies the ECCV 2016 paper, Real-time Joint Tracking of a Hand Manipulating an Object from RGB-D Input.

VolumeDeform: Real-time Volumetric Non-rigid Reconstruction

Collaboration with the University of Erlangen-Nuremberg and Stanford University

We present a novel approach for the reconstruction of dynamic geometric shapes using a single hand-held consumer-grade RGB-D sensor at real-time rates. Our method does not require a pre-defined shape template to start with and builds up the scene model from scratch during the scanning process. Geometry and motion are parameterized in a unified manner by a volumetric representation that encodes a distance field of the surface geometry as well as the non-rigid space deformation. Motion tracking is based on a set of extracted sparse color features in combination with a dense depth-based constraint formulation. This enables accurate tracking and drastically reduces drift inherent to standard model-to-depth alignment. We cast finding the optimal deformation of space as a non-linear regularized variational optimization problem by enforcing local smoothness and proximity to the input constraints. The problem is tackled in real-time at the camera's capture rate using a data-parallel flip-flop optimization strategy. Our results demonstrate robust tracking even for fast motion and scenes that lack geometric features.

EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras

Collaboration with the Computer Vision and Multimodal Computing Group, MPI for Informatics, and the Intel Visual Computing Institute

The EgoCap Pose dataset provides 100,000 egocentric fisheye images with detailed pose annotation of eight subjects performing various activities and wearing different apparel. Performances are recorded on greenscreen background to ease augmentation. The dataset contains the raw recordings as well as the augmentation proposed in EgoCap [Rhodin et al. SIGGRAPH Asia 2016], which is ready to be used for DNN training. Pose annotations are inferred from a marker-less multi-view motion capture algorithms and provide 18 body part labels per image. Unique to the dataset is the detailed annotation, greenscreen background for background augmentation, egocentric perspective, and the fisheye distortion due to 180 degree field of view lenses.

Shading-based Refinement on Volumetric Signed Distance Functions

Collaboration with the University of Erlangen-Nuremberg, Stanford University, and ETH Zurich

We present a novel method to obtain fine-scale detail in 3D reconstructions generated with RGB-D cameras or other commodity scanning devices. As the depth data of these sensors is noisy, truncated signed distance fields are typically used to regularize out this noise in the reconstructions, which unfortunately over-smooths results. In our approach, we leverage RGB data to refine these reconstructions through inverse shading cues, as color input is typically of much higher resolution than the depth data. As a result, we obtain reconstructions with high geometric detail - far beyond the depth resolution of the camera itself - as well as highly-accurate surface albedo, at high computational efficiency. Our core contribution is shading-based refinement directly on the implicit surface representation, which is generated from globally-aligned RGB-D images. We formulate the inverse shading problem on the volumetric distance field, and present a novel objective function which jointly optimizes for fine-scale surface geometry and spatially-varying surface reflectance. In addition, we solve for incident illumination, allowing application in general and unconstrained environments. In order to enable the efficient reconstruction of sub-millimeter detail, we store and process our surface using a sparse voxel hashing scheme that we augmented by introducing a grid hierarchy. A tailored GPU-based Gauss-Newton solver enables us to refine large shape models to previously unseen resolution within only a few seconds. Non-linear shape optimization directly on the implicit shape model allows for a highly-efficient parallelization, and enables much higher reconstruction detail. Our method is versatile and can be combined with a sea of scanning approaches based on implicit surfaces.

3D Human Body Shape Models and Tools [MPII Human Shape]

Collaboration with the Computer Vision and Multimodal Computing Group, MPI for Informatics, and Saarland University

MPII Human Shape is a family of expressive 3D human body shape models and tools for human shape space building, manipulation and evaluation. Human shape spaces are based on the widely used statistical body representation and learned from the CAESAR dataset, the largest commercially available scan database to date. As preprocessing several thousand scans for learning the models is a challenge in itself, we contribute by developing robust best practice solutions for scan alignment that quantitatively lead to the best learned models. We make the models as well as the tools publicly available. Extensive evaluation shows improved accuracy and generality of our new models, as well as superior performance for human body reconstruction from sparse input data.

Performance Capture of Multiple Actors Using a Stereo Camera [GVVPerfCapEva: BinoCap 2013]

We describe a new algorithm which is able to track skeletal motion and detailed surface geometry of one or more actors from footage recorded with a stereo rig that is allowed to move. It succeeds in general sets with uncontrolled background and uncontrolled illumination, and scenes in which actors strike non-frontal poses. It is one of the first performance capture methods to exploit detailed BRDF information and scene illumination for accurate pose tracking and surface refinement in general scenes. It also relies on a new foreground segmentation approach that combines appearance, stereo, and pose tracking results to segment out actors from the background. Appearance, segmentation, and motion cues are combined in a new pose optimization framework that is robust under uncontrolled lighting, uncontrolled background and very sparse camera views.

High-quality Facial Performance Capture [GVVPerfCapEva: MonFaceCap 2013]

Detailed facial performance geometry can be reconstructed using dense camera and light setups in controlled studios. However, a wide range of important applications cannot employ these approaches, including all movie productions shot from a single principal camera. For post-production, these require dynamic monocular face capture for appearance modification. We therefore present a new method for capturing detailed, dynamic, spatio-temporally coherent 3D face geometry from monocular video. Our approach works under uncontrolled lighting and needs no markers, yet it successfully reconstructs expressive motion including high-frequency face detail such as winkles and dimples. This database demonstrate the reconstruction quality attained by our proposed monocular approach on three long and complex sequences showing challenging head motion and facial expressions.

Inertial Depth Tracker Dataset [GVVPerfCapEva: IDT 2013]

In recent years, the availability of inexpensive depth cameras, such as the Microsoft Kinect, has boosted the research in monocular full body skeletal pose tracking. Unfortunately, existing trackers often fail to capture poses where a single camera provides insufficient data, such as non-frontal poses, and all other poses with body part occlusions. With this dataset, we provide the means to evaluate a monocular depth tracker based on ground-truth joint positions that have been obtained with a optical marker-based system and which are calibrated with respect to the depth images. Furthermore, this dataset contains the readings of six inertial sensors worn by the person. This enables the development and testing of trackers that fuse information from these two complementary sensor types.

Evaluation of 3D Hand Tracking [GVVPerfCapEva: Dexter 1]

Dexter 1 is a dataset for evaluating algorithms for markerless, 3D articulated hand motion tracking. This dataset consists of 7 sequences of general hand motion that covers the abduction-adduction and flexion-extension ranges of the hand. All sequences are with a single actor's right hand. Dexter 1 consists of data from the following sensors.
  1. 5 Sony DFW-V500 RGB cameras
  2. 1 Creative Gesture Camera (Close-range ToF depth sensor)
  3. 1 Kinect structured light camera
While the RGB and ToF data is nearly synchronized, the structured light data is not synchronized. In addition, this dataset also contains manual annotations on the ToF data for all fingertip positions and approximate palm center.

Personalized Depth Tracker Dataset [GVVPerfCapEva: PDT 2013]

Collaboration with Harvard University and University of Erlangen-Nuremberg

Reconstructing a three-dimensional representation of human motion in real-time constitutes an important research topic with applications in sports sciences, human-computer-interaction, and the movie industry. This dataset was created to evaluate two different kinds of algorithms. Firstly one can assess the accuracy of methods that estimate the shape of a human from two sequentially taken depth images from a Microsoft Kinect sensor. Secondly one can use the dataset to evaluate the accuracy of depth-based full-body trackers.

Lightweight Binocular Facial Performance Capture under Uncontrolled Lighting [GVVPerfCapEva: FaceCap]

Collaboration with University of Stuttgart

Recent progress in passive facial performance capture has shown impressively detailed results on highly articulated motion. However, most methods rely on complex multi-camera set-ups, controlled lighting or fiducial markers. This prevents them from being used in general environments, outdoor scenes, during live action on a film set, or by freelance animators and everyday users who want to capture their digital selves. In this paper, we therefore propose a lightweight passive facial performance capture approach that is able to reconstruct high-quality dynamic facial geometry from only a single pair of stereo cameras. Our method succeeds under uncontrolled and time-varying lighting, and also in outdoor scenes. Our approach builds upon and extends recent image-based scene flow computation, lighting estimation and shading-based refinement algorithms. It integrates them into a pipeline that is specifically tailored towards facial performance reconstruction from challenging binocular footage under uncontrolled lighting. In an experimental evaluation, the strong capabilities of our method become explicit: We achieve detailed and spatio-temporally coherent results for expressive facial motion in both indoor and outdoor scenes -- even from low quality input images recorded with a hand-held consumer stereo camera. We believe that our approach is the first to capture facial performances of such high quality from a single stereo rig and we demonstrate that it brings facial performance capture out of the studio, into the wild, and within the reach of everybody.

SkelSurf: Motion Capture Using Joint Skeleton Tracking and Surface Estimation

This paper proposes a method for capturing the performance of a human or an animal from a multi-view video sequence. Given an articulated template model and silhouettes from a multi-view image sequence, our approach recovers not only the movement of the skeleton, but also the possibly non-rigid temporal deformation of the 3D surface. While large scale deformations or fast movements are captured by the skeleton pose and approximate surface skinning, true small scale deformations or non-rigid garment motion are captured by fitting the surface to the silhouette. We further propose a novel optimization scheme for skeleton-based pose estimation that exploits the skeleton’s tree structure to split the optimization problem into a local one and a lower dimensional global one. We show on various sequences that our approach can capture the 3D motion of animals and humans accurately even in the case of rapid movements and wide apparel like skirts.

Performance Capture from Sparse Multi-view Video [GVVPerfCapEva: PCSM]

Collaboration with Stanford University

This paper proposes a new marker-less approach to capturing human performances from multi-view video. Our algorithm can jointly reconstruct spatio-temporally coherent geometry, motion and textural surface appearance of actors that perform complex and rapid moves. Furthermore, since our algorithm is purely mesh-based and makes as few as possible prior assumptions about the type of subject being tracked, it can even capture performances of people wearing wide apparel, such as a dancer wearing a skirt. To serve this purpose our method efficiently and effectively combines the power of surface- and volume-based shape deformation techniques with a new mesh-based analysis-through-synthesis framework. This framework extracts motion constraints from video and makes the laser-scan of the tracked subject mimic the recorded performance. Also small-scale time-varying shape detail is recovered by applying model-guided multi-view stereo to refine the model surface. Our method delivers captured performance data at higher level of detail, is highly versatile, and is applicable to many complex types of scenes that could not be handled by alternative marker-based or marker-free recording techniques.

Markerless Motion Capture of Multiple Characters Using Multi-view Image Segmentation [GVVPerfCapEva: MVIC]

Collaboration with Tsinghua University and ETH Zurich

We present a markerless motion capture approach that reconstructs the skeletal motion and detailed time-varying surface geometry of multi-person from multi-view video. Due to ambiguities in feature-to-person assignments and frequent occlusions, it is not feasible to directly apply single-person capture approaches to the multiperson case. We therefore propose a combined image segmentation and tracking approach to overcome these difficulties. A new probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Thereafter, a single-person markerless motion and surface capture approach can be applied to each individual, either one-by-one or in parallel, even under strong occlusions. We demonstrate the performance of our approach on several challenging multi-person motions, including dance and martial arts, and also provide a reference dataset for multi-person motion capture with ground truth.

Performance Capture of Interacting Characters with Handheld Kinects [GVVPerfCapEva: HKIC]

Collaboration with Tsinghua University

We present an algorithm for marker-less performance capture of interacting humans using only three hand-held Kinect cameras. Our method reconstructs human skeletal poses, deforming surface geometry and camera poses for every time step of the depth video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Only the combination of geometric and photometric correspondences and the integration of human pose and camera pose estimation enables reliable performance capture with only three sensors. As opposed to previous performance capture methods, our algorithm succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.

A Statistical Model of Human Pose and Body Shape [GVVPerfCapEva: ScanDB]

The aim of this research is to develop a detailed statistical model of human body shapes by leveraging an encoding for mesh surfaces that is linear with respect to rotations and scalings and describes human pose and body shape in a unified framework. We have captured 114 subjects in a subset of 35 poses using a 3D laser scanner. Additionally, we measured body weight and several other biometric measures of the subject. Nonrigid registration is performed to bring the scans into correspondence.