Human Pose Estimation Datasets

COCO Keypoints Detection

17 Keypoints:

Keypoint detection format:

Annotations

Annotations for keypoints are just like in Object Detection (Segmentation), except a number of keypoints is specified in sets of 3, (x, y, v).

annotation{
    "keypoints": [x1,y1,v1,...],
    "num_keypoints": int,
    "id": int,
    "image_id": int,
    "category_id": int,
    "segmentation": RLE or [polygon],
    "area": float,
    "bbox": [x,y,width,height],
    "iscrowd": 0 or 1,
}

“keypoints”: a length 3k array where k is the total number of keypoints defined for the category.
- Each keypoint has
  - a 0-indexed location x, y
  - visible flag v
    - v=0: not labeled (in which case x=y=0)
    - v=1: labeled but not visible
    - v=2: labeled and visible
      A keypoint is considered visible if it falls inside the object segment.
- For example, (229, 256, 2) means there’s a keypoint at pixel x=229, y=256 and v=2 indicates that it is a visible keypoint
“num_keypoints”: indicates the number of labeled keypoints (v>0) for a given object (many objects, e.g. crowds and small objects, will have num_keypoints=0).

Example

"annotations": [
    {
        "segmentation": [[204.01,306.23,...206.53,307.95]],
        "num_keypoints": 15,
        "area": 5463.6864,
        "iscrowd": 0,
        "keypoints": [229,256,2,...,223,369,2],
        "image_id": 289343,
        "bbox": [204.01,235.08,60.84,177.36],
        "category_id": 1,
        "id": 201376
    }
]

MPII

State of the art benchmark for evaluation of articulated human pose estimation.
Includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities.
Overall the dataset covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames.

Keypoints

Id	Name
0	r ankle
1	r knee
2	r hip
3	l hip
4	l knee
5	l ankle
6	pelvis
7	thorax
8	upper neck
9	head top
10	r wrist
11	r elbow
12	r shoulder
13	l shoulder
14	l elbow
15	l wrist

PoseTrack

PoseTrack is a large-scale benchmark for human pose estimation and tracking in image sequences. It provides a publicly available training and validation set as well as an evaluation server for benchmarking on a held-out test set.

Tasks

Single-frame Pose Estimation

The aim of this task is to perform multi-person human pose estimation in single frames.
It is similar to the ones covered by existing datasets like “MPII Human Pose” and MS COCO Keypoints Challenge.
Note that this scenario assumes that body poses are estimated independently in each frame.
Evaluation: The evaluation is performed using standard mean Average Precision (mAP) metric

Articulated People Tracking

This task requires to provide temporally consistent poses for all people visible in the video. This means that in addition to pose estimation of each person, it is also required to track body joints of people.
Evaluation: The evaluation will include both pose estimation accuracy as well as pose tracking accuracy.
- The pose estimation accuracy is evaluated using the stand mAP metric
- The evaluation of pose tracking is according to the CLEAR MOT metrics, the de-facto standard for evaluation of multi-target tracking.
Trajectory-based measures are also evaluated that count the number of mostly tracked (MT), mostly lost (ML) tracks and the number of times a ground-truth trajectory is fragmented (FM).

Annotations

Each person is labeled with a head bounding box and positions of the body joints.
Omit annotations of people in dense crowds and in some cases also choose to skip annotating people in upright standing poses.
Ignore regions to specify which people in the image where ignored during annotation.
Each sequence included in the PoseTrack benchmark correspond to about 5 seconds of video. The number of frames in each sequence might vary as different videos were recorded with different number of frames per second (FPS).
- Training sequences: annotations for 30 consecutive frames centered in the middle of the sequence
- Validation and test sequences: annotate 30 consecutive frames and in addition annotate every 4-th frame of the sequence

Annotation Format

File format of PoseTrack 2018 is based on the Microsoft COCO dataset annotation format

`.json` Dictionary Structure

At top level, each .json file stores a dictionary with three elements:

images
- A list of described images. The list must contain the information for all images referenced by a person description in the file.
- Each list element is a dictionary and must contain only two fields
  - file_name : must refer to the original posetrack image
  - id (unique int)
  - Example
```
has_no_densepose:true
is_labeled:true
file_name:"images/val/000342_mpii_test/000000.jpg"
nframes:100
frame_id:10003420000
vid_id:"000342"
id:10003420000
```
annotations
- Another list of dictionaries
- Each item of the list describes one detected person and is itself a dictionary. It must have at least the following fields:
  - image_id: int, an image with a corresponding id must be in images,
  - track_id
    - int, the track this person is performing
    - unique per frame
  - keypoints: list of floats, length three times number of estimated keypoints in order x, y, ? for every point. (The third value per keypoint is only there for COCO format consistency and not used.)
    - Example
      bbox_head: [] # 4 items keypoints: [] # 51 items track_id: 0 image_id: 10003420000 bbox: [] # 4 items scores: [] category_id: 1 id: 1000342000000
  - scores
    - list of float, length number of estimated keypoints
    - each value between 0. and 1. providing a prediction confidence for each keypoint
categories
- Must be a list containing precisely one item, describing the person structure
- The dictionary must contain
- name: person
- keypoints: a list of strings which must be a superset of [nose, upper_neck, head_top, left_shoulder,right_shoulder, left_elbow, right_elbow, left_wrist, right_wrist,left_hip, right_hip, left_knee, right_knee, left_ankle,right_ankle]. (The order may be arbitrary.)
- Example
```
supercategory: "person"
id: 1
name: "person"
keypoints: [] # 17 items
skeleton: [] # 19 items
```

Keypoints Annotations

Keypoints annotations by PoseTrack are similar to COCO keypoints, except

left_eye and right_eye are changed to head_bottom and head_top, respectively
Annotations for ears are excluded. (I.e., only 15 keypoints are annotated)
Note: If you look at the annotation closely, there’re 51 elements in keypoints dictionary (3 elements (x, y, v) for each keypoint). In other words, there’re still 17 annotated keypoints.
To (manually) exclude left_ear and right_ear, elements 9 to 14, which correpond to (x, y, v) of left_ear and right_ear, are all set to 0.

id	COCO Keypoints	PoseTrack
0	nose	nose
1	left_eye	head_bottom
2	right_eye	head_top
3	left_ear	~~left_ear~~
4	right_ear	~~right_ear~~
5	left_shoulder	left_shoulder
6	right_shoulder	right_shoulder
7	left_elbow	left_elbow
8	right_elbow	right_elbow
9	left_wrist	left_wrist
10	right_wrist	right_wrist
11	left_hip	left_hip
12	right_hip	right_hip
13	left_knee	left_knee
14	right_knee	right_knee
15	left_ankle	left_ankle
16	right_ankle	right_ankle

Visualization:

Reference

Computer Vision HPE

Last updated on Apr 3, 2022

Human Pose Estimation Datasets

COCO Keypoints Detection

Annotations

Categories

MPII

Keypoints

PoseTrack

Tasks

Single-frame Pose Estimation

Articulated People Tracking

Annotations

Annotation Format

.json Dictionary Structure

Keypoints Annotations

Reference

`.json` Dictionary Structure