In this section, we explain the annotation format of AVA in details:
mmaction2
├── data
│ ├── ava
│ │ ├── annotations
│ │ | ├── ava_dense_proposals_train.FAIR.recall_93.9.pkl
│ │ | ├── ava_dense_proposals_val.FAIR.recall_93.9.pkl
│ │ | ├── ava_dense_proposals_test.FAIR.recall_93.9.pkl
│ │ | ├── ava_train_v2.1.csv
│ │ | ├── ava_val_v2.1.csv
│ │ | ├── ava_train_excluded_timestamps_v2.1.csv
│ │ | ├── ava_val_excluded_timestamps_v2.1.csv
│ │ | ├── ava_action_list_v2.1_for_activitynet_2018.pbtxt
In the annotation folder, ava_dense_proposals_[train/val/test].FAIR.recall_93.9.pkl
are human proposals generated by a human detector. They are used in training, validation and testing respectively. Take ava_dense_proposals_train.FAIR.recall_93.9.pkl
as an example. It is a dictionary of size 203626. The key consists of the videoID
and the timestamp
. For example, the key -5KQ66BBWC4,0902
means the values are the detection results for the frame at the $$902_{nd}$$ second in the video -5KQ66BBWC4
. The values in the dictionary are numpy arrays with shape $$N \times 5$$ , $$N$$ is the number of detected human bounding boxes in the corresponding frame. The format of bounding box is $$[x_1, y_1, x_2, y_2, score], 0 \le x_1, y_1, x_2, w_2, score \le 1$$. $$(x_1, y_1)$$ indicates the top-left corner of the bounding box, $$(x_2, y_2)$$ indicates the bottom-right corner of the bounding box; $$(0, 0)$$ indicates the top-left corner of the image, while $$(1, 1)$$ indicates the bottom-right corner of the image.
In the annotation folder, ava_[train/val]_v[2.1/2.2].csv
are ground-truth labels for spatio-temporal action detection, which are used during training & validation. Take ava_train_v2.1.csv
as an example, it is a csv file with 837318 lines, each line is the annotation for a human instance in one frame. For example, the first line in ava_train_v2.1.csv
is '-5KQ66BBWC4,0902,0.077,0.151,0.283,0.811,80,1'
: the first two items -5KQ66BBWC4
and 0902
indicate that it corresponds to the $$902_{nd}$$ second in the video -5KQ66BBWC4
. The next four items ($$[0.077(x_1), 0.151(y_1), 0.283(x_2), 0.811(y_2)]$$) indicates the location of the bounding box, the bbox format is the same as human proposals. The next item 80
is the action label. The last item 1
is the ID of this bounding box.
ava_[train/val]_excludes_timestamps_v[2.1/2.2].csv
contains excluded timestamps which are not used during training or validation. The format is video_id, second_idx
.
ava_action_list_v[2.1/2.2]_for_activitynet_[2018/2019].pbtxt
contains the label map of the AVA dataset, which maps the action name to the label index.