SlowOnly

Abstract

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA.

Citation

@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={Proceedings of the IEEE international conference on computer vision},
  pages={6202--6211},
  year={2019}
}

Model Zoo

Kinetics-400

config	resolution	gpus	backbone	pretrain	top1 acc	top5 acc	inference_time(video/s)	gpu_mem(M)	ckpt	log	json
slowonly_r50_4x16x1_256e_kinetics400_rgb	short-side 256	8x4	ResNet50	None	72.76	90.51	x	3168	ckpt	log	json
slowonly_r50_video_4x16x1_256e_kinetics400_rgb	short-side 320	8x2	ResNet50	None	72.90	90.82	x	8472	ckpt	log	json
slowonly_r50_8x8x1_256e_kinetics400_rgb	short-side 256	8x4	ResNet50	None	74.42	91.49	x	5820	ckpt	log	json
slowonly_r50_4x16x1_256e_kinetics400_rgb	short-side 320	8x2	ResNet50	None	73.02	90.77	4.0 (40x3 frames)	3168	ckpt	log	json
slowonly_r50_8x8x1_256e_kinetics400_rgb	short-side 320	8x3	ResNet50	None	74.93	91.92	2.3 (80x3 frames)	5820	ckpt	log	json
slowonly_imagenet_pretrained_r50_4x16x1_150e_kinetics400_rgb	short-side 320	8x2	ResNet50	ImageNet	73.39	91.12	x	3168	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x8x1_150e_kinetics400_rgb	short-side 320	8x4	ResNet50	ImageNet	75.55	92.04	x	5820	ckpt	log	json
slowonly_nl_embedded_gaussian_r50_4x16x1_150e_kinetics400_rgb	short-side 320	8x2	ResNet50	ImageNet	74.54	91.73	x	4435	ckpt	log	json
slowonly_nl_embedded_gaussian_r50_8x8x1_150e_kinetics400_rgb	short-side 320	8x4	ResNet50	ImageNet	76.07	92.42	x	8895	ckpt	log	json
slowonly_r50_4x16x1_256e_kinetics400_flow	short-side 320	8x2	ResNet50	ImageNet	61.79	83.62	x	8450	ckpt	log	json
slowonly_r50_8x8x1_196e_kinetics400_flow	short-side 320	8x4	ResNet50	ImageNet	65.76	86.25	x	8455	ckpt	log	json

Kinetics-400 Data Benchmark

In data benchmark, we compare two different data preprocessing methods: (1) Resize video to 340x256, (2) Resize the short edge of video to 320px, (3) Resize the short edge of video to 256px.

config	resolution	gpus	backbone	Input	pretrain	top1 acc	top5 acc	testing protocol	ckpt	log	json
slowonly_r50_randomresizedcrop_340x256_4x16x1_256e_kinetics400_rgb	340x256	8x2	ResNet50	4x16	None	71.61	90.05	10 clips x 3 crops	ckpt	log	json
slowonly_r50_randomresizedcrop_320p_4x16x1_256e_kinetics400_rgb	short-side 320	8x2	ResNet50	4x16	None	73.02	90.77	10 clips x 3 crops	ckpt	log	json
slowonly_r50_randomresizedcrop_256p_4x16x1_256e_kinetics400_rgb	short-side 256	8x4	ResNet50	4x16	None	72.76	90.51	10 clips x 3 crops	ckpt	log	json

Kinetics-400 OmniSource Experiments

config	resolution	backbone	pretrain	w. OmniSource	top1 acc	top5 acc	ckpt	log	json
slowonly_r50_4x16x1_256e_kinetics400_rgb	short-side 320	ResNet50	None	❌	73.0	90.8	ckpt	log	json
x	x	ResNet50	None	✔️	76.8	92.5	ckpt	x	x
slowonly_r101_8x8x1_196e_kinetics400_rgb	x	ResNet101	None	❌	76.5	92.7	ckpt	x	x
x	x	ResNet101	None	✔️	80.4	94.4	ckpt	x	x

Kinetics-600

config	resolution	gpus	backbone	pretrain	top1 acc	top5 acc	ckpt	log	json
slowonly_r50_video_8x8x1_256e_kinetics600_rgb	short-side 256	8x4	ResNet50	None	77.5	93.7	ckpt	log	json

Kinetics-700

config	resolution	gpus	backbone	pretrain	top1 acc	top5 acc	ckpt	log	json
slowonly_r50_video_8x8x1_256e_kinetics700_rgb	short-side 256	8x4	ResNet50	None	65.0	86.1	ckpt	log	json

GYM99

config	resolution	gpus	backbone	pretrain	top1 acc	mean class acc	ckpt	log	json
slowonly_imagenet_pretrained_r50_4x16x1_120e_gym99_rgb	short-side 256	8x2	ResNet50	ImageNet	79.3	70.2	ckpt	log	json
slowonly_k400_pretrained_r50_4x16x1_120e_gym99_flow	short-side 256	8x2	ResNet50	Kinetics	80.3	71.0	ckpt	log	json
1: 1 Fusion					83.7	74.8

Jester

config	resolution	gpus	backbone	pretrain	top1 acc	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x8x1_64e_jester_rgb	height 100	8	ResNet50	ImageNet	97.2	ckpt	log	json

HMDB51

config	gpus	backbone	pretrain	top1 acc	top5 acc	gpu_mem(M)	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x4x1_64e_hmdb51_rgb	8	ResNet50	ImageNet	37.52	71.50	5812	ckpt	log	json
slowonly_k400_pretrained_r50_8x4x1_40e_hmdb51_rgb	8	ResNet50	Kinetics400	65.95	91.05	5812	ckpt	log	json

UCF101

config	gpus	backbone	pretrain	top1 acc	top5 acc	gpu_mem(M)	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x4x1_64e_ucf101_rgb	8	ResNet50	ImageNet	71.35	89.35	5812	ckpt	log	json
slowonly_k400_pretrained_r50_8x4x1_40e_ucf101_rgb	8	ResNet50	Kinetics400	92.78	99.42	5812	ckpt	log	json

Something-Something V1

config	gpus	backbone	pretrain	top1 acc	top5 acc	gpu_mem(M)	ckpt	log	json
slowonly_imagenet_pretrained_r50_8x4x1_64e_sthv1_rgb	8	ResNet50	ImageNet	47.76	77.49	7759	ckpt	log	json

:::{note}

The gpus indicates the number of gpu we used to get the checkpoint. It is noteworthy that the configs we provide are used for 8 gpus as default.
According to the Linear Scaling Rule, you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU,
e.g., lr=0.01 for 4 GPUs x 2 video/gpu and lr=0.08 for 16 GPUs x 4 video/gpu.
The inference_time is got by this benchmark script, where we use the sampling frames strategy of the test setting and only care about the model inference time, not including the IO time and pre-processing time. For each setting, we use 1 gpu and set batch size (videos per gpu) to 1 to calculate the inference time.
The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at Kinetics400-Validation. The corresponding data list (each line is of the format 'video_id, num_frames, label_index') and the label map are also available.

:::

For more details on data preparation, you can refer to corresponding parts in Data Preparation.

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train SlowOnly model on Kinetics-400 dataset in a deterministic option with periodic validation.

python tools/train.py configs/recognition/slowonly/slowonly_r50_4x16x1_256e_kinetics400_rgb.py \
    --work-dir work_dirs/slowonly_r50_4x16x1_256e_kinetics400_rgb \
    --validate --seed 0 --deterministic

For more details, you can refer to Training setting part in getting_started.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test SlowOnly model on Kinetics-400 dataset and dump the result to a json file.

python tools/test.py configs/recognition/slowonly/slowonly_r50_4x16x1_256e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips=prob

For more details, you can refer to Test a dataset part in getting_started.