The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. Specifically, it can achieve the performance of 3D CNN but maintain 2D CNN's complexity. TSM shifts part of the channels along the temporal dimension; thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. We also extended TSM to online setting, which enables real-time low-latency online video recognition and video object detection. TSM is accurate and efficient: it ranks the first place on the Something-Something leaderboard upon publication; on Jetson Nano and Galaxy Note8, it achieves a low latency of 13ms and 35ms for online video recognition.
@inproceedings{lin2019tsm,
title={TSM: Temporal Shift Module for Efficient Video Understanding},
author={Lin, Ji and Gan, Chuang and Han, Song},
booktitle={Proceedings of the IEEE International Conference on Computer Vision},
year={2019}
}
@article{NonLocal2018,
author = {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
title = {Non-local Neural Networks},
journal = {CVPR},
year = {2018}
}
config | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top5 acc | inference_time(video/s) | gpu_mem(M) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tsm_r50_1x1x8_50e_kinetics400_rgb | 340x256 | 8 | ResNet50 | ImageNet | 70.24 | 89.56 | 70.36 | 89.49 | 74.0 (8x1 frames) | 7079 | ckpt | log | json |
tsm_r50_1x1x8_50e_kinetics400_rgb | short-side 256 | 8 | ResNet50 | ImageNet | 70.59 | 89.52 | x | x | x | 7079 | ckpt | log | json |
tsm_r50_1x1x8_50e_kinetics400_rgb | short-side 320 | 8 | ResNet50 | ImageNet | 70.73 | 89.81 | x | x | x | 7079 | ckpt | log | json |
tsm_r50_1x1x8_100e_kinetics400_rgb | short-side 320 | 8 | ResNet50 | ImageNet | 71.90 | 90.03 | x | x | x | 7079 | ckpt | log | json |
tsm_r50_gpu_normalize_1x1x8_50e_kinetics400_rgb.py | short-side 256 | 8 | ResNet50 | ImageNet | 70.48 | 89.40 | x | x | x | 7076 | ckpt | log | json |
tsm_r50_video_1x1x8_50e_kinetics400_rgb | short-side 256 | 8 | ResNet50 | ImageNet | 70.25 | 89.66 | 70.36 | 89.49 | 74.0 (8x1 frames) | 7077 | ckpt | log | json |
tsm_r50_dense_1x1x8_50e_kinetics400_rgb | short-side 320 | 8 | ResNet50 | ImageNet | 73.46 | 90.84 | x | x | x | 7079 | ckpt | log | json |
tsm_r50_dense_1x1x8_100e_kinetics400_rgb | short-side 320 | 8 | ResNet50 | ImageNet | 74.55 | 91.74 | x | x | x | 7079 | ckpt | log | json |
tsm_r50_1x1x16_50e_kinetics400_rgb | 340x256 | 8 | ResNet50 | ImageNet | 72.09 | 90.37 | 70.67 | 89.98 | 47.0 (16x1 frames) | 10404 | ckpt | log | json |
tsm_r50_1x1x16_50e_kinetics400_rgb | short-side 256 | 8x4 | ResNet50 | ImageNet | 71.89 | 90.73 | x | x | x | 10398 | ckpt | log | json |
tsm_r50_1x1x16_100e_kinetics400_rgb | short-side 320 | 8 | ResNet50 | ImageNet | 72.80 | 90.75 | x | x | x | 10398 | ckpt | log | json |
tsm_nl_embedded_gaussian_r50_1x1x8_50e_kinetics400_rgb | short-side 320 | 8x4 | ResNet50 | ImageNet | 72.03 | 90.25 | 71.81 | 90.36 | x | 8931 | ckpt | log | json |
tsm_nl_gaussian_r50_1x1x8_50e_kinetics400_rgb | short-side 320 | 8x4 | ResNet50 | ImageNet | 70.70 | 89.90 | x | x | x | 10125 | ckpt | log | json |
tsm_nl_dot_product_r50_1x1x8_50e_kinetics400_rgb | short-side 320 | 8x4 | ResNet50 | ImageNet | 71.60 | 90.34 | x | x | x | 8358 | ckpt | log | json |
tsm_mobilenetv2_dense_1x1x8_100e_kinetics400_rgb | short-side 320 | 8 | MobileNetV2 | ImageNet | 68.46 | 88.64 | x | x | x | 3385 | ckpt | log | json |
tsm_mobilenetv2_dense_1x1x8_kinetics400_rgb_port | short-side 320 | 8 | MobileNetV2 | ImageNet | 69.89 | 89.01 | x | x | x | 3385 | infer_ckpt | x | x |
config | gpus | backbone | pretrain | top1 acc | top5 acc | gpu_mem(M) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|
tsm_r50_video_1x1x8_50e_diving48_rgb | 8 | ResNet50 | ImageNet | 75.99 | 97.16 | 7070 | ckpt | log | json |
tsm_r50_video_1x1x16_50e_diving48_rgb | 8 | ResNet50 | ImageNet | 81.62 | 97.66 | 7070 | ckpt | log | json |
config | resolution | gpus | backbone | pretrain | top1 acc (efficient/accurate) | top5 acc (efficient/accurate) | reference top1 acc (efficient/accurate) | reference top5 acc (efficient/accurate) | gpu_mem(M) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|---|---|---|
tsm_r50_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 45.58 / 47.70 | 75.02 / 76.12 | 45.50 / 47.33 | 74.34 / 76.60 | 7077 | ckpt | log | json |
tsm_r50_flip_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 47.10 / 48.51 | 76.02 / 77.56 | 45.50 / 47.33 | 74.34 / 76.60 | 7077 | ckpt | log | json |
tsm_r50_randaugment_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 47.16 / 48.90 | 76.07 / 77.92 | 45.50 / 47.33 | 74.34 / 76.60 | 7077 | ckpt | log | json |
tsm_r50_ptv_randaugment_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 47.65 / 48.66 | 76.67 / 77.41 | 45.50 / 47.33 | 74.34 / 76.60 | 7077 | ckpt | log | json |
tsm_r50_ptv_augmix_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 46.26 / 47.68 | 75.92 / 76.49 | 45.50 / 47.33 | 74.34 / 76.60 | 7077 | ckpt | log | json |
tsm_r50_flip_randaugment_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 47.85 / 50.31 | 76.78 / 78.18 | 45.50 / 47.33 | 74.34 / 76.60 | 7077 | ckpt | log | json |
tsm_r50_1x1x16_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 47.77 / 49.03 | 76.82 / 77.83 | 47.05 / 48.61 | 76.40 / 77.96 | 10390 | ckpt | log | json |
tsm_r101_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 46.09 / 48.59 | 75.41 / 77.10 | 46.64 / 48.13 | 75.40 / 77.31 | 9800 | ckpt | log | json |
config | resolution | gpus | backbone | pretrain | top1 acc (efficient/accurate) | top5 acc (efficient/accurate) | reference top1 acc (efficient/accurate) | reference top5 acc (efficient/accurate) | gpu_mem(M) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|---|---|---|
tsm_r50_1x1x8_50e_sthv2_rgb | height 256 | 8 | ResNet50 | ImageNet | 59.11 / 61.82 | 85.39 / 86.80 | xx / 61.2 | xx / xx | 7069 | ckpt | log | json |
tsm_r50_1x1x16_50e_sthv2_rgb | height 256 | 8 | ResNet50 | ImageNet | 61.06 / 63.19 | 86.66 / 87.93 | xx / 63.1 | xx / xx | 10400 | ckpt | log | json |
tsm_r101_1x1x8_50e_sthv2_rgb | height 256 | 8 | ResNet101 | ImageNet | 60.88 / 63.84 | 86.56 / 88.30 | xx / 63.3 | xx / xx | 9727 | ckpt | log | json |
config | resolution | gpus | backbone | pretrain | top1 acc (efficient/accurate) | top5 acc (efficient/accurate) | delta top1 acc (efficient/accurate) | delta top5 acc (efficient/accurate) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|---|---|
tsm_r50_mixup_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 46.35 / 48.49 | 75.07 / 76.88 | +0.77 / +0.79 | +0.05 / +0.70 | ckpt | log | json |
tsm_r50_cutmix_1x1x8_50e_sthv1_rgb | height 100 | 8 | ResNet50 | ImageNet | 45.92 / 47.46 | 75.23 / 76.71 | +0.34 / -0.24 | +0.21 / +0.59 | ckpt | log | json |
config | resolution | gpus | backbone | pretrain | top1 acc (efficient/accurate) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|
tsm_r50_1x1x8_50e_jester_rgb | height 100 | 8 | ResNet50 | ImageNet | 96.5 / 97.2 | ckpt | log | json |
config | gpus | backbone | pretrain | top1 acc | top5 acc | gpu_mem(M) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|
tsm_k400_pretrained_r50_1x1x8_25e_hmdb51_rgb | 8 | ResNet50 | Kinetics400 | 72.68 | 92.03 | 10388 | ckpt | log | json |
tsm_k400_pretrained_r50_1x1x16_25e_hmdb51_rgb | 8 | ResNet50 | Kinetics400 | 74.77 | 93.86 | 10388 | ckpt | log | json |
config | gpus | backbone | pretrain | top1 acc | top5 acc | gpu_mem(M) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|
tsm_k400_pretrained_r50_1x1x8_25e_ucf101_rgb | 8 | ResNet50 | Kinetics400 | 94.50 | 99.58 | 10389 | ckpt | log | json |
tsm_k400_pretrained_r50_1x1x16_25e_ucf101_rgb | 8 | ResNet50 | Kinetics400 | 94.58 | 99.37 | 10389 | ckpt | log | json |
:::{note}
...
test_pipeline = [
dict(
type='SampleFrames',
clip_len=1,
frame_interval=1,
num_clips=16, # `num_clips = 8` when using 8 segments
twice_sample=True, # set `twice_sample=True` for twice sample in accurate setting
test_mode=True),
dict(type='RawFrameDecode'),
dict(type='Resize', scale=(-1, 256)),
# dict(type='CenterCrop', crop_size=224), it is used for efficient setting
dict(type='ThreeCrop', crop_size=256), # it is used for accurate setting
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
alpha=0.2
.:::
For more details on data preparation, you can refer to corresponding parts in Data Preparation.
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train TSM model on Kinetics-400 dataset in a deterministic option with periodic validation.
python tools/train.py configs/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb.py \
--work-dir work_dirs/tsm_r50_1x1x8_100e_kinetics400_rgb \
--validate --seed 0 --deterministic
For more details, you can refer to Training setting part in getting_started.
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test TSM model on Kinetics-400 dataset and dump the result to a json file.
python tools/test.py configs/recognition/tsm/tsm_r50_1x1x8_50e_kinetics400_rgb.py \
checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
--out result.json
For more details, you can refer to Test a dataset part in getting_started.