TimeSformer

简介

@misc{bertasius2021spacetime,
    title   = {Is Space-Time Attention All You Need for Video Understanding?},
    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},
    year    = {2021},
    eprint  = {2102.05095},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

模型库

Kinetics-400

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
timesformer_divST_8x32x1_15e_kinetics400_rgb	短边 320	8	TimeSformer	ImageNet-21K	77.92	93.29	x	17874	ckpt	log	json
timesformer_jointST_8x32x1_15e_kinetics400_rgb	短边 320	8	TimeSformer	ImageNet-21K	77.01	93.08	x	25658	ckpt	log	json
timesformer_sapceOnly_8x32x1_15e_kinetics400_rgb	短边 320	8	TimeSformer	ImageNet-21K	76.93	92.90	x	12750	ckpt	log	json

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数（32G V100）。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。
依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。
如，lr=0.005 对应 8 GPUs x 8 video/gpu，以及 lr=0.004375 对应 8 GPUs x 7 video/gpu。
MMAction2 保持与原代码的测试设置一致（three crop x 1 clip）。
TimeSformer 使用的预训练模型 vit_base_patch16_224.pth 转换自 vision_transformer。

对于数据集准备的细节，用户可参考数据集准备文档中的 Kinetics400 部分。

如何训练

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TimeSformer 模型在 Kinetics400 数据集上的训练。

python tools/train.py configs/recognition/timesformer/timesformer_divST_8x32x1_15e_kinetics400_rgb.py \
    --work-dir work_dirs/timesformer_divST_8x32x1_15e_kinetics400_rgb.py \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics400 数据集上测试 TimeSformer 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/timesformer/timesformer_divST_8x32x1_15e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。