We present Audiovisual SlowFast Networks, an archi-
tecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply inte- grated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and vi- sual features at multiple layers, enabling audio to con- tribute to the formation of hierarchical audiovisual con- cepts. To overcome training difficulties that arise from dif- ferent learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Au- dio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the gener- alization of AVSlowFast to learn self-supervised audiovi- sual features. Code will be made available at: https: //github.com/facebookresearch/SlowFast.
@article{xiao2020audiovisual,
title={Audiovisual SlowFast Networks for Video Recognition},
author={Xiao, Fanyi and Lee, Yong Jae and Grauman, Kristen and Malik, Jitendra and Feichtenhofer, Christoph},
journal={arXiv preprint arXiv:2001.08740},
year={2020}
}
config | n_fft | gpus | backbone | pretrain | top1 acc/delta | top5 acc/delta | inference_time(video/s) | gpu_mem(M) | ckpt | log | json |
---|---|---|---|---|---|---|---|---|---|---|---|
tsn_r18_64x1x1_100e_kinetics400_audio_feature | 1024 | 8 | ResNet18 | None | 19.7 | 35.75 | x | 1897 | ckpt | log | json |
tsn_r18_64x1x1_100e_kinetics400_audio_feature + tsn_r50_video_320p_1x1x3_100e_kinetics400_rgb | 1024 | 8 | ResNet(18+50) | None | 71.50(+0.39) | 90.18(+0.14) | x | x | x | x | x |
:::{note}
:::
For more details on data preparation, you can refer to Prepare audio
in Data Preparation.
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train ResNet model on Kinetics-400 audio dataset in a deterministic option with periodic validation.
python tools/train.py configs/audio_recognition/tsn_r50_64x1x1_100e_kinetics400_audio_feature.py \
--work-dir work_dirs/tsn_r50_64x1x1_100e_kinetics400_audio_feature \
--validate --seed 0 --deterministic
For more details, you can refer to Training setting part in getting_started.
You can use the following command to test a model.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test ResNet model on Kinetics-400 audio dataset and dump the result to a json file.
python tools/test.py configs/audio_recognition/tsn_r50_64x1x1_100e_kinetics400_audio_feature.py \
checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
--out result.json
For more details, you can refer to Test a dataset part in getting_started.
For multi-modality fusion, you can use the simple script, the standard usage is:
python tools/analysis/report_accuracy.py --scores ${AUDIO_RESULT_PKL} ${VISUAL_RESULT_PKL} --datalist data/kinetics400/kinetics400_val_list_rawframes.txt --coefficient 1 1
tools/test.py
by the argument --out
.tools/test.py
by the argument --out
.