Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, Dahua Lin
We introduce OmniSource, a novel framework for leveraging web data to train video recognition models. OmniSource overcomes the barriers between data formats, such as images, short videos, and long untrimmed videos for webly-supervised learning. First, data samples with multiple formats, curated by task-specific data collection and automatically filtered by a teacher model, are transformed into a unified form. Then a joint-training strategy is proposed to deal with the domain gaps between multiple data sources and formats in webly-supervised learning. Several good practices, including data balancing, resampling, and cross-dataset mixup are adopted in joint training. Experiments show that by utilizing data from multiple sources and formats, OmniSource is more data-efficient in training. With only 3.5M images and 800K minutes videos crawled from the internet without human labeling (less than 2% of prior works), our models learned with OmniSource improve Top-1 accuracy of 2D- and 3D-ConvNet baseline models by 3.0% and 3.9%, respectively, on the Kinetics-400 benchmark. With OmniSource, we establish new records with different pretraining strategies for video recognition. Our best models achieve 80.4%, 80.5%, and 83.6 Top-1 accuracies on the Kinetics-400 benchmark respectively for training-from-scratch, ImageNet pre-training and IG-65M pre-training.
@article{duan2020omni,
title={Omni-sourced Webly-supervised Learning for Video Recognition},
author={Duan, Haodong and Zhao, Yue and Xiong, Yuanjun and Liu, Wentao and Lin, Dahua},
journal={arXiv preprint arXiv:2003.13042},
year={2020}
}
We currently released 4 models trained with OmniSource framework, including both 2D and 3D architectures. We compare the performance of models trained with or without OmniSource in the following table.
Model | Modality | Pretrained | Backbone | Input | Resolution | Top-1 (Baseline / OmniSource (Delta)) | Top-5 (Baseline / OmniSource (Delta))) | Download |
---|---|---|---|---|---|---|---|---|
TSN | RGB | ImageNet | ResNet50 | 3seg | 340x256 | 70.6 / 73.6 (+ 3.0) | 89.4 / 91.0 (+ 1.6) | Baseline / OmniSource |
TSN | RGB | IG-1B | ResNet50 | 3seg | short-side 320 | 73.1 / 75.7 (+ 2.6) | 90.4 / 91.9 (+ 1.5) | Baseline / OmniSource |
SlowOnly | RGB | Scratch | ResNet50 | 4x16 | short-side 320 | 72.9 / 76.8 (+ 3.9) | 90.9 / 92.5 (+ 1.6) | Baseline / OmniSource |
SlowOnly | RGB | Scratch | ResNet101 | 8x8 | short-side 320 | 76.5 / 80.4 (+ 3.9) | 92.7 / 94.4 (+ 1.7) | Baseline / OmniSource |
We release a subset of web dataset used in the OmniSource paper. Specifically, we release the web data in the 200 classes of Mini-Kinetics. The statistics of those datasets is detailed in preparing_omnisource. To obtain those data, you need to fill in a data request form. After we received your request, the download link of these data will be send to you. For more details on the released OmniSource web dataset, please refer to preparing_omnisource.
We benchmark the OmniSource framework on the released subset, results are listed in the following table (we report the Top-1 and Top-5 accuracy on Mini-Kinetics validation). The benchmark can be used as a baseline for video recognition with web data.
Model | Modality | Pretrained | Backbone | Input | Resolution | top1 acc | top5 acc | ckpt | json | log |
---|---|---|---|---|---|---|---|---|---|---|
tsn_r50_1x1x8_100e_minikinetics_rgb | RGB | ImageNet | ResNet50 | 3seg | short-side 320 | 77.4 | 93.6 | ckpt | json | log |
tsn_r50_1x1x8_100e_minikinetics_googleimage_rgb | RGB | ImageNet | ResNet50 | 3seg | short-side 320 | 78.0 | 93.6 | ckpt | json | log |
tsn_r50_1x1x8_100e_minikinetics_webimage_rgb | RGB | ImageNet | ResNet50 | 3seg | short-side 320 | 78.6 | 93.6 | ckpt | json | log |
tsn_r50_1x1x8_100e_minikinetics_insvideo_rgb | RGB | ImageNet | ResNet50 | 3seg | short-side 320 | 80.6 | 95.0 | ckpt | json | log |
tsn_r50_1x1x8_100e_minikinetics_kineticsraw_rgb | RGB | ImageNet | ResNet50 | 3seg | short-side 320 | 78.6 | 93.2 | ckpt | json | log |
tsn_r50_1x1x8_100e_minikinetics_omnisource_rgb | RGB | ImageNet | ResNet50 | 3seg | short-side 320 | 81.3 | 94.8 | ckpt | json | log |
Model | Modality | Pretrained | Backbone | Input | Resolution | top1 acc | top5 acc | ckpt | json | log |
---|---|---|---|---|---|---|---|---|---|---|
slowonly_r50_8x8x1_256e_minikinetics_rgb | RGB | None | ResNet50 | 8x8 | short-side 320 | 78.6 | 93.9 | ckpt | json | log |
slowonly_r50_8x8x1_256e_minikinetics_googleimage_rgb | RGB | None | ResNet50 | 8x8 | short-side 320 | 80.8 | 95.0 | ckpt | json | log |
slowonly_r50_8x8x1_256e_minikinetics_webimage_rgb | RGB | None | ResNet50 | 8x8 | short-side 320 | 81.3 | 95.2 | ckpt | json | log |
slowonly_r50_8x8x1_256e_minikinetics_insvideo_rgb | RGB | None | ResNet50 | 8x8 | short-side 320 | 82.4 | 95.6 | ckpt | json | log |
slowonly_r50_8x8x1_256e_minikinetics_kineticsraw_rgb | RGB | None | ResNet50 | 8x8 | short-side 320 | 80.3 | 94.5 | ckpt | json | log |
slowonly_r50_8x8x1_256e_minikinetics_omnisource_rgb | RGB | None | ResNet50 | 8x8 | short-side 320 | 82.9 | 95.8 | ckpt | json | log |
We also list the benchmark in the original paper which run on Kinetics-400 for comparison:
Model | Baseline | +GG-img | +[GG-IG]-img | +IG-vid | +KRaw | OmniSource |
---|---|---|---|---|---|---|
TSN-3seg-ResNet50 | 70.6 / 89.4 | 71.5 / 89.5 | 72.0 / 90.0 | 72.0 / 90.3 | 71.7 / 89.6 | 73.6 / 91.0 |
SlowOnly-4x16-ResNet50 | 73.8 / 90.9 | 74.5 / 91.4 | 75.2 / 91.6 | 75.2 / 91.7 | 74.5 / 91.1 | 76.6 / 92.5 |