Diff of /docs/intro.rst [000000] .. [dc40d0]

Switch to unified view

a b/docs/intro.rst
1
What is LAVIS?
2
####################################
3
4
LAVIS is a Python deep learning library for LAnguage-and-VISion research and applications.
5
It features a unified design to access state-of-the-art foundation language-vision models (`ALBEF <https://arxiv.org/pdf/2107.07651.pdf>`_,
6
`BLIP <https://arxiv.org/pdf/2201.12086.pdf>`_, `ALPRO <https://arxiv.org/pdf/2112.09583.pdf>`_, `CLIP <https://arxiv.org/pdf/2103.00020.pdf>`_), common tasks 
7
(retrieval, captioning, visual question answering, multimodal classification etc.) and datasets (COCO, Flickr, Nocaps, Conceptual
8
Commons, SBU, etc.).
9
10
This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal
11
scenarios, and benchmark them across standard and customized datasets. 
12
13
Key features of LAVIS include:
14
15
- **Modular and Extensible Library Design**: facilitating to easily utilize and repurpose existing modules (datasets, models, preprocessors), also to add new modules.
16
17
- **Easy Off-the-shelf Inference and Feature Extraction**: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.
18
19
- **Reproducible Model Zoo**: provided training/pre-training recipies to easily replicate and extend state-of-the-art models.
20
21
- **Dataset Zoo and Automatic Downloading Tools**: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloaing scripts to help prepare a large variety of datasets and their annotations.
22
23
Other features include:
24
25
- **Distributed Training** using multiple GPUs on one machine or across multiple machines.
26
27
- **Web Demo**: try supported models on your own pictures, questions etc.
28
29
- **Leaderboard**: comparing state-of-the-art models across standard datasets. 
30
31
- **Dataset Explorer**: help browse and understand language-vision datasets.
32
33
Supported Tasks, Models and Datasets
34
####################################
35
36
The following table shows the supported models and language-vision tasks by LAVIS. Adapting existing models to more tasks is possible and next to come in future releases.
37
38
======================================== =========================== ============================================= ============ 
39
Tasks                                     Supported Models            Supported Datasets                            Modalities  
40
======================================== =========================== ============================================= ============ 
41
Image-text Pre-training                   ALBEF, BLIP                 COCO, VisualGenome, SBU, ConceptualCaptions  image, text  
42
Image-text Retrieval                      ALBEF, BLIP, CLIP           COCO, Flickr30k                              image, text  
43
Text-image Retrieval                      ALBEF, BLIP, CLIP           COCO, Flickr30k                              image, text  
44
Visual Question Answering                 ALBEF, BLIP                 VQAv2, OKVQA, A-OKVQA                        image, text  
45
Image Captioning                          BLIP                        COCO, NoCaps                                 image, text  
46
Image Classification                      CLIP                        ImageNet                                     image        
47
Natural Language Visual Reasoning (NLVR)  ALBEF, BLIP                 NLVR2                                        image, text  
48
Visual Entailment (VE)                    ALBEF                       SNLI-VE                                      image, text  
49
Visual Dialogue                           BLIP                        VisDial                                      image, text  
50
Video-text Retrieval                      BLIP, ALPRO                 MSRVTT, DiDeMo                               video, text  
51
Text-video Retrieval                      BLIP, ALPRO                 MSRVTT, DiDeMo                               video, text  
52
Video Question Answering (VideoQA)        BLIP, ALPRO                 MSRVTT, MSVD                                 video, text  
53
Video Dialogue                            VGD-GPT                     AVSD                                         video, text  
54
Multimodal Feature Extraction             ALBEF, CLIP, BLIP, ALPRO    customized                                   image, text  
55
======================================== =========================== ============================================= ============ 
56
57
Library Design
58
####################################
59
60
.. image:: _static/architecture.png
61
  :width: 550
62
63
LAVIS has six key modules.
64
65
- ``lavis.runners`` manages the overall training and evaluation lifecycle. It is also responsible for creating required components lazily as per demand, such as optimizers, learning rate schedulers and dataloaders. Currently ``RunnerBase`` implements epoch-based training and ``RunerIters`` implements iteration-based training.
66
- ``lavis.tasks`` implements concrete training and evaluation logic per task. A task could be, for example, retrieval, captioning, pre-training. The rationale to have an abstraction of task is to accommodate task-specific training and evaluation. For example, evaluating a retrieval model is different from a classification model.
67
- ``lavis.datasets`` is responsible for creating datasets, where ``lavis.datasets.builders`` loads dataset configurations, downloads annotations and returns a dataset object; ``lavis.datasets.datasets`` defines the supported datasets, each is a ``torch.utils.data.Dataset`` instance. We also provide `automatic dataset downloading tools` in ``datasets/download_scripts`` to help prepare common public datasets.
68
- ``lavis.models`` holds definition for the supported models and shared model layers.
69
- ``lavis.processors`` handles preprocessing of text and images/videos before feeding the model. For images and videos, a processor can be thought as transfroms in torchvision; for text input, this may include lowering case, truncation etc.
70
- ``lavis.common`` module contains shared classes and methods used by multiple other modules. For example,
71
72
   - ``lavis.common.config`` contains classes to store and manipulate configuration files used by LAVIS. In particular, we use a hierarchical configuration design, to allow highly customizable training and evaluation.
73
   - ``lavis.common.registry``  serves as a centralized place to manage modules that share the same functionalities. It allows building datasets, models, tasks, and learning rate schedulers during runtime, by specifying their names as string in the configuration file.
74
   - ``lavis.common.optims`` contains definitions of learning rate schedulers.
75
   - ``lavis.common.dist_utils`` contains utilities for distributed training and evaluation.
76
   - ``lavis.common.utils`` contains miscellaneous utilities, mostly IO-related helper functions.
77
78
79
Installation
80
############
81
1. (Optional) Creating conda environment
82
83
.. code-block:: bash
84
85
   conda create -n lavis python=3.8
86
   conda activate lavis
87
88
2. Cloning and building from source
89
90
.. code-block:: bash
91
92
   git clone https://github.com/salesforce/LAVIS.git
93
   cd LAVIS
94
   pip install .
95
96
If you would like to develop on LAVIS, you may find it easier to build with editable mode::
97
98
   pip install -e .
99