Diff of /docs/getting_started.rst [000000] .. [dc40d0]

Switch to unified view

a b/docs/getting_started.rst
1
Dataset Zoo
2
##################
3
LAVIS inherently supports a wide variety of common language-vision datasets by providing automatic download scripts to help download and organize these datasets; 
4
and implements PyTorch datasets for these datasets. To view supported datasets, use the following code:
5
6
.. code-block:: python
7
8
    from lavis.datasets.builders import dataset_zoo
9
    dataset_names = dataset_zoo.get_names()
10
    print(dataset_names)
11
    # ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',
12
    #  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',
13
    #  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',
14
    #  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']
15
    print(len(dataset_names))
16
    # 23
17
18
19
Auto-Downloading and Loading Datasets
20
######################################
21
We now take COCO caption dataset as an example to demonstrate how to download and prepare the dataset.
22
23
In ``lavis/datasets/download_scripts/``, we provide tools to download most common public language-vision datasets supported by LAVIS.
24
The COCO caption dataset uses images from COCO dataset. Therefore, we first download COCO images via:
25
26
.. code-block:: bash
27
    
28
    cd lavis/datasets/download_scripts/ && python download_coco.py
29
30
This will automatically download and extract COCO images to the default LAVIS cache location.
31
The default cache location is ``~/.cache/lavis``, defined in ``lavis/configs/default.yaml``.
32
33
After downloading the images, we can use ``load_dataset()`` to obtain the dataset. On the first run, this will automatically download and cache annotation files.
34
35
.. code-block:: python
36
37
    from lavis.datasets.builders import load_dataset
38
    coco_dataset = load_dataset("coco_caption")
39
40
    print(coco_dataset.keys())
41
    # dict_keys(['train', 'val', 'test'])
42
43
    print(len(coco_dataset["train"]))
44
    # 566747
45
46
    print(coco_dataset["train"][0])
47
    # {'image': <PIL.Image.Image image mode=RGB size=640x480>,
48
    #  'text_input': 'A woman wearing a net on her head cutting a cake. ',
49
    #  'image_id': 0}
50
51
If you already host a local copy of the dataset, you can pass in the ``vis_path`` argument to change the default location to load images.
52
53
.. code-block:: python
54
55
    coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH)
56
57
58
Model Zoo
59
####################################
60
LAVIS supports a growing list of pre-trained models for different tasks,
61
datatsets and of varying sizes. Let's get started by viewing the supported models.
62
63
.. code-block:: python
64
65
    from lavis.models import model_zoo
66
    print(model_zoo)
67
    # ==================================================
68
    # Architectures                  Types
69
    # ==================================================
70
    # albef_classification           base, ve
71
    # albef_nlvr                     base
72
    # albef_pretrain                 base
73
    # albef_retrieval                base, coco, flickr
74
    # albef_vqa                      base, vqav2
75
    # alpro_qa                       base, msrvtt, msvd
76
    # alpro_retrieval                base, msrvtt, didemo
77
    # blip_caption                   base, base_coco, large, large_coco
78
    # blip_classification            base
79
    # blip_feature_extractor         base
80
    # blip_nlvr                      base
81
    # blip_pretrain                  base
82
    # blip_retrieval                 base, coco, flickr
83
    # blip_vqa                       base, vqav2
84
    # clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
85
86
    # show total number of support model variants
87
    len(model_zoo)
88
    # 33
89
90
91
Inference with Pre-trained Models
92
####################################
93
94
Now let's see how to use models in LAVIS to perform inference on example data. We first
95
load a sample image from local.
96
97
.. code-block:: python
98
99
    from PIL import Image
100
101
    # setup device to use
102
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
103
104
    # load sample image
105
    raw_image = Image.open("docs/_static/merlion.png").convert("RGB")
106
107
This example image shows `Merlion park <https://en.wikipedia.org/wiki/Merlion>`_ (`image credit <https://theculturetrip.com/asia/singapore/articles/what-exactly-is-singapores-merlion-anyway/>`_), a landmark in Singapore.
108
109
.. image:: _static/merlion.png
110
111
Image Captioning
112
*******************************
113
We now use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each
114
pre-trained model with its preprocessors (transforms),  we use ``load_model_and_preprocess()`` with the following arguments:
115
116
- ``name``: The name of the model to load. This could be a pre-trained model, task model, or feature extractor. See ``model_zoo`` for a full list of model names.
117
- ``model_type``: Each architecture has variants trained on different datasets and at different scale. See Types column in ``model_zoo`` for a full list of model types.
118
- ``is_eval``: if `True`, set the model to evaluation mode. This is desired for inference or feature extraction.
119
- ``device``: device to load the model to.
120
121
.. code-block:: python
122
123
    from lavis.models import load_model_and_preprocess
124
    # loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
125
    # this also loads the associated image processors
126
    model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
127
128
    # preprocess the image
129
    # vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
130
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
131
132
    # generate caption
133
    model.generate({"image": image})
134
    # ['a large fountain spewing water into the air']
135
136
137
You may also load models and their preprocessors separately via ``load_model()`` and ``load_processor()``.
138
In BLIP, you can also generate diverse captions by turning nucleus sampling on.
139
140
.. code-block:: python
141
142
    from lavis.processors import load_processor
143
    from lavis.models import load_model
144
145
    # load image preprocesser used for BLIP
146
    vis_processor = load_processor("blip_image_eval").build(image_size=384)
147
    model = load_model(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
148
149
    image = vis_processor(image).unsqueeze(0).to(device)
150
    model.generate({"image": raw_image}, use_nucleus_sampling=True)
151
    # one generated random sample: ['some very pretty buildings and some water jets']
152
153
154
Visual question answering (VQA)
155
*******************************
156
BLIP model is able to answer free-form questions about images in natural language.
157
To access the VQA model, simply replace the ``name`` and ``model_type`` arguments 
158
passed to ``load_model_and_preprocess()``.
159
160
.. code-block:: python
161
162
    from lavis.models import load_model_and_preprocess
163
    model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)
164
165
    # ask a random question.
166
    question = "Which city is this photo taken?"
167
    
168
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
169
    question = txt_processors["eval"](question)
170
171
    model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")
172
    # ['singapore']
173
174
175
Unified Feature Extraction Interface
176
####################################
177
178
LAVIS provides a unified interface to extract multimodal features from each architecture.
179
To extract features, we load the feature extractor variants of each model.
180
The multimodal feature can be used for multimodal classification. The low-dimensional unimodal features can be used to compute cross-modal similarity.
181
182
.. code-block:: python
183
184
    from lavis.models import load_model_and_preprocess 
185
    
186
    model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)
187
    caption = "a large fountain spewing water into the air"
188
189
    image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
190
    text_input = txt_processors["eval"](caption)
191
192
    sample = {"image": image, "text_input": [text_input]}
193
194
    features_multimodal = model.extract_features(sample)
195
    print(features_multimodal.keys())
196
    # odict_keys(['image_embeds', 'multimodal_embeds'])
197
    print(features_multimodal.multimodal_embeds.shape)
198
    # torch.Size([1, 12, 768]), use features_multimodal[:, 0, :] for multimodal classification tasks
199
200
    features_image = model.extract_features(sample, mode="image")
201
    print(features_image.keys())
202
    # odict_keys(['image_embeds', 'image_embeds_proj'])
203
    print(features_image.image_embeds.shape)
204
    # torch.Size([1, 197, 768])
205
    print(features_image.image_embeds_proj.shape)
206
    # torch.Size([1, 197, 256])
207
208
    features_text = model.extract_features(sample, mode="text")
209
    print(features_text.keys())
210
    # odict_keys(['text_embeds', 'text_embeds_proj'])
211
    print(features_text.text_embeds.shape)
212
    # torch.Size([1, 12, 768])
213
    print(features_text.text_embeds_proj.shape)
214
    # torch.Size([1, 12, 256])
215
    
216
    similarity = features_image.image_embeds_proj[:, 0, :] @ features_text.text_embeds_proj[:, 0, :].t()
217
    print(similarity)
218
    # tensor([[0.2622]])
219
220
Since LAVIS supports a unified feature extraction interface, minimal changes are necessary to use a different model as feature extractor. For example,
221
to use ALBEF as the feature extractor, one only needs to change the following line:
222
223
.. code-block:: python
224
225
    model, vis_processors, txt_processors = load_model_and_preprocess(name="albef_feature_extractor", model_type="base", is_eval=True, device=device)
226
227
Similarly, to use CLIP as feature extractor: 
228
229
.. code-block:: python
230
231
    model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="base", is_eval=True, device=device)
232
    # model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="RN50", is_eval=True, device=device)
233
    # model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="ViT-L-14", is_eval=True, device=device)