|
a |
|
b/docs/getting_started.rst |
|
|
1 |
Dataset Zoo |
|
|
2 |
################## |
|
|
3 |
LAVIS inherently supports a wide variety of common language-vision datasets by providing automatic download scripts to help download and organize these datasets; |
|
|
4 |
and implements PyTorch datasets for these datasets. To view supported datasets, use the following code: |
|
|
5 |
|
|
|
6 |
.. code-block:: python |
|
|
7 |
|
|
|
8 |
from lavis.datasets.builders import dataset_zoo |
|
|
9 |
dataset_names = dataset_zoo.get_names() |
|
|
10 |
print(dataset_names) |
|
|
11 |
# ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m', |
|
|
12 |
# 'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi', |
|
|
13 |
# 'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr', |
|
|
14 |
# 'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa'] |
|
|
15 |
print(len(dataset_names)) |
|
|
16 |
# 23 |
|
|
17 |
|
|
|
18 |
|
|
|
19 |
Auto-Downloading and Loading Datasets |
|
|
20 |
###################################### |
|
|
21 |
We now take COCO caption dataset as an example to demonstrate how to download and prepare the dataset. |
|
|
22 |
|
|
|
23 |
In ``lavis/datasets/download_scripts/``, we provide tools to download most common public language-vision datasets supported by LAVIS. |
|
|
24 |
The COCO caption dataset uses images from COCO dataset. Therefore, we first download COCO images via: |
|
|
25 |
|
|
|
26 |
.. code-block:: bash |
|
|
27 |
|
|
|
28 |
cd lavis/datasets/download_scripts/ && python download_coco.py |
|
|
29 |
|
|
|
30 |
This will automatically download and extract COCO images to the default LAVIS cache location. |
|
|
31 |
The default cache location is ``~/.cache/lavis``, defined in ``lavis/configs/default.yaml``. |
|
|
32 |
|
|
|
33 |
After downloading the images, we can use ``load_dataset()`` to obtain the dataset. On the first run, this will automatically download and cache annotation files. |
|
|
34 |
|
|
|
35 |
.. code-block:: python |
|
|
36 |
|
|
|
37 |
from lavis.datasets.builders import load_dataset |
|
|
38 |
coco_dataset = load_dataset("coco_caption") |
|
|
39 |
|
|
|
40 |
print(coco_dataset.keys()) |
|
|
41 |
# dict_keys(['train', 'val', 'test']) |
|
|
42 |
|
|
|
43 |
print(len(coco_dataset["train"])) |
|
|
44 |
# 566747 |
|
|
45 |
|
|
|
46 |
print(coco_dataset["train"][0]) |
|
|
47 |
# {'image': <PIL.Image.Image image mode=RGB size=640x480>, |
|
|
48 |
# 'text_input': 'A woman wearing a net on her head cutting a cake. ', |
|
|
49 |
# 'image_id': 0} |
|
|
50 |
|
|
|
51 |
If you already host a local copy of the dataset, you can pass in the ``vis_path`` argument to change the default location to load images. |
|
|
52 |
|
|
|
53 |
.. code-block:: python |
|
|
54 |
|
|
|
55 |
coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH) |
|
|
56 |
|
|
|
57 |
|
|
|
58 |
Model Zoo |
|
|
59 |
#################################### |
|
|
60 |
LAVIS supports a growing list of pre-trained models for different tasks, |
|
|
61 |
datatsets and of varying sizes. Let's get started by viewing the supported models. |
|
|
62 |
|
|
|
63 |
.. code-block:: python |
|
|
64 |
|
|
|
65 |
from lavis.models import model_zoo |
|
|
66 |
print(model_zoo) |
|
|
67 |
# ================================================== |
|
|
68 |
# Architectures Types |
|
|
69 |
# ================================================== |
|
|
70 |
# albef_classification base, ve |
|
|
71 |
# albef_nlvr base |
|
|
72 |
# albef_pretrain base |
|
|
73 |
# albef_retrieval base, coco, flickr |
|
|
74 |
# albef_vqa base, vqav2 |
|
|
75 |
# alpro_qa base, msrvtt, msvd |
|
|
76 |
# alpro_retrieval base, msrvtt, didemo |
|
|
77 |
# blip_caption base, base_coco, large, large_coco |
|
|
78 |
# blip_classification base |
|
|
79 |
# blip_feature_extractor base |
|
|
80 |
# blip_nlvr base |
|
|
81 |
# blip_pretrain base |
|
|
82 |
# blip_retrieval base, coco, flickr |
|
|
83 |
# blip_vqa base, vqav2 |
|
|
84 |
# clip ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50 |
|
|
85 |
|
|
|
86 |
# show total number of support model variants |
|
|
87 |
len(model_zoo) |
|
|
88 |
# 33 |
|
|
89 |
|
|
|
90 |
|
|
|
91 |
Inference with Pre-trained Models |
|
|
92 |
#################################### |
|
|
93 |
|
|
|
94 |
Now let's see how to use models in LAVIS to perform inference on example data. We first |
|
|
95 |
load a sample image from local. |
|
|
96 |
|
|
|
97 |
.. code-block:: python |
|
|
98 |
|
|
|
99 |
from PIL import Image |
|
|
100 |
|
|
|
101 |
# setup device to use |
|
|
102 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
103 |
|
|
|
104 |
# load sample image |
|
|
105 |
raw_image = Image.open("docs/_static/merlion.png").convert("RGB") |
|
|
106 |
|
|
|
107 |
This example image shows `Merlion park <https://en.wikipedia.org/wiki/Merlion>`_ (`image credit <https://theculturetrip.com/asia/singapore/articles/what-exactly-is-singapores-merlion-anyway/>`_), a landmark in Singapore. |
|
|
108 |
|
|
|
109 |
.. image:: _static/merlion.png |
|
|
110 |
|
|
|
111 |
Image Captioning |
|
|
112 |
******************************* |
|
|
113 |
We now use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each |
|
|
114 |
pre-trained model with its preprocessors (transforms), we use ``load_model_and_preprocess()`` with the following arguments: |
|
|
115 |
|
|
|
116 |
- ``name``: The name of the model to load. This could be a pre-trained model, task model, or feature extractor. See ``model_zoo`` for a full list of model names. |
|
|
117 |
- ``model_type``: Each architecture has variants trained on different datasets and at different scale. See Types column in ``model_zoo`` for a full list of model types. |
|
|
118 |
- ``is_eval``: if `True`, set the model to evaluation mode. This is desired for inference or feature extraction. |
|
|
119 |
- ``device``: device to load the model to. |
|
|
120 |
|
|
|
121 |
.. code-block:: python |
|
|
122 |
|
|
|
123 |
from lavis.models import load_model_and_preprocess |
|
|
124 |
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset. |
|
|
125 |
# this also loads the associated image processors |
|
|
126 |
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device) |
|
|
127 |
|
|
|
128 |
# preprocess the image |
|
|
129 |
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference) |
|
|
130 |
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device) |
|
|
131 |
|
|
|
132 |
# generate caption |
|
|
133 |
model.generate({"image": image}) |
|
|
134 |
# ['a large fountain spewing water into the air'] |
|
|
135 |
|
|
|
136 |
|
|
|
137 |
You may also load models and their preprocessors separately via ``load_model()`` and ``load_processor()``. |
|
|
138 |
In BLIP, you can also generate diverse captions by turning nucleus sampling on. |
|
|
139 |
|
|
|
140 |
.. code-block:: python |
|
|
141 |
|
|
|
142 |
from lavis.processors import load_processor |
|
|
143 |
from lavis.models import load_model |
|
|
144 |
|
|
|
145 |
# load image preprocesser used for BLIP |
|
|
146 |
vis_processor = load_processor("blip_image_eval").build(image_size=384) |
|
|
147 |
model = load_model(name="blip_caption", model_type="base_coco", is_eval=True, device=device) |
|
|
148 |
|
|
|
149 |
image = vis_processor(image).unsqueeze(0).to(device) |
|
|
150 |
model.generate({"image": raw_image}, use_nucleus_sampling=True) |
|
|
151 |
# one generated random sample: ['some very pretty buildings and some water jets'] |
|
|
152 |
|
|
|
153 |
|
|
|
154 |
Visual question answering (VQA) |
|
|
155 |
******************************* |
|
|
156 |
BLIP model is able to answer free-form questions about images in natural language. |
|
|
157 |
To access the VQA model, simply replace the ``name`` and ``model_type`` arguments |
|
|
158 |
passed to ``load_model_and_preprocess()``. |
|
|
159 |
|
|
|
160 |
.. code-block:: python |
|
|
161 |
|
|
|
162 |
from lavis.models import load_model_and_preprocess |
|
|
163 |
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device) |
|
|
164 |
|
|
|
165 |
# ask a random question. |
|
|
166 |
question = "Which city is this photo taken?" |
|
|
167 |
|
|
|
168 |
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device) |
|
|
169 |
question = txt_processors["eval"](question) |
|
|
170 |
|
|
|
171 |
model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate") |
|
|
172 |
# ['singapore'] |
|
|
173 |
|
|
|
174 |
|
|
|
175 |
Unified Feature Extraction Interface |
|
|
176 |
#################################### |
|
|
177 |
|
|
|
178 |
LAVIS provides a unified interface to extract multimodal features from each architecture. |
|
|
179 |
To extract features, we load the feature extractor variants of each model. |
|
|
180 |
The multimodal feature can be used for multimodal classification. The low-dimensional unimodal features can be used to compute cross-modal similarity. |
|
|
181 |
|
|
|
182 |
.. code-block:: python |
|
|
183 |
|
|
|
184 |
from lavis.models import load_model_and_preprocess |
|
|
185 |
|
|
|
186 |
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device) |
|
|
187 |
caption = "a large fountain spewing water into the air" |
|
|
188 |
|
|
|
189 |
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device) |
|
|
190 |
text_input = txt_processors["eval"](caption) |
|
|
191 |
|
|
|
192 |
sample = {"image": image, "text_input": [text_input]} |
|
|
193 |
|
|
|
194 |
features_multimodal = model.extract_features(sample) |
|
|
195 |
print(features_multimodal.keys()) |
|
|
196 |
# odict_keys(['image_embeds', 'multimodal_embeds']) |
|
|
197 |
print(features_multimodal.multimodal_embeds.shape) |
|
|
198 |
# torch.Size([1, 12, 768]), use features_multimodal[:, 0, :] for multimodal classification tasks |
|
|
199 |
|
|
|
200 |
features_image = model.extract_features(sample, mode="image") |
|
|
201 |
print(features_image.keys()) |
|
|
202 |
# odict_keys(['image_embeds', 'image_embeds_proj']) |
|
|
203 |
print(features_image.image_embeds.shape) |
|
|
204 |
# torch.Size([1, 197, 768]) |
|
|
205 |
print(features_image.image_embeds_proj.shape) |
|
|
206 |
# torch.Size([1, 197, 256]) |
|
|
207 |
|
|
|
208 |
features_text = model.extract_features(sample, mode="text") |
|
|
209 |
print(features_text.keys()) |
|
|
210 |
# odict_keys(['text_embeds', 'text_embeds_proj']) |
|
|
211 |
print(features_text.text_embeds.shape) |
|
|
212 |
# torch.Size([1, 12, 768]) |
|
|
213 |
print(features_text.text_embeds_proj.shape) |
|
|
214 |
# torch.Size([1, 12, 256]) |
|
|
215 |
|
|
|
216 |
similarity = features_image.image_embeds_proj[:, 0, :] @ features_text.text_embeds_proj[:, 0, :].t() |
|
|
217 |
print(similarity) |
|
|
218 |
# tensor([[0.2622]]) |
|
|
219 |
|
|
|
220 |
Since LAVIS supports a unified feature extraction interface, minimal changes are necessary to use a different model as feature extractor. For example, |
|
|
221 |
to use ALBEF as the feature extractor, one only needs to change the following line: |
|
|
222 |
|
|
|
223 |
.. code-block:: python |
|
|
224 |
|
|
|
225 |
model, vis_processors, txt_processors = load_model_and_preprocess(name="albef_feature_extractor", model_type="base", is_eval=True, device=device) |
|
|
226 |
|
|
|
227 |
Similarly, to use CLIP as feature extractor: |
|
|
228 |
|
|
|
229 |
.. code-block:: python |
|
|
230 |
|
|
|
231 |
model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="base", is_eval=True, device=device) |
|
|
232 |
# model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="RN50", is_eval=True, device=device) |
|
|
233 |
# model, vis_processors, txt_processors = load_model_and_preprocess(name="clip_feature_extractor", model_type="ViT-L-14", is_eval=True, device=device) |