|
a |
|
b/docs/tutorial.training-example.rst |
|
|
1 |
Example on Finetuning BLIP on COCO-Captioning |
|
|
2 |
################################################ |
|
|
3 |
|
|
|
4 |
To finetune BLIP model on the coco caption dataset, first refer to :ref:`prep coco` to prepare the dataset if you have not done so. |
|
|
5 |
|
|
|
6 |
To finetune the model, we have prepared a run script for you, which can run as follows: |
|
|
7 |
|
|
|
8 |
.. code-block:: bash |
|
|
9 |
|
|
|
10 |
bash run_scripts/blip/train/train_caption_coco_large.sh |
|
|
11 |
|
|
|
12 |
This will finetune the pre-trained BLIP large model into a new model that can be used for captioning. |
|
|
13 |
|
|
|
14 |
Deep Dive |
|
|
15 |
********** |
|
|
16 |
Now let's take a closer look at the script and see what it does. |
|
|
17 |
|
|
|
18 |
.. code-block:: bash |
|
|
19 |
|
|
|
20 |
python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip/train/caption_coco_large_ft.yaml |
|
|
21 |
|
|
|
22 |
As can be seen, the script simply calls the :code:`train.py` with PyTorch distributed training enabled. |
|
|
23 |
The :code:`--cfg-path` argument specifies the **runtime config** file to use. The config file is a YAML file that specifies the training parameters, shown as follows: |
|
|
24 |
|
|
|
25 |
.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml |
|
|
26 |
:language: yaml |
|
|
27 |
:linenos: |
|
|
28 |
|
|
|
29 |
The runtime config file is divided into 3 sections: |
|
|
30 |
- :code:`model`: specifies the model architecture and type to use. |
|
|
31 |
- :code:`data`: specifies the dataset to use. |
|
|
32 |
- :code:`run`: specifies the runner arguments, such as tasks, optimizer, learning rate scheduler, etc. |
|
|
33 |
|
|
|
34 |
We describe each section in detail below. |
|
|
35 |
|
|
|
36 |
Model configurations |
|
|
37 |
===================== |
|
|
38 |
|
|
|
39 |
.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml |
|
|
40 |
:language: yaml |
|
|
41 |
:linenos: |
|
|
42 |
:lines: 6-10 |
|
|
43 |
|
|
|
44 |
The :code:`arch` argument specifies the model architecture to use. In this case, we use the :code:`blip_caption` architecture. |
|
|
45 |
You can find available architectures by inspecting the :code:`model_zoo`. |
|
|
46 |
Once the architecture is specified, the runner will look for the model class registered with the name and try to instantiate a model instance. |
|
|
47 |
In this case :code:`BlipCaption` is the model registered with the name :code:`blip_caption`. |
|
|
48 |
|
|
|
49 |
The registry maintains a mapping from the name string to the model class. |
|
|
50 |
This allows the runner to find the model class dynamically based on the name string from the config file. |
|
|
51 |
The following segment in :code:`lavis/models/blip_models/blip_caption.py` shows how :code:`BlipCaption` is registered with the name string :code:`blip_caption`: |
|
|
52 |
|
|
|
53 |
.. literalinclude:: ../lavis/models/blip_models/blip_caption.py |
|
|
54 |
:language: python |
|
|
55 |
:linenos: |
|
|
56 |
:lines: 20-38 |
|
|
57 |
|
|
|
58 |
One same model architecture may be pre-trained or finetuned on different datasets or have different model configurations. |
|
|
59 |
For example, :code:`BlipCaption` have: |
|
|
60 |
|
|
|
61 |
- :code:`base_coco`: pre-trained base BLIP model adapated for COCO captioning finetuning. |
|
|
62 |
|
|
|
63 |
- :code:`large_coco`: pre-trained large BLIP model adapated for COCO captioning finetuning. |
|
|
64 |
|
|
|
65 |
Therefore, we also need to specify :code:`model_type`. Here we use :code:`large_coco`. |
|
|
66 |
And we set :code:`load_finetuned` to :code:`False` to indicate that we are finetuning the model from the pre-trained weights. |
|
|
67 |
If :code:`load_finetuned` set to :code:`True` as by default, the model will load finetuned weights on coco captioning. |
|
|
68 |
|
|
|
69 |
Given the model architecture and type, the library will then look for the default model config for :code:`large_coco` in :code:`lavis/models/blip_models/blip_caption.py`. |
|
|
70 |
As can be seen in the above code snippet, the corresponding config path is stored in :code:`BlipCaption.PRETRAINED_MODEL_CONFIG_DICT`. |
|
|
71 |
Then the library will load :code:`lavis/configs/models/blip_caption_large_coco.yaml` as the configuration to build the model. |
|
|
72 |
|
|
|
73 |
*Priority of Configs*: Note that the priority of the run config is higher than the default model config, meaning that arguments in the run config will override the default model config. |
|
|
74 |
For example, in the default model config, :code:`load_finetuned` is set to :code:`True` by default, while in the run config, we set it to :code:`False` and finetuning from the pre-trained weights only. |
|
|
75 |
|
|
|
76 |
|
|
|
77 |
Dataset configurations |
|
|
78 |
========================= |
|
|
79 |
|
|
|
80 |
The second section of the config file specifies the dataset(s) to use. |
|
|
81 |
|
|
|
82 |
.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml |
|
|
83 |
:language: yaml |
|
|
84 |
:linenos: |
|
|
85 |
:lines: 12-24 |
|
|
86 |
|
|
|
87 |
We associate each dataset with a :code:`vis_processor` and a :code:`text_processor`, responsible for processing the visual and textual input respectively. |
|
|
88 |
Here we again use the registry mechanism to dynamically load the processor class based on the name string. |
|
|
89 |
For example, :code:`blip_image_train` is the name string for the :code:`BlipImageTrainProcessor` class, which is registered in :code:`lavis/processors/blip_processors.py`. |
|
|
90 |
|
|
|
91 |
Similarly, the dataset name string is also registered in the registry, pointing to a dataset builder :code:`COCOCapBuilder` class. |
|
|
92 |
By default, the builder will load the default dataset configuration as in :code:`DATASET_CONFIG_DICT`. You may also add new dataset types by adding new entries to the dictionary. |
|
|
93 |
|
|
|
94 |
The dataset configuration used here is: |
|
|
95 |
|
|
|
96 |
.. literalinclude:: ../lavis/configs/datasets/coco/defaults_cap.yaml |
|
|
97 |
:language: yaml |
|
|
98 |
:linenos: |
|
|
99 |
:lines: 6-28 |
|
|
100 |
|
|
|
101 |
In this configuration file, we specify the dataset name and mainly its building information. |
|
|
102 |
The build information is divided into two parts: :code:`annotation` and :code:`images`. The annotation files will be automatically downloaded upon loading the dataset for the first time. |
|
|
103 |
The :code:`images` part specifies the image root directory. This is a relative path to the cache directory, which is :code:`cache` by default. If you have a local copy of the dataset, you can specify the path to the local copy by |
|
|
104 |
overwriting the :code:`images` part in the runtime config file. For example, you may alter the run config as below to use your local dataset copy: |
|
|
105 |
|
|
|
106 |
.. code:: yaml |
|
|
107 |
|
|
|
108 |
datasets: |
|
|
109 |
coco_caption: # name of the dataset builder |
|
|
110 |
vis_processor: |
|
|
111 |
train: |
|
|
112 |
name: "blip_image_train" |
|
|
113 |
eval: |
|
|
114 |
name: "blip_image_eval" |
|
|
115 |
text_processor: |
|
|
116 |
train: |
|
|
117 |
name: "blip_caption" |
|
|
118 |
prompt: "a picture of " |
|
|
119 |
eval: |
|
|
120 |
name: "blip_caption" |
|
|
121 |
images: |
|
|
122 |
YOUR_LOCAL_IMAGE_ROOT_DIR |
|
|
123 |
|
|
|
124 |
LAVIS supports using multiple datasets for training. See an example in :code:`lavis/projects/blip/train/pretrain_14m.yaml`. |
|
|
125 |
|
|
|
126 |
|
|
|
127 |
Runner configurations |
|
|
128 |
========================= |
|
|
129 |
The last section of the config file specifies the arguments for the runner, shown below: |
|
|
130 |
|
|
|
131 |
.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml |
|
|
132 |
:language: yaml |
|
|
133 |
:linenos: |
|
|
134 |
:lines: 26-56 |
|
|
135 |
|
|
|
136 |
Here we specify runner-related arguments, including |
|
|
137 |
- task-specific arguments, such as :code:`task`, :code:`max_len`, :code:`min_len`, etc. |
|
|
138 |
- learning rate schedulers, optimizer; |
|
|
139 |
- distributed training settings; |
|
|
140 |
- logging and checkpointing settings. |
|
|
141 |
|
|
|
142 |
Available Configurations |
|
|
143 |
######################### |
|
|
144 |
|
|
|
145 |
See :ref:`config` for the full list of available configurations and their descriptions. |