Switch to unified view

a b/docs/tutorial.training-example.rst
1
Example on Finetuning BLIP on COCO-Captioning
2
################################################
3
4
To finetune BLIP model on the coco caption dataset, first refer to :ref:`prep coco` to prepare the dataset if you have not done so.
5
6
To finetune the model, we have prepared a run script for you, which can run as follows:
7
8
.. code-block:: bash
9
10
    bash run_scripts/blip/train/train_caption_coco_large.sh
11
12
This will finetune the pre-trained BLIP large model into a new model that can be used for captioning.
13
14
Deep Dive
15
**********
16
Now let's take a closer look at the script and see what it does.
17
18
.. code-block:: bash
19
20
    python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip/train/caption_coco_large_ft.yaml
21
22
As can be seen, the script simply calls the :code:`train.py` with PyTorch distributed training enabled.
23
The :code:`--cfg-path` argument specifies the **runtime config** file to use. The config file is a YAML file that specifies the training parameters, shown as follows:
24
25
.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml
26
    :language: yaml
27
    :linenos:
28
29
The runtime config file is divided into 3 sections:
30
    - :code:`model`: specifies the model architecture and type to use.
31
    - :code:`data`: specifies the dataset to use.
32
    - :code:`run`: specifies the runner arguments, such as tasks, optimizer, learning rate scheduler, etc.
33
34
We describe each section in detail below.
35
36
Model configurations
37
=====================
38
39
.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml
40
    :language: yaml
41
    :linenos:
42
    :lines: 6-10
43
44
The :code:`arch` argument specifies the model architecture to use. In this case, we use the :code:`blip_caption` architecture.
45
You can find available architectures by inspecting the :code:`model_zoo`.
46
Once the architecture is specified, the runner will look for the model class registered with the name and try to instantiate a model instance.
47
In this case :code:`BlipCaption` is the model registered with the name :code:`blip_caption`.
48
49
The registry maintains a mapping from the name string to the model class.
50
This allows the runner to find the model class dynamically based on the name string from the config file. 
51
The following segment in :code:`lavis/models/blip_models/blip_caption.py` shows how :code:`BlipCaption` is registered with the name string :code:`blip_caption`:
52
53
.. literalinclude:: ../lavis/models/blip_models/blip_caption.py
54
    :language: python
55
    :linenos:
56
    :lines: 20-38
57
58
One same model architecture may be pre-trained or finetuned on different datasets or have different model configurations.
59
For example, :code:`BlipCaption` have:
60
61
    - :code:`base_coco`: pre-trained base BLIP model adapated for COCO captioning finetuning.
62
63
    - :code:`large_coco`: pre-trained large BLIP model adapated for COCO captioning finetuning.
64
65
Therefore, we also need to specify :code:`model_type`. Here we use :code:`large_coco`.
66
And we set :code:`load_finetuned` to :code:`False` to indicate that we are finetuning the model from the pre-trained weights.
67
If :code:`load_finetuned` set to :code:`True` as by default, the model will load finetuned weights on coco captioning.
68
69
Given the model architecture and type, the library will then look for the default model config for :code:`large_coco` in :code:`lavis/models/blip_models/blip_caption.py`.
70
As can be seen in the above code snippet, the corresponding config path is stored in :code:`BlipCaption.PRETRAINED_MODEL_CONFIG_DICT`. 
71
Then the library will load :code:`lavis/configs/models/blip_caption_large_coco.yaml` as the configuration to build the model.
72
73
*Priority of Configs*: Note that the priority of the run config is higher than the default model config, meaning that arguments in the run config will override the default model config.
74
For example, in the default model config, :code:`load_finetuned` is set to :code:`True` by default, while in the run config, we set it to :code:`False` and finetuning from the pre-trained weights only.
75
76
77
Dataset configurations
78
=========================
79
80
The second section of the config file specifies the dataset(s) to use.
81
82
.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml
83
    :language: yaml
84
    :linenos:
85
    :lines: 12-24
86
87
We associate each dataset with a :code:`vis_processor` and a :code:`text_processor`, responsible for processing the visual and textual input respectively.
88
Here we again use the registry mechanism to dynamically load the processor class based on the name string.
89
For example, :code:`blip_image_train` is the name string for the :code:`BlipImageTrainProcessor` class, which is registered in :code:`lavis/processors/blip_processors.py`.
90
91
Similarly, the dataset name string is also registered in the registry, pointing to a dataset builder :code:`COCOCapBuilder` class.
92
By default, the builder will load the default dataset configuration as in :code:`DATASET_CONFIG_DICT`. You may also add new dataset types by adding new entries to the dictionary.
93
94
The dataset configuration used here is:
95
96
.. literalinclude:: ../lavis/configs/datasets/coco/defaults_cap.yaml
97
    :language: yaml
98
    :linenos:
99
    :lines: 6-28
100
101
In this configuration file, we specify the dataset name and mainly its building information.
102
The build information is divided into two parts: :code:`annotation` and :code:`images`. The annotation files will be automatically downloaded upon loading the dataset for the first time.
103
The :code:`images` part specifies the image root directory. This is a relative path to the cache directory, which is :code:`cache` by default. If you have a local copy of the dataset, you can specify the path to the local copy by
104
overwriting the :code:`images` part in the runtime config file. For example, you may alter the run config as below to use your local dataset copy:
105
106
.. code:: yaml
107
108
    datasets:
109
        coco_caption: # name of the dataset builder
110
            vis_processor:
111
                train:
112
                name: "blip_image_train"
113
                eval:
114
                name: "blip_image_eval"
115
            text_processor:
116
                train:
117
                name: "blip_caption"
118
                prompt: "a picture of "
119
                eval:
120
                name: "blip_caption"
121
            images:
122
                YOUR_LOCAL_IMAGE_ROOT_DIR
123
124
LAVIS supports using multiple datasets for training. See an example in :code:`lavis/projects/blip/train/pretrain_14m.yaml`.
125
126
127
Runner configurations
128
=========================
129
The last section of the config file specifies the arguments for the runner, shown below:
130
131
.. literalinclude:: ../lavis/projects/blip/train/caption_coco_large_ft.yaml
132
    :language: yaml
133
    :linenos:
134
    :lines: 26-56
135
136
Here we specify runner-related arguments, including
137
    - task-specific arguments, such as :code:`task`, :code:`max_len`, :code:`min_len`, etc.
138
    - learning rate schedulers, optimizer;
139
    - distributed training settings;
140
    - logging and checkpointing settings.
141
142
Available Configurations
143
#########################
144
145
See :ref:`config` for the full list of available configurations and their descriptions.