Diff of /docs/tutorial.models.rst [000000] .. [dc40d0]

Switch to unified view

a b/docs/tutorial.models.rst
1
Adding Models
2
####################################
3
4
This is a tutorial on adding new models using ``lavis.models`` module.
5
6
The LAVIS library includes a standard model module that builds the foundation for many major language-vision models such as `ALBEF <https://arxiv.org/pdf/2107.07651.pdf>`_,
7
`BLIP <https://arxiv.org/pdf/2201.12086.pdf>`_, `ALPRO <https://arxiv.org/pdf/2112.09583.pdf>`_, and `CLIP <https://arxiv.org/pdf/2103.00020.pdf>`_. 
8
The ``lavis.models`` module is designed such that any new models can be added and integrated into the LAVIS library, with minimal steps to develop training and testing procedures. 
9
In this tutorial, we will replicate the steps to add a GPT-style model specifically for `video-grounded dialogue tasks <https://arxiv.org/pdf/1901.09107.pdf>`_. 
10
11
Base Model ``lavis.models.base_model``
12
**************************************************************
13
14
Note that any new model definition should inherit the base model class ``BaseModel``:
15
16
.. code-block:: python
17
18
    from omegaconf import OmegaConf
19
    
20
    import numpy as np
21
    
22
    import torch
23
    import torch.nn as nn
24
    
25
    from lavis.common.utils import get_abs_path
26
    
27
    class BaseModel(nn.Module):
28
        """Base class for models."""
29
    
30
        def __init__(self):
31
            super().__init__()
32
    
33
        def forward_features(self, *args, **kwargs):
34
            """Similar to *forward* but only return features."""
35
            raise NotImplementedError
36
    
37
        def load_from_pretrained(self, url_or_filename):
38
            raise NotImplementedError
39
    
40
        @classmethod
41
        def _from_config(cls, cfg=None, model_type="base"):
42
            if not cfg:
43
                # useful when building model without a provided configuration file
44
                cfg = OmegaConf.load(cls.default_config_path(model_type)).model
45
    
46
            return cls.from_config(cfg)
47
    
48
        @classmethod
49
        def from_pretrained(cls, model_type="base"):
50
            """
51
            Build a pretrained model from the default configuration file, specified by model_type.
52
            """
53
            return cls._from_config(cfg=None, model_type=model_type)
54
    
55
        @property
56
        def device(self):
57
            return list(self.parameters())[0].device
58
    
59
        @classmethod
60
        def default_config_path(cls, model_type="base"):
61
            assert (
62
                model_type in cls.PRETRAINED_MODEL_CONFIG_DICT
63
            ), "Unknown model type {}".format(model_type)
64
            return get_abs_path(cls.PRETRAINED_MODEL_CONFIG_DICT[model_type])
65
    
66
        def before_evaluation(self, **kwargs):
67
            pass
68
    
69
        def show_n_params(self, return_str=True):
70
            tot = 0
71
            for p in self.parameters():
72
                w = 1
73
                for x in p.shape:
74
                    w *= x
75
                tot += w
76
            if return_str:
77
                if tot >= 1e6:
78
                    return "{:.1f}M".format(tot / 1e6)
79
                else:
80
                    return "{:.1f}K".format(tot / 1e3)
81
            else:
82
                return tot
83
84
85
In this base model, we already declare and standardize many common methods such as ``_from_config`` and ``_from_pretrained``. 
86
Inheriting this base model class allows us to standardize operations of models across all model classes while still allowing customizations. 
87
We advise users not to change the implementation of the base model class as this will affect all existing model subclasses.
88
89
GPT-style Video-grounded Dialogue Model ``lavis.models.gpt_models.gpt_dialogue``
90
********************************************************************************
91
92
In this step, we can define a new model class, e.g. under ``lavis.models.gpt_models.gpt_dialogue``, for GPT-based dialogue models designed specifically for video-grounded dialogues. 
93
Note that we assume the model class inherits from the standard model super class ``GPT2LMHeadModel`` from the ``transformers`` `library <https://huggingface.co/docs/transformers/index>`_.
94
We also enforce model integration to the LAVIS framework through the inheritance of the ``BaseModel`` from the LAVIS library, as the secondary super class.
95
96
.. code-block:: python
97
98
    import torch
99
    from lavis.common.registry import registry
100
    from lavis.models.base_model import BaseModel
101
    
102
    from transformers import GPT2Model, GPT2LMHeadModel
103
    from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
104
    import math
105
    import torch
106
    import torch.nn as nn
107
    from torch.nn import CrossEntropyLoss, MSELoss
108
        
109
    @registry.register_model("gpt_dialogue")
110
    class GPTDialogue(GPT2LMHeadModel, BaseModel):
111
        ...
112
 
113
Next, we can modify the architecture of the model during model initialization to fit the tasks of interest, i.e. video-grounded dialogues. 
114
In this case, we want to add additional model parameters for a linear network to transform the video feature representations to the model dimension. 
115
116
.. code-block:: python
117
118
    class GPTDialogue(GPT2LMHeadModel, BaseModel):
119
120
        def __init__(self, config, len_video_ft=4224):
121
            
122
            super().__init__(config)
123
            
124
            self.video_ff = nn.Linear(len_video_ft, config.n_embd)
125
       
126
            # Model parallel
127
            self.model_parallel = False
128
            self.device_map = None
129
    
130
            # Initialize weights and apply final processing
131
            self.post_init()
132
    
133
Note that for each new model class, we advise redefining the ``from_config`` method which is inherited from the ``BaseModel`` class.
134
As each model usually has its own unique configurations, redefining the method will ensure the model instances are created properly. 
135
For instance, ``GPTDialogue`` requires an additional parameter of video feature length (``len_video_ft``) which should be part of the model initialization procedure. 
136
Another additional parameter is the number of tokens/words (as we include additional special tokens in the vocabulary for dialogue tasks). 
137
138
.. code-block:: python
139
140
    class GPTDialogue(GPT2LMHeadModel, BaseModel):
141
        ...
142
        @classmethod
143
        def from_config(cls, cfg):
144
            model = cls.from_pretrained('gpt2', len_video_ft=cfg['len_video_ft']) 
145
            model.resize_token_embeddings(cfg['len_tokenizer'])
146
            return model
147
148
Other basic methods should also be defined explicitly in the new model class, including the ``forward`` function. 
149
For instance, in GPT models for video-grounded dialogue tasks, we want the forward operation also includes the transformation and integration of video features before passing the representations to the Transformer layers. 
150
151
.. code-block:: python
152
153
    class GPTDialogue(GPT2LMHeadModel, BaseModel):
154
        ...
155
156
        def forward(self, samples, 
157
                    past_key_values=None,
158
                    position_ids=None,
159
                    head_mask=None,
160
                    encoder_hidden_states=None,
161
                    encoder_attention_mask=None,
162
                    use_cache=None,
163
                    output_attentions=None,
164
                    output_hidden_states=None,
165
                    return_dict=None):        
166
                
167
                input_embs = self.transformer.wte(samples['input_ids'])
168
                video_embs = self.video_ff(samples['video_fts'])
169
                input_embs = torch.cat([video_embs, input_embs], dim=1)
170
                        
171
                transformer_outputs = self.transformer(
172
                    attention_mask=samples['attn_mask'],
173
                    token_type_ids=samples['token_type_ids'],
174
                    inputs_embeds=input_embs,
175
                    position_ids=position_ids,
176
                    head_mask=head_mask,
177
                    encoder_hidden_states=encoder_hidden_states,
178
                    encoder_attention_mask=encoder_attention_mask,
179
                    use_cache=use_cache,
180
                    output_attentions=output_attentions,
181
                    output_hidden_states=output_hidden_states,
182
                    return_dict=return_dict,
183
                )
184
                hidden_states = transformer_outputs[0]
185
            
186
                lm_logits = self.lm_head(hidden_states)
187
                ...
188
189
Registering New Model ``lavis.models.__init__``
190
********************************************************************************
191
192
Any new model must be officially registered as part of the ``lavis.models`` module. 
193
For instance, to add a model class for GPT-based dialogue models, we can modify the ``__init__.py`` as follows:
194
195
.. code-block:: python
196
197
    from lavis.models.gpt_models.gpt_dialogue import GPTDialogue
198
    
199
    __all__ = [
200
        ...
201
        "GPTDialogue"
202
    ]
203
204
Assigning Model
205
********************************************************************************
206
207
From the above example of a model class, note that we define a ``from_config method`` for the new model class. 
208
This method will process a configuration file and pass specific parameters to initialize the model classes properly. 
209
To do this, we can assign/ associate the correct registry of model classes in a configuration file. 
210
For instance, the following should be specified in a configuration file e.g. ``dialogue_avsd_ft.yaml``:
211
212
.. code-block:: yaml
213
214
    model:
215
      arch: gpt_dialogue # name of the model 
216
      model_type: base
217
218
219
Subsequently, any processes (e.g. training) should load this configuration file to assign the correct model.
220
221
.. code-block:: sh
222
223
    python train.py --cfg-path dialogue_avsd_ft.yaml
224
225
Note that to simplify the model configuration, we only enable two main parameters here: ``arch`` and ``model_type``. ``arch`` refers to the model class registry, and ``model_type`` is the corresponding model type under this model family.
226
For instance, with ``gpt_dialogue``, we have a model ``base`` which has its own configuration in a separate configuration file e.g. ``gpt_dialogue_base.yaml``:
227
228
.. code-block:: yaml
229
230
    model:
231
      arch: gpt_dialogue
232
      len_tokenizer: 50264 # 50257 tokens from gpt2 default tokenizer + additional special tokens       
233
      len_video_ft: 4224 # i3d_rgb: 2048 i3d_flow: 2048 vggish: 128 
234
235
We can pass load this configuration and pass the parameters to the above ``from_config`` method to initialize the model accordingly. 
236
We advise the users to maintain a dictionary that contains default paths to model configurations, in the model class definition. 
237
By default, the LAVIS framework will search for configurations from each model class defined as ``model.PRETRAINED_MODEL_CONFIG_DICT``.
238
239
.. code-block:: python
240
241
    class GPTDialogue(GPT2LMHeadModel, BaseModel):
242
        PRETRAINED_MODEL_CONFIG_DICT = {
243
                "base": "configs/models/gpt_dialogue_base.yaml"
244
            }
245
        ...