|
a |
|
b/docs/tutorial.models.rst |
|
|
1 |
Adding Models |
|
|
2 |
#################################### |
|
|
3 |
|
|
|
4 |
This is a tutorial on adding new models using ``lavis.models`` module. |
|
|
5 |
|
|
|
6 |
The LAVIS library includes a standard model module that builds the foundation for many major language-vision models such as `ALBEF <https://arxiv.org/pdf/2107.07651.pdf>`_, |
|
|
7 |
`BLIP <https://arxiv.org/pdf/2201.12086.pdf>`_, `ALPRO <https://arxiv.org/pdf/2112.09583.pdf>`_, and `CLIP <https://arxiv.org/pdf/2103.00020.pdf>`_. |
|
|
8 |
The ``lavis.models`` module is designed such that any new models can be added and integrated into the LAVIS library, with minimal steps to develop training and testing procedures. |
|
|
9 |
In this tutorial, we will replicate the steps to add a GPT-style model specifically for `video-grounded dialogue tasks <https://arxiv.org/pdf/1901.09107.pdf>`_. |
|
|
10 |
|
|
|
11 |
Base Model ``lavis.models.base_model`` |
|
|
12 |
************************************************************** |
|
|
13 |
|
|
|
14 |
Note that any new model definition should inherit the base model class ``BaseModel``: |
|
|
15 |
|
|
|
16 |
.. code-block:: python |
|
|
17 |
|
|
|
18 |
from omegaconf import OmegaConf |
|
|
19 |
|
|
|
20 |
import numpy as np |
|
|
21 |
|
|
|
22 |
import torch |
|
|
23 |
import torch.nn as nn |
|
|
24 |
|
|
|
25 |
from lavis.common.utils import get_abs_path |
|
|
26 |
|
|
|
27 |
class BaseModel(nn.Module): |
|
|
28 |
"""Base class for models.""" |
|
|
29 |
|
|
|
30 |
def __init__(self): |
|
|
31 |
super().__init__() |
|
|
32 |
|
|
|
33 |
def forward_features(self, *args, **kwargs): |
|
|
34 |
"""Similar to *forward* but only return features.""" |
|
|
35 |
raise NotImplementedError |
|
|
36 |
|
|
|
37 |
def load_from_pretrained(self, url_or_filename): |
|
|
38 |
raise NotImplementedError |
|
|
39 |
|
|
|
40 |
@classmethod |
|
|
41 |
def _from_config(cls, cfg=None, model_type="base"): |
|
|
42 |
if not cfg: |
|
|
43 |
# useful when building model without a provided configuration file |
|
|
44 |
cfg = OmegaConf.load(cls.default_config_path(model_type)).model |
|
|
45 |
|
|
|
46 |
return cls.from_config(cfg) |
|
|
47 |
|
|
|
48 |
@classmethod |
|
|
49 |
def from_pretrained(cls, model_type="base"): |
|
|
50 |
""" |
|
|
51 |
Build a pretrained model from the default configuration file, specified by model_type. |
|
|
52 |
""" |
|
|
53 |
return cls._from_config(cfg=None, model_type=model_type) |
|
|
54 |
|
|
|
55 |
@property |
|
|
56 |
def device(self): |
|
|
57 |
return list(self.parameters())[0].device |
|
|
58 |
|
|
|
59 |
@classmethod |
|
|
60 |
def default_config_path(cls, model_type="base"): |
|
|
61 |
assert ( |
|
|
62 |
model_type in cls.PRETRAINED_MODEL_CONFIG_DICT |
|
|
63 |
), "Unknown model type {}".format(model_type) |
|
|
64 |
return get_abs_path(cls.PRETRAINED_MODEL_CONFIG_DICT[model_type]) |
|
|
65 |
|
|
|
66 |
def before_evaluation(self, **kwargs): |
|
|
67 |
pass |
|
|
68 |
|
|
|
69 |
def show_n_params(self, return_str=True): |
|
|
70 |
tot = 0 |
|
|
71 |
for p in self.parameters(): |
|
|
72 |
w = 1 |
|
|
73 |
for x in p.shape: |
|
|
74 |
w *= x |
|
|
75 |
tot += w |
|
|
76 |
if return_str: |
|
|
77 |
if tot >= 1e6: |
|
|
78 |
return "{:.1f}M".format(tot / 1e6) |
|
|
79 |
else: |
|
|
80 |
return "{:.1f}K".format(tot / 1e3) |
|
|
81 |
else: |
|
|
82 |
return tot |
|
|
83 |
|
|
|
84 |
|
|
|
85 |
In this base model, we already declare and standardize many common methods such as ``_from_config`` and ``_from_pretrained``. |
|
|
86 |
Inheriting this base model class allows us to standardize operations of models across all model classes while still allowing customizations. |
|
|
87 |
We advise users not to change the implementation of the base model class as this will affect all existing model subclasses. |
|
|
88 |
|
|
|
89 |
GPT-style Video-grounded Dialogue Model ``lavis.models.gpt_models.gpt_dialogue`` |
|
|
90 |
******************************************************************************** |
|
|
91 |
|
|
|
92 |
In this step, we can define a new model class, e.g. under ``lavis.models.gpt_models.gpt_dialogue``, for GPT-based dialogue models designed specifically for video-grounded dialogues. |
|
|
93 |
Note that we assume the model class inherits from the standard model super class ``GPT2LMHeadModel`` from the ``transformers`` `library <https://huggingface.co/docs/transformers/index>`_. |
|
|
94 |
We also enforce model integration to the LAVIS framework through the inheritance of the ``BaseModel`` from the LAVIS library, as the secondary super class. |
|
|
95 |
|
|
|
96 |
.. code-block:: python |
|
|
97 |
|
|
|
98 |
import torch |
|
|
99 |
from lavis.common.registry import registry |
|
|
100 |
from lavis.models.base_model import BaseModel |
|
|
101 |
|
|
|
102 |
from transformers import GPT2Model, GPT2LMHeadModel |
|
|
103 |
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions |
|
|
104 |
import math |
|
|
105 |
import torch |
|
|
106 |
import torch.nn as nn |
|
|
107 |
from torch.nn import CrossEntropyLoss, MSELoss |
|
|
108 |
|
|
|
109 |
@registry.register_model("gpt_dialogue") |
|
|
110 |
class GPTDialogue(GPT2LMHeadModel, BaseModel): |
|
|
111 |
... |
|
|
112 |
|
|
|
113 |
Next, we can modify the architecture of the model during model initialization to fit the tasks of interest, i.e. video-grounded dialogues. |
|
|
114 |
In this case, we want to add additional model parameters for a linear network to transform the video feature representations to the model dimension. |
|
|
115 |
|
|
|
116 |
.. code-block:: python |
|
|
117 |
|
|
|
118 |
class GPTDialogue(GPT2LMHeadModel, BaseModel): |
|
|
119 |
|
|
|
120 |
def __init__(self, config, len_video_ft=4224): |
|
|
121 |
|
|
|
122 |
super().__init__(config) |
|
|
123 |
|
|
|
124 |
self.video_ff = nn.Linear(len_video_ft, config.n_embd) |
|
|
125 |
|
|
|
126 |
# Model parallel |
|
|
127 |
self.model_parallel = False |
|
|
128 |
self.device_map = None |
|
|
129 |
|
|
|
130 |
# Initialize weights and apply final processing |
|
|
131 |
self.post_init() |
|
|
132 |
|
|
|
133 |
Note that for each new model class, we advise redefining the ``from_config`` method which is inherited from the ``BaseModel`` class. |
|
|
134 |
As each model usually has its own unique configurations, redefining the method will ensure the model instances are created properly. |
|
|
135 |
For instance, ``GPTDialogue`` requires an additional parameter of video feature length (``len_video_ft``) which should be part of the model initialization procedure. |
|
|
136 |
Another additional parameter is the number of tokens/words (as we include additional special tokens in the vocabulary for dialogue tasks). |
|
|
137 |
|
|
|
138 |
.. code-block:: python |
|
|
139 |
|
|
|
140 |
class GPTDialogue(GPT2LMHeadModel, BaseModel): |
|
|
141 |
... |
|
|
142 |
@classmethod |
|
|
143 |
def from_config(cls, cfg): |
|
|
144 |
model = cls.from_pretrained('gpt2', len_video_ft=cfg['len_video_ft']) |
|
|
145 |
model.resize_token_embeddings(cfg['len_tokenizer']) |
|
|
146 |
return model |
|
|
147 |
|
|
|
148 |
Other basic methods should also be defined explicitly in the new model class, including the ``forward`` function. |
|
|
149 |
For instance, in GPT models for video-grounded dialogue tasks, we want the forward operation also includes the transformation and integration of video features before passing the representations to the Transformer layers. |
|
|
150 |
|
|
|
151 |
.. code-block:: python |
|
|
152 |
|
|
|
153 |
class GPTDialogue(GPT2LMHeadModel, BaseModel): |
|
|
154 |
... |
|
|
155 |
|
|
|
156 |
def forward(self, samples, |
|
|
157 |
past_key_values=None, |
|
|
158 |
position_ids=None, |
|
|
159 |
head_mask=None, |
|
|
160 |
encoder_hidden_states=None, |
|
|
161 |
encoder_attention_mask=None, |
|
|
162 |
use_cache=None, |
|
|
163 |
output_attentions=None, |
|
|
164 |
output_hidden_states=None, |
|
|
165 |
return_dict=None): |
|
|
166 |
|
|
|
167 |
input_embs = self.transformer.wte(samples['input_ids']) |
|
|
168 |
video_embs = self.video_ff(samples['video_fts']) |
|
|
169 |
input_embs = torch.cat([video_embs, input_embs], dim=1) |
|
|
170 |
|
|
|
171 |
transformer_outputs = self.transformer( |
|
|
172 |
attention_mask=samples['attn_mask'], |
|
|
173 |
token_type_ids=samples['token_type_ids'], |
|
|
174 |
inputs_embeds=input_embs, |
|
|
175 |
position_ids=position_ids, |
|
|
176 |
head_mask=head_mask, |
|
|
177 |
encoder_hidden_states=encoder_hidden_states, |
|
|
178 |
encoder_attention_mask=encoder_attention_mask, |
|
|
179 |
use_cache=use_cache, |
|
|
180 |
output_attentions=output_attentions, |
|
|
181 |
output_hidden_states=output_hidden_states, |
|
|
182 |
return_dict=return_dict, |
|
|
183 |
) |
|
|
184 |
hidden_states = transformer_outputs[0] |
|
|
185 |
|
|
|
186 |
lm_logits = self.lm_head(hidden_states) |
|
|
187 |
... |
|
|
188 |
|
|
|
189 |
Registering New Model ``lavis.models.__init__`` |
|
|
190 |
******************************************************************************** |
|
|
191 |
|
|
|
192 |
Any new model must be officially registered as part of the ``lavis.models`` module. |
|
|
193 |
For instance, to add a model class for GPT-based dialogue models, we can modify the ``__init__.py`` as follows: |
|
|
194 |
|
|
|
195 |
.. code-block:: python |
|
|
196 |
|
|
|
197 |
from lavis.models.gpt_models.gpt_dialogue import GPTDialogue |
|
|
198 |
|
|
|
199 |
__all__ = [ |
|
|
200 |
... |
|
|
201 |
"GPTDialogue" |
|
|
202 |
] |
|
|
203 |
|
|
|
204 |
Assigning Model |
|
|
205 |
******************************************************************************** |
|
|
206 |
|
|
|
207 |
From the above example of a model class, note that we define a ``from_config method`` for the new model class. |
|
|
208 |
This method will process a configuration file and pass specific parameters to initialize the model classes properly. |
|
|
209 |
To do this, we can assign/ associate the correct registry of model classes in a configuration file. |
|
|
210 |
For instance, the following should be specified in a configuration file e.g. ``dialogue_avsd_ft.yaml``: |
|
|
211 |
|
|
|
212 |
.. code-block:: yaml |
|
|
213 |
|
|
|
214 |
model: |
|
|
215 |
arch: gpt_dialogue # name of the model |
|
|
216 |
model_type: base |
|
|
217 |
|
|
|
218 |
|
|
|
219 |
Subsequently, any processes (e.g. training) should load this configuration file to assign the correct model. |
|
|
220 |
|
|
|
221 |
.. code-block:: sh |
|
|
222 |
|
|
|
223 |
python train.py --cfg-path dialogue_avsd_ft.yaml |
|
|
224 |
|
|
|
225 |
Note that to simplify the model configuration, we only enable two main parameters here: ``arch`` and ``model_type``. ``arch`` refers to the model class registry, and ``model_type`` is the corresponding model type under this model family. |
|
|
226 |
For instance, with ``gpt_dialogue``, we have a model ``base`` which has its own configuration in a separate configuration file e.g. ``gpt_dialogue_base.yaml``: |
|
|
227 |
|
|
|
228 |
.. code-block:: yaml |
|
|
229 |
|
|
|
230 |
model: |
|
|
231 |
arch: gpt_dialogue |
|
|
232 |
len_tokenizer: 50264 # 50257 tokens from gpt2 default tokenizer + additional special tokens |
|
|
233 |
len_video_ft: 4224 # i3d_rgb: 2048 i3d_flow: 2048 vggish: 128 |
|
|
234 |
|
|
|
235 |
We can pass load this configuration and pass the parameters to the above ``from_config`` method to initialize the model accordingly. |
|
|
236 |
We advise the users to maintain a dictionary that contains default paths to model configurations, in the model class definition. |
|
|
237 |
By default, the LAVIS framework will search for configurations from each model class defined as ``model.PRETRAINED_MODEL_CONFIG_DICT``. |
|
|
238 |
|
|
|
239 |
.. code-block:: python |
|
|
240 |
|
|
|
241 |
class GPTDialogue(GPT2LMHeadModel, BaseModel): |
|
|
242 |
PRETRAINED_MODEL_CONFIG_DICT = { |
|
|
243 |
"base": "configs/models/gpt_dialogue_base.yaml" |
|
|
244 |
} |
|
|
245 |
... |