|
a/README.md |
|
b/README.md |
1 |
# Cost-effective Instruction Learning for Pathology Vision and Language Analysis (CLOVER) |
1 |
# Cost-effective Instruction Learning for Pathology Vision and Language Analysis (CLOVER) |
2 |
|
2 |
|
3 |
The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. To augment the use of instructions, we construct a high-quality set of template-based instructions in the context of digital pathology. From two benchmark datasets, our findings reveal the strength of hybrid-form instructions in the visual question-answer in pathology. Extensive results show the cost-effectiveness of CLOVER in answering both open-ended and closed-ended questions, where CLOVER outperforms strong baselines that possess 37 times more training parameters and use instruction data generated from GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot learning in the external clinical dataset. These findings demonstrate that cost-effective modeling of CLOVER could accelerate the adoption of rapid conversational applications in the landscape of digital pathology. |
3 |
The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. To augment the use of instructions, we construct a high-quality set of template-based instructions in the context of digital pathology. From two benchmark datasets, our findings reveal the strength of hybrid-form instructions in the visual question-answer in pathology. Extensive results show the cost-effectiveness of CLOVER in answering both open-ended and closed-ended questions, where CLOVER outperforms strong baselines that possess 37 times more training parameters and use instruction data generated from GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot learning in the external clinical dataset. These findings demonstrate that cost-effective modeling of CLOVER could accelerate the adoption of rapid conversational applications in the landscape of digital pathology. |
4 |
|
4 |
|
5 |
|
5 |
|
6 |
|
6 |
|
7 |
|
7 |
|
8 |
|
8 |
|
9 |
|
9 |
|
10 |
|
10 |
|
11 |
|
11 |
|
12 |
|
12 |
|
13 |
## Release |
13 |
## Release
|
14 |
- Checkpoints and instruction dataset will be released soon. |
14 |
- Checkpoints and instruction dataset will be released soon.
|
15 |
|
15 |
|
16 |
|
16 |
|
17 |
|
17 |
|
18 |
## Workflow of CLOVER |
18 |
## Workflow of CLOVER |
19 |
|
19 |
|
20 |
<p align="center"> |
20 |
<p align="center">
|
21 |
<img src="imgs/image.png" width="90%"> <br> |
21 |
|
22 |
|
|
|
23 |
*CLOVER employs the training framework of BLIP-2 to achieve a fast domain tuning with lightweight parameters. The entire training process of CLOVER includes two major stages: (i) alignment of vision and language and (ii) supervised fine-tuning with instructions. The alignment compels the model to acquire valuable representations between vision and language. Instruction fine-tuning is vital here for activating LLMs to excel in visual language question answering. Stage 1 requires inputs of image-text pairs, where we use the large-scale Quilt-1M dataset. Stage 2 demands domain-specific instruction data. As we have seen a significant lack of the required instruction data in the literature, we propose a low-cost solution of instruction data generation carefully designed for analyzing pathological data.* |
22 |
*CLOVER employs the training framework of BLIP-2 to achieve a fast domain tuning with lightweight parameters. The entire training process of CLOVER includes two major stages: (i) alignment of vision and language and (ii) supervised fine-tuning with instructions. The alignment compels the model to acquire valuable representations between vision and language. Instruction fine-tuning is vital here for activating LLMs to excel in visual language question answering. Stage 1 requires inputs of image-text pairs, where we use the large-scale Quilt-1M dataset. Stage 2 demands domain-specific instruction data. As we have seen a significant lack of the required instruction data in the literature, we propose a low-cost solution of instruction data generation carefully designed for analyzing pathological data.*
|
24 |
</p> |
23 |
</p> |
25 |
|
24 |
|
26 |
|
25 |
|
27 |
|
26 |
|
28 |
## Contents |
27 |
## Contents
|
29 |
- [Cost-effective Instruction Learning for Pathology Vision and Language Analysis (CLOVER)](#cost-effective-instruction-learning-for-pathology-vision-and-language-analysis-clover) |
28 |
- [Cost-effective Instruction Learning for Pathology Vision and Language Analysis (CLOVER)](#cost-effective-instruction-learning-for-pathology-vision-and-language-analysis-clover)
|
30 |
- [Release](#release) |
29 |
- [Release](#release)
|
31 |
- [Workflow of CLOVER](#workflow-of-clover) |
30 |
- [Workflow of CLOVER](#workflow-of-clover)
|
32 |
- [Contents](#contents) |
31 |
- [Contents](#contents)
|
33 |
- [Data Download](#data-download) |
32 |
- [Data Download](#data-download)
|
34 |
- [Installation](#installation) |
33 |
- [Installation](#installation)
|
35 |
- [Training](#training) |
34 |
- [Training](#training)
|
36 |
- [Inference](#inference) |
35 |
- [Inference](#inference)
|
37 |
- [Case Study](#case-study) |
36 |
- [Case Study](#case-study)
|
38 |
- [Related Projects](#related-projects) |
37 |
- [Related Projects](#related-projects) |
39 |
|
38 |
|
40 |
|
39 |
|
41 |
|
40 |
|
42 |
|
41 |
|
43 |
### Data Download |
42 |
### Data Download
|
44 |
- Stage 1: Quilt-1M dataset can be downloaded from [Google](https://docs.google.com/forms/d/e/1FAIpQLSdSe06DIbPn71jA2rCxe_5tUPfyHhSH1Z7ZTJBxWM26cnpZFg/viewform) or [Zenodo](https://zenodo.org/records/8239942). |
43 |
- Stage 1: Quilt-1M dataset can be downloaded from [Google](https://docs.google.com/forms/d/e/1FAIpQLSdSe06DIbPn71jA2rCxe_5tUPfyHhSH1Z7ZTJBxWM26cnpZFg/viewform) or [Zenodo](https://zenodo.org/records/8239942).
|
45 |
- Stage 2: CLOVER Instructions will be released. Of course, you can also use our prompt to generate the data from [PY FILE](./generate_instructions.py) if you want. |
44 |
- Stage 2: CLOVER Instructions will be released. Of course, you can also use our prompt to generate the data from [PY FILE](./generate_instructions.py) if you want. |
46 |
|
45 |
|
47 |
|
46 |
|
48 |
### Installation |
47 |
### Installation |
49 |
|
48 |
|
50 |
1. Creating conda environment |
49 |
1. Creating conda environment
|
51 |
```bash |
50 |
```bash
|
52 |
conda create -n clover python=3.9 |
51 |
conda create -n clover python=3.9
|
53 |
conda activate clover |
52 |
conda activate clover
|
54 |
``` |
53 |
``` |
55 |
|
54 |
|
56 |
2. Building from source |
55 |
2. Building from source
|
57 |
```bash |
56 |
```bash
|
58 |
git clone https://github.com/JLINEkai/CLOVER.git |
57 |
git clone https://github.com/JLINEkai/CLOVER.git
|
59 |
cd CLOVER |
58 |
cd CLOVER
|
60 |
pip install -r requirements.txt |
59 |
pip install -r requirements.txt
|
61 |
``` |
60 |
``` |
62 |
|
61 |
|
63 |
|
62 |
|
64 |
### Training |
63 |
### Training
|
65 |
- Stage 1 (Alignment): |
64 |
- Stage 1 (Alignment):
|
66 |
```bash |
65 |
```bash
|
67 |
python train_blip2qformer.py |
66 |
python train_blip2qformer.py
|
68 |
``` |
67 |
```
|
69 |
- Stage 2 (Instruction finetuning): |
68 |
- Stage 2 (Instruction finetuning):
|
70 |
|
69 |
|
71 |
You can choose large language model (LLM) in [FILE](.\lavis\projects\blip2\train\pretrain_stage2.yaml). We provide FlanT5XL and Vicuna 7B. |
70 |
You can choose large language model (LLM) in [FILE](.\lavis\projects\blip2\train\pretrain_stage2.yaml). We provide FlanT5XL and Vicuna 7B.
|
72 |
```bash |
71 |
```bash
|
73 |
python -m torch.distributed.run --nproc_per_node=1 train.py |
72 |
python -m torch.distributed.run --nproc_per_node=1 train.py
|
74 |
```` |
73 |
```` |
75 |
|
74 |
|
76 |
### Inference |
75 |
### Inference |
77 |
|
76 |
|
78 |
```bash |
77 |
```bash
|
79 |
python -m torch.distributed.run --nproc_per_node=1 evaluate.py --cfg-path lavis/projects/blip2/eval/vqav2_zeroshot_flant5xl_eval.yaml |
78 |
python -m torch.distributed.run --nproc_per_node=1 evaluate.py --cfg-path lavis/projects/blip2/eval/vqav2_zeroshot_flant5xl_eval.yaml
|
80 |
```` |
79 |
```` |
81 |
|
80 |
|
82 |
|
81 |
|
83 |
## Case Study |
82 |
## Case Study |
84 |
|
83 |
|
85 |
<p align="center"> |
|
|
86 |
<img src="imgs/case1.png" width="90%"> <br> |
|
|
87 |
|
|
|
88 |
*Qualitative comparisons of visual question answering on QUILT-VQA. (Image source: QUILT-VQA)* |
|
|
89 |
</p> |
|
|
90 |
|
|
|
91 |
<p align="center"> |
|
|
92 |
<img src="imgs/case2.png" width="90%"> <br> |
|
|
93 |
|
|
|
94 |
*Qualitative comparisons of visual question answering on LLaVA-Med-17K. (Image source: [link](https://www.ncbi.nlm.nih.gov/pubmed/26147524))* |
|
|
95 |
</p> |
|
|
96 |
|
|
|
97 |
If you have any questions, please send an email to chenkaitao@pjlab.org.cn. |
84 |
If you have any questions, please send an email to chenkaitao@pjlab.org.cn. |
98 |
|
85 |
|
99 |
## Related Projects |
86 |
## Related Projects
|
100 |
- Our model is based on BLIP-2 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://github.com/salesforce/LAVIS/tree/main) |
87 |
- Our model is based on BLIP-2 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://github.com/salesforce/LAVIS/tree/main) |
101 |
|
88 |
|
102 |
|
89 |
|
103 |
|
90 |
|
104 |
|
91 |
|