Switch to unified view

a/README.md b/README.md
1
# Cost-effective Instruction Learning for Pathology Vision and Language Analysis (CLOVER)
1
# Cost-effective Instruction Learning for Pathology Vision and Language Analysis (CLOVER)
2
2
3
The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. To augment the use of instructions, we construct a high-quality set of template-based instructions in the context of digital pathology. From two benchmark datasets, our findings reveal the strength of hybrid-form instructions in the visual question-answer in pathology. Extensive results show the cost-effectiveness of CLOVER in answering both open-ended and closed-ended questions, where CLOVER outperforms strong baselines that possess 37 times more training parameters and use instruction data generated from GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot learning in the external clinical dataset. These findings demonstrate that cost-effective modeling of CLOVER could accelerate the adoption of rapid conversational applications in the landscape of digital pathology.
3
The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. To augment the use of instructions, we construct a high-quality set of template-based instructions in the context of digital pathology. From two benchmark datasets, our findings reveal the strength of hybrid-form instructions in the visual question-answer in pathology. Extensive results show the cost-effectiveness of CLOVER in answering both open-ended and closed-ended questions, where CLOVER outperforms strong baselines that possess 37 times more training parameters and use instruction data generated from GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot learning in the external clinical dataset. These findings demonstrate that cost-effective modeling of CLOVER could accelerate the adoption of rapid conversational applications in the landscape of digital pathology.
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
## Release
13
## Release
14
- Checkpoints and instruction dataset will be released soon. 
14
- Checkpoints and instruction dataset will be released soon. 
15
 
15
 
16
16
17
17
18
## Workflow of CLOVER
18
## Workflow of CLOVER
19
19
20
<p align="center">
20
<p align="center">
21
    <img src="imgs/image.png" width="90%"> <br>
21
  
22
 
23
  *CLOVER employs the training framework of BLIP-2 to achieve a fast domain tuning with lightweight parameters. The entire training process of CLOVER includes two major stages: (i) alignment of vision and language and (ii) supervised fine-tuning with instructions. The alignment compels the model to acquire valuable representations between vision and language. Instruction fine-tuning is vital here for activating LLMs to excel in visual language question answering. Stage 1 requires inputs of image-text pairs, where we use the large-scale Quilt-1M dataset. Stage 2 demands domain-specific instruction data. As we have seen a significant lack of the required instruction data in the literature, we propose a low-cost solution of instruction data generation carefully designed for analyzing pathological data.*
22
  *CLOVER employs the training framework of BLIP-2 to achieve a fast domain tuning with lightweight parameters. The entire training process of CLOVER includes two major stages: (i) alignment of vision and language and (ii) supervised fine-tuning with instructions. The alignment compels the model to acquire valuable representations between vision and language. Instruction fine-tuning is vital here for activating LLMs to excel in visual language question answering. Stage 1 requires inputs of image-text pairs, where we use the large-scale Quilt-1M dataset. Stage 2 demands domain-specific instruction data. As we have seen a significant lack of the required instruction data in the literature, we propose a low-cost solution of instruction data generation carefully designed for analyzing pathological data.*
24
</p>
23
</p>
25
24
26
25
27
26
28
## Contents
27
## Contents
29
- [Cost-effective Instruction Learning for Pathology Vision and Language Analysis (CLOVER)](#cost-effective-instruction-learning-for-pathology-vision-and-language-analysis-clover)
28
- [Cost-effective Instruction Learning for Pathology Vision and Language Analysis (CLOVER)](#cost-effective-instruction-learning-for-pathology-vision-and-language-analysis-clover)
30
  - [Release](#release)
29
  - [Release](#release)
31
  - [Workflow of CLOVER](#workflow-of-clover)
30
  - [Workflow of CLOVER](#workflow-of-clover)
32
  - [Contents](#contents)
31
  - [Contents](#contents)
33
    - [Data Download](#data-download)
32
    - [Data Download](#data-download)
34
    - [Installation](#installation)
33
    - [Installation](#installation)
35
    - [Training](#training)
34
    - [Training](#training)
36
    - [Inference](#inference)
35
    - [Inference](#inference)
37
  - [Case Study](#case-study)
36
  - [Case Study](#case-study)
38
  - [Related Projects](#related-projects)
37
  - [Related Projects](#related-projects)
39
38
40
39
41
40
42
41
43
### Data Download
42
### Data Download
44
- Stage 1: Quilt-1M dataset can be downloaded from [Google](https://docs.google.com/forms/d/e/1FAIpQLSdSe06DIbPn71jA2rCxe_5tUPfyHhSH1Z7ZTJBxWM26cnpZFg/viewform) or [Zenodo](https://zenodo.org/records/8239942).
43
- Stage 1: Quilt-1M dataset can be downloaded from [Google](https://docs.google.com/forms/d/e/1FAIpQLSdSe06DIbPn71jA2rCxe_5tUPfyHhSH1Z7ZTJBxWM26cnpZFg/viewform) or [Zenodo](https://zenodo.org/records/8239942).
45
- Stage 2: CLOVER Instructions will be released. Of course, you can also use our prompt to generate the data from [PY FILE](./generate_instructions.py) if you want.
44
- Stage 2: CLOVER Instructions will be released. Of course, you can also use our prompt to generate the data from [PY FILE](./generate_instructions.py) if you want.
46
45
47
46
48
### Installation
47
### Installation
49
48
50
1. Creating conda environment
49
1. Creating conda environment
51
```bash
50
```bash
52
conda create -n clover python=3.9
51
conda create -n clover python=3.9
53
conda activate clover
52
conda activate clover
54
```
53
```
55
54
56
2. Building from source
55
2. Building from source
57
```bash
56
```bash
58
git clone https://github.com/JLINEkai/CLOVER.git
57
git clone https://github.com/JLINEkai/CLOVER.git
59
cd CLOVER
58
cd CLOVER
60
pip install -r requirements.txt
59
pip install -r requirements.txt
61
```
60
```
62
61
63
62
64
### Training
63
### Training
65
- Stage 1 (Alignment): 
64
- Stage 1 (Alignment): 
66
```bash
65
```bash
67
python train_blip2qformer.py
66
python train_blip2qformer.py
68
```
67
```
69
- Stage 2 (Instruction finetuning): 
68
- Stage 2 (Instruction finetuning): 
70
  
69
  
71
You can choose large language model (LLM) in [FILE](.\lavis\projects\blip2\train\pretrain_stage2.yaml). We provide FlanT5XL and Vicuna 7B.
70
You can choose large language model (LLM) in [FILE](.\lavis\projects\blip2\train\pretrain_stage2.yaml). We provide FlanT5XL and Vicuna 7B.
72
```bash
71
```bash
73
python -m torch.distributed.run --nproc_per_node=1 train.py 
72
python -m torch.distributed.run --nproc_per_node=1 train.py 
74
````
73
````
75
74
76
### Inference
75
### Inference
77
76
78
```bash
77
```bash
79
python -m torch.distributed.run --nproc_per_node=1 evaluate.py --cfg-path lavis/projects/blip2/eval/vqav2_zeroshot_flant5xl_eval.yaml
78
python -m torch.distributed.run --nproc_per_node=1 evaluate.py --cfg-path lavis/projects/blip2/eval/vqav2_zeroshot_flant5xl_eval.yaml
80
````
79
````
81
80
82
81
83
## Case Study
82
## Case Study
84
83
85
<p align="center">
86
    <img src="imgs/case1.png" width="90%"> <br>
87
 
88
  *Qualitative comparisons of visual question answering on QUILT-VQA. (Image source: QUILT-VQA)*
89
</p>
90
91
<p align="center">
92
    <img src="imgs/case2.png" width="90%"> <br>
93
 
94
  *Qualitative comparisons of visual question answering on LLaVA-Med-17K. (Image source: [link](https://www.ncbi.nlm.nih.gov/pubmed/26147524))*
95
</p>
96
97
If you have any questions, please send an email to chenkaitao@pjlab.org.cn.
84
If you have any questions, please send an email to chenkaitao@pjlab.org.cn.
98
85
99
## Related Projects
86
## Related Projects
100
- Our model is based on BLIP-2 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://github.com/salesforce/LAVIS/tree/main)
87
- Our model is based on BLIP-2 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://github.com/salesforce/LAVIS/tree/main)
101
88
102
89
103
90
104
91