a b/README.md
1
# Medical search Query relevance judgment
2
3
## Question description
4
Correlation between queries (i.e., search terms) evaluates how well two queries match the topics expressed by them, that is, whether and to what extent Query-A and Query-B are escaped. The topic of Query refers to the focus of query, and determining the correlation between two query terms is an important task, which is often used in the search quality optimization scenario of long-tail query. This task data set is generated under such background.
5
<div align=center>
6
7
![examples](./pic/1.png)
8
</div>
9
10
## Dataset introduction
11
12
[Download](https://tianchi.aliyun.com/competition/entrance/532001/information)
13
14
The correlation between Query and Title is divided into three levels (0-2). 0 is the worst correlation, and 2 is the best correlation.
15
16
2 points: indicates that A and B are equivalent, the expression is completely consistent
17
18
1 score: B is the semantic subset of A, and B refers to A scope less than A
19
20
0 score: B is the semantic parent set of A, B refers to A range greater than A; Or A has nothing to do with B semantics
21
22
## Structure
23
```
24
·
25
├── data
26
│   ├──  example_pred.json
27
│   ├── KUAKE-QQR_dev.json
28
│   ├── KUAKE-QQR_test.json
29
│   └── KUAKE-QQR_train.json
30
├── tencent-ailab-embedding-zh-d100-v0.2.0-s
31
│   ├── tencent-ailab-embedding-zh-d100-v0.2.0-s.txt
32
│   └── readme.txt
33
├── chinese-bert-wwm-ext
34
│   ├── added_tokens.json
35
│   ├── config.json
36
│   ├── pytorch_model.bin
37
│   ├── special_tokens_map.json
38
│   ├── tokenizer_config.json
39
│   ├── tokenizer.json
40
│   └── vocab.txt
41
├── pic
42
│   └── 1.png
43
├── scripts
44
│   ├── inference.sh
45
│   ├── eval.sh
46
│   └── train.sh
47
├── train.py
48
├── eval.py
49
├── models.py
50
├── inference.py   
51
└── README.md
52
```
53
54
## Environment
55
56
```shell
57
pip install gensim
58
pip install numpy
59
pip install tqdm
60
conda install torch
61
pip install transformer
62
```
63
64
## Prepare
65
Download corpus from Tencent AI Lab
66
```shell
67
wget https://ai.tencent.com/ailab/nlp/zh/data/tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz # v0.2.0 100 demention-Small
68
```
69
Decompress the corpus
70
```shell
71
tar -zxvf tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz
72
```
73
74
Download the bert model and configuration file
75
76
```shell
77
mkdir chinese-bert-wwm-ext
78
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/added_tokens.json
79
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/config.json
80
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/pytorch_model.bin
81
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/special_tokens_map.json
82
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/tokenizer.json
83
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/tokenizer_config.json
84
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/vocab.txt
85
```
86
## Train
87
88
```python
89
python train.py --model_name {model_name} --datadir {datadir} --epochs 30 --lr 1e-4 --max_length 32 --batch_size 8 --savepath ./results --gpu 0 --w2v_path {w2v_path}
90
```
91
Or run the scripts
92
93
```shell
94
sh scripts/train.sh
95
```
96
97
## Eval
98
99
```python
100
python eval.py --model_name {model_name} --w2v_path {w2v_path} --model_path {model_path}
101
```
102
Or run the scripts
103
104
```shell
105
sh scripts/eval.sh
106
```
107
108
## Inference
109
```python
110
python inference.py --model_name {model_name} --batch_size 8 --max_length 32 --savepath ./results --datadir {datadir} --model_path {model_path} --gpu 0 --w2v_path {w2v_path}
111
```
112
Or run the scripts
113
114
```shell
115
sh scripts/inference.sh
116
```
117
118
## Results
119
120
<div align=center>
121
122
| Model | Params(M) | Train Acc(%) |Val Acc(%)|Test Acc(%)|
123
| :----:| :----: | :----: |:----:|:----:|
124
| SemNN | 200.04 | 64.02 |65.56|61.41|
125
| SemLSTM | 200.24 | 66.81 |67.00|69.74|
126
| SemAttention |200.48| 76.14 |74.50|75.57|
127
| Bert | 102.27 | 95.85 |82.88|82.65|
128
129
</div>