|
a |
|
b/README.md |
|
|
1 |
# Medical search Query relevance judgment |
|
|
2 |
|
|
|
3 |
## Question description |
|
|
4 |
Correlation between queries (i.e., search terms) evaluates how well two queries match the topics expressed by them, that is, whether and to what extent Query-A and Query-B are escaped. The topic of Query refers to the focus of query, and determining the correlation between two query terms is an important task, which is often used in the search quality optimization scenario of long-tail query. This task data set is generated under such background. |
|
|
5 |
<div align=center> |
|
|
6 |
|
|
|
7 |
 |
|
|
8 |
</div> |
|
|
9 |
|
|
|
10 |
## Dataset introduction |
|
|
11 |
|
|
|
12 |
[Download](https://tianchi.aliyun.com/competition/entrance/532001/information) |
|
|
13 |
|
|
|
14 |
The correlation between Query and Title is divided into three levels (0-2). 0 is the worst correlation, and 2 is the best correlation. |
|
|
15 |
|
|
|
16 |
2 points: indicates that A and B are equivalent, the expression is completely consistent |
|
|
17 |
|
|
|
18 |
1 score: B is the semantic subset of A, and B refers to A scope less than A |
|
|
19 |
|
|
|
20 |
0 score: B is the semantic parent set of A, B refers to A range greater than A; Or A has nothing to do with B semantics |
|
|
21 |
|
|
|
22 |
## Structure |
|
|
23 |
``` |
|
|
24 |
· |
|
|
25 |
├── data |
|
|
26 |
│ ├── example_pred.json |
|
|
27 |
│ ├── KUAKE-QQR_dev.json |
|
|
28 |
│ ├── KUAKE-QQR_test.json |
|
|
29 |
│ └── KUAKE-QQR_train.json |
|
|
30 |
├── tencent-ailab-embedding-zh-d100-v0.2.0-s |
|
|
31 |
│ ├── tencent-ailab-embedding-zh-d100-v0.2.0-s.txt |
|
|
32 |
│ └── readme.txt |
|
|
33 |
├── chinese-bert-wwm-ext |
|
|
34 |
│ ├── added_tokens.json |
|
|
35 |
│ ├── config.json |
|
|
36 |
│ ├── pytorch_model.bin |
|
|
37 |
│ ├── special_tokens_map.json |
|
|
38 |
│ ├── tokenizer_config.json |
|
|
39 |
│ ├── tokenizer.json |
|
|
40 |
│ └── vocab.txt |
|
|
41 |
├── pic |
|
|
42 |
│ └── 1.png |
|
|
43 |
├── scripts |
|
|
44 |
│ ├── inference.sh |
|
|
45 |
│ ├── eval.sh |
|
|
46 |
│ └── train.sh |
|
|
47 |
├── train.py |
|
|
48 |
├── eval.py |
|
|
49 |
├── models.py |
|
|
50 |
├── inference.py |
|
|
51 |
└── README.md |
|
|
52 |
``` |
|
|
53 |
|
|
|
54 |
## Environment |
|
|
55 |
|
|
|
56 |
```shell |
|
|
57 |
pip install gensim |
|
|
58 |
pip install numpy |
|
|
59 |
pip install tqdm |
|
|
60 |
conda install torch |
|
|
61 |
pip install transformer |
|
|
62 |
``` |
|
|
63 |
|
|
|
64 |
## Prepare |
|
|
65 |
Download corpus from Tencent AI Lab |
|
|
66 |
```shell |
|
|
67 |
wget https://ai.tencent.com/ailab/nlp/zh/data/tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz # v0.2.0 100 demention-Small |
|
|
68 |
``` |
|
|
69 |
Decompress the corpus |
|
|
70 |
```shell |
|
|
71 |
tar -zxvf tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz |
|
|
72 |
``` |
|
|
73 |
|
|
|
74 |
Download the bert model and configuration file |
|
|
75 |
|
|
|
76 |
```shell |
|
|
77 |
mkdir chinese-bert-wwm-ext |
|
|
78 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/added_tokens.json |
|
|
79 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/config.json |
|
|
80 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/pytorch_model.bin |
|
|
81 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/special_tokens_map.json |
|
|
82 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/tokenizer.json |
|
|
83 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/tokenizer_config.json |
|
|
84 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/vocab.txt |
|
|
85 |
``` |
|
|
86 |
## Train |
|
|
87 |
|
|
|
88 |
```python |
|
|
89 |
python train.py --model_name {model_name} --datadir {datadir} --epochs 30 --lr 1e-4 --max_length 32 --batch_size 8 --savepath ./results --gpu 0 --w2v_path {w2v_path} |
|
|
90 |
``` |
|
|
91 |
Or run the scripts |
|
|
92 |
|
|
|
93 |
```shell |
|
|
94 |
sh scripts/train.sh |
|
|
95 |
``` |
|
|
96 |
|
|
|
97 |
## Eval |
|
|
98 |
|
|
|
99 |
```python |
|
|
100 |
python eval.py --model_name {model_name} --w2v_path {w2v_path} --model_path {model_path} |
|
|
101 |
``` |
|
|
102 |
Or run the scripts |
|
|
103 |
|
|
|
104 |
```shell |
|
|
105 |
sh scripts/eval.sh |
|
|
106 |
``` |
|
|
107 |
|
|
|
108 |
## Inference |
|
|
109 |
```python |
|
|
110 |
python inference.py --model_name {model_name} --batch_size 8 --max_length 32 --savepath ./results --datadir {datadir} --model_path {model_path} --gpu 0 --w2v_path {w2v_path} |
|
|
111 |
``` |
|
|
112 |
Or run the scripts |
|
|
113 |
|
|
|
114 |
```shell |
|
|
115 |
sh scripts/inference.sh |
|
|
116 |
``` |
|
|
117 |
|
|
|
118 |
## Results |
|
|
119 |
|
|
|
120 |
<div align=center> |
|
|
121 |
|
|
|
122 |
| Model | Params(M) | Train Acc(%) |Val Acc(%)|Test Acc(%)| |
|
|
123 |
| :----:| :----: | :----: |:----:|:----:| |
|
|
124 |
| SemNN | 200.04 | 64.02 |65.56|61.41| |
|
|
125 |
| SemLSTM | 200.24 | 66.81 |67.00|69.74| |
|
|
126 |
| SemAttention |200.48| 76.14 |74.50|75.57| |
|
|
127 |
| Bert | 102.27 | 95.85 |82.88|82.65| |
|
|
128 |
|
|
|
129 |
</div> |