|
a/README.md |
|
b/README.md |
1 |
# Medical search Query relevance judgment |
1 |
# Medical search Query relevance judgment |
2 |
|
2 |
|
3 |
## Question description |
3 |
## Question description
|
4 |
Correlation between queries (i.e., search terms) evaluates how well two queries match the topics expressed by them, that is, whether and to what extent Query-A and Query-B are escaped. The topic of Query refers to the focus of query, and determining the correlation between two query terms is an important task, which is often used in the search quality optimization scenario of long-tail query. This task data set is generated under such background. |
4 |
Correlation between queries (i.e., search terms) evaluates how well two queries match the topics expressed by them, that is, whether and to what extent Query-A and Query-B are escaped. The topic of Query refers to the focus of query, and determining the correlation between two query terms is an important task, which is often used in the search quality optimization scenario of long-tail query. This task data set is generated under such background.
|
5 |
<div align=center> |
5 |
<div align=center> |
6 |
|
6 |
|
7 |
 |
7 |

|
8 |
</div> |
8 |
</div> |
9 |
|
9 |
|
10 |
## Dataset introduction |
10 |
## Dataset introduction |
11 |
|
11 |
|
12 |
[Download](https://tianchi.aliyun.com/competition/entrance/532001/information) |
12 |
[Download](https://tianchi.aliyun.com/competition/entrance/532001/information) |
13 |
|
13 |
|
14 |
The correlation between Query and Title is divided into three levels (0-2). 0 is the worst correlation, and 2 is the best correlation. |
14 |
The correlation between Query and Title is divided into three levels (0-2). 0 is the worst correlation, and 2 is the best correlation. |
15 |
|
15 |
|
16 |
2 points: indicates that A and B are equivalent, the expression is completely consistent |
16 |
2 points: indicates that A and B are equivalent, the expression is completely consistent |
17 |
|
17 |
|
18 |
1 score: B is the semantic subset of A, and B refers to A scope less than A |
18 |
1 score: B is the semantic subset of A, and B refers to A scope less than A |
19 |
|
19 |
|
20 |
0 score: B is the semantic parent set of A, B refers to A range greater than A; Or A has nothing to do with B semantics |
20 |
0 score: B is the semantic parent set of A, B refers to A range greater than A; Or A has nothing to do with B semantics |
21 |
|
21 |
|
22 |
## Structure |
22 |
## Structure
|
23 |
``` |
23 |
```
|
24 |
· |
24 |
·
|
25 |
├── data |
25 |
├── data
|
26 |
│ ├── example_pred.json |
26 |
│ ├── example_pred.json
|
27 |
│ ├── KUAKE-QQR_dev.json |
27 |
│ ├── KUAKE-QQR_dev.json
|
28 |
│ ├── KUAKE-QQR_test.json |
28 |
│ ├── KUAKE-QQR_test.json
|
29 |
│ └── KUAKE-QQR_train.json |
29 |
│ └── KUAKE-QQR_train.json
|
30 |
├── tencent-ailab-embedding-zh-d100-v0.2.0-s |
30 |
├── tencent-ailab-embedding-zh-d100-v0.2.0-s
|
31 |
│ ├── tencent-ailab-embedding-zh-d100-v0.2.0-s.txt |
31 |
│ ├── tencent-ailab-embedding-zh-d100-v0.2.0-s.txt
|
32 |
│ └── readme.txt |
32 |
│ └── readme.txt
|
33 |
├── chinese-bert-wwm-ext |
33 |
├── chinese-bert-wwm-ext
|
34 |
│ ├── added_tokens.json |
34 |
│ ├── added_tokens.json
|
35 |
│ ├── config.json |
35 |
│ ├── config.json
|
36 |
│ ├── pytorch_model.bin |
36 |
│ ├── pytorch_model.bin
|
37 |
│ ├── special_tokens_map.json |
37 |
│ ├── special_tokens_map.json
|
38 |
│ ├── tokenizer_config.json |
38 |
│ ├── tokenizer_config.json
|
39 |
│ ├── tokenizer.json |
39 |
│ ├── tokenizer.json
|
40 |
│ └── vocab.txt |
40 |
│ └── vocab.txt
|
41 |
├── pic |
41 |
├── pic
|
42 |
│ └── 1.png |
42 |
│ └── 1.png
|
43 |
├── scripts |
43 |
├── scripts
|
44 |
│ ├── inference.sh |
44 |
│ ├── inference.sh
|
45 |
│ ├── eval.sh |
45 |
│ ├── eval.sh
|
46 |
│ └── train.sh |
46 |
│ └── train.sh
|
47 |
├── train.py |
47 |
├── train.py
|
48 |
├── eval.py |
48 |
├── eval.py
|
49 |
├── models.py |
49 |
├── models.py
|
50 |
├── inference.py |
50 |
├── inference.py
|
51 |
└── README.md |
51 |
└── README.md
|
52 |
``` |
52 |
``` |
53 |
|
53 |
|
54 |
## Environment |
54 |
## Environment |
55 |
|
55 |
|
56 |
```shell |
56 |
```shell
|
57 |
pip install gensim |
57 |
pip install gensim
|
58 |
pip install numpy |
58 |
pip install numpy
|
59 |
pip install tqdm |
59 |
pip install tqdm
|
60 |
conda install torch |
60 |
conda install torch
|
61 |
pip install transformer |
61 |
pip install transformer
|
62 |
``` |
62 |
``` |
63 |
|
63 |
|
64 |
## Prepare |
64 |
## Prepare
|
65 |
Download corpus from Tencent AI Lab |
65 |
Download corpus from Tencent AI Lab
|
66 |
```shell |
66 |
```shell
|
67 |
wget https://ai.tencent.com/ailab/nlp/zh/data/tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz # v0.2.0 100 demention-Small |
67 |
wget https://ai.tencent.com/ailab/nlp/zh/data/tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz # v0.2.0 100 demention-Small
|
68 |
``` |
68 |
```
|
69 |
Decompress the corpus |
69 |
Decompress the corpus
|
70 |
```shell |
70 |
```shell
|
71 |
tar -zxvf tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz |
71 |
tar -zxvf tencent-ailab-embedding-zh-d100-v0.2.0-s.tar.gz
|
72 |
``` |
72 |
``` |
73 |
|
73 |
|
74 |
Download the bert model and configuration file |
74 |
Download the bert model and configuration file |
75 |
|
75 |
|
76 |
```shell |
76 |
```shell
|
77 |
mkdir chinese-bert-wwm-ext |
77 |
mkdir chinese-bert-wwm-ext
|
78 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/added_tokens.json |
78 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/added_tokens.json
|
79 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/config.json |
79 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/config.json
|
80 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/pytorch_model.bin |
80 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/pytorch_model.bin
|
81 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/special_tokens_map.json |
81 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/special_tokens_map.json
|
82 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/tokenizer.json |
82 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/tokenizer.json
|
83 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/tokenizer_config.json |
83 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/tokenizer_config.json
|
84 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/vocab.txt |
84 |
wget -P chinese-bert-wwm-ext https://huggingface.co/hfl/chinese-bert-wwm-ext/resolve/main/vocab.txt
|
85 |
``` |
85 |
```
|
86 |
## Train |
86 |
## Train |
87 |
|
87 |
|
88 |
```python |
88 |
```python
|
89 |
python train.py --model_name {model_name} --datadir {datadir} --epochs 30 --lr 1e-4 --max_length 32 --batch_size 8 --savepath ./results --gpu 0 --w2v_path {w2v_path} |
89 |
python train.py --model_name {model_name} --datadir {datadir} --epochs 30 --lr 1e-4 --max_length 32 --batch_size 8 --savepath ./results --gpu 0 --w2v_path {w2v_path}
|
90 |
``` |
90 |
```
|
91 |
Or run the scripts |
91 |
Or run the scripts |
92 |
|
92 |
|
93 |
```shell |
93 |
```shell
|
94 |
sh scripts/train.sh |
94 |
sh scripts/train.sh
|
95 |
``` |
95 |
``` |
96 |
|
96 |
|
97 |
## Eval |
97 |
## Eval |
98 |
|
98 |
|
99 |
```python |
99 |
```python
|
100 |
python eval.py --model_name {model_name} --w2v_path {w2v_path} --model_path {model_path} |
100 |
python eval.py --model_name {model_name} --w2v_path {w2v_path} --model_path {model_path}
|
101 |
``` |
101 |
```
|
102 |
Or run the scripts |
102 |
Or run the scripts |
103 |
|
103 |
|
104 |
```shell |
104 |
```shell
|
105 |
sh scripts/eval.sh |
105 |
sh scripts/eval.sh
|
106 |
``` |
106 |
``` |
107 |
|
107 |
|
108 |
## Inference |
108 |
## Inference
|
109 |
```python |
109 |
```python
|
110 |
python inference.py --model_name {model_name} --batch_size 8 --max_length 32 --savepath ./results --datadir {datadir} --model_path {model_path} --gpu 0 --w2v_path {w2v_path} |
110 |
python inference.py --model_name {model_name} --batch_size 8 --max_length 32 --savepath ./results --datadir {datadir} --model_path {model_path} --gpu 0 --w2v_path {w2v_path}
|
111 |
``` |
111 |
```
|
112 |
Or run the scripts |
112 |
Or run the scripts |
113 |
|
113 |
|
114 |
```shell |
114 |
```shell
|
115 |
sh scripts/inference.sh |
115 |
sh scripts/inference.sh
|
116 |
``` |
116 |
``` |
117 |
|
117 |
|
118 |
## Results |
118 |
## Results |
119 |
|
119 |
|
120 |
<div align=center> |
120 |
<div align=center> |
121 |
|
121 |
|
122 |
| Model | Params(M) | Train Acc(%) |Val Acc(%)|Test Acc(%)| |
122 |
| Model | Params(M) | Train Acc(%) |Val Acc(%)|Test Acc(%)|
|
123 |
| :----:| :----: | :----: |:----:|:----:| |
123 |
| :----:| :----: | :----: |:----:|:----:|
|
124 |
| SemNN | 200.04 | 64.02 |65.56|61.41| |
124 |
| SemNN | 200.04 | 64.02 |65.56|61.41|
|
125 |
| SemLSTM | 200.24 | 66.81 |67.00|69.74| |
125 |
| SemLSTM | 200.24 | 66.81 |67.00|69.74|
|
126 |
| SemAttention |200.48| 76.14 |74.50|75.57| |
126 |
| SemAttention |200.48| 76.14 |74.50|75.57|
|
127 |
| Bert | 102.27 | 95.85 |82.88|82.65| |
127 |
| Bert | 102.27 | 95.85 |82.88|82.65| |
128 |
|
128 |
|
129 |
</div> |
129 |
</div> |