# Intelligent data extraction from medical reports

From a medical repport image (in french) not template based the system extract the following informations using NER and image segmentation: 

- the patient’s name, and date of birth
- the date of the medical intervention
- the type of medical intervention (for example : radiology)
- the name of the doctor who performed the medical
intervention
- the address of the intervention
- the referring doctor

The system work as follow : 

<section align='center'>
    <img src='https://github.com/IKetchup/Intelligent_data_extraction_from_medical_reports/blob/main/images/schema.PNG', height="400"/>
</section>

## Requirements

- Tesseract 5.0.0
- pytesseract 0.3.8
- NumPy 1.19.5
- OpenCV python 4.5.1.48
- SpaCy 3.2.0


## Image segmentation and text extraction

The image segmentation and text extraction from image of medical repport is done using the algorithm ACABS (Automatic Cropper and Block Segmenter). For more details about ACABS see the [repport](Intelligent_data_extraction_from_medical_reports.pdf).

### ACABS

ACABS first detect and segment the image into block of text. It then select the relevent block text and remove the report's header and footer. Finally the text is extracted from the medical image report thanks to OCR.

```python
from acabs import ACABS

#usage on an image
text = ACABS(path_to_image)

#usage on a folder of image
import os

texts = ''
_, _, filenames = next(os.walk(path_to_folder))
os.chdir(path_to_folder)
for file in filenames:
    text = ACABS(file)
        with open('path_to_save_text/' + file.split('.')[0] + '.txt', 'w') as f:
        f.write(text)

    texts = texts + '\jump=================== New repport : ' + file + ' ===================\jump' + text
```
Visual output of ACABS segmentation: 

<section align='center'>
    <img src='https://github.com/IKetchup/Intelligent_data_extraction_from_medical_reports/blob/main/images/acabs_result_fancy.png', height="500"/>
</section>

## Extraction of key information

After using ACABS to extract the text, the data need to be annotated (like [annotated_text.json](annotated_text.json)). In order to speed up the annotation use a software like  [ner-annotator](https://github.com/tecoholic/ner-annotator).

Transform the data into a spacy like format using [transform_data.py](transform_data.py).

### Train a model

Use [config.cfg](config.cfg) to customize the NER model.

- Verify the data: 
```bash 
python -m spacy debug data ./config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
```

- Train the model: 
```bash 
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ../dev.spacy --gpu-id 1
```

- Evaluate the model: 
```bash 
python -m spacy evaluate ./output/model-best ./dev.spacy
```

### Use a model

See [predictions.ipynb](predictions.ipynb)

<section align='center'>
    <img src='https://github.com/IKetchup/Intelligent_data_extraction_from_medical_reports/blob/main/images/pred.png'/>
</section>

	a/README.md		b/README.md
1	# Intelligent data extraction from medical reports	1	# Intelligent data extraction from medical reports
2		2
3	From a medical repport image (in french) not template based the system extract the following informations using NER and image segmentation:	3	From a medical repport image (in french) not template based the system extract the following informations using NER and image segmentation:
4		4
5	- the patient’s name, and date of birth	5	- the patient’s name, and date of birth
6	- the date of the medical intervention	6	- the date of the medical intervention
7	- the type of medical intervention (for example : radiology)	7	- the type of medical intervention (for example : radiology)
8	- the name of the doctor who performed the medical	8	- the name of the doctor who performed the medical
9	intervention	9	intervention
10	- the address of the intervention	10	- the address of the intervention
11	- the referring doctor	11	- the referring doctor
12		12
13	The system work as follow :	13	The system work as follow :
14		14
15	<section align='center'>	15	<section align='center'>
16	<img src='images/schema.PNG', height="400"/>	16	<img src='https://github.com/IKetchup/Intelligent_data_extraction_from_medical_reports/blob/main/images/schema.PNG', height="400"/>
17	</section>	17	</section>
18		18
19	## Requirements	19	## Requirements
20		20
21	- Tesseract 5.0.0	21	- Tesseract 5.0.0
22	- pytesseract 0.3.8	22	- pytesseract 0.3.8
23	- NumPy 1.19.5	23	- NumPy 1.19.5
24	- OpenCV python 4.5.1.48	24	- OpenCV python 4.5.1.48
25	- SpaCy 3.2.0	25	- SpaCy 3.2.0
26		26
27		27
28	## Image segmentation and text extraction	28	## Image segmentation and text extraction
29		29
30	The image segmentation and text extraction from image of medical repport is done using the algorithm ACABS (Automatic Cropper and Block Segmenter). For more details about ACABS see the [repport](Intelligent_data_extraction_from_medical_reports.pdf).	30	The image segmentation and text extraction from image of medical repport is done using the algorithm ACABS (Automatic Cropper and Block Segmenter). For more details about ACABS see the [repport](Intelligent_data_extraction_from_medical_reports.pdf).
31		31
32	### ACABS	32	### ACABS
33		33
34	ACABS first detect and segment the image into block of text. It then select the relevent block text and remove the report's header and footer. Finally the text is extracted from the medical image report thanks to OCR.	34	ACABS first detect and segment the image into block of text. It then select the relevent block text and remove the report's header and footer. Finally the text is extracted from the medical image report thanks to OCR.
35		35
36	```python	36	```python
37	from acabs import ACABS	37	from acabs import ACABS
38		38
39	#usage on an image	39	#usage on an image
40	text = ACABS(path_to_image)	40	text = ACABS(path_to_image)
41		41
42	#usage on a folder of image	42	#usage on a folder of image
43	import os	43	import os
44		44
45	texts = ''	45	texts = ''
46	_, _, filenames = next(os.walk(path_to_folder))	46	_, _, filenames = next(os.walk(path_to_folder))
47	os.chdir(path_to_folder)	47	os.chdir(path_to_folder)
48	for file in filenames:	48	for file in filenames:
49	text = ACABS(file)	49	text = ACABS(file)
50	with open('path_to_save_text/' + file.split('.')[0] + '.txt', 'w') as f:	50	with open('path_to_save_text/' + file.split('.')[0] + '.txt', 'w') as f:
51	f.write(text)	51	f.write(text)
52		52
53	texts = texts + '\jump=================== New repport : ' + file + ' ===================\jump' + text	53	texts = texts + '\jump=================== New repport : ' + file + ' ===================\jump' + text
54	```	54	```
55	Visual output of ACABS segmentation:	55	Visual output of ACABS segmentation:
56		56
57	<section align='center'>	57	<section align='center'>
58	<img src='images/acabs_result_fancy.png', height="500"/>	58	<img src='https://github.com/IKetchup/Intelligent_data_extraction_from_medical_reports/blob/main/images/acabs_result_fancy.png', height="500"/>
59	</section>	59	</section>
60		60
61	## Extraction of key information	61	## Extraction of key information
62		62
63	After using ACABS to extract the text, the data need to be annotated (like [annotated_text.json](annotated_text.json)). In order to speed up the annotation use a software like [ner-annotator](https://github.com/tecoholic/ner-annotator).	63	After using ACABS to extract the text, the data need to be annotated (like [annotated_text.json](annotated_text.json)). In order to speed up the annotation use a software like [ner-annotator](https://github.com/tecoholic/ner-annotator).
64		64
65	Transform the data into a spacy like format using [transform_data.py](transform_data.py).	65	Transform the data into a spacy like format using [transform_data.py](transform_data.py).
66		66
67	### Train a model	67	### Train a model
68		68
69	Use [config.cfg](config.cfg) to customize the NER model.	69	Use [config.cfg](config.cfg) to customize the NER model.
70		70
71	- Verify the data:	71	- Verify the data:
72	```bash	72	```bash
73	python -m spacy debug data ./config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy	73	python -m spacy debug data ./config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
74	```	74	```
75		75
76	- Train the model:	76	- Train the model:
77	```bash	77	```bash
78	python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ../dev.spacy --gpu-id 1	78	python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ../dev.spacy --gpu-id 1
79	```	79	```
80		80
81	- Evaluate the model:	81	- Evaluate the model:
82	```bash	82	```bash
83	python -m spacy evaluate ./output/model-best ./dev.spacy	83	python -m spacy evaluate ./output/model-best ./dev.spacy
84	```	84	```
85		85
86	### Use a model	86	### Use a model
87		87
88	See [predictions.ipynb](predictions.ipynb)	88	See [predictions.ipynb](predictions.ipynb)
89		89
90	<section align='center'>	90	<section align='center'>
91	<img src='images/pred.png'/>	91	<img src='https://github.com/IKetchup/Intelligent_data_extraction_from_medical_reports/blob/main/images/pred.png'/>
92	</section>	92	</section>