From a medical repport image (in french) not template based the system extract the following informations using NER and image segmentation:
The system work as follow :
The image segmentation and text extraction from image of medical repport is done using the algorithm ACABS (Automatic Cropper and Block Segmenter). For more details about ACABS see the repport.
ACABS first detect and segment the image into block of text. It then select the relevent block text and remove the report's header and footer. Finally the text is extracted from the medical image report thanks to OCR.
from acabs import ACABS
#usage on an image
text = ACABS(path_to_image)
#usage on a folder of image
import os
texts = ''
_, _, filenames = next(os.walk(path_to_folder))
os.chdir(path_to_folder)
for file in filenames:
text = ACABS(file)
with open('path_to_save_text/' + file.split('.')[0] + '.txt', 'w') as f:
f.write(text)
texts = texts + '\jump=================== New repport : ' + file + ' ===================\jump' + text
Visual output of ACABS segmentation:
After using ACABS to extract the text, the data need to be annotated (like annotated_text.json). In order to speed up the annotation use a software like ner-annotator.
Transform the data into a spacy like format using transform_data.py.
Use config.cfg to customize the NER model.
python -m spacy debug data ./config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ../dev.spacy --gpu-id 1
python -m spacy evaluate ./output/model-best ./dev.spacy