|
a |
|
b/guide/walkthrough/data_management.md |
|
|
1 |
# Managing Data with medaCy |
|
|
2 |
medaCy provides rich utilities for managing data through the |
|
|
3 |
[Dataset](../../medacy/data/dataset.py) class. |
|
|
4 |
|
|
|
5 |
In short, a [Dataset](../../medacy/data/dataset.py) |
|
|
6 |
provides an abstraction from a file directory that allows other components |
|
|
7 |
of medaCy to efficiently access and utilize large amounts of data. An instantiated |
|
|
8 |
[Dataset](../../medacy/data/dataset.py) automatically knows its purpose |
|
|
9 |
(either for prediction or training) and maintains auxillary files for |
|
|
10 |
medaCy components such as Metamap accordingly. |
|
|
11 |
|
|
|
12 |
In the context of medaCy, a [Dataset](../../medacy/data/dataset.py) is |
|
|
13 |
composed of at least a collection of raw text files. Such a [Dataset](../../medacy/data/dataset.py) |
|
|
14 |
is referred to as a *prediction dataset*. A [Dataset](../../medacy/data/dataset.py) can |
|
|
15 |
be used for training if and only if each raw text file has a corresponding annotation file - hence, |
|
|
16 |
we refer to this as a *training dataset*. |
|
|
17 |
|
|
|
18 |
For the following examples, assume your data directory *home/medacy/data* is structure as follows: |
|
|
19 |
``` |
|
|
20 |
home/medacy/data |
|
|
21 |
├── file_one.ann |
|
|
22 |
├── file_one.txt |
|
|
23 |
├── file_two.ann |
|
|
24 |
└── file_two.txt |
|
|
25 |
``` |
|
|
26 |
|
|
|
27 |
## Table of contents |
|
|
28 |
1. [Creating a Dataset](#creating-a-dataset) |
|
|
29 |
2. [Using a Dataset](#using-a-dataset) |
|
|
30 |
|
|
|
31 |
## Creating a Dataset |
|
|
32 |
MedaCy provides two functionalities for loading data: |
|
|
33 |
1. [Loading data from your machine](#loading-data-locally). |
|
|
34 |
2. [Loading an existing medaCy compatible dataset](#loading-a-medacy-compatible-dataset). |
|
|
35 |
|
|
|
36 |
|
|
|
37 |
## Loading data locally |
|
|
38 |
To create a *Dataset*, simply instantiate one with a path to the directory containing your data. |
|
|
39 |
|
|
|
40 |
```python |
|
|
41 |
from medacy.data.dataset import Dataset |
|
|
42 |
data = Dataset('/home/medacy/data') |
|
|
43 |
``` |
|
|
44 |
|
|
|
45 |
MedaCy **does not** alter the data you load in any way - it only reads from it. |
|
|
46 |
|
|
|
47 |
A common data work flow might look like this. |
|
|
48 |
|
|
|
49 |
```pythonstub |
|
|
50 |
>>> from medacy.data.datset import Dataset |
|
|
51 |
>>> from medacy.pipeline_components.feature_overlayers.metamap.metamap import MetaMap |
|
|
52 |
|
|
|
53 |
>>> dataset = Dataset('/home/medacy/data') |
|
|
54 |
>>> for data_file in dataset: |
|
|
55 |
... data_file.file_name |
|
|
56 |
'file_one' |
|
|
57 |
'file_two' |
|
|
58 |
>>> data |
|
|
59 |
['file_one', 'file_two'] |
|
|
60 |
>>> data.is_metamapped() |
|
|
61 |
False |
|
|
62 |
>>> metamap = Metamap('/home/path/to/metamap/binary') |
|
|
63 |
>>> with metamap: |
|
|
64 |
... data.metamap(metamap) |
|
|
65 |
data.is_metamapped() |
|
|
66 |
True |
|
|
67 |
``` |
|
|
68 |
|
|
|
69 |
If all your data metamapped successfully, your data directory will look like this: |
|
|
70 |
|
|
|
71 |
``` |
|
|
72 |
home/medacy/data |
|
|
73 |
├── file_one.ann |
|
|
74 |
├── file_one.txt |
|
|
75 |
├── file_two.ann |
|
|
76 |
├── file_two.txt |
|
|
77 |
└── metamapped |
|
|
78 |
├── file_one.metamapped |
|
|
79 |
└── file_two.metamapped |
|
|
80 |
``` |
|
|
81 |
|
|
|
82 |
## Using a Dataset |
|
|
83 |
A *Dataset* is utilized for two main tasks: |
|
|
84 |
|
|
|
85 |
1. [Model Training](#model-training) |
|
|
86 |
2. [Model Prediction](#model-prediction) |
|
|
87 |
|
|
|
88 |
### Model Training |
|
|
89 |
To utilize a *Dataset* for training insure that the data you're loading is valid training data in a supported annotation format. After creating a *Model* with a processing *Pipeline*, simply pass the *Dataset* in for prediction. Here is an example of training an NER model for extraction of information relevant to nano-particles. |
|
|
90 |
|
|
|
91 |
```python |
|
|
92 |
from medacy.data.dataset import Dataset |
|
|
93 |
from medacy.model.model import Model |
|
|
94 |
from medacy.pipelines import FDANanoDrugLabelPipeline |
|
|
95 |
|
|
|
96 |
dataset = Dataset('/home/medacy/data') |
|
|
97 |
entities = ['Nanoparticle', 'Dose'] |
|
|
98 |
pipeline = FDANanoDrugLabelPipeline(entities=entities) |
|
|
99 |
model = Model(pipeline) |
|
|
100 |
|
|
|
101 |
model.fit(dataset) |
|
|
102 |
``` |
|
|
103 |
|
|
|
104 |
**Note**: Unless you have tuned your *Pipeline* to extract features relevant to your problem domain, the trained model will likely not be very predictive. See [Training a model](model_training.md). |
|
|
105 |
|
|
|
106 |
### Model Prediction |
|
|
107 |
|
|
|
108 |
Once you have a trained or imported a model, pass in a Dataset object for bulk prediction of text. |
|
|
109 |
|
|
|
110 |
```python |
|
|
111 |
from medacy.data.dataset import Dataset |
|
|
112 |
from medacy.model.model import Model |
|
|
113 |
|
|
|
114 |
dataset = Dataset('/home/medacy/data') |
|
|
115 |
model = Model.load_external('medacy_model_clinical_notes') |
|
|
116 |
|
|
|
117 |
model.predict(dataset) |
|
|
118 |
``` |
|
|
119 |
|
|
|
120 |
By default, this creates a sub-directory in your prediction dataset named *predictions*. Assuming the file structure described previously, your directory would look like this: |
|
|
121 |
|
|
|
122 |
``` |
|
|
123 |
/home/medacy/data |
|
|
124 |
├── file_one.txt |
|
|
125 |
├── file_two.txt |
|
|
126 |
└── predictions |
|
|
127 |
├── file_one.ann |
|
|
128 |
└── file_two.ann |
|
|
129 |
``` |
|
|
130 |
|
|
|
131 |
where all files under *predictions* are the trained models predictions over your test data. |