Switch to unified view

a b/guide/walkthrough/data_management.md
1
# Managing Data with medaCy
2
medaCy provides rich utilities for managing data through the
3
[Dataset](../../medacy/data/dataset.py) class.
4
5
In short, a [Dataset](../../medacy/data/dataset.py)
6
provides an abstraction from a file directory that allows other components
7
of medaCy to efficiently access and utilize large amounts of data. An instantiated
8
[Dataset](../../medacy/data/dataset.py) automatically knows its purpose
9
(either for prediction or training) and maintains auxillary files for
10
medaCy components such as Metamap accordingly.
11
12
In the context of medaCy, a [Dataset](../../medacy/data/dataset.py) is
13
composed of at least a collection of raw text files. Such a [Dataset](../../medacy/data/dataset.py)
14
is referred to as a *prediction dataset*. A [Dataset](../../medacy/data/dataset.py) can
15
be used for training if and only if each raw text file has a corresponding annotation file - hence,
16
we refer to this as a *training dataset*.
17
18
For the following examples, assume your data directory *home/medacy/data* is structure as follows:
19
```
20
home/medacy/data
21
├── file_one.ann
22
├── file_one.txt
23
├── file_two.ann
24
└── file_two.txt
25
```
26
27
## Table of contents
28
1. [Creating a Dataset](#creating-a-dataset)
29
2. [Using a Dataset](#using-a-dataset)
30
31
## Creating a Dataset
32
MedaCy provides two functionalities for loading data:
33
1. [Loading data from your machine](#loading-data-locally).
34
2. [Loading an existing medaCy compatible dataset](#loading-a-medacy-compatible-dataset).
35
36
37
## Loading data locally
38
To create a *Dataset*, simply instantiate one with a path to the directory containing your data.
39
40
```python
41
from medacy.data.dataset import Dataset
42
data = Dataset('/home/medacy/data')
43
```
44
45
MedaCy **does not** alter the data you load in any way - it only reads from it.
46
47
A common data work flow might look like this.
48
49
```pythonstub
50
>>> from medacy.data.datset import Dataset
51
>>> from medacy.pipeline_components.feature_overlayers.metamap.metamap import MetaMap
52
53
>>> dataset = Dataset('/home/medacy/data')
54
>>> for data_file in dataset:
55
...     data_file.file_name
56
'file_one'
57
'file_two'
58
>>> data
59
['file_one', 'file_two']
60
>>> data.is_metamapped()
61
False
62
>>> metamap = Metamap('/home/path/to/metamap/binary')
63
>>> with metamap:
64
...     data.metamap(metamap)
65
data.is_metamapped()
66
True
67
```
68
69
If all your data metamapped successfully, your data directory will look like this:
70
71
```
72
home/medacy/data
73
├── file_one.ann
74
├── file_one.txt
75
├── file_two.ann
76
├── file_two.txt
77
└── metamapped
78
    ├── file_one.metamapped
79
    └── file_two.metamapped
80
```
81
82
## Using a Dataset
83
A *Dataset* is utilized for two main tasks:
84
85
1. [Model Training](#model-training)
86
2. [Model Prediction](#model-prediction)
87
88
### Model Training
89
To utilize a *Dataset* for training insure that the data you're loading is valid training data in a supported annotation format. After creating a *Model* with a processing *Pipeline*, simply pass the *Dataset* in for prediction. Here is an example of training an NER model for extraction of information relevant to nano-particles.
90
91
```python
92
from medacy.data.dataset import Dataset
93
from medacy.model.model import Model
94
from medacy.pipelines import FDANanoDrugLabelPipeline
95
96
dataset = Dataset('/home/medacy/data')
97
entities = ['Nanoparticle', 'Dose']
98
pipeline = FDANanoDrugLabelPipeline(entities=entities)
99
model = Model(pipeline)
100
101
model.fit(dataset)
102
```
103
104
**Note**: Unless you have tuned your *Pipeline* to extract features relevant to your problem domain, the trained model will likely not be very predictive. See [Training a model](model_training.md).
105
106
### Model Prediction
107
108
Once you have a trained or imported a model, pass in a Dataset object for bulk prediction of text.
109
110
```python
111
from medacy.data.dataset import Dataset
112
from medacy.model.model import Model
113
114
dataset = Dataset('/home/medacy/data')
115
model = Model.load_external('medacy_model_clinical_notes')
116
117
model.predict(dataset)
118
```
119
120
By default, this creates a sub-directory in your prediction dataset named *predictions*. Assuming the file structure described previously, your directory would look like this:
121
122
```
123
/home/medacy/data
124
├── file_one.txt
125
├── file_two.txt
126
└── predictions
127
    ├── file_one.ann
128
    └── file_two.ann
129
```
130
131
where all files under *predictions* are the trained models predictions over your test data.