Switch to unified view

a b/guide/command_line_interface.md
1
# Using medaCy's Command Line Interface
2
3
MedaCy provides a command line interface (CLI) for using medaCy's core functionality. The core functionality includes 
4
cross validation, model training, and directory-level prediction, but other modules within medaCy also provide command 
5
line interfaces. While instructions can be displayed with the command `python -m medacy --help`, 
6
this guide provides a more thorough set of instructions.
7
8
A command to cross validate using the ClinicalPipeline might look like this:
9
10
```bash
11
(medacy_venv) $ python -m medacy -pl ClinicalPipeline -d ./your/dataset validate -k 7 -gt ./ground
12
```
13
14
Note that the pipeline and dataset are specified before the validate command, and that some arguments follow the 
15
name of the command. The first step to using the CLI is to identify which arguments must come before the command
16
and which must come after
17
18
## Global Arguments
19
20
Arguments are global when they could be applied to more than one command. 
21
22
### Dataset Arguments
23
24
* `-d` specifies the path to a dataset directory. If you are cross validating or fitting, this is the dataset of txt and ann files
25
 that will be used for that process. If you are predicting, this is the directory of txt files that will be predicted over.
26
* `-ent` specifies the path to a JSON with an "entities" key. This key's value should be a list of entities that are a 
27
subset of what appear in the selected dataset. If this argument is not used, medaCy will automatically use all the entities
28
that appear in the specified dataset.
29
30
### Pipeline Arguments
31
 
32
Two arguments relate specifically to the pipeline selection; only one should be used in a given command:
33
* `-pl` specifies which pre-configured pipeline to use. Options include `ClinicalPipeline`, `SystematicReviewPipeline`, 
34
`BertPipeline`, and `LstmSystematicReviewPipeline`.
35
* `-cpl` specifies the path to a JSON from which a custom pipeline can be constructed, see [here](creating_custom_pipeline_from_json.md) for a guide
36
on creating custom pipelines with JSON files
37
38
Note that `-ent` and `-cpl` both require JSON files. These can be the same JSON.
39
40
### Learner Arguments
41
42
While pipelines typically pre-configure the learners that they use, the BiLSTM and BERT learner allow for some arguments to be specified
43
from the command line.
44
45
These commands relate specifically to learning algorithms. They can be specified regardless of which pipeline or 
46
learner has been selected, but they will only be used if the pipeline uses a learner that supports that parameter, otherwise
47
they will simply be ignored.
48
49
* `-c` specifies which GPU device to use; if this is not specified, the command will run from the CPU, which may result
50
in slower performance for the BiLSTM and BERT learners. The authors of medaCy recommend the Python module [gpustat](https://pypi.org/project/gpustat/) for checking GPU availability.
51
* `-w` specifies a path to a word embedding binary file. This is required for the BiLSTM and is not used by any other learner.
52
53
The following arguments have default values but can be adjusted if you have knowledge of how changing them might affect
54
the learning process:
55
* `-b` specifies a batch size.
56
* `-e` specifies a number of epochs.
57
* `-pm` specifies a pretrained model for the BERT learner.
58
* `-crf` causes the BERT learner to use a CRF layer when this flag is present; it does not take an argument.
59
* `-lr` specifies a learning rate
60
61
### Logging and Testing Arguments
62
63
* `-t` runs the command in testing mode, which will only use one file in the specified dataset and use more verbose logging.
64
* `-lf` specifies a log file. The default log file is `medacy_n.log`, where `n` is the GPU being used or `cpu`. 
65
This argument allows you to specify a different log file.
66
* `-lc` causes the logging information to also be printed to the console.
67
68
## `validate`
69
70
If you don't have a pretrained medaCy model, the first step to using medaCy is creating or selecting a pipeline
71
configuration and validating it with your annotated dataset.
72
73
* `-k` specifies a number of folds. MedaCy uses sentence-level stratification. 10 is used by default, but a lower number will run more quickly.
74
* `-gt` specifies a directory to store the groundtruth version of the dataset. These are the ann files where sentence spans are aligned with which tokens are assigned a given
75
label within medaCy. This is not guaranteed to be the same as how they appear in the dataset itself because of how the selected pipeline tokenizes the data.
76
* `-pd` specifies a directory in which to store the predictions made throughout the cross validation process.
77
78
Note that the directories for `-gt` and `-pd` must already exist. If these arguments are not passed, the cross validation groundtruth and predictions will not be written anywhere.
79
80
The data outputted to the `-gt` and `-pd` directories is especially useful when combined with medaCy's inter-dataset 
81
agreement calculator. The command for this is `python -m medacy.tools.calculators.inter_dataset_agreement`.
82
Inter-dataset agreement can measure how similar the groudtruth is to the original dataset, or how similar the fold predictions
83
are to the groundtruth, etc.
84
85
A validate command might look like one of these:
86
87
```bash
88
(medacy_venv) $ python -m medacy -d ~/datasets/my_dataset -cpl ./my_pipeline.json validate -k 7 -gt ./ground -pd ./pred
89
(medacy_venv) $ python -m medacy -d ~/datasets/my_dataset -pl LstmSystematicReviewPipeline -c 0 -w ~/datasets/my_word_embeddings.bin validate
90
```
91
92
## `train`
93
94
There is little that needs to be specified here because the model, dataset, and pipeline arguments
95
are specified in the first section.
96
97
* `-f` specifies the location to write the model binary to.
98
* `-gt` specifies a directory to write the groundtruth to, the same as for `validate`.
99
100
```bash
101
(medacy_venv) $ python -m medacy -d ~/datasets/my_dataset -cpl ./my_pipeline.json train -gt ./ground -f ./my_crf_model.pkl
102
(medacy_venv) $ python -m medacy -d ~/datasets/my_dataset -pl LstmSystematicReviewPipeline -c 0 -w ~/datasets/my_word_embeddings.bin train -f ./my_bilstm_model.pkl
103
```
104
105
## `predict`
106
107
When predicting, one must specify in the first section which pipeline they want to use and which dataset of txt files to predict over.
108
109
* `-m` specifies the path to the model binary file to use. Remember to use the same pipeline that was used to create the model binary selected here.
110
* `-pd` specifies the directory to write the predictions to. 
111
112
```bash
113
(medacy_venv) $ python -m medacy -d ~/datasets/my_txt_files -cpl ./my_pipeline.json predict -m ./my_crf_model.pkl -pd ./crf_predictions
114
(medacy_venv) $ python -m medacy -d ~/datasets/my_txt_files -pl LstmSystematicReviewPipeline -w ~/datasets/my_word_embeddings.bin predict -m ./my_bilstm_model.pkl -pd ./bilstm_predictions
115
```