|
a |
|
b/guide/command_line_interface.md |
|
|
1 |
# Using medaCy's Command Line Interface |
|
|
2 |
|
|
|
3 |
MedaCy provides a command line interface (CLI) for using medaCy's core functionality. The core functionality includes |
|
|
4 |
cross validation, model training, and directory-level prediction, but other modules within medaCy also provide command |
|
|
5 |
line interfaces. While instructions can be displayed with the command `python -m medacy --help`, |
|
|
6 |
this guide provides a more thorough set of instructions. |
|
|
7 |
|
|
|
8 |
A command to cross validate using the ClinicalPipeline might look like this: |
|
|
9 |
|
|
|
10 |
```bash |
|
|
11 |
(medacy_venv) $ python -m medacy -pl ClinicalPipeline -d ./your/dataset validate -k 7 -gt ./ground |
|
|
12 |
``` |
|
|
13 |
|
|
|
14 |
Note that the pipeline and dataset are specified before the validate command, and that some arguments follow the |
|
|
15 |
name of the command. The first step to using the CLI is to identify which arguments must come before the command |
|
|
16 |
and which must come after |
|
|
17 |
|
|
|
18 |
## Global Arguments |
|
|
19 |
|
|
|
20 |
Arguments are global when they could be applied to more than one command. |
|
|
21 |
|
|
|
22 |
### Dataset Arguments |
|
|
23 |
|
|
|
24 |
* `-d` specifies the path to a dataset directory. If you are cross validating or fitting, this is the dataset of txt and ann files |
|
|
25 |
that will be used for that process. If you are predicting, this is the directory of txt files that will be predicted over. |
|
|
26 |
* `-ent` specifies the path to a JSON with an "entities" key. This key's value should be a list of entities that are a |
|
|
27 |
subset of what appear in the selected dataset. If this argument is not used, medaCy will automatically use all the entities |
|
|
28 |
that appear in the specified dataset. |
|
|
29 |
|
|
|
30 |
### Pipeline Arguments |
|
|
31 |
|
|
|
32 |
Two arguments relate specifically to the pipeline selection; only one should be used in a given command: |
|
|
33 |
* `-pl` specifies which pre-configured pipeline to use. Options include `ClinicalPipeline`, `SystematicReviewPipeline`, |
|
|
34 |
`BertPipeline`, and `LstmSystematicReviewPipeline`. |
|
|
35 |
* `-cpl` specifies the path to a JSON from which a custom pipeline can be constructed, see [here](creating_custom_pipeline_from_json.md) for a guide |
|
|
36 |
on creating custom pipelines with JSON files |
|
|
37 |
|
|
|
38 |
Note that `-ent` and `-cpl` both require JSON files. These can be the same JSON. |
|
|
39 |
|
|
|
40 |
### Learner Arguments |
|
|
41 |
|
|
|
42 |
While pipelines typically pre-configure the learners that they use, the BiLSTM and BERT learner allow for some arguments to be specified |
|
|
43 |
from the command line. |
|
|
44 |
|
|
|
45 |
These commands relate specifically to learning algorithms. They can be specified regardless of which pipeline or |
|
|
46 |
learner has been selected, but they will only be used if the pipeline uses a learner that supports that parameter, otherwise |
|
|
47 |
they will simply be ignored. |
|
|
48 |
|
|
|
49 |
* `-c` specifies which GPU device to use; if this is not specified, the command will run from the CPU, which may result |
|
|
50 |
in slower performance for the BiLSTM and BERT learners. The authors of medaCy recommend the Python module [gpustat](https://pypi.org/project/gpustat/) for checking GPU availability. |
|
|
51 |
* `-w` specifies a path to a word embedding binary file. This is required for the BiLSTM and is not used by any other learner. |
|
|
52 |
|
|
|
53 |
The following arguments have default values but can be adjusted if you have knowledge of how changing them might affect |
|
|
54 |
the learning process: |
|
|
55 |
* `-b` specifies a batch size. |
|
|
56 |
* `-e` specifies a number of epochs. |
|
|
57 |
* `-pm` specifies a pretrained model for the BERT learner. |
|
|
58 |
* `-crf` causes the BERT learner to use a CRF layer when this flag is present; it does not take an argument. |
|
|
59 |
* `-lr` specifies a learning rate |
|
|
60 |
|
|
|
61 |
### Logging and Testing Arguments |
|
|
62 |
|
|
|
63 |
* `-t` runs the command in testing mode, which will only use one file in the specified dataset and use more verbose logging. |
|
|
64 |
* `-lf` specifies a log file. The default log file is `medacy_n.log`, where `n` is the GPU being used or `cpu`. |
|
|
65 |
This argument allows you to specify a different log file. |
|
|
66 |
* `-lc` causes the logging information to also be printed to the console. |
|
|
67 |
|
|
|
68 |
## `validate` |
|
|
69 |
|
|
|
70 |
If you don't have a pretrained medaCy model, the first step to using medaCy is creating or selecting a pipeline |
|
|
71 |
configuration and validating it with your annotated dataset. |
|
|
72 |
|
|
|
73 |
* `-k` specifies a number of folds. MedaCy uses sentence-level stratification. 10 is used by default, but a lower number will run more quickly. |
|
|
74 |
* `-gt` specifies a directory to store the groundtruth version of the dataset. These are the ann files where sentence spans are aligned with which tokens are assigned a given |
|
|
75 |
label within medaCy. This is not guaranteed to be the same as how they appear in the dataset itself because of how the selected pipeline tokenizes the data. |
|
|
76 |
* `-pd` specifies a directory in which to store the predictions made throughout the cross validation process. |
|
|
77 |
|
|
|
78 |
Note that the directories for `-gt` and `-pd` must already exist. If these arguments are not passed, the cross validation groundtruth and predictions will not be written anywhere. |
|
|
79 |
|
|
|
80 |
The data outputted to the `-gt` and `-pd` directories is especially useful when combined with medaCy's inter-dataset |
|
|
81 |
agreement calculator. The command for this is `python -m medacy.tools.calculators.inter_dataset_agreement`. |
|
|
82 |
Inter-dataset agreement can measure how similar the groudtruth is to the original dataset, or how similar the fold predictions |
|
|
83 |
are to the groundtruth, etc. |
|
|
84 |
|
|
|
85 |
A validate command might look like one of these: |
|
|
86 |
|
|
|
87 |
```bash |
|
|
88 |
(medacy_venv) $ python -m medacy -d ~/datasets/my_dataset -cpl ./my_pipeline.json validate -k 7 -gt ./ground -pd ./pred |
|
|
89 |
(medacy_venv) $ python -m medacy -d ~/datasets/my_dataset -pl LstmSystematicReviewPipeline -c 0 -w ~/datasets/my_word_embeddings.bin validate |
|
|
90 |
``` |
|
|
91 |
|
|
|
92 |
## `train` |
|
|
93 |
|
|
|
94 |
There is little that needs to be specified here because the model, dataset, and pipeline arguments |
|
|
95 |
are specified in the first section. |
|
|
96 |
|
|
|
97 |
* `-f` specifies the location to write the model binary to. |
|
|
98 |
* `-gt` specifies a directory to write the groundtruth to, the same as for `validate`. |
|
|
99 |
|
|
|
100 |
```bash |
|
|
101 |
(medacy_venv) $ python -m medacy -d ~/datasets/my_dataset -cpl ./my_pipeline.json train -gt ./ground -f ./my_crf_model.pkl |
|
|
102 |
(medacy_venv) $ python -m medacy -d ~/datasets/my_dataset -pl LstmSystematicReviewPipeline -c 0 -w ~/datasets/my_word_embeddings.bin train -f ./my_bilstm_model.pkl |
|
|
103 |
``` |
|
|
104 |
|
|
|
105 |
## `predict` |
|
|
106 |
|
|
|
107 |
When predicting, one must specify in the first section which pipeline they want to use and which dataset of txt files to predict over. |
|
|
108 |
|
|
|
109 |
* `-m` specifies the path to the model binary file to use. Remember to use the same pipeline that was used to create the model binary selected here. |
|
|
110 |
* `-pd` specifies the directory to write the predictions to. |
|
|
111 |
|
|
|
112 |
```bash |
|
|
113 |
(medacy_venv) $ python -m medacy -d ~/datasets/my_txt_files -cpl ./my_pipeline.json predict -m ./my_crf_model.pkl -pd ./crf_predictions |
|
|
114 |
(medacy_venv) $ python -m medacy -d ~/datasets/my_txt_files -pl LstmSystematicReviewPipeline -w ~/datasets/my_word_embeddings.bin predict -m ./my_bilstm_model.pkl -pd ./bilstm_predictions |
|
|
115 |
``` |