a b/changelog.md
1
# Changelog
2
3
## Unreleased
4
5
### Added
6
7
- Added grad spike detection to the `edsnlp.train` script, and per weight layer gradient logging.
8
9
### Fixed
10
11
- Fixed mini-batch accumulation for multi-task training
12
13
## v0.17.0 (2025-04-15)
14
15
### Added
16
17
- Support for numpy>2.0, and formal support for Python 3.11 and Python 3.12
18
- Expose the defaults patterns of `eds.negation`, `eds.hypothesis`, `eds.family`, `eds.history` and `eds.reported_speech` under a `eds.negation.default_patterns` attribute
19
- Added a `context_getter` SpanGetter argument to the `eds.matcher` class to only retrieve entities inside the spans returned by the getter
20
- Added a `filter_expr` parameter to scorers to filter the documents to score
21
- Added a new `required` field to `eds.contextual_matcher` assign patterns to only match if the required field has been found, and an `include` parameter (similar to `exclude`) to search for required patterns without assigning them to the entity
22
- Added context strings (e.g., "words[0:5] | sent[0:1]") to the `eds.contextual_matcher` component to allow for more complex patterns in the selection of the window around the trigger spans.
23
- Include and exclude patterns in the contextual matcher now dismiss matches that occur inside the anchor pattern (e.g. "anti" exclude pattern for anchor pattern "antibiotics" will not match the "anti" part of "antibiotics")
24
- Pull Requests will now build a public accessible preview of the docs
25
26
### Changed
27
- Improve the contextual matcher documentation.
28
29
### Fixed
30
31
- `edsnlp.package` now correctly detect if a project uses an old-style poetry pyproject or a PEP621 pyproject.toml.
32
- PEP621 projects containing nested directories (e.g., "my_project/pipes/foo.py") are now supported.
33
- Try several paths to find current pip executable
34
- The parameter "value_extract" of `eds.score` now correctly handles lists of patterns.
35
- "Zero variance error" when computing param tuning importance are now catched and converted as a warning
36
37
## v0.16.0 (2025-03-26)
38
39
### Added
40
- Hyperparameter Tuning for EDS-NLP: introduced a new script `edsnlp.tune` for hyperparameter tuning using Optuna. This feature allows users to efficiently optimize model parameters with options for single-phase or two-phase tuning strategies. Includes support for parameter importance analysis, visualization, pruning, and automatic handling of GPU time budgets.
41
- Provided a [detailed tutorial](https://aphp.github.io/edsnlp/v0.16.0/tutorials/tuning/) on hyperparameter tuning, covering usage scenarios and configuration options.
42
- `ScheduledOptimizer` (e.g., `@core: "optimizer"`) now supports importing optimizers using their qualified name (e.g., `optim: "torch.optim.Adam"`).
43
- `eds.ner_crf` now computes confidence score on spans.
44
45
### Changed
46
47
- The loss of `eds.ner_crf` is now computed as the mean over the words instead of the sum. This change is compatible with multi-gpu training.
48
- Having multiple stats keys matching a batching pattern now warns instead of raising an error.
49
50
### Fixed
51
52
- Support packaging with poetry 2.0
53
- Solve pickling issues with multiprocessing when pytorch is installed
54
- Allow deep attributes like `a.b.c` for `span_attributes` in Standoff and OMOP doc2dict converters
55
- Fixed various aspects of stream shuffling:
56
57
  - Ensure the Parquet reader shuffles the data when `shuffle=True`
58
  - Ensure we don't overwrite the RNG of the data reader when calling `stream.shuffle()` with no seed
59
  - Raise an error if the batch size in `stream.shuffle(batch_size=...)` is not compatible with the stream
60
- `eds.split` now keeps doc and span attributes in the sub-documents.
61
62
## v0.15.0 (2024-12-13)
63
64
### Added
65
66
- `edsnlp.data.read_parquet` now accept a `work_unit="fragment"` option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.
67
- Accept no validation data in `edsnlp.train` script
68
- Log the training config at the beginning of the trainings
69
- Support a specific model output dir path for trainings (`output_model_dir`), and whether to save the model or not (`save_model`)
70
- Specify whether to log the validation results or not (`logger=False`)
71
- Added support for the CoNLL format with `edsnlp.data.read_conll` and with a specific `eds.conll_dict2doc` converter
72
- Added a Trainable Biaffine Dependency Parser (`eds.biaffine_dep_parser`) component and metrics
73
- New `eds.extractive_qa` component to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as in `eds.ner_crf`.
74
75
### Fixed
76
77
- Fix `join_thread` missing attribute in `SimpleQueue` when cleaning a multiprocessing executor
78
- Support huggingface transformers that do not set `cls_token_id` and `sep_token_id` (we now also look for these tokens in the `special_tokens_map` and `vocab` mappings)
79
- Fix changing scorers dict size issue when evaluating during training
80
- Seed random states (instead of using `random.RandomState()`) when shuffling in data readers : this is important for
81
  1. reproducibility
82
  2. in multiprocessing mode, ensure that the same data is shuffled in the same way in all workers
83
- Bubble BaseComponent instantiation errors correctly
84
- Improved support for multi-gpu gradient accumulation (only sync the gradients at the end of the accumulation), now controled by the optiona `sub_batch_size` argument of `TrainingData`.
85
- Support again edsnlp without pytorch installed
86
- We now test that edsnlp works without pytorch installed
87
- Fix units and scales, ie 1l = 1dm3, 1ml = 1cm3
88
89
## v0.14.0 (2024-11-14)
90
91
### Added
92
93
- Support for setuptools based projects in `edsnlp.package` command
94
- Pipelines can now be instantiated directly from a config file (instead of having to cast a dict containing their arguments) by putting the @core = "pipeline" or "load" field in the pipeline section)
95
- `edsnlp.load` now correctly takes disable, enable and exclude parameters into account
96
- Pipeline now has a basic repr showing is base langage (mostly useful to know its tokenizer) and its pipes
97
- New `python -m edsnlp.evaluate` script to evaluate a model on a dataset
98
- Sentence detection can now be configured to change the minimum number of newlines to consider a newline-triggered sentence, and disable capitalization checking.
99
- New `eds.split` pipe to split a document into multiple documents based on a splitting pattern (useful for training)
100
- Allow `converter` argument of `edsnlp.data.read/from_...` to be a list of converters instead of a single converter
101
- New revamped and documented `edsnlp.train` script and API
102
- Support YAML config files (supported only CFG/INI files before)
103
- Most of EDS-NLP functions are now clickable in the documentation
104
- ScheduledOptimizer now accepts schedules directly in place of parameters, and easy parameter selection:
105
106
    ```
107
    ScheduledOptimizer(
108
        optim="adamw",
109
        module=nlp,
110
        total_steps=2000,
111
        groups={
112
            "^transformer": {
113
                # lr will go from 0 to 5e-5 then to 0 for params matching "transformer"
114
                "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,},
115
            },
116
            "": {
117
                # lr will go from 3e-4 during 200 steps then to 0 for other params
118
                "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,},
119
            },
120
        },
121
    )
122
    ```
123
124
### Changed
125
126
- `eds.span_context_getter`'s parameter `context_sents` is no longer optional and must be explicitly set to 0 to disable sentence context
127
- In multi-GPU setups, streams that contain torch components are now stripped of their parameter tensors when sent to CPU Workers since these workers only perform preprocessing and postprocessing and should therefore not need the model parameters.
128
- The `batch_size` argument of `Pipeline` is deprecated and is not used anymore. Use the `batch_size` argument of `stream.map_pipeline` instead.
129
130
### Fixed
131
132
- Sort files before iterating over a standoff or json folder to ensure reproducibility
133
- Sentence detection now correctly match capitalized letters + apostrophe
134
- We now ensure that the workers pool is properly closed whatever happens (exception, garbage collection, data ending) in the `multiprocessing` backend. This prevents some executions from hanging indefinitely at the end of the processing.
135
- Propagate torch sharing strategy to other workers in the `multiprocessing` backend. This is useful when the system is running out of file descriptors and `ulimit -n` is not an option. Torch sharing strategy can also be set via an environment variable `TORCH_SHARING_STRATEGY` (default is `file_descriptor`, [consider using `file_system` if you encounter issues](https://pytorch.org/docs/stable/multiprocessing.html#file-system-file-system)).
136
137
### Data API changes
138
139
- `LazyCollection` objects are now called `Stream` objects
140
- By default, `multiprocessing` backend now preserves the order of the input data. To disable this and improve performance, use `deterministic=False` in the `set_processing` method
141
- :rocket: Parallelized GPU inference throughput improvements !
142
143
    - For simple {pre-process → model → post-process} pipelines, GPU inference can be up to 30% faster in non-deterministic mode (results can be out of order) and up to 20% faster in deterministic mode (results are in order)
144
    - For multitask pipelines, GPU inference can be up to twice as fast (measured in a two-tasks BERT+NER+Qualif pipeline on T4 and A100 GPUs)
145
146
- The `.map_batches`, `.map_pipeline` and `.map_gpu` methods now support a specific `batch_size` and batching function, instead of having a single batch size for all pipes
147
- Readers now have a `loop` parameter to cycle over the data indefinitely (useful for training)
148
- Readers now have a `shuffle` parameter to shuffle the data before iterating over it
149
- In `multiprocessing` mode, file based readers now read the data in the workers (was an option before)
150
- We now support two new special batch sizes
151
152
    - "fragment" in the case of parquet datasets: rows of a full parquet file fragment per batch
153
    - "dataset" which is mostly useful during training, for instance to shuffle the dataset at each epoch.
154
  These are also compatible in batched writer such as parquet, where each input fragment can be processed and mapped to a single matching output fragment.
155
156
- :boom: Breaking change: a `map` function returning a list or a generator won't be automatically flattened anymore. Use `flatten()` to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output
157
- :boom: Breaking change: the `chunk_size` and `sort_chunks` are now deprecated : to sort data before applying a transformation, use `.map_batches(custom_sort_fn, batch_size=...)`
158
159
### Training API changes
160
161
- We now provide a training script `python -m edsnlp.train --config config.cfg` that should fit many use cases. Check out the docs !
162
- In particular, we do not require pytorch's Dataloader for training and can rely solely on EDS-NLP stream/data API, which is better suited for large streamable datasets and dynamic preprocessing (ie different result each time we apply a noised preprocessing op on a sample).
163
- Each trainable component can now provide a `stats` field in its `preprocess` output to log info about the sample (number of words, tokens, spans, ...):
164
165
    - these stats are both used for batching (e.g., make batches of no more than "25000 tokens")
166
    - for logging
167
    - for computing correct loss means when accumulating gradients over multiple mini-mini-batches
168
    - for computing correct loss means in multi-GPU setups, since these stats are synchronized and accumulated across GPUs
169
170
- Support multi GPU training via hugginface `accelerate` and EDS-NLP `Stream` API consideration of env['WOLRD_SIZE'] and env['LOCAL_RANK'] environment variables
171
172
## v0.13.1
173
174
### Added
175
176
- `eds.tables` accepts a minimum_table_size (default 2) argument to reduce pollution
177
- `RuleBasedQualifier` now expose a `process` method that only returns qualified entities and token without actually tagging them, deferring this task to the `__call__` method.
178
- Added new patterns for metastasis detection. Developed on CT-Scan reports.
179
- Added citation of articles
180
181
### Changed
182
183
- Renamed `edsnlp.scorers` to `edsnlp.metrics` and removed the `_scorer` suffix from their
184
  registry name (e.g, `@scorers = ner_overlap_scorer` → `@metrics = ner_overlap`)
185
- Rename `eds.measurements` to `eds.quantities`
186
- scikit-learn (used in `eds.endlines`) is no longer installed by default when installing `edsnlp[ml]`
187
188
### Fixed
189
190
- Disorder and Behavior pipes don't use a "PRESENT" or "ABSENT" `status` anymore. Instead, `status=None` by default,
191
  and `ent._.negation` is set to True instead of setting `status` to "ABSENT". To this end, the *tobacco* and *alcohol*
192
  now use the `NegationQualifier` internally.
193
- Numbers are now only detected without trying to remove the pollution in between digits, ie `55 @ 77777` could be detected as a full number before, but not anymore.
194
- Resolve encoding-related data reading issues by forcing utf-8
195
196
## v0.13.0
197
198
### Added
199
200
- `data.set_processing(...)` now expose an `autocast` parameter to disable or tweak the automatic casting of the tensor
201
  during the processing. Autocasting should result in a slight speedup, but may lead to numerical instability.
202
- Use `torch.inference_mode` to disable view tracking and version counter bumps during inference.
203
- Added a new NER pipeline for suicide attempt detection
204
- Added date cues (regular expression matches that contributed to a date being detected) under the extension `ent._.date_cues`
205
- Added tables processing in eds.measurement
206
- Added 'all' as possible input in eds.measurement measurements config
207
- Added new units in eds.measurement
208
209
### Changed
210
211
- Default to mixed precision inference
212
213
### Fixed
214
215
- `edsnlp.load("your/huggingface-model", install_dependencies=True)` now correctly resolves the python pip
216
  (especially on Colab) to auto-install the model dependencies
217
- We now better handle empty documents in the `eds.transformer`, `eds.text_cnn` and `eds.ner_crf` components
218
- Support mixed precision in `eds.text_cnn` and `eds.ner_crf` components
219
- Support pre-quantization (<4.30) transformers versions
220
- Verify that all batches are non empty
221
- Fix `span_context_getter` for `context_words` = 0, `context_sents` > 2 and support assymetric contexts
222
- Don't split sentences on rare unicode symbols
223
- Better detect abbreviations, like `E.coli`, now split as [`E.`, `coli`] and not [`E`, `.`, `coli`]
224
225
## v0.12.3
226
227
### Changed
228
229
Packages:
230
231
- Pip-installable models are now built with `hatch` instead of poetry, which allows us to expose `artifacts` (weights)
232
  at the root of the sdist package (uploadable to HF) and move them inside the package upon installation to avoid conflicts.
233
- Dependencies are no longer inferred with dill-magic (this didn't work well before anyway)
234
- Option to perform substitutions in the model's README.md file (e.g., for the model's name, metrics, ...)
235
- Huggingface models are now installed with pip *editable* installations, which is faster since it doesn't copy around the weights
236
237
## v0.12.1
238
239
### Added
240
241
- Added binary distribution for linux aarch64 (Streamlit's environment)
242
- Added new separator option in eds.table and new input check
243
244
### Fixed
245
246
- Make catalogue & entrypoints compatible with py37-py312
247
- Check that a data has a doc before trying to use the document's `note_datetime`
248
249
## v0.12.0
250
251
### Added
252
253
- The `eds.transformer` component now accepts `prompts` (passed to its `preprocess` method, see breaking change below) to add before each window of text to embed.
254
- `LazyCollection.map` / `map_batches` now support generator functions as arguments.
255
- Window stride can now be disabled (i.e., stride = window) during training in the `eds.transformer` component by `training_stride = False`
256
- Added a new `eds.ner_overlap_scorer` to evaluate matches between two lists of entities, counting true when the dice overlap is above a given threshold
257
- `edsnlp.load` now accepts EDS-NLP models from the huggingface hub 🤗 !
258
- New `python -m edsnlp.package` command to package a model for the huggingface hub or pypi-like registries
259
- Improve table detection in `eds.tables` and support new options in `table._.to_pd_table(...)`:
260
  - `header=True` to use first row as header
261
  - `index=True` to use first column as index
262
  - `as_spans=True` to fill cells as document spans instead of strings
263
264
### Changed
265
266
- :boom: Major breaking change in trainable components, moving towards a more "task-centric" design:
267
  - the `eds.transformer` component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via the `preprocess` method, which now accepts more arguments than just the docs to process.
268
  - similarly the `eds.span_pooler` is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in the `preprocess` method.
269
270
  Consequently, the `eds.transformer` and `eds.span_pooler` no longer accept their `span_getter` argument, and the `eds.ner_crf`, `eds.span_classifier`, `eds.span_linker` and `eds.span_qualifier` components now accept a `context_getter` argument instead, as well as a `span_getter` argument for the latter two. This refactoring can be summarized as follows:
271
272
    ```diff
273
    - eds.transformer.span_getter
274
    + eds.ner_crf.context_getter
275
    + eds.span_classifier.context_getter
276
    + eds.span_linker.context_getter
277
278
    - eds.span_pooler.span_getter
279
    + eds.span_qualifier.span_getter
280
    + eds.span_linker.span_getter
281
    ```
282
283
    and as an example for the `eds.span_linker` component:
284
285
    ```diff
286
    nlp.add_pipe(
287
        eds.span_linker(
288
            metric="cosine",
289
            probability_mode="sigmoid",
290
    +       span_getter="ents",
291
    +       # context_getter="ents",  -> by default, same as span_getter
292
            embedding=eds.span_pooler(
293
                hidden_size=128,
294
    -           span_getter="ents",
295
                embedding=eds.transformer(
296
    -               span_getter="ents",
297
                    model="prajjwal1/bert-tiny",
298
                    window=128,
299
                    stride=96,
300
                ),
301
            ),
302
        ),
303
        name="linker",
304
    )
305
    ```
306
- Trainable embedding components now all use `foldedtensor` to return embeddings, instead of returning a tensor of floats and a mask tensor.
307
- :boom: TorchComponent `__call__` no longer applies the end to end method, and instead calls the `forward` method directly, like all torch modules.
308
- The trainable `eds.span_qualifier` component has been renamed to `eds.span_classifier` to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not).
309
- `omop` converter now takes the `note_datetime` field into account by default when building a document
310
- `span._.date.to_datetime()` and `span._.date.to_duration()` now automatically take the `note_datetime` into account
311
- `nlp.vocab` is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anyway
312
313
### Fixed
314
315
- `edsnlp.data.read_json` now correctly read the files from the directory passed as an argument, and not from the parent directory.
316
- Overwrite spacy's Doc, Span and Token pickling utils to allow recursively storing Doc, Span and Token objects in the extension values (in particular, span._.date.doc)
317
- Removed pendulum dependency, solving various pickling, multiprocessing and missing attributes errors
318
319
## v0.11.2
320
321
### Fixed
322
- Fix `edsnlp.utils.file_system.normalize_fs_path` file system detection not working correctly
323
- Improved performance of `edsnlp.data` methods over a filesystem (`fs` parameter)
324
325
## v0.11.1 (2024-04-02)
326
327
### Added
328
329
- Automatic estimation of cpu count when using multiprocessing
330
- `optim.initialize()` method to create optim state before the first backward pass
331
332
### Changed
333
334
- `nlp.post_init` will not tee lazy collections anymore (use `edsnlp.utils.collections.multi_tee` yourself if needed)
335
336
### Fixed
337
338
- Corrected inconsistencies in `eds.span_linker`
339
340
## v0.11.0 (2024-03-29)
341
342
### Added
343
344
- Support for a `filesystem` parameter in every `edsnlp.data.read_*` and `edsnlp.data.write_*` functions
345
- Pipes of a pipeline are now easily accessible with `nlp.pipes.xxx` instead of `nlp.get_pipe("xxx")`
346
- Support builtin Span attributes in converters `span_attributes` parameter, e.g.
347
  ```python
348
  import edsnlp
349
350
  nlp = ...
351
  nlp.add_pipe("eds.sentences")
352
353
  data = edsnlp.data.from_xxx(...)
354
  data = data.map_pipeline(nlp)
355
  data.to_pandas(converters={"ents": {"span_attributes": ["sent.text", "start", "end"]}})
356
  ```
357
- Support assigning Brat AnnotatorNotes as span attributes: `edsnlp.data.read_standoff(...,  notes_as_span_attribute="cui")`
358
- Support for mapping full batches in `edsnlp.processing` pipelines with `map_batches` lazy collection method:
359
  ```python
360
  import edsnlp
361
362
  data = edsnlp.data.from_xxx(...)
363
  data = data.map_batches(lambda batch: do_something(batch))
364
  data.to_pandas()
365
  ```
366
- New `data.map_gpu` method to map a deep learning operation on some data and take advantage of edsnlp multi-gpu inference capabilities
367
- Added average precision computation in edsnlp span_classification scorer
368
- You can now add pipes to your pipeline by instantiating them directly, which comes with many advantages, such as auto-completion, introspection and type checking !
369
370
  ```python
371
  import edsnlp, edsnlp.pipes as eds
372
373
  nlp = edsnlp.blank("eds")
374
  nlp.add_pipe(eds.sentences())
375
  # instead of nlp.add_pipe("eds.sentences")
376
  ```
377
378
  *The previous way of adding pipes is still supported.*
379
- New `eds.span_linker` deep-learning component to match entities with their concepts in a knowledge base, in synonym-similarity or concept-similarity mode.
380
381
### Changed
382
383
- `nlp.preprocess_many` now uses lazy collections to enable parallel processing
384
- :warning: Breaking change. Improved and simplified `eds.span_qualifier`: we didn't support combination groups before, so this feature was scrapped for now. We now also support splitting values of a single qualifier between different span labels.
385
- Optimized edsnlp.data batching, especially for large batch sizes (removed a quadratic loop)
386
- :warning: Breaking change. By default, the name of components added to a pipeline is now the default name defined in their class `__init__` signature. For most components of EDS-NLP, this will change the name from "eds.xxx" to "xxx".
387
388
### Fixed
389
390
- Flatten list outputs (such as "ents" converter) when iterating: `nlp.map(data).to_iterable("ents")` is now a list of entities, and not a list of lists of entities
391
- Allow span pooler to choose between multiple base embedding spans (as likely produced by `eds.transformer`) by sorting them by Dice overlap score.
392
- EDS-NLP does not raise an error anymore when saving a model to an already existing, but empty directory
393
394
## v0.10.7 (2024-03-12)
395
396
### Added
397
398
- Support empty writer converter by default in `edsnlp.data` readers / writers (do not convert by default)
399
- Add support for polars data import / export
400
- Allow kwargs in `eds.transformer` to pass to the transformer model
401
402
### Changed
403
404
- Saving pipelines now longer saves the `disabled` status of the pipes (i.e., all pipes are considered "enabled" when saved). This feature was not used and causing issues when saving a model wrapped in a `nlp.select_pipes` context.
405
406
### Fixed
407
408
- Allow missing `meta.json`, `tokenizer` and `vocab` paths when loading saved models
409
- Save torch buffers when dumping machine learning models to disk (previous versions only saved the model parameters)
410
- Fix automatic `batch_size` estimation in `eds.transformer` when `max_tokens_per_device` is set to `auto` and multiple GPUs are used
411
- Fix JSONL file parsing
412
413
## v0.10.6 (2024-02-24)
414
415
### Added
416
417
- Added `batch_by`, `split_into_batches_after`, `sort_chunks`, `chunk_size`, `disable_implicit_parallelism` parameters to processing (`simple` and `multiprocessing`) backends to improve performance
418
  and memory usage. Sorting chunks can improve yield up to **twice the speed** in some cases.
419
- The deep learning cache mechanism now supports multitask models with weight sharing in multiprocessing mode.
420
- Added `max_tokens_per_device="auto"` parameter to `eds.transformer` to estimate memory usage and automatically split the input into chunks that fit into the GPU.
421
422
### Changed
423
424
- Improved speed and memory usage of the `eds.text_cnn` pipe by running the CNN on a non-padded version of its input: expect a speedup up to 1.3x in real-world use cases.
425
- Deprecate the converters' (especially for BRAT/Standoff data) `bool_attributes`
426
  parameter in favor of general `default_attributes`. This new mapping describes how to
427
  set attributes on spans for which no attribute value was found in the input format.
428
  This is especially useful for negation, or frequent attributes values (e.g. "negated"
429
  is often False, "temporal" is often "present"), that annotators may not want to
430
  annotate every time.
431
- Default `eds.ner_crf` window is now set to 40 and stride set to 20, as it doesn't
432
  affect throughput (compared to before, window set to 20) and improves accuracy.
433
- New default `overlap_policy='merge'` option and parameter renaming in
434
  `eds.span_context_getter` (which replaces `eds.span_sentence_getter`)
435
436
### Fixed
437
438
- Improved error handling in `multiprocessing` backend (e.g., no more deadlock)
439
- Various improvements to the data processing related documentation pages
440
- Begin of sentence / end of sentence transitions of the `eds.ner_crf` component are now
441
  disabled when windows are used (e.g., neither `window=1` equivalent to softmax and
442
  `window=0`equivalent to default full sequence Viterbi decoding)
443
- `eds` tokenizer nows inherits from `spacy.Tokenizer` to avoid typing errors
444
- Only match 'ne' negation pattern when not part of another word to avoid false positives cases like `u[ne] cure de 10 jours`
445
- Disabled pipes are now correctly ignored in the `Pipeline.preprocess` method
446
- Add "eventuel*" patterns to `eds.hyphothesis`
447
448
## v0.10.5 (2024-01-29)
449
450
### Fixed
451
452
- Allow non-url paths when parquet filesystem is given
453
454
## v0.10.4 (2024-01-19)
455
456
### Changed
457
458
- Assigning `doc._.note_datetime` will now automatically cast the value to a `pendulum.DateTime` object
459
460
### Added
461
462
- Support loading model from package name (e.g., `edsnlp.load("eds_pseudo_aphp")`)
463
- Support filesystem parameter in `edsnlp.data.read_parquet` and `edsnlp.data.write_parquet`
464
465
### Fixed
466
467
- Support doc -> list converters with parquet files writer
468
- Fixed some OOM errors when writing many outputs to parquet files
469
- Both edsnlp & spacy factories are now listed when a factory lookup fails
470
- Fixed some GPU OOM errors with the `eds.transformer` pipe when processing really long documents
471
472
## v0.10.3 (2024-01-11)
473
474
### Added
475
476
- By default, `edsnlp.data.write_json` will infer if the data should be written as a single JSONL
477
  file or as a directory of JSON files, based on the `path` argument being a file or not.
478
479
### Fixed
480
481
- Measurements now correctly match "0.X", "0.XX", ... numbers
482
- Typo in "celsius" measurement unit
483
- Spaces and digits are now supported in BRAT entity labels
484
- Fixed missing 'permet pas + verb' false positive negation patterns
485
486
## v0.10.2 (2023-12-20)
487
488
### Changed
489
490
- `eds.span_qualifier` qualifiers argument now automatically adds the underscore prefix if not present
491
492
### Fixed
493
494
- Fix imports of components declared in `spacy_factories` entry points
495
- Support `pendulum` v3
496
- `AsList` errors are now correctly reported
497
- `eds.span_qualifier` saved configuration during `to_disk` is now longer null
498
499
## v0.10.1 (2023-12-15)
500
501
### Changed
502
503
- Small regex matching performance improvement, up to 1.25x faster (e.g. `eds.measurements`)
504
505
### Fixed
506
507
- Microgram scale is now correctly 1/1000g and inverse meter now 1/100 inverse cm.
508
- We now isolate some of edsnlp components (trainable pipes that require ml dependencies)
509
  in a new `edsnlp_factories` entry points to prevent spacy from auto-importing them.
510
- TNM scores followed by a space are now correctly detected
511
- Removed various short TNM false positives (e.g., "PT" or "a T") and false negatives
512
- The Span value extension is not more forcibly overwritten, and user assigned values are returned by `Span._.value` in priority, before the aggregated `span._.get(span.label_)` getter result (#220)
513
- Enable mmap during multiprocessing model transfers
514
- `RegexMatcher` now supports all alignment modes (`strict`, `expand`, `contract`) and better handles partial doc matching (#201).
515
- `on_ent_only=False/True` is now supported again in qualifier pipes (e.g., "eds.negation", "eds.hypothesis", ...)
516
517
## v0.10.0 (2023-12-04)
518
519
### Added
520
521
- New add unified `edsnlp.data` api (json, brat, spark, pandas) and LazyCollection object
522
  to efficiently read / write data from / to different formats & sources.
523
- New unified processing API to select the execution execution backends via `data.set_processing(...)`
524
- The training scripts can now use data from multiple concatenated adapters
525
- Support quantized transformers (compatible with multiprocessing as well !)
526
527
### Changed
528
529
- `edsnlp.pipelines` has been renamed to `edsnlp.pipes`, but the old name is still available for backward compatibility
530
- Pipes (in `edsnlp/pipes`) are now lazily loaded, which should improve the loading time of the library.
531
- `to_disk` methods can now return a config to override the initial config of the pipeline (e.g., to load a transformer directly from the path storing its fine-tuned weights)
532
- The `eds.tokenizer` tokenizer has been added to entry points, making it accessible from the outside
533
- Deprecate old connectors (e.g. BratDataConnector) in favor of the new `edsnlp.data` API
534
- Deprecate old `pipe` wrapper in favor of the new processing API
535
536
### Fixed
537
538
- Support for pydantic v2
539
- Support for python 3.11 (not ci-tested yet)
540
541
## v0.10.0beta1 (2023-12-04)
542
543
Large refacto of EDS-NLP to allow training models and performing inference using PyTorch
544
as the deep-learning backend. Rather than a mere wrapper of Pytorch using spaCy, this is
545
a new framework to build hybrid multi-task models.
546
547
To achieve this, instead of patching spaCy's pipeline, a new pipeline was implemented in
548
a similar fashion to aphp/edspdf#12. The new pipeline tries to preserve the existing API,
549
especially for non-machine learning uses such as rule-based components. This means that
550
users can continue to use the library in the same way as before, while also having the option to train models using PyTorch. We still
551
use spaCy data structures such as Doc and Span to represent the texts and their annotations.
552
553
Otherwise, changes should be transparent for users that still want to use spacy pipelines
554
with `nlp = spacy.blank('eds')`. To benefit from the new features, users should use
555
`nlp = edsnlp.blank('eds')` instead.
556
557
### Added
558
559
- New pipeline system available via `edsnlp.blank('eds')` (instead of `spacy.blank('eds')`)
560
- Use the confit package to instantiate components
561
- Training script with Pytorch only (`tests/training/`) and tutorial
562
- New trainable embeddings: `eds.transformer`, `eds.text_cnn`, `eds.span_pooler`
563
  embedding contextualizer pipes
564
- Re-implemented the trainable NER component and trainable Span qualifier with the new
565
  system under `eds.ner_crf` and `eds.span_classifier`
566
- New efficient implementation for eds.transformer (to be used in place of
567
  spacy-transformer)
568
569
### Changed
570
571
- Pipe registering: `Language.factory` -> `edsnlp.registry.factory.register` via confit
572
- Lazy loading components from their entry point (had to patch spacy.Language.__init__)
573
  to avoid having to wrap every import torch statement for pure rule-based use cases.
574
  Hence, torch is not a required dependency
575
576
## v0.9.2 (2023-12-04)
577
578
### Changed
579
580
- Fix matchers to skip pipes with assigned extensions that are not required by the matcher during the initialization
581
582
## v0.9.1 (2023-09-22)
583
584
### Changed
585
586
- Improve negation patterns
587
- Abstent disorders now set the negation to True when matched as `ABSENT`
588
- Default qualifier is now `None` instead of `False` (empty string)
589
590
### Fixed
591
592
- `span_getter` is not incompatible with on_ents_only anymore
593
- `ContextualMatcher` now supports empty matches (e.g. lookahead/lookbehind) in `assign` patterns
594
595
## v0.9.0 (2023-09-15)
596
597
### Added
598
599
- New `to_duration` method to convert an absolute date into a date relative to the note_datetime (or None)
600
601
### Changes
602
603
- Input and output of components are now specified by `span_getter` and `span_setter` arguments.
604
- :boom: Score / disorders / behaviors entities now have a fixed label (passed as an argument), instead of being dynamically set from the component name. The following scores may have a different name
605
  than the current one in your pipelines:
606
    * `eds.emergency.gemsa` → `emergency_gemsa`
607
    * `eds.emergency.ccmu` → `emergency_ccmu`
608
    * `eds.emergency.priority` → `emergency_priority`
609
    * `eds.charlson` → `charlson`
610
    * `eds.elston_ellis` → `elston_ellis`
611
    * `eds.SOFA` → `sofa`
612
    * `eds.adicap` → `adicap`
613
    * `eds.measuremets` → `size`, `weight`, ... instead of `eds.size`, `eds.weight`, ...
614
- `eds.dates` now separate dates from durations. Each entity has its own label:
615
    * `spans["dates"]` → entities labelled as `date` with a `span._.date` parsed object
616
    * `spans["durations"]` → entities labelled as `duration` with a `span._.duration` parsed object
617
- the "relative" / "absolute" / "duration" mode of the time entity is now stored in
618
  the `mode` attribute of the `span._.date/duration`
619
- the "from" / "until" period bound, if any, is now stored in the `span._.date.bound` attribute
620
- `to_datetime` now only return absolute dates, converts relative dates into absolute if `doc._.note_datetime` is given, and None otherwise
621
622
### Fixed
623
624
- `export_to_brat` issue with spans of entities on multiple lines.
625
626
## v0.8.1 (2023-05-31)
627
628
Fix release to allow installation from source
629
630
## v0.8.0 (2023-05-24)
631
632
### Added
633
634
- New trainable component for multi-label, multi-class span qualification (any attribute/extension)
635
- Add range measurements (like `la tumeur fait entre 1 et 2 cm`) to `eds.measurements` matcher
636
- Add `eds.CKD` component
637
- Add `eds.COPD` component
638
- Add `eds.alcohol` component
639
- Add `eds.cerebrovascular_accident` component
640
- Add `eds.congestive_heart_failure` component
641
- Add `eds.connective_tissue_disease` component
642
- Add `eds.dementia` component
643
- Add `eds.diabetes` component
644
- Add `eds.hemiplegia` component
645
- Add `eds.leukemia` component
646
- Add `eds.liver_disease` component
647
- Add `eds.lymphoma` component
648
- Add `eds.myocardial_infarction` component
649
- Add `eds.peptic_ulcer_disease` component
650
- Add `eds.peripheral_vascular_disease` component
651
- Add `eds.solid_tumor` component
652
- Add `eds.tobacco` component
653
- Add `eds.spaces` (or `eds.normalizer` with `spaces=True`) to detect space tokens, and add `ignore_space_tokens` to `EDSPhraseMatcher` and `SimstringMatcher` to skip them
654
- Add `ignore_space_tokens` option in most components
655
- `eds.tables`: new pipeline to identify formatted tables
656
- New `merge_mode` parameter in `eds.measurements` to normalize existing entities or detect
657
  measures only inside existing entities
658
- Tokenization exceptions (`Mr.`, `Dr.`, `Mrs.`) and non end-of-sentence periods are now tokenized with the next letter in the `eds` tokenizer
659
660
### Changed
661
662
- Disable `EDSMatcher` preprocessing auto progress tracking by default
663
- Moved dependencies to a single pyproject.toml: support for `pip install -e '.[dev,docs,setup]'`
664
- ADICAP matcher now allow dot separators (e.g. `B.H.HP.A7A0`)
665
666
### Fixed
667
668
- Abbreviation and number tokenization issues in the `eds` tokenizer
669
- `eds.adicap` : reparsed the dictionnary used to decode the ADICAP codes (some of them were wrongly decoded)
670
- Fix build for python 3.9 on Mac M1/M2 machines.
671
672
## v0.7.4 (2022-12-12)
673
674
### Added
675
676
- `eds.history` : Add the option to consider only the closest dates in the sentence (dates inside the boundaries and if there is not, it takes the closest date in the entire sentence).
677
- `eds.negation` : It takes into account following past participates and preceding infinitives.
678
- `eds.hypothesis`: It takes into account following past participates hypothesis verbs.
679
- `eds.negation` & `eds.hypothesis` : Introduce new patterns and remove unnecessary patterns.
680
- `eds.dates` : Add a pattern for preceding relative dates (ex: l'embolie qui est survenue **à 10 jours**).
681
- Improve patterns in the `eds.pollution` component to account for multiline footers
682
- Add `QuickExample` object to quickly try a pipeline.
683
- Add UMLS terminology matcher `eds.umls`
684
- New `RegexMatcher` method to create spans from groupdicts
685
- New `eds.dates` option to disable time detection
686
687
### Changed
688
689
- Improve date detection by removing false positives
690
691
### Fixed
692
693
- `eds.hypothesis` : Remove too generic patterns.
694
- `EDSTokenizer` : It now tokenizes `"rechereche d'"` as `["recherche", "d'"]`, instead of `["recherche", "d", "'"]`.
695
- Fix small typos in the documentation and in the docstring.
696
- Harmonize processing utils (distributed custom_pipe) to have the same API for Pandas and Pyspark
697
- Fix BratConnector file loading issues with complex file hierarchies
698
699
## v0.7.2 (2022-10-26)
700
701
### Added
702
703
- Improve the `eds.history` component by taking into account the date extracted from `eds.dates` component.
704
- New pop up when you click on the copy icon in the termynal widget (docs).
705
- Add NER `eds.elston-ellis` pipeline to identify Elston Ellis scores
706
- Add flags=re.MULTILINE to `eds.pollution` and change pattern of footer
707
708
### Fixed
709
710
- Remove the warning in the ``eds.sections`` when ``eds.normalizer`` is in the pipe.
711
- Fix filter_spans for strictly nested entities
712
- Fill eds.remove-lowercase "assign" metadata to run the pipeline during EDSPhraseMatcher preprocessing
713
- Allow back spaCy components whose name contains a dot (forbidden since spaCy v3.4.2) for backward compatibility.
714
715
## v0.7.1 (2022-10-13)
716
717
### Added
718
719
- Add new patterns (footer, web entities, biology tables, coding sections) to pipeline normalisation (pollution)
720
721
### Changed
722
723
- Improved TNM detection algorithm
724
- Account for more modifiers in ADICAP codes detection
725
726
### Fixed
727
728
- Add nephew, niece and daughter to family qualifier patterns
729
- EDSTokenizer (`spacy.blank('eds')`) now recognizes non-breaking whitespaces as spaces and does not split float numbers
730
- `eds.dates` pipeline now allows new lines as space separators in dates
731
732
## v0.7.0 (2022-09-06)
733
734
### Added
735
736
- New nested NER trainable `nested_ner` pipeline component
737
- Support for nested entities and attributes in BratDataConnector
738
- Pytorch wrappers and experimental training utils
739
- Add attribute `section` to entities
740
- Add new cases for separator pattern when components of the TNM score are separated by a forward slash
741
- Add NER `eds.adicap` pipeline to identify ADICAP codes
742
- Add patterns to `pollution` pipeline and simplifies activating or deactivating specific patterns
743
744
### Changed
745
746
- Simplified the configuration scheme of the `pollution` pipeline
747
- Update of the `ContextualMatcher` (and all pipelines depending on it), rendering it more flexible to use
748
- Rename R component of score TNM as "resection_completeness"
749
750
### Fixed
751
752
- Prevent section titles from capturing surrounding tokens, causing overlaps (#113)
753
- Enhance existing patterns for section detection and add patterns for previously ignored sections (introduction, evolution, modalites de sortie, vaccination) .
754
- Fix explain mode, which was always triggered, in `eds.history` factory.
755
- Fix test in `eds.sections`. Previously, no check was done
756
- Remove SOFA scores spurious span suffixes
757
758
## v0.6.2 (2022-08-02)
759
760
### Added
761
762
- New `SimstringMatcher` matcher to perform fuzzy term matching, and `algorithm` parameter in terminology components and `eds.matcher` component
763
- Makefile to install,test the application and see the documentation
764
765
### Changed
766
767
- Add consultation date pattern "CS", and False Positive patterns for dates (namely phone numbers and pagination).
768
- Update the pipeline score `eds.TNM`. Now it is possible to return a dictionary where the results are either `str` or `int` values
769
770
### Fixed
771
772
- Add new patterns to the negation qualifier
773
- Numpy header issues with binary distributed packages
774
- Simstring dependency on Windows
775
776
## v0.6.1 (2022-07-11)
777
778
### Added
779
780
- Now possible to provide regex flags when using the RegexMatcher
781
- New `ContextualMatcher` pipe, aiming at replacing the `AdvancedRegex` pipe.
782
- New `as_ents` parameter for `eds.dates`, to save detected dates as entities
783
784
### Changed
785
786
- Faster `eds.sentences` pipeline component with Cython
787
- Bump version of Pydantic in `requirements.txt` to 1.8.2 to handle an incompatibility with the ContextualMatcher
788
- Optimise space requirements by using `.csv.gz` compression for verbs
789
790
### Fixed
791
792
- `eds.sentences` behaviour with dot-delimited dates (eg `02.07.2022`, which counted as three sentences)
793
794
## v0.6.0 (2022-06-17)
795
796
### Added
797
798
- Complete revamp of the measurements detection pipeline, with better parsing and more exhaustive matching
799
- Add new functionality to the method `Span._.date.to_datetime()` to return a result infered from context for those cases with missing information.
800
- Force a batch size of 2000 when distributing a pipeline with Spark
801
- New patterns to pipeline `eds.dates` to identify cases where only the month is mentioned
802
- New `eds.terminology` component for generic terminology matching, using the `kb_id_` attribute to store fine-grained entity label
803
- New `eds.cim10` terminology matching pipeline
804
- New `eds.drugs` terminology pipeline that maps brand names and active ingredients to a unique [ATC](https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System) code
805
806
## v0.5.3 (2022-05-04)
807
808
### Added
809
810
- Support for strings in the example utility
811
- [TNM](https://en.wikipedia.org/wiki/TNM_staging_system) detection and normalisation with the `eds.TNM` pipeline
812
- Support for arbitrary callback for Pandas multiprocessing, with the `callback` argument
813
814
## v0.5.2 (2022-05-04)
815
816
### Added
817
818
- Support for chained attributes in the `processing` pipelines
819
- Colour utility with the category20 colour palette
820
821
### Fixed
822
823
- Correct a REGEX on the date detector (both `nov` and `nov.` are now detected, as all other months)
824
825
## v0.5.1 (2022-04-11)
826
827
### Fixed
828
829
- Updated Numpy requirements to be compatible with the `EDSPhraseMatcher`
830
831
## v0.5.0 (2022-04-08)
832
833
### Added
834
835
- New `eds` language to better fit French clinical documents and improve speed
836
- Testing for markdown codeblocks to make sure the documentation is actually executable
837
838
### Changed
839
840
- Complete revamp of the date detection pipeline, with better parsing and more exhaustive matching
841
- Reimplementation of the EDSPhraseMatcher in Cython, leading to a x15 speed increase
842
843
## v0.4.4 (2022-03-31)
844
845
- Add `measures` pipeline
846
- Cap Jinja2 version to fix mkdocs
847
- Adding the possibility to add context in the processing module
848
- Improve the speed of char replacement pipelines (accents and quotes)
849
- Improve the speed of the regex matcher
850
851
## v0.4.3 (2022-03-18)
852
853
- Fix regex matching on spans.
854
- Add fast_parse in date pipeline.
855
- Add relative_date information parsing
856
857
## v0.4.2 (2022-03-16)
858
859
- Fix issue with `dateparser` library (see scrapinghub/dateparser#1045)
860
- Fix `attr` issue in the `advanced-regex` pipelin
861
- Add documentation for `eds.covid`
862
- Update the demo with an explanation for the regex
863
864
## v0.4.1 (2022-03-14)
865
866
- Added support to Koalas DataFrames in the `edsnlp.processing` pipe.
867
- Added `eds.covid` NER pipeline for detecting COVID19 mentions.
868
869
## v0.4.0 (2022-02-22)
870
871
- Profound re-write of the normalisation :
872
    - The custom attribute `CUSTOM_NORM` is completely abandoned in favour of a more _spacyfic_ alternative
873
    - The `normalizer` pipeline modifies the `NORM` attribute in place
874
    - Other pipelines can modify the `Token._.excluded` custom attribute
875
- EDS regex and term matchers can ignore excluded tokens during matching, effectively adding a second dimension to normalisation (choice of the attribute and possibility to skip _pollution_ tokens
876
  regardless of the attribute)
877
- Matching can be performed on custom attributes more easily
878
- Qualifiers are regrouped together within the `edsnlp.qualifiers` submodule, the inheritance from the `GenericMatcher` is dropped.
879
- `edsnlp.utils.filter.filter_spans` now accepts a `label_to_remove` parameter. If set, only corresponding spans are removed, along with overlapping spans. Primary use-case: removing pseudo cues for
880
  qualifiers.
881
- Generalise the naming convention for extensions, which keep the same name as the pipeline that created them (eg `Span._.negation` for the `eds.negation` pipeline). The previous convention is kept
882
  for now, but calling it issues a warning.
883
- The `dates` pipeline underwent some light formatting to increase robustness and fix a few issues
884
- A new `consultation_dates` pipeline was added, which looks for dates preceded by expressions specific to consultation dates
885
- In rule-based processing, the `terms.py` submodule is replaced by `patterns.py` to reflect the possible presence of regular expressions
886
- Refactoring of the architecture :
887
    - pipelines are now regrouped by type (`core`, `ner`, `misc`, `qualifiers`)
888
    - `matchers` submodule contains `RegexMatcher` and `PhraseMatcher` classes, which interact with the normalisation
889
    - `multiprocessing` submodule contains `spark` and `local` multiprocessing tools
890
    - `connectors` contains `Brat`, `OMOP` and `LabelTool` connectors
891
    - `utils` contains various utilities
892
- Add entry points to make pipeline usable directly, removing the need to import `edsnlp.components`.
893
- Add a `eds` namespace for components: for instance, `negation` becomes `eds.negation`. Using the former pipeline name still works, but issues a deprecation warning.
894
- Add 3 score pipelines related to emergency
895
- Add a helper function to use a spaCy pipeline as a Spark UDF.
896
- Fix alignment issues in RegexMatcher
897
- Change the alignment procedure, dropping clumsy `numpy` dependency in favour of `bisect`
898
- Change the name of `eds.antecedents` to `eds.history`.
899
  Calling `eds.antecedents` still works, but issues a deprecation warning and support will be removed in a future version.
900
- Add a `eds.covid` component, that identifies mentions of COVID
901
- Change the demo, to include NER components
902
903
## v0.3.2 (2021-11-24)
904
905
- Major revamp of the normalisation.
906
    - The `normalizer` pipeline **now adds atomic components** (`lowercase`, `accents`, `quotes`, `pollution` & `endlines`) to the processing pipeline, and compiles the results into a
907
      new `Doc._.normalized` extension. The latter is itself a spaCy `Doc` object, wherein tokens are normalised and pollution tokens are removed altogether. Components that match on the `CUSTOM_NORM`
908
      attribute process the `normalized` document, and matches are brought back to the original document using a token-wise mapping.
909
    - Update the `RegexMatcher` to use the `CUSTOM_NORM` attribute
910
    - Add an `EDSPhraseMatcher`, wrapping spaCy's `PhraseMatcher` to enable matching on `CUSTOM_NORM`.
911
    - Update the `matcher` and `advanced` pipelines to enable matching on the `CUSTOM_NORM` attribute.
912
- Add an OMOP connector, to help go back and forth between OMOP-formatted pandas dataframes and spaCy documents.
913
- Add a `reason` pipeline, that extracts the reason for visit.
914
- Add an `endlines` pipeline, that classifies newline characters between spaces and actual ends of line.
915
- Add possibility to annotate within entities for qualifiers (`negation`, `hypothesis`, etc), ie if the cue is within the entity. Disabled by default.
916
917
## v0.3.1 (2021-10-13)
918
919
- Update `dates` to remove miscellaneous bugs.
920
- Add `isort` pre-commit hook.
921
- Improve performance for `negation`, `hypothesis`, `antecedents`, `family` and `rspeech` by using spaCy's `filter_spans` and our `consume_spans` methods.
922
- Add proposition segmentation to `hypothesis` and `family`, enhancing results.
923
924
## v0.3.0 (2021-09-29)
925
926
- Renamed `generic` to `matcher`. This is a non-breaking change for the average user, adding the pipeline is still :
927
928
  ```{ .python .no-check }
929
  nlp.add_pipe("matcher", config=dict(terms=dict(maladie="maladie")))
930
  ```
931
932
- Removed `quickumls` pipeline. It was untested, unmaintained. Will be added back in a future release.
933
- Add `score` pipeline, and `charlson`.
934
- Add `advanced-regex` pipeline
935
- Corrected bugs in the `negation` pipeline
936
937
## v0.2.0 (2021-09-13)
938
939
- Add `negation` pipeline
940
- Add `family` pipeline
941
- Add `hypothesis` pipeline
942
- Add `antecedents` pipeline
943
- Add `rspeech` pipeline
944
- Refactor the library :
945
    - Remove the `rules` folder
946
    - Add a `pipelines` folder, containing one subdirectory per component
947
    - Every component subdirectory contains a module defining the component, and a module defining a factory, plus any other utilities (eg `terms.py`)
948
949
## v0.1.0 (2021-09-29)
950
951
First working version. Available pipelines :
952
953
- `section`
954
- `sentences`
955
- `normalization`
956
- `pollution`