edsnlp.train
script, and per weight layer gradient logging.eds.negation
, eds.hypothesis
, eds.family
, eds.history
and eds.reported_speech
under a eds.negation.default_patterns
attributecontext_getter
SpanGetter argument to the eds.matcher
class to only retrieve entities inside the spans returned by the getterfilter_expr
parameter to scorers to filter the documents to scorerequired
field to eds.contextual_matcher
assign patterns to only match if the required field has been found, and an include
parameter (similar to exclude
) to search for required patterns without assigning them to the entityeds.contextual_matcher
component to allow for more complex patterns in the selection of the window around the trigger spans.edsnlp.package
now correctly detect if a project uses an old-style poetry pyproject or a PEP621 pyproject.toml.eds.score
now correctly handles lists of patterns.edsnlp.tune
for hyperparameter tuning using Optuna. This feature allows users to efficiently optimize model parameters with options for single-phase or two-phase tuning strategies. Includes support for parameter importance analysis, visualization, pruning, and automatic handling of GPU time budgets.ScheduledOptimizer
(e.g., @core: "optimizer"
) now supports importing optimizers using their qualified name (e.g., optim: "torch.optim.Adam"
).eds.ner_crf
now computes confidence score on spans.eds.ner_crf
is now computed as the mean over the words instead of the sum. This change is compatible with multi-gpu training.a.b.c
for span_attributes
in Standoff and OMOP doc2dict convertersFixed various aspects of stream shuffling:
Ensure the Parquet reader shuffles the data when shuffle=True
stream.shuffle()
with no seedstream.shuffle(batch_size=...)
is not compatible with the streameds.split
now keeps doc and span attributes in the sub-documents.edsnlp.data.read_parquet
now accept a work_unit="fragment"
option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.edsnlp.train
scriptoutput_model_dir
), and whether to save the model or not (save_model
)logger=False
)edsnlp.data.read_conll
and with a specific eds.conll_dict2doc
convertereds.biaffine_dep_parser
) component and metricseds.extractive_qa
component to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as in eds.ner_crf
.join_thread
missing attribute in SimpleQueue
when cleaning a multiprocessing executorcls_token_id
and sep_token_id
(we now also look for these tokens in the special_tokens_map
and vocab
mappings)random.RandomState()
) when shuffling in data readers : this is important forsub_batch_size
argument of TrainingData
.edsnlp.package
commandedsnlp.load
now correctly takes disable, enable and exclude parameters into accountpython -m edsnlp.evaluate
script to evaluate a model on a dataseteds.split
pipe to split a document into multiple documents based on a splitting pattern (useful for training)converter
argument of edsnlp.data.read/from_...
to be a list of converters instead of a single converteredsnlp.train
script and APIScheduledOptimizer now accepts schedules directly in place of parameters, and easy parameter selection:
ScheduledOptimizer(
optim="adamw",
module=nlp,
total_steps=2000,
groups={
"^transformer": {
# lr will go from 0 to 5e-5 then to 0 for params matching "transformer"
"lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,},
},
"": {
# lr will go from 3e-4 during 200 steps then to 0 for other params
"lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,},
},
},
)
eds.span_context_getter
's parameter context_sents
is no longer optional and must be explicitly set to 0 to disable sentence contextbatch_size
argument of Pipeline
is deprecated and is not used anymore. Use the batch_size
argument of stream.map_pipeline
instead.multiprocessing
backend. This prevents some executions from hanging indefinitely at the end of the processing.multiprocessing
backend. This is useful when the system is running out of file descriptors and ulimit -n
is not an option. Torch sharing strategy can also be set via an environment variable TORCH_SHARING_STRATEGY
(default is file_descriptor
, consider using file_system
if you encounter issues).LazyCollection
objects are now called Stream
objectsmultiprocessing
backend now preserves the order of the input data. To disable this and improve performance, use deterministic=False
in the set_processing
method🚀 Parallelized GPU inference throughput improvements !
The .map_batches
, .map_pipeline
and .map_gpu
methods now support a specific batch_size
and batching function, instead of having a single batch size for all pipes
loop
parameter to cycle over the data indefinitely (useful for training)shuffle
parameter to shuffle the data before iterating over itmultiprocessing
mode, file based readers now read the data in the workers (was an option before)We now support two new special batch sizes
💥 Breaking change: a map
function returning a list or a generator won't be automatically flattened anymore. Use flatten()
to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output
chunk_size
and sort_chunks
are now deprecated : to sort data before applying a transformation, use .map_batches(custom_sort_fn, batch_size=...)
python -m edsnlp.train --config config.cfg
that should fit many use cases. Check out the docs !Each trainable component can now provide a stats
field in its preprocess
output to log info about the sample (number of words, tokens, spans, ...):
Support multi GPU training via hugginface accelerate
and EDS-NLP Stream
API consideration of env['WOLRD_SIZE'] and env['LOCAL_RANK'] environment variables
eds.tables
accepts a minimum_table_size (default 2) argument to reduce pollutionRuleBasedQualifier
now expose a process
method that only returns qualified entities and token without actually tagging them, deferring this task to the __call__
method.edsnlp.scorers
to edsnlp.metrics
and removed the _scorer
suffix from their@scorers = ner_overlap_scorer
→ @metrics = ner_overlap
)eds.measurements
to eds.quantities
eds.endlines
) is no longer installed by default when installing edsnlp[ml]
status
anymore. Instead, status=None
by default,ent._.negation
is set to True instead of setting status
to "ABSENT". To this end, the tobacco and alcoholNegationQualifier
internally.55 @ 77777
could be detected as a full number before, but not anymore.data.set_processing(...)
now expose an autocast
parameter to disable or tweak the automatic casting of the tensortorch.inference_mode
to disable view tracking and version counter bumps during inference.ent._.date_cues
edsnlp.load("your/huggingface-model", install_dependencies=True)
now correctly resolves the python pipeds.transformer
, eds.text_cnn
and eds.ner_crf
componentseds.text_cnn
and eds.ner_crf
componentsspan_context_getter
for context_words
= 0, context_sents
> 2 and support assymetric contextsE.coli
, now split as [E.
, coli
] and not [E
, .
, coli
]Packages:
hatch
instead of poetry, which allows us to expose artifacts
(weights)note_datetime
eds.transformer
component now accepts prompts
(passed to its preprocess
method, see breaking change below) to add before each window of text to embed.LazyCollection.map
/ map_batches
now support generator functions as arguments.eds.transformer
component by training_stride = False
eds.ner_overlap_scorer
to evaluate matches between two lists of entities, counting true when the dice overlap is above a given thresholdedsnlp.load
now accepts EDS-NLP models from the huggingface hub 🤗 !python -m edsnlp.package
command to package a model for the huggingface hub or pypi-like registrieseds.tables
and support new options in table._.to_pd_table(...)
:header=True
to use first row as headerindex=True
to use first column as indexas_spans=True
to fill cells as document spans instead of stringseds.transformer
component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via the preprocess
method, which now accepts more arguments than just the docs to process.eds.span_pooler
is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in the preprocess
method.Consequently, the eds.transformer
and eds.span_pooler
no longer accept their span_getter
argument, and the eds.ner_crf
, eds.span_classifier
, eds.span_linker
and eds.span_qualifier
components now accept a context_getter
argument instead, as well as a span_getter
argument for the latter two. This refactoring can be summarized as follows:
```diff
- eds.transformer.span_getter
+ eds.ner_crf.context_getter
+ eds.span_classifier.context_getter
+ eds.span_linker.context_getter
- eds.span_pooler.span_getter
+ eds.span_qualifier.span_getter
+ eds.span_linker.span_getter
```
and as an example for the `eds.span_linker` component:
```diff
nlp.add_pipe(
eds.span_linker(
metric="cosine",
probability_mode="sigmoid",
+ span_getter="ents",
+ # context_getter="ents", -> by default, same as span_getter
embedding=eds.span_pooler(
hidden_size=128,
- span_getter="ents",
embedding=eds.transformer(
- span_getter="ents",
model="prajjwal1/bert-tiny",
window=128,
stride=96,
),
),
),
name="linker",
)
```
foldedtensor
to return embeddings, instead of returning a tensor of floats and a mask tensor.__call__
no longer applies the end to end method, and instead calls the forward
method directly, like all torch modules.eds.span_qualifier
component has been renamed to eds.span_classifier
to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not).omop
converter now takes the note_datetime
field into account by default when building a documentspan._.date.to_datetime()
and span._.date.to_duration()
now automatically take the note_datetime
into accountnlp.vocab
is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anywayedsnlp.data.read_json
now correctly read the files from the directory passed as an argument, and not from the parent directory.edsnlp.utils.file_system.normalize_fs_path
file system detection not working correctlyedsnlp.data
methods over a filesystem (fs
parameter)optim.initialize()
method to create optim state before the first backward passnlp.post_init
will not tee lazy collections anymore (use edsnlp.utils.collections.multi_tee
yourself if needed)eds.span_linker
filesystem
parameter in every edsnlp.data.read_*
and edsnlp.data.write_*
functionsnlp.pipes.xxx
instead of nlp.get_pipe("xxx")
span_attributes
parameter, e.g.nlp = ...
nlp.add_pipe("eds.sentences")
data = edsnlp.data.from_xxx(...)
data = data.map_pipeline(nlp)
data.to_pandas(converters={"ents": {"span_attributes": ["sent.text", "start", "end"]}})
- Support assigning Brat AnnotatorNotes as span attributes: `edsnlp.data.read_standoff(..., notes_as_span_attribute="cui")`
- Support for mapping full batches in `edsnlp.processing` pipelines with `map_batches` lazy collection method:
python
import edsnlp
data = edsnlp.data.from_xxx(...)
data = data.map_batches(lambda batch: do_something(batch))
data.to_pandas()
``
- New
data.map_gpu` method to map a deep learning operation on some data and take advantage of edsnlp multi-gpu inference capabilities
- Added average precision computation in edsnlp span_classification scorer
- You can now add pipes to your pipeline by instantiating them directly, which comes with many advantages, such as auto-completion, introspection and type checking !
```python
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
# instead of nlp.add_pipe("eds.sentences")
```
The previous way of adding pipes is still supported.
- New eds.span_linker
deep-learning component to match entities with their concepts in a knowledge base, in synonym-similarity or concept-similarity mode.
nlp.preprocess_many
now uses lazy collections to enable parallel processingeds.span_qualifier
: we didn't support combination groups before, so this feature was scrapped for now. We now also support splitting values of a single qualifier between different span labels.__init__
signature. For most components of EDS-NLP, this will change the name from "eds.xxx" to "xxx".nlp.map(data).to_iterable("ents")
is now a list of entities, and not a list of lists of entitieseds.transformer
) by sorting them by Dice overlap score.edsnlp.data
readers / writers (do not convert by default)eds.transformer
to pass to the transformer modeldisabled
status of the pipes (i.e., all pipes are considered "enabled" when saved). This feature was not used and causing issues when saving a model wrapped in a nlp.select_pipes
context.meta.json
, tokenizer
and vocab
paths when loading saved modelsbatch_size
estimation in eds.transformer
when max_tokens_per_device
is set to auto
and multiple GPUs are usedbatch_by
, split_into_batches_after
, sort_chunks
, chunk_size
, disable_implicit_parallelism
parameters to processing (simple
and multiprocessing
) backends to improve performancemax_tokens_per_device="auto"
parameter to eds.transformer
to estimate memory usage and automatically split the input into chunks that fit into the GPU.eds.text_cnn
pipe by running the CNN on a non-padded version of its input: expect a speedup up to 1.3x in real-world use cases.bool_attributes
default_attributes
. This new mapping describes how toeds.ner_crf
window is now set to 40 and stride set to 20, as it doesn'toverlap_policy='merge'
option and parameter renaming ineds.span_context_getter
(which replaces eds.span_sentence_getter
)multiprocessing
backend (e.g., no more deadlock)eds.ner_crf
component are nowwindow=1
equivalent to softmax andwindow=0
equivalent to default full sequence Viterbi decoding)eds
tokenizer nows inherits from spacy.Tokenizer
to avoid typing errorsu[ne] cure de 10 jours
Pipeline.preprocess
methodeds.hyphothesis
doc._.note_datetime
will now automatically cast the value to a pendulum.DateTime
objectedsnlp.load("eds_pseudo_aphp")
)edsnlp.data.read_parquet
and edsnlp.data.write_parquet
eds.transformer
pipe when processing really long documentsedsnlp.data.write_json
will infer if the data should be written as a single JSONLpath
argument being a file or not.eds.span_qualifier
qualifiers argument now automatically adds the underscore prefix if not presentspacy_factories
entry pointspendulum
v3AsList
errors are now correctly reportededs.span_qualifier
saved configuration during to_disk
is now longer nulleds.measurements
)edsnlp_factories
entry points to prevent spacy from auto-importing them.Span._.value
in priority, before the aggregated span._.get(span.label_)
getter result (#220)RegexMatcher
now supports all alignment modes (strict
, expand
, contract
) and better handles partial doc matching (#201).on_ent_only=False/True
is now supported again in qualifier pipes (e.g., "eds.negation", "eds.hypothesis", ...)edsnlp.data
api (json, brat, spark, pandas) and LazyCollection objectdata.set_processing(...)
edsnlp.pipelines
has been renamed to edsnlp.pipes
, but the old name is still available for backward compatibilityedsnlp/pipes
) are now lazily loaded, which should improve the loading time of the library.to_disk
methods can now return a config to override the initial config of the pipeline (e.g., to load a transformer directly from the path storing its fine-tuned weights)eds.tokenizer
tokenizer has been added to entry points, making it accessible from the outsideedsnlp.data
APIpipe
wrapper in favor of the new processing APILarge refacto of EDS-NLP to allow training models and performing inference using PyTorch
as the deep-learning backend. Rather than a mere wrapper of Pytorch using spaCy, this is
a new framework to build hybrid multi-task models.
To achieve this, instead of patching spaCy's pipeline, a new pipeline was implemented in
a similar fashion to aphp/edspdf#12. The new pipeline tries to preserve the existing API,
especially for non-machine learning uses such as rule-based components. This means that
users can continue to use the library in the same way as before, while also having the option to train models using PyTorch. We still
use spaCy data structures such as Doc and Span to represent the texts and their annotations.
Otherwise, changes should be transparent for users that still want to use spacy pipelines
with nlp = spacy.blank('eds')
. To benefit from the new features, users should use
nlp = edsnlp.blank('eds')
instead.
edsnlp.blank('eds')
(instead of spacy.blank('eds')
)tests/training/
) and tutorialeds.transformer
, eds.text_cnn
, eds.span_pooler
eds.ner_crf
and eds.span_classifier
Language.factory
-> edsnlp.registry.factory.register
via confitABSENT
None
instead of False
(empty string)span_getter
is not incompatible with on_ents_only anymoreContextualMatcher
now supports empty matches (e.g. lookahead/lookbehind) in assign
patternsto_duration
method to convert an absolute date into a date relative to the note_datetime (or None)span_getter
and span_setter
arguments.eds.emergency.gemsa
→ emergency_gemsa
eds.emergency.ccmu
→ emergency_ccmu
eds.emergency.priority
→ emergency_priority
eds.charlson
→ charlson
eds.elston_ellis
→ elston_ellis
eds.SOFA
→ sofa
eds.adicap
→ adicap
eds.measuremets
→ size
, weight
, ... instead of eds.size
, eds.weight
, ...eds.dates
now separate dates from durations. Each entity has its own label:spans["dates"]
→ entities labelled as date
with a span._.date
parsed objectspans["durations"]
→ entities labelled as duration
with a span._.duration
parsed objectmode
attribute of the span._.date/duration
span._.date.bound
attributeto_datetime
now only return absolute dates, converts relative dates into absolute if doc._.note_datetime
is given, and None otherwiseexport_to_brat
issue with spans of entities on multiple lines.Fix release to allow installation from source
la tumeur fait entre 1 et 2 cm
) to eds.measurements
matchereds.CKD
componenteds.COPD
componenteds.alcohol
componenteds.cerebrovascular_accident
componenteds.congestive_heart_failure
componenteds.connective_tissue_disease
componenteds.dementia
componenteds.diabetes
componenteds.hemiplegia
componenteds.leukemia
componenteds.liver_disease
componenteds.lymphoma
componenteds.myocardial_infarction
componenteds.peptic_ulcer_disease
componenteds.peripheral_vascular_disease
componenteds.solid_tumor
componenteds.tobacco
componenteds.spaces
(or eds.normalizer
with spaces=True
) to detect space tokens, and add ignore_space_tokens
to EDSPhraseMatcher
and SimstringMatcher
to skip themignore_space_tokens
option in most componentseds.tables
: new pipeline to identify formatted tablesmerge_mode
parameter in eds.measurements
to normalize existing entities or detectMr.
, Dr.
, Mrs.
) and non end-of-sentence periods are now tokenized with the next letter in the eds
tokenizerEDSMatcher
preprocessing auto progress tracking by defaultpip install -e '.[dev,docs,setup]'
B.H.HP.A7A0
)eds
tokenizereds.adicap
: reparsed the dictionnary used to decode the ADICAP codes (some of them were wrongly decoded)eds.history
: Add the option to consider only the closest dates in the sentence (dates inside the boundaries and if there is not, it takes the closest date in the entire sentence).eds.negation
: It takes into account following past participates and preceding infinitives.eds.hypothesis
: It takes into account following past participates hypothesis verbs.eds.negation
& eds.hypothesis
: Introduce new patterns and remove unnecessary patterns.eds.dates
: Add a pattern for preceding relative dates (ex: l'embolie qui est survenue à 10 jours).eds.pollution
component to account for multiline footersQuickExample
object to quickly try a pipeline.eds.umls
RegexMatcher
method to create spans from groupdictseds.dates
option to disable time detectioneds.hypothesis
: Remove too generic patterns.EDSTokenizer
: It now tokenizes "rechereche d'"
as ["recherche", "d'"]
, instead of ["recherche", "d", "'"]
.eds.history
component by taking into account the date extracted from eds.dates
component.eds.elston-ellis
pipeline to identify Elston Ellis scoreseds.pollution
and change pattern of footereds.sections
when eds.normalizer
is in the pipe.spacy.blank('eds')
) now recognizes non-breaking whitespaces as spaces and does not split float numberseds.dates
pipeline now allows new lines as space separators in datesnested_ner
pipeline componentsection
to entitieseds.adicap
pipeline to identify ADICAP codespollution
pipeline and simplifies activating or deactivating specific patternspollution
pipelineContextualMatcher
(and all pipelines depending on it), rendering it more flexible to useeds.history
factory.eds.sections
. Previously, no check was doneSimstringMatcher
matcher to perform fuzzy term matching, and algorithm
parameter in terminology components and eds.matcher
componenteds.TNM
. Now it is possible to return a dictionary where the results are either str
or int
valuesContextualMatcher
pipe, aiming at replacing the AdvancedRegex
pipe.as_ents
parameter for eds.dates
, to save detected dates as entitieseds.sentences
pipeline component with Cythonrequirements.txt
to 1.8.2 to handle an incompatibility with the ContextualMatcher.csv.gz
compression for verbseds.sentences
behaviour with dot-delimited dates (eg 02.07.2022
, which counted as three sentences)Span._.date.to_datetime()
to return a result infered from context for those cases with missing information.eds.dates
to identify cases where only the month is mentionededs.terminology
component for generic terminology matching, using the kb_id_
attribute to store fine-grained entity labeleds.cim10
terminology matching pipelineeds.drugs
terminology pipeline that maps brand names and active ingredients to a unique ATC codeeds.TNM
pipelinecallback
argumentprocessing
pipelinesnov
and nov.
are now detected, as all other months)EDSPhraseMatcher
eds
language to better fit French clinical documents and improve speedmeasures
pipelinedateparser
library (see scrapinghub/dateparser#1045)attr
issue in the advanced-regex
pipelineds.covid
edsnlp.processing
pipe.eds.covid
NER pipeline for detecting COVID19 mentions.CUSTOM_NORM
is completely abandoned in favour of a more spacyfic alternativenormalizer
pipeline modifies the NORM
attribute in placeToken._.excluded
custom attributeedsnlp.qualifiers
submodule, the inheritance from the GenericMatcher
is dropped.edsnlp.utils.filter.filter_spans
now accepts a label_to_remove
parameter. If set, only corresponding spans are removed, along with overlapping spans. Primary use-case: removing pseudo cues forSpan._.negation
for the eds.negation
pipeline). The previous convention is keptdates
pipeline underwent some light formatting to increase robustness and fix a few issuesconsultation_dates
pipeline was added, which looks for dates preceded by expressions specific to consultation datesterms.py
submodule is replaced by patterns.py
to reflect the possible presence of regular expressionscore
, ner
, misc
, qualifiers
)matchers
submodule contains RegexMatcher
and PhraseMatcher
classes, which interact with the normalisationmultiprocessing
submodule contains spark
and local
multiprocessing toolsconnectors
contains Brat
, OMOP
and LabelTool
connectorsutils
contains various utilitiesedsnlp.components
.eds
namespace for components: for instance, negation
becomes eds.negation
. Using the former pipeline name still works, but issues a deprecation warning.numpy
dependency in favour of bisect
eds.antecedents
to eds.history
.eds.antecedents
still works, but issues a deprecation warning and support will be removed in a future version.eds.covid
component, that identifies mentions of COVIDnormalizer
pipeline now adds atomic components (lowercase
, accents
, quotes
, pollution
& endlines
) to the processing pipeline, and compiles the results into aDoc._.normalized
extension. The latter is itself a spaCy Doc
object, wherein tokens are normalised and pollution tokens are removed altogether. Components that match on the CUSTOM_NORM
normalized
document, and matches are brought back to the original document using a token-wise mapping.RegexMatcher
to use the CUSTOM_NORM
attributeEDSPhraseMatcher
, wrapping spaCy's PhraseMatcher
to enable matching on CUSTOM_NORM
.matcher
and advanced
pipelines to enable matching on the CUSTOM_NORM
attribute.reason
pipeline, that extracts the reason for visit.endlines
pipeline, that classifies newline characters between spaces and actual ends of line.negation
, hypothesis
, etc), ie if the cue is within the entity. Disabled by default.dates
to remove miscellaneous bugs.isort
pre-commit hook.negation
, hypothesis
, antecedents
, family
and rspeech
by using spaCy's filter_spans
and our consume_spans
methods.hypothesis
and family
, enhancing results.generic
to matcher
. This is a non-breaking change for the average user, adding the pipeline is still :{ .python .no-check }
nlp.add_pipe("matcher", config=dict(terms=dict(maladie="maladie")))
quickumls
pipeline. It was untested, unmaintained. Will be added back in a future release.score
pipeline, and charlson
.advanced-regex
pipelinenegation
pipelinenegation
pipelinefamily
pipelinehypothesis
pipelineantecedents
pipelinerspeech
pipelinerules
folderpipelines
folder, containing one subdirectory per componentterms.py
)First working version. Available pipelines :
section
sentences
normalization
pollution