The i2b2 organization, a national center for biomedical
computing,
sponsored a medical deidentification challenge in conjunction with
the
2006 AMIA conference. MITRE participated in this challenge, and
the
work we did there is the foundation of the MIST system. The data
for
that challenge - 889 fully deidentified medical discharge
summaries,
annotated for PHI - has been released to the public by i2b2 as
their
NLP dataset #1B, and is available for download once your
organization
executes the appropriate bilateral data use agreement. The data
and
registration procedures can be found here. We
are
distributing a MIST task which manipulates this data.
This task is a simple variant of
the general deidentification task. Make sure you've read the
documentation on general
deidentification customizations.
The name of this task, when you need to refer to it in MATEngine,
MATWorkspaceEngine, or the UI, is "AMIA Deidentification".
The data, as distributed by i2b2, is not in the appropriate
format
for use with this task, for the following reasons:
We provide a script which you can use to prepare the data
appropriately. The script can be found in src/tasks/AMIA/utils in
your
distribution.
Let's say that your input file that you got from i2b2 is train.xml, and you want to put the postprocessed documents in the outdir directory:
% python src/tasks/AMIA/utils/split_AMIA_file.py --extend_dates \
--promote_type_attr train.xml outdir
Note that this script makes two crucial repairs to the data as it
splits it:
Once you run this script, the output directory contains segmented
individual records whose
annotations are in the appropriate form, but the documents do not
contain zones or tokens, which are important to MIST. So we next
apply
an AMIA-specific workflow to add these zones and tokens, and at
the
same time convert the documents to MAT JSON format.
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "AMIA Deidentification" \
--input_dir outdir --input_file_re ".*[.]xml" --input_file_type xml-inline \
--workflow "Process tagged untokenized docs" --steps "zone and align" \
--output_dir json-outdir --output_file_type mat-json --output_fsuff ".json"
Windows:
% %MAT_PKG_HOME%\bin\MATEngine.cmd --task "AMIA Deidentification" \
--input_dir outdir --input_file_re ".*[.]xml" --input_file_type xml-inline \
--workflow "Process tagged untokenized docs" --steps "zone and align" \
--output_dir json-outdir --output_file_type mat-json --output_fsuff ".json"
The documents you've can be used as any other fully annotated
document can
be used; e.g. you can apply the nominate and transform steps to
create
redacted or resynthesized documents. As another example, these
documents are now suitable for building a model:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task 'AMIA Deidentification' \
--input_files 'outdir/*.json' --file_type mat-json --save_as_default_model
Windows:
% %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "AMIA Deidentification" \
--input_files 'outdir\*.json' --file_type mat-json --save_as_default_model
The resulting model will be in src/tasks/AMIA/default_model.
Note: You may
have to increase the Java heap size in order to make the model
building
(and subsequent tagging) work; you may do this in your task.xml
file by
modifying the <java_subprocess_parameters> as follows, for
example:
<java_subprocess_parameters heap_size="2G"/>
The AMIA task has an additional workflow "Process tagged
untokenized
docs", which should be applied to documents which have content
annotations for PHI but is missing either zones or tokens or both.
This
workflow has a special "zone and align" step; once this step is
applied, the resulting documents are in the same state as if they
had
been processed in the "Demo" workflow using the the "zone" and
"tag"
steps.
HOSPITAL |
A medical facility |
PATIENT |
The name of a patient |
DOCTOR |
The name of a medical
provider |
DATE |
A date, including the year |
LOCATION |
A partial or full address,
including city, state and ZIP |
ID |
An ID code or number |
PHONE |
A telephone number |
AGE |
An age |
The AMIA task provides some special replacer implementations.
Implementation |
UI name |
Description |
---|---|---|
AMIAReplacementEngine.AMIADEIDReplacementEngine |
clear -> DE-ID |
Maps clear text PIIs to the a
DE-id-style obscured pattern. For most tags, the pattern is, e.g., **HOSPITAL. However, AGE, DATE, PATIENT and DOCTOR have subsequent patterns surrounded by square brackets.
|
AMIAReplacementEngine.AMIADEIDResynthesisEngine | DE-ID -> clear |
Maps the DE-id-style pattern described above into clear text |
AMIAReplacementEngine.AMIAPIIReplacementEngine |
clear -> clear |
Maps clear text PIIs to
resynthesized, artificial PIIs. For most tags, the behavior of this replacer is identical to the general clear -> clear replacer, except that there are some idiosyncracies of handling HOSPITALs (which can include room numbers) and DOCTORs (which can include initials with an attached set of initials for the medical transcriber, e.g. "djh / vp"). |