The i2b2 organization, a national center for biomedical
computing, sponsored a medical deidentification challenge in
conjunction with the 2006 AMIA conference. MITRE participated in
this challenge, and the work we did there is the foundation of the
MIST system. The data for that challenge - 889 fully deidentified
medical discharge summaries, annotated for PHI - has been released
to the public by i2b2 as their NLP dataset #1B, and is available
for download once your organization executes the appropriate
bilateral data use agreement. The data and registration procedures
can be found here.
We are distributing a MIST task which manipulates this data.
This task is a simple variant of the general deidentification
task. Make sure you've read the documentation on general deidentification
customizations.
The name of this task, when you need to refer to it in MATEngine,
MATWorkspaceEngine, or the UI, is "AMIA Deidentification".
The data, as distributed by i2b2, is not in the appropriate
format for use with this task, for the following reasons:
We provide a script which you can use to prepare the data
appropriately. The script can be found in src/tasks/AMIA/utils in
your distribution.
Let's say that your input file that you got from i2b2 is train.xml, and you want to put the postprocessed documents in the outdir directory:
% python src/tasks/AMIA/utils/split_AMIA_file.py --extend_dates \
--promote_type_attr train.xml outdir
Note that this script makes two crucial repairs to the data as it
splits it:
Once you run this script, the output directory contains segmented
individual records whose annotations are in the appropriate form,
but the documents do not contain zones or tokens, which are
important to MIST. So we next apply an AMIA-specific workflow to
add these zones and tokens, and at the same time convert the
documents to MAT JSON format.
Unix:
% $MAT_PKG_HOME/bin/MATEngine --task "AMIA Deidentification" \
--input_dir outdir --input_file_re ".*[.]xml" --input_file_type xml-inline \
--workflow "Process tagged untokenized docs" --steps "zone and align" \
--output_dir json-outdir --output_file_type mat-json --output_fsuff ".json"
Windows:
% %MAT_PKG_HOME%\bin\MATEngine.cmd --task "AMIA Deidentification" \
--input_dir outdir --input_file_re ".*[.]xml" --input_file_type xml-inline \
--workflow "Process tagged untokenized docs" --steps "zone and align" \
--output_dir json-outdir --output_file_type mat-json --output_fsuff ".json"
The documents you've produced can be used as any other fully
annotated document can be used; e.g. you can apply the nominate
and transform steps to create redacted or resynthesized documents.
As another example, these documents are now suitable for building
a model:
Unix:
% $MAT_PKG_HOME/bin/MATModelBuilder --task 'AMIA Deidentification' \
--input_files 'outdir/*.json' --file_type mat-json --save_as_default_model
Windows:
% %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "AMIA Deidentification" \
--input_files 'outdir\*.json' --file_type mat-json --save_as_default_model
The resulting model will be in src/tasks/AMIA/default_model.
Note: You may have to increase the Java heap size in order to
make the model building (and subsequent tagging) work; you may do
this in your task.xml file by modifying the
<java_subprocess_parameters> as follows, for example:
<java_subprocess_parameters heap_size="2G"/>
The AMIA task has an additional workflow "Process tagged
untokenized docs", which should be applied to documents which have
content annotations for PHI but is missing either zones or tokens
or both. This workflow has a special "zone and align" step; once
this step is applied, the resulting documents are in the same
state as if they had been processed in the "Demo" workflow using
the the "zone" and "tag" steps.
HOSPITAL |
A medical facility |
PATIENT |
The name of a patient |
DOCTOR |
The name of a medical
provider |
DATE |
A date, including the year |
LOCATION |
A partial or full address,
including city, state and ZIP |
ID |
An ID code or number |
PHONE |
A telephone number |
AGE |
An age |
The AMIA task provides some special replacer implementations.
Implementation |
UI name |
Description |
---|---|---|
AMIAReplacementEngine.AMIADEIDReplacementEngine |
clear -> DE-ID |
Maps clear text PIIs to the a
DE-id-style obscured pattern. For most tags, the pattern is, e.g., **HOSPITAL. However, AGE, DATE, PATIENT and DOCTOR have subsequent patterns surrounded by square brackets.
|
AMIAReplacementEngine.AMIADEIDResynthesisEngine | DE-ID -> clear |
Maps the DE-id-style pattern described above into clear text |
AMIAReplacementEngine.AMIAPIIReplacementEngine |
clear -> clear |
Maps clear text PIIs to
resynthesized, artificial PIIs. For most tags, the behavior of this replacer is identical to the general clear -> clear replacer, except that there are some idiosyncracies of handling HOSPITALs (which can include room numbers) and DOCTORs (which can include initials with an attached set of initials for the medical transcriber, e.g. "djh / vp"). |