The AMIA medical deidentification task

The i2b2 organization, a national center for biomedical computing, sponsored a medical deidentification challenge in conjunction with the 2006 AMIA conference. MITRE participated in this challenge, and the work we did there is the foundation of the MIST system. The data for that challenge - 889 fully deidentified medical discharge summaries, annotated for PHI - has been released to the public by i2b2 as their NLP dataset #1B, and is available for download once your organization executes the appropriate bilateral data use agreement. The data and registration procedures can be found here. We are distributing a MIST task which manipulates this data.

This task is a simple variant of the general deidentification task. Make sure you've read the documentation on general deidentification customizations.

Preparing the data

The data, as distributed by i2b2, is not in the appropriate format for use with this task, for the following reasons:

We provide a script which you can use to prepare the data appropriately. The script can be found in src/tasks/AMIA/utils in your distribution.

Let's say that your input file that you got from i2b2 is train.xml, and you want to put the postprocessed documents in the outdir directory:

% python src/tasks/AMIA/utils/split_AMIA_file.py --extend_dates \
--promote_type_attr train.xml outdir

Note that this script makes two crucial repairs to the data as it splits it:

Once you run this script, the output directory contains segmented individual records whose annotations are in the appropriate form, but the documents do not contain zones or tokens, which are important to MIST. So we next apply an AMIA-specific workflow to add these zones and tokens, and at the same time convert the documents to MAT JSON format.

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "AMIA Deidentification" \
--input_dir outdir --input_file_re ".*[.]xml" --input_file_type xml-inline \
--workflow "Process tagged untokenized docs" --steps "zone and align" \
--output_dir json-outdir --output_file_type mat-json --output_fsuff ".json"


Windows:

% %MAT_PKG_HOME%\bin\MATEngine.cmd --task "AMIA Deidentification" \
--input_dir outdir --input_file_re ".*[.]xml" --input_file_type xml-inline \
--workflow "Process tagged untokenized docs" --steps "zone and align" \
--output_dir json-outdir --output_file_type mat-json --output_fsuff ".json"

The documents you've can be used as any other fully annotated document can be used; e.g. you can apply the nominate and transform steps to create redacted or resynthesized documents. As another example, these documents are now suitable for building a model:

Unix:

% $MAT_PKG_HOME/bin/MATModelBuilder --task 'AMIA Deidentification' \
--input_files 'outdir/*.json' --file_type mat-json --save_as_default_model

Windows:

% %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "AMIA Deidentification" \
--input_files 'outdir\*.json' --file_type mat-json --save_as_default_model

The resulting model will be in src/tasks/AMIA/default_model.

Note: You may have to increase the Java heap size in order to make the model building (and subsequent tagging) work; you may do this in your task.xml file by modifying the <java_subprocess_parameters> as follows, for example:

  <java_subprocess_parameters heap_size="2G"/>

Additional workflows

The AMIA task has an additional workflow "Process tagged untokenized docs", which should be applied to documents which have content annotations for PHI but is missing either zones or tokens or both. This workflow has a special "zone and align" step; once this step is applied, the resulting documents are in the same state as if they had been processed in the "Demo" workflow using the the "zone" and "tag" steps.

Annotation set

HOSPITAL
A medical facility
PATIENT
The name of a patient
DOCTOR
The name of a medical provider
DATE
A date, including the year
LOCATION
A partial or full address, including city, state and ZIP
ID
An ID code or number
PHONE
A telephone number
AGE
An age

Additional replacer implementations

The AMIA task provides some special replacer implementations.

Implementation
UI name
Description
AMIAReplacementEngine.AMIADEIDReplacementEngine
clear -> DE-ID
Maps clear text PIIs to the a DE-id-style obscured pattern.

For most tags, the pattern is, e.g., **HOSPITAL. However, AGE, DATE, PATIENT and DOCTOR have subsequent patterns surrounded by square brackets.
  • **PATIENT[AAA B. CCC], where the pattern represents the pattern of name tokens in the clear text name. The token substitutions are, by default, consistent within the scope of a single document. This pattern applies to DOCTOR as well.
  • **DATE[5/6/09], where the sequence between the brackets is an actual date, displaced consistently throughout the document by the same randomly-selected offset.
  • **AGE[in 30s], where the sequence between the bracket indicates a decade of life, with the exception of "birth-12", "in teens", and "90+".
AMIAReplacementEngine.AMIADEIDResynthesisEngine DE-ID -> clear
Maps the DE-id-style pattern described above into clear text
AMIAReplacementEngine.AMIAPIIReplacementEngine
clear -> clear
Maps clear text PIIs to resynthesized, artificial PIIs.

For most tags, the behavior of this replacer is identical to the general clear -> clear replacer, except that there are some idiosyncracies of handling HOSPITALs (which can include room numbers) and DOCTORs (which can include initials with an attached set of initials for the medical transcriber, e.g. "djh / vp").