The AMIA medical deidentification task

The i2b2 organization, a national center for biomedical computing, sponsored a medical deidentification challenge in conjunction with the 2006 AMIA conference. MITRE participated in this challenge, and the work we did there is the foundation of the MIST system. The data for that challenge - 889 fully deidentified medical discharge summaries, annotated for PHI - has been released to the public by i2b2 as their NLP dataset #1B, and is available for download once your organization executes the appropriate bilateral data use agreement. The data and registration procedures can be found here. We are distributing a MIST task which manipulates this data.

This task is a simple variant of the general deidentification task. Make sure you've read the documentation on general deidentification customizations.

Task name

The name of this task, when you need to refer to it in MATEngine, MATWorkspaceEngine, or the UI, is "AMIA Deidentification".

Preparing the data

The data, as distributed by i2b2, is not in the appropriate format for use with this task, for the following reasons:

It is distributed as two large XML files, each containing hundreds of documents.
The extent of the annotations for dates covers only the day and month, presumably because the HIPAA guidelines do not require the year to be obscured. However, this extent makes reliable resynthesis impossible.
The annotations are ENAMEX-style annotations, with a single PHI annotation which bears a "type" attribute which contains the type of the PII. This task, however, expects there to be a different annotation for each PII type.

We provide a script which you can use to prepare the data appropriately. The script can be found in src/tasks/AMIA/utils in your distribution.

Let's say that your input file that you got from i2b2 is train.xml, and you want to put the postprocessed documents in the outdir directory:

% python src/tasks/AMIA/utils/split_AMIA_file.py --extend_dates \
--promote_type_attr train.xml outdir

Note that this script makes two crucial repairs to the data as it splits it:

The training data, as distributed by i2b2, has a mismatched XML tag at line 25722 (the offending string is <PHI TYPE="DATE">25th of July<PHI TYPE="DOCTOR">). This error is repaired.
The raw test data, if you ultimately choose to download it, has Windows line terminations, while the annotated ground truth test data has Unix line terminations. If you were to try to annotate the raw data documents, and score them against the ground truth, the signals would differ and the scorer would fail. The script converts any Windows line terminations it finds into Unix line terminations.

Once you run this script, the output directory contains segmented individual records whose annotations are in the appropriate form, but the documents do not contain zones or tokens, which are important to MIST. So we next apply an AMIA-specific workflow to add these zones and tokens, and at the same time convert the documents to MAT JSON format.

Unix:

% $MAT_PKG_HOME/bin/MATEngine --task "AMIA Deidentification" \
--input_dir outdir --input_file_re ".*[.]xml" --input_file_type xml-inline \
--workflow "Process tagged untokenized docs" --steps "zone and align" \
--output_dir json-outdir --output_file_type mat-json --output_fsuff ".json"

Windows:

% %MAT_PKG_HOME%\bin\MATEngine.cmd --task "AMIA Deidentification" \
--input_dir outdir --input_file_re ".*[.]xml" --input_file_type xml-inline \
--workflow "Process tagged untokenized docs" --steps "zone and align" \
--output_dir json-outdir --output_file_type mat-json --output_fsuff ".json"

The documents you've can be used as any other fully annotated document can be used; e.g. you can apply the nominate and transform steps to create redacted or resynthesized documents. As another example, these documents are now suitable for building a model:

Unix:

% $MAT_PKG_HOME/bin/MATModelBuilder --task 'AMIA Deidentification' \
--input_files 'outdir/*.json' --file_type mat-json --save_as_default_model

Windows:

% %MAT_PKG_HOME%\bin\MATModelBuilder.cmd --task "AMIA Deidentification" \
--input_files 'outdir\*.json' --file_type mat-json --save_as_default_model

The resulting model will be in src/tasks/AMIA/default_model.

Note: You may have to increase the Java heap size in order to make the model building (and subsequent tagging) work; you may do this in your task.xml file by modifying the <java_subprocess_parameters> as follows, for example:

  <java_subprocess_parameters heap_size="2G"/>

Additional workflows

The AMIA task has an additional workflow "Process tagged untokenized docs", which should be applied to documents which have content annotations for PHI but is missing either zones or tokens or both. This workflow has a special "zone and align" step; once this step is applied, the resulting documents are in the same state as if they had been processed in the "Demo" workflow using the the "zone" and "tag" steps.

Annotation set

HOSPITAL	A medical facility
PATIENT	The name of a patient
DOCTOR	The name of a medical provider
DATE	A date, including the year
LOCATION	A partial or full address, including city, state and ZIP
ID	An ID code or number
PHONE	A telephone number
AGE	An age

Additional replacer implementations

The AMIA task provides some special replacer implementations.

Implementation	UI name	Description
AMIAReplacementEngine.AMIADEIDReplacementEngine	clear -> DE-ID	Maps clear text PIIs to the a DE-id-style obscured pattern. For most tags, the pattern is, e.g., HOSPITAL. However, AGE, DATE, PATIENT and DOCTOR have subsequent patterns surrounded by square brackets. PATIENT[AAA B. CCC], where the pattern represents the pattern of name tokens in the clear text name. The token substitutions are, by default, consistent within the scope of a single document. This pattern applies to DOCTOR as well. DATE[5/6/09], where the sequence between the brackets is an actual date, displaced consistently throughout the document by the same randomly-selected offset. AGE[in 30s], where the sequence between the bracket indicates a decade of life, with the exception of "birth-12", "in teens", and "90+".
AMIAReplacementEngine.AMIADEIDResynthesisEngine	DE-ID -> clear	Maps the DE-id-style pattern described above into clear text
AMIAReplacementEngine.AMIAPIIReplacementEngine	clear -> clear	Maps clear text PIIs to resynthesized, artificial PIIs. For most tags, the behavior of this replacer is identical to the general clear -> clear replacer, except that there are some idiosyncracies of handling HOSPITALs (which can include room numbers) and DOCTORs (which can include initials with an attached set of initials for the medical transcriber, e.g. "djh / vp").