The Identification
Scrubber Toolkit

The MITRE Identification Scrubber Toolkit (MIST) is a suite of tools for identifying and redacting personally identifiable information (PII) in free-text medical records. MIST helps you replace these PII either with obscuring fillers, such as [NAME], or with artificial, synthesized, but realistic English fillers.

For example, MIST can help you convert this document:

Patient ID: P89474

Mary Phillips is a 45-year-old woman with a history of diabetes.
She arrived at New Hope Medical Center on August 5 complaining
of abdominal pain. Dr. Gertrude Philippoussis diagnosed her
with appendicitis and admitted her at 10 PM.
into this:
Patient ID: [ID]

[NAME] is a [AGE]-year-old woman with a history of diabetes.
She arrived at [HOSPITAL] on [DATE] complaining
of abdominal pain. Dr. [PHYSICIAN] diagnosed her
with appendicitis and admitted her at 10 PM.
or this:
Patient ID: ID586

Sandy Parkinson is a 34-year-old woman with a history of diabetes.
She arrived at Mercy Hospital on July 10 complaining
of abdominal pain. Dr. Myron Prendergast diagnosed her
with appendicitis and admitted her at 10 PM.

MIST decomposes the deidentification task into two subtasks:

The first subtask is addressed by the MITRE Annotation Toolkit (MAT), which is a highly customizable suite of tools for natural language processing upon which MIST is built. The customizations for MIST itself address the second subtask. The MIST documentation uses the terms annotation and tagging interchangeably for the task of identifying, either by hand or automatically, the PII phrases in your documents. The labels for your PII types (e.g., NAME, PHYSICIAN, AGE, DATE) will be the tags that you'll be applying to your documents.

MITRE's research program in deidentification is focused on attempting to reduce the overhead of document deidentification in order to enable broader sharing of appropriately redacted medical records, to enhance medical and public health research. Our government sponsors are interested in these issues as ways of lowering the cost of health care for the DoD and civilian government organizations in the future.

You can download MIST here. The current version of MIST is 2.0.4.

While MITRE has no control over what you do with MIST, we do ask, as a courtesy, that you let us know ahead of time about any publications you submit which refer to MIST or its performance.

MITRE has assigned a BSD license to its contributions to the MIST toolkit.

MIST is distributed with a large number of open-source components which bear similarly liberal licenses. MIST requires specific versions of some of these tools, and in some cases has modified those packages to enhance their functionality. The packages and their licenses are:

MIST is also distributed with a few optional GPL-licensed components. These components are not central to the operation of MIST, and we view their inclusion as "mere aggregation" in GPL parlance (in other words, the GPL license is not required for MIST as a whole). Those packages are:

MIST is a research prototype. It is not intended to be enterprise-ready: it is not internationalized, it is not configured to work with enterprise Web capabilities like Tomcat or Apache, it has no real security model, and it is not designed for 24/7 availability or replication.

On the other hand, MIST has been under development for several years, and has been used successfully by a number of MITRE's research partners to do various deidentification tasks. The statistical trainer/tagger that provides the core of MIST's functionality achieved the highest score in the 2006 i2b2 evaluation of deidentification tools.

If your needs are straightforward and you're comfortable with research software, MIST may be useful to you. For other situations, MIST's value is in the design of its approach to the deidentification task, and we encourage others who are building more enterprise-ready capabilities to familiarize themselves with the design of MIST.

No. MITRE has set up a users mailing list for discussion about MIST, but the MIST team does not have the resources to provide open-ended long-term support for open-source packages. Members of the MIST team may monitor this list, and may occasionally comment, but cannot be relied upon for help or advice.

Instead, the MIST team is seeking a government or non-profit partner to assume continued custodianship of the MIST tool.

Until we locate a custodian, we may, from time to time, post an updated version of the MIST toolkit. If we do so, we'll announce it on the users mailing list above.

If you're interested in the technical details of MIST and how to use it, you can read the on-line documentation (best viewed with Firefox):

For those interested in a general survey of the state of the art in automated deidentification, we recommend the following (non-MITRE) paper:

O. Uzuner, Y. Luo, P. Szolovits. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007 Sep-Oct; 14(5):550-63.

MITRE researchers have co-authored several papers on MIST, alone and with partner institutions:

B. Wellner, M. Huyck, S. Mardis, J. Aberdeen, A. Morgan, L. Peshkin, A. Yeh, J. Hitzeman, L. Hirschman. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 2007 Sep-Oct;14(5):564-73.

R. Yeniterzi, J. Aberdeen, S. Bayer, B. Wellner, C. Clark, L. Hirschman, B. Malin. Effects of Personal Identifier Resynthesis on Clinical Text De-identification. J Am Med Inform Assoc 2010; 17(2):159-68.

J. Aberdeen, S. Bayer, R. Yeniterzi, B. Wellner, C. Clark, D. Hanauer, B. Malin, L. Hirschman. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. International Journal of Medical Informatics, 2010, 79(12):849-859.

D. Carrell, B. Malin, J. Aberdeen, S. Bayer, C. Clark, B. Wellner, and L. Hirschman. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. Journal of the American Medical Informatics Association, Jul. 2012.

D. Hanauer, J. Aberdeen, S. Bayer, B. Wellner, C. Clark, K. Zheng, and L. Hirschman. Bootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs. International Journal of Medical Informatics, 2013, 82(9):821-831.

MIST version history

The changes listed here apply exclusively to the MIST deidentification capability. For changes related to the underlying MITRE Annotation Toolkit, see the documentation for each release. For instance, MAT 2.0 introduced a whole host of changes, including a new UI, enhanced scorer, and document comparison capabilities. For upgrade notes and a summary of the changes in MAT 2.0, see here; for a detailed version history, see here.

Note that in order to use MIST 2.0, your task must be updated. If you obtained your task from us, please contact us for an updated task and we'll see what we can do; if it turns out that we no longer have the resources to help, the upgrade notes will be able to help you.

Not also that in order to use MIST 2.0, your workspaces and models must all be updated. The upgrade notes will tell you how to do that.