The HIPAA task

The HIPAA task is a general-purpose deidentification task which addresses deidentification of medical records. Make sure you've read the documentation on general deidentification customizations.

Task name

The name of this task, when you need to refer to it in MATEngine, MATWorkspaceEngine, or the UI, is "HIPAA Deidentification".

Additional steps

Most of the workflows in the HIPAA task have an additional initial step, "clean", which cannot be undone; this step converts the document to ASCII with Unix line endings.

Annotation set

The HIPAA law which governs medical record privacy specifies 19 categories which the law requires to be obscured. We discuss their implementation here. This implementation is informed by our experiences so far with our research partners. The bold text for each section is taken directly from 45CFR164.514, the law which governs PHI privacy.

(A) Names

The NAME tag (full and partial names) and the INITIALS tag (initials).

(B) All geographic subdivisions smaller than a State, including
street address, city, county, precinct, zip code, and their equivalent
geocodes, except for the initial three digits of a zip code if,
according to the current publicly available data from the Bureau of the
  (1) The geographic unit formed by combining all zip codes with the
      same three initial digits contains more than 20,000 people; and
  (2) The initial three digits of a zip code for all such geographic
      units containing 20,000 or fewer people is changed to 000.

The LOCATION tag. This tag does not permit subdivision of ZIP codes. The state should be obscured as well. All contiguous elements of a location should be included in a single tag, e.g., "12 Mulberry Lane, Winston-Salem, NC, 52004". Locations internal to a hospital, such as room numbers, should use the OTHER tag.

(C) All elements of dates (except year) for dates directly related
to an individual, including birth date, admission date, discharge date,
date of death; and all ages over 89 and all elements of dates (including
year) indicative of such age, except that such ages and elements may be
aggregated into a single category of age 90 or older

The DATE tag and the AGE tag. The DATE tag should include the year, to support resynthesis of realistic fillers (this process is significantly hampered by leaving the year out). We recommend that all ages be tagged.

(D) Telephone numbers;
(E) Fax numbers;

The PHONE tag.

(F) Electronic mail addresses;

The EMAIL tag.

(G) Social security numbers;

The SSN tag.

(H) Medical record numbers;
(I) Health plan beneficiary numbers;
(J) Account numbers;
(K) Certificate/license numbers;
(L) Vehicle identifiers and serial numbers, including license plate numbers;
(M) Device identifiers and serial numbers;

The IDNUM tag. This tag can also be used for any other alphanumeric code not listed here, if the user prefers.

(N) Web Universal Resource Locators (URLs);

The URL tag.

(O) Internet Protocol (IP) address numbers;


(P) Biometric identifiers, including finger and voice prints;
(Q) Full face photographic images and any comparable images;

Not relevant.

(R) Any other unique identifying number, characteristic, or code,
except as permitted by paragraph (c) of this section

The OTHER tag. This tag may include things like room numbers, or any other identifying information the use chooses not to use IDNUM for.

In addition, although not required by HIPAA, the HOSPITAL tag can be used to obscure the name of hospitals and other medical facilities, since many users seem to want to do that.

Additional replacer implementations

The HIPAA task provides some special replacer implementations.

UI name
clear -> clear
A specialization of the general clear -> clear replacer which provides some customizations for rendering some HIPAA-specific categories.
clear -> DE-ID
Maps clear text PIIs to the a DE-id-style obscured pattern.

For most tags, the pattern is, e.g., **HOSPITAL. However, AGE, DATE and NAME have subsequent patterns surrounded by angle brackets.
  • **NAME<AAA B. CCC>, where the pattern represents the pattern of name tokens in the clear text name. The token substitutions are, by default, consistent within the scope of a single document.
  • **DATE<5/6/09>, where the sequence between the brackets is an actual date, displaced consistently throughout the document by the same randomly-selected offset.
  • **AGE<in 30s>, where the sequence between the bracket indicates a decade of life, with the exception of "birth-12", "in teens", and "90+".
DE-ID -> clear
Maps the DE-id-style pattern described above into clear text
[ ] -> clear
A specialization of the general [ ] -> clear replacer which provides some customizations for rendering some HIPAA-specific categories