A software tool for masking PHI and other data in XML and JSON files.

Data Masking Tool

A library with command line tool for masking data, such as protected health information (PHI), in XML and JSON files.

Features include:

Mask data from both JSON and XML input files
Multiple files masked at once with consistent masked values, including dates and times.
Optional validation of input file to a JSON or XML schema definition to prevent PHI on unexpected fields to cause PHI "leakage"
Support a wide variety of field type transformations (dates, numerics, text, names, urls, and more).

Usage

Building the project

git clone <repository location>
cd igia-datamask
git checkout develop
mvn clean install

Running the executable jar

igia-datamask.sh --mask=type:xml,config:config.xml,in:employees.xml,schema:employees.xsd,out:employees-masked.xml [ --skip-schema-validation ]
igia-datamask.sh --mask=type:json,config:config.xml,in:employees.json,schema:employees.json,out:employees-masked.json [ --skip-schema-validation ]

--mask   
               Can specify multiple --mask options per execution to similarly mask data across files.
               Single complex parameter follows "=", which includes config:, in:, xsd:, and out: values.
               "config:" parameter is for the masking configuration file which contains field level instructions
               for the specific xml file.
               "in:" parameter is for the input xml file containing PHI.
               "schema:" parameter is for the XSD or JSON file used for validating the input XML file to make sure new fields haven't
               been introduced in the input without being configured in the configuration file.
               "out:" parameter is the file generated by the program and will contain the masked version of
               the input file as instructed via the config file.

 --skip-schema-validation
               Skip the XSD or JSON validation of the input file.  Note that this will mean that changes in the input file
               may introduce PHI leakage if those new fields contain PHI and have not been included in the configuration.

Example

The igia-datamask code repository contains sample data masking configuration and input files. Run ./run-example.sh in the project root directory after you build the jar file (see Building the project above).

Change log

This version contains a moderate refactoring to support JSON in additional to the original XML support.
Adding JSON support triggered changes to configuration file and command line parameters for consistency across file types:
- The --mask parameter now requires a 'type' sub-parameter which can be set to "xml" or "json".
- The --mask sub-parameter 'xsd' is now 'schema' to reflect that validation schemas may be either XSD or JSON schemas (json schema v4 is supported, see http://json-schema.org).
- The --skip-xsd-validation parameter is now --skip-schema-validation to make is agnostic to xml or json.
- In the xml configuration files, the 'xpath' attribute for <field> is now 'path', e.g. <field path="..." >.  The path may be an xpath expression for XML files, and a json path expression for JSON files (see http://goessner.net/articles/JsonPath/).
- The <field> 'type' attribute should be set to 'JSON' for JSON files.

Introduction

Personal Identifiers Requiring De-identification

The following identifiers of the individual or of relatives, employers, or household members of the individual, must be removed according to HIPAA regulations. The following is the status of what may be handled by this library:

Names; Use NameTransformer to assign random name from dictionary file.
Addresses; All geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly available data from the Bureau of the Census:
- (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and
- (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.
Use TextTransformer to generate random text for address fields of equal length to source, and ZipCodeTransformer to mask the digits of zip.
Dates; All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
Use DateOffsetTransformer to offset dates by random dates and BirthdateTransformer to adjust date of birth to make patient age < 90.
Telephone numbers;
Use NumericIdentifierTransformer to change phone to (999)999-9999.
Fax numbers;
Use NumericIdentifierTransformer to change phone to (999)999-9999.
Electronic mail addresses;
Use EmailTransformer to generate random fake email address.
Social security numbers;
Use NumericIdentifierTransformer to change SSN to 999-99-9999.
Medical record numbers;
Use IdentifierTransformer to change hash identifier with common salt to prevent reidentification and maintain references.
Health plan beneficiary numbers;
Use IdentifierTransformer to change hash identifier with common salt to prevent reidentification and maintain references.
Account numbers;
Use IdentifierTransformer to change hash identifier with common salt to prevent reidentification and maintain references.
Certificate/license numbers;
Use IdentifierTransformer to change hash identifier with common salt to prevent reidentification and maintain references.
Vehicle identifiers and serial numbers, including license plate numbers;
Use IdentifierTransformer to change hash identifier with common salt to prevent reidentification and maintain references.
Device identifiers and serial numbers;
Use IdentifierTransformer to change hash identifier with common salt to prevent reidentification and maintain references.
Web Universal Resource Locators (URLs);
Use TextTransformer to generate random text of equal length to source. You may use TokenizeTransformer if the URL contains an identifier previously de-identified in configuration, such as patientId used in a FHIR resource URL, in which case just the id in the URL is replaced consistently with the de-identified patientId value. For example http://hl7.org/fhir/Patient/12345 where 12345 is a previously de-identified value, might become http://hl7.org/fhir/Patient/99999.
Internet Protocol (IP) address numbers;
User IpAddressTransformer to change IP address to 192.168.1.100 (private non-routable address).
Biometric identifiers, including finger and voice prints;
This is not handled in XML currently.
Full face photographic images and any comparable images;
This is not handled in XML currently.
Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section;
Use TextTransformer to generate random text of equal length to source for other scenarios.

Configuration

A per source format configuration file must be provided when de-identifying a document. The source format configuration file is specified in XML format as follows:

<config>
    <namespaces>
        <namespace prefix="s" url="http://schemas.xmlsoap.org/soap/envelope/"/>
        <namespace prefix="i" url="http://www.w3.org/2001/XMLSchema-instance"/>
    </namespaces>
    <fields>
        <field path="//employee/@id" type="ATTRIBUTE" transform="IDENTIFIER" />
        <field path="//employee/firstName" type="TEXT" transform="NAME" />
        <field path="//employee/lastName" type="TEXT" transform="NAME" />
        <field path="//department/@id" type="ATTRIBUTE" transform="NUMERIC_IDENTIFIER" />
        <field path="//employee/phone" type="TEXT" transform="NUMERIC_IDENTIFIER" />
        <field path="//employee/startdate" type="TEXT" transform="DATE_OFFSET">
            <params>
                <entry>
                    <key>SIMPLE_DATE_FORMAT</key>
                    <value>MM/dd/yyyy</value>
                </entry>
            </params>
        </field>
        <field path="//employee/zip" type="TEXT" transform="ZIP_CODE" />
        <field path="//employee/email" type="TEXT" transform="EMAIL" />
        <field path="//employee/dob" type="TEXT" transform="BIRTHDATE" />
    </fields>
</config>

The <namespaces> section is optional and only pertinent when masking XML source documents; it is ignored when processing JSON source documents. Each specified namespace is preloaded to the XML parser and the prefix values must be used in any xpath expressions evaluating XML content paths covered by that namespace. For example: path="//s:SoapBody/s:SoapEnvelope".

The <fields> section specified each field to be masked. The rules are executed in order, each being processed against the entire source document before moving to the next rule. Order of specification is important only in that (1) there is a global value cache of masked values such that value A transformed to value B in a given transformation, will always be transformed A to B regardless of transformation type, and (2) the TokenizeTransformer will only replace values in a string if the value has previously been transformed in another transformer.

The <field> entity may contain additional params used by the specified transformer type, in key/value format (example SIMPLE_DATE_FORMATMM/dd/yyyy).

The <field> entity must specify path, type, and transform attributes. The path attribute must be either (1) an XPath expression for XML source files, or (2) a JSON path expression for JSON source files (see https://github.com/json-path/JsonPath). For XML source documents the type attribute may be TEXT or ATTRIBUTE; for JSON source documents the type attribute should be JSON. The following de-identification transform types are supported:

Transform Class	Config Type	Description
BirthdateTransformer	`BIRTHDATE`	Uses DateOffsetTransformer plus accounts for age > 90 rule.
DateOffsetTransformer	`DATE_OFFSET`	Shifts dates and times by a random number of seconds between 10 and 365 days.
EmailTransformer	`EMAIL`	Matches common email address formats and replaces with random but similar format.
IdentifierTransformer	`IDENTIFIER`	Changes any field value to a salted MD5 hash of the original value.
IpAddressTransformer	`IPADDRESS`	Changes any field to `127.0.0.1`.
NameTransformer	`NAME`	Changes field values using a supplied nonsensical latin-like name dictionary.
NumericIdentifierTransformer	`NUMERIC`	Transforms SSN's and phone numbers to replace digits with 9's, or other long number values with random numbers of the same length.
TextTransformer	`TEXT`	Replaces value with "Lorem ipsum dolor..." text of similar length to the source.
TokenizeTextTransformer	`TOKENIZE`	Replaces any previously transformed tokens within the input text, while keeping the remaining original text, provided that the source token is a separate delimited value, such as within a paragraph.
UrlTransformer	`URL`	Replaces any previously transformed tokens within the URL tokens, while keeping the remaining original text, provided that the source token is a separate delimited value.
ZipCodeTransformer	`ZIP_CODE`	Replaces with 99999 or 99999-9999 or 999999999 depending on source format. Other formats will return an empty string.

Acknowledgments

We'd like to thank MIT Lincoln Laboratories for sharing their "Data Anonymizer for Public Health Data" (L. Barbieri, M. Deangelus, J. Alekseyev, G. Larocque) (https://www.ll.mit.edu), which inspired the data transformation filters of the igia-datamask tool.

igia

Releases

Key Cloak (OAuth)

FHIR API HAPI Config

SMART Launch App

igia-platform / igia-datamask