Skip to content

Hash-Based Identification

When you begin a visit with Open Data Capture, you must first chose your identification method. If you select the “Personal Information” option, you will need to complete a form with the following fields:

  • First Name
  • Last Name
  • Sex at Birth
  • Date of Birth

These data are used to compute a unique ID using a one-way hashing function to compute this ID. This method that is currently employed by the NIH for a similar purpose.

What is a Hash?

A hash function is a mathematical function that takes an input and returns a fixed-size string of characters, which is typically a hexadecimal number. The same input will always produce the same output, but even a small change to the input will produce a very different output.

In our case, we use an algorithm called SHA-256 which creates a 256-bit (32-byte) hash value. This is considered a “strong” hash function because it’s computationally infeasible to produce two different messages that have the same hash value (collision resistance) and it’s also computationally infeasible to recreate the original message from its hash value (pre-image resistance). For reference, the total number of possible combinations is 1.158 x 10^77, which is a value that approaches the estimated number of atoms in the observable universe, which ranges from 10^78 to 10^82.

Example

To illustrate how this process works, suppose we have the following information:

Variable NameValue
First NameRené
Last NameTremblay
Date of Birth1980-01-01
Sexfemale

Before adding this information to our database, we thoroughly validate and process the provided inputs. Specifically, we standardize the format of first and last names by converting any accented Latin characters to their non-accented ASCII equivalent and capitalizing all characters. Non-Latin characters are not accepted and the user must provide a Latin representation of the name. Additionally, we verify that the date of birth adheres to the ISO 8601 standard and that the sex field is accurately filled with “male” or “female”.

If the inputs appear to be valid, the inputs are concatenated together alongside a private key, which is an additional string of characters that is stored in the memory of our server. This is done to prevent an adversary from determining if an individual is in the database, in the unlikely event of a data breach.

Input RENE_TREMBLAY_1980-01-01_female_FOO
Output 70c7a252fe82c829c08a8f26377dc600c18966eff2a294e724863480559561fc

This output is used as the identifier for René Tremblay. Her first and last name are not entered into the database, and we store her date of birth and sex, alongside her identifier.