Hash-Based Identification
When you begin a visit with Open Data Capture, you must first chose your identification method. If you select the “Personal Information” option, you will need to complete a form with the following fields:
- First Name
- Last Name
- Sex at Birth
- Date of Birth
These data are used to compute a unique ID using a one-way hashing function to compute this ID. This method that is currently employed by the NIH for a similar purpose.
What is a Hash?
A hash function is a mathematical function that takes an input and returns a fixed-size string of characters, which is typically a hexadecimal number. The same input will always produce the same output, but even a small change to the input will produce a very different output.
In our case, we use an algorithm called SHA-256 which creates a 256-bit (32-byte) hash value. This is considered a “strong” hash function because it’s computationally infeasible to produce two different messages that have the same hash value (collision resistance) and it’s also computationally infeasible to recreate the original message from its hash value (pre-image resistance). For reference, the total number of possible combinations is 1.158 x 10^77, which is a value that approaches the estimated number of atoms in the observable universe, which ranges from 10^78 to 10^82.
Example
To illustrate how this process works, suppose we have the following information:
Variable Name | Value |
---|---|
First Name | René |
Last Name | Tremblay |
Date of Birth | 1980-01-01 |
Sex | female |
Before adding this information to our database, we thoroughly validate and process the provided inputs. Specifically, we standardize the format of first and last names by converting any accented Latin characters to their non-accented ASCII equivalent and capitalizing all characters. Non-Latin characters are not accepted and the user must provide a Latin representation of the name. Additionally, we verify that the date of birth adheres to the ISO 8601 standard and that the sex field is accurately filled with “male” or “female”.
If the inputs appear to be valid, the inputs are concatenated together alongside a private key, which is an additional string of characters that is stored in the memory of our server. This is done to prevent an adversary from determining if an individual is in the database, in the unlikely event of a data breach.
Input | RENE_TREMBLAY_1980-01-01_female_FOO |
Output | 70c7a252fe82c829c08a8f26377dc600c18966eff2a294e724863480559561fc |
This output is used as the identifier for René Tremblay. Her first and last name are not entered into the database, and we store her date of birth and sex, alongside her identifier.