Revealed: Secret PIIs in Your Unstructured Data: Opinion
Personally identifiable information, or PII, is pretty intuitive. If you know someone’s phone, Social Security number or credit card number, you have a direct link to their identity.
Hackers use these identifiers, along with a few more personal details, as keys to unlock data, steal identities, and ultimately take money. The lines between PII and non-PII data are blurring. It’s been known for at least 10 years that there are specific pieces of data which may appear anonymous, but when they’re taken together are just as effective at identifying a person as traditional PII.
The easiest to understand of these so called quasi-PIIs is the trio of full birth date, ZIP code and gender. If a company had published a dataset that had been “de-identified” by removing all the standard PIIs, but left those three data items alone, a smart hacker could with very high likelihood find the name and address of the person behind that data.
Why would this work? At a very basic level, the identity thief is effectively doing the work of a detective – essentially going through lists looking for matches. The lists in this case are voting records, which in the U.S. are available from most towns and counties at a nominal fee – typically around $40. In the UK this information is free.
Voting records contain name, address, and most importantly full birth date; postal codes can be easily determined from address. By looking for matching birth dates and postal codes, identity thieves narrow down the search to a few names. Add gender information and for most postal codes, hackers can arrive at a unique name.
Of course, the more additional information or clues gathered, especially taken from social media and other websites, the easier it is to filter and narrow down names when there’s more than one candidate.
Take the U.S., for example. A quick back-of-the envelope calculation tells you why one might do very well with this approach. Taking 365 days –
ignoring leap years – and multiplying by an average age of 80, it works out that a complete birth date gives 29,200 “bins” to place a ZIP code’s worth of U.S. citizens. If you have gender information, you double the number of slots, to a little over 58,000.
I can hear nitpickers out there that saying that voting rolls contain only the names of those over the age of 18, so you would have to remove 6,570 slots. True enough, but researchers have shown it’s possible to exploit Facebook’s leaky handling of data on school age minors to partially address this gap.
In any case, based on the last U.S. census, there are more than 40,000 ZIP codes, with an average of only 7,000 people per ZIP code. On a gut level, it seems there’s a good chance most of those 7,000 people will find themselves alone in one of those 58,000 slots. In other words, the odds are that most of them won’t share the same date of birthdate, ZIP code, and gender.
Carnegie Mellon computer science professor and data privacy expert Latanya Sweeney ran the numbers back in 2000: using then current census data (broken down by ZIP codes and age groups) she was able to identify 87% of the people in the U.S. using just those three non-PIIs.
Piecing the information together is even easier in the UK, as a post code will often cover little more than a single street.
Fortunately, Sweeney’s research and results from other experts have made their way to policy makers. For example, when medical research on U.S. patients is published, HIPAA’s Safe Harbor de-identification rules say that no geographic unit smaller than a state can be included in the public data. Full dates (e.g., admission, birth) must also have the year removed.
With U.S. regulations on PII varying by the particular legislation, this is by no means a universal rule. However, the Federal Trade Commission, an influential regulatory agency on privacy matters, has recently issued new best practices on data de-identification.
They’ve called for all companies to achieve a “reasonable level of confidence” that their public data can’t be linked back to an individual. Clearly, the combination of birth date, ZIP code and gender would fail that test.
Are there other quasi-PII’s out there? Of course! The larger problem is that consumers are sharing all kinds of information about themselves on websites and social forums.
In a possible scenario, think of an online retailer collecting preference data about its customers – sports interests, hobbies, etc. – along with geographic data and perhaps income information.
These data items would not be considered traditional PII. If hackers pulled this “anonymous” data from a poorly permissioned file on a server, you could imagine them mining various special interest sites, looking for names that match up based on those interests and geo data.
Once they have a match, the next step might be a phishing attack, with the hackers pretending to be the retailer.
For companies that want to stay ahead of the coming stricter de-identification rules — that are being considered in the US and will likely become law in the EU — it would be worth their while to start carefully reviewing their non-PII data. Wherever that data might be on their file system.