Personally identifiable information, or PII, is prettyintuitive. If you know someone's phone, Social Security number orcredit card number, you have a direct link to their identity.

|

Hackers use these identifiers, along with a few more personaldetails, as keys to unlock data, steal identities, and ultimatelytake money. The lines between PII and non-PII data are blurring.It's been known for at least 10 years that there are specificpieces of data which may appear anonymous, but when they're takentogether are just as effective at identifying a person astraditional PII.

|

The easiest to understand of these so called quasi-PIIs is thetrio of full birth date, ZIP code and gender. If a company hadpublished a dataset that had been “de-identified” by removing allthe standard PIIs, but left those three data items alone, a smarthacker could with very high likelihood find the name and address ofthe person behind that data.

|

Why would this work? At a very basic level, the identitythief is effectively doing the work of a detective – essentiallygoing through lists looking for matches. The lists in this case arevoting records, which in the U.S. are available from most towns andcounties at a nominal fee – typically around $40. In the UK thisinformation is free.

|

Voting records contain name, address, and most importantly fullbirth date; postal codes can be easily determined from address. Bylooking for matching birth dates and postal codes, identity thievesnarrow down the search to a few names. Add gender information andfor most postal codes, hackers can arrive at a unique name.

|

Of course, the more additional information or clues gathered,especially taken from social media and other websites, the easierit is to filter and narrow down names when there's more than onecandidate.

|

Take the U.S., for example. A quick back-of-the envelopecalculation tells you why one might do very well with thisapproach. Taking 365 days –

|

ignoring leap years – and multiplying by an average age of 80,it works out that a complete birth date gives 29,200 “bins” toplace a ZIP code's worth of U.S. citizens. If you have genderinformation, you double the number of slots, to a little over58,000.

|

I can hear nitpickers out there that saying that voting rollscontain only the names of those over the age of 18, so you wouldhave to remove 6,570 slots. True enough, but researchers have shown it's possible to exploit Facebook'sleaky handling of data on school age minors to partially addressthis gap.

|

In any case, based on the last U.S. census, there are more than40,000 ZIP codes, with an average of only 7,000 people perZIP code. On a gut level, it seems there's a good chance most ofthose 7,000 people will find themselves alone in one of those58,000 slots. In other words, the odds are that most of them won'tshare the same date of birthdate, ZIP code, and gender.

|

Carnegie Mellon computer science professor and data privacyexpert Latanya Sweeney ran the numbers back in 2000: using thencurrent census data (broken down by ZIP codes and age groups) shewas able to identify 87% of the people in the U.S. using just thosethree non-PIIs.

|

Piecing the information together is even easier in the UK, as apost code will often cover little more than a single street.

|

Fortunately, Sweeney's research and results from other expertshave made their way to policy makers. For example, when medicalresearch on U.S. patients is published, HIPAA's Safe Harbor de-identification rules say that nogeographic unit smaller than a state can be included in the publicdata. Full dates (e.g., admission, birth) must also have the yearremoved.

|

With U.S. regulations on PII varying by the particularlegislation, this is by no means a universal rule. However, theFederal Trade Commission, an influential regulatory agency onprivacy matters, has recently issued new best practices on data de-identification.

|

They've called for all companies to achieve a “reasonable levelof confidence” that their public data can't be linked back to anindividual. Clearly, the combination of birth date, ZIP code andgender would fail that test.

|

Are there other quasi-PII's out there? Of course! The largerproblem is that consumers are sharing all kinds of informationabout themselves on websites and social forums.

|

In a possible scenario, think of an online retailer collectingpreference data about its customers – sports interests, hobbies,etc. – along with geographic data and perhaps incomeinformation.

|

These data items would not be considered traditionalPII. If hackers pulled this “anonymous” data from a poorlypermissioned file on a server, you could imagine them miningvarious special interest sites, looking for names that match upbased on those interests and geo data.

|

Once they have a match, the next step might be a phishingattack, with the hackers pretending to be the retailer.

|

For companies that want to stay ahead of the coming stricterde-identification rules — that are being considered in the US and will likely become law in the EU — it would be worth their while to start carefully reviewingtheir non-PII data. Wherever that data might be on their filesystem.

|

Andy Green is atechnical content specialist at Varonis in London, England.

|

Complete your profile to continue reading and get FREE access to CUTimes.com, part of your ALM digital membership.

  • Critical CUTimes.com information including comprehensive product and service provider listings via the Marketplace Directory, CU Careers, resources from industry leaders, webcasts, and breaking news, analysis and more with our informative Newsletters.
  • Exclusive discounts on ALM and CU Times events.
  • Access to other award-winning ALM websites including Law.com and GlobeSt.com.
NOT FOR REPRINT

© 2024 ALM Global, LLC, All Rights Reserved. Request academic re-use from www.copyright.com. All other uses, submit a request to [email protected]. For more information visit Asset & Logo Licensing.