In recent years it has been claimed that Data is the oil of the Digital economy. Your data is your moat, it is something unique that your organisation possess that cannot be easily replicated (as opposed to a product feature or price).

Take Google, it started off by crawling webpages, then used the data on which links users clicked on to optimise their search rankings. The 2nd order data (metadata, data about data) is just as valuable as the catalogue of websites Google set out to gather in the first place.

For decades, organisations have been performing analytics on their own data in order to (i) to track and improve performance, (ii) better understand customer needs in order to improve their offers or (iii) sell information products. (adapted from Wixom & Ross 2017)

In most companies today, data resides in silos, hiding behind firewalls, however following the example of information services firms such as Bloomberg, Moodys, Experian, Google and Adobe, organisations have recently been exploring the potential of monetizing data they hold as an additional source of revenue.

Data exchange and commercialisation to date has been quietly focused within the marketing industry where marketing data management platforms (DMP’s) that have the ability to link data (such as Cookies) from multiple websites together to build anonymous but deep behavioural profiles that can be used for ad targeting.  Beyond, marketing there are numerous high value use cases for data sharing. For example, mobile phone location data would be immensely useful for transport or public service planning. A bank and insurance company could team up to determine if there is a link between insurance claims and credit worthiness.

With the proliferation of online devices and services (social media, smart phone, smart watch, home assistant, ecommerce) more and more aspects of our personal information are now pervasively available and visible on the internet. A survey by CareerBuilder in 2018 found that 70 percent of employers use social media to screen candidates during the hiring process.

The predominant assessment tool in use today to assess privacy risk in data use (or sharing) is the Privacy Impact Assessment (PIA). “A privacy impact assessment is a systematic assessment of a project that identifies the impact that the project might have on the privacy of individuals, and sets out recommendations for managing, minimising or eliminating that impact.” [Source OAIC]  Based on our obsevations, current PIA’s suffer from several weaknesses.  

  • It’s qualitative (a questionnaire) so it’s only as good as the analysis that supports it and most organisations have no quantitative tools to support PIA’s
  • Focused on the de-identification of defined “sensitive” fields which refers to a set of statically defined data fields
  • Doesn’t take into account quasi-identifiers (composite identifiers) or assess the potential for reidentification
  • In most instances do not incorporate scenario-based assessments of potential privacy breach events  

We reference a recent example of a privacy breach that demonstrates the potential ineffectiveness of existing Privacy Impact Assessments / Risk Assessments.   In July 2018 Public Transport Victoria (PTV) released a data set containing 1.8 billion historical records of public transport users’ activity for use in a Data Science Event, in which it was found that membership attacks could expose the identities of certain travellers within the data set, including police officers and members of parliament.  

The Office of the Victorian Information Commissioner (OVIC) released a report in August 2019 on their investigation (“Disclosure of myki travel information” which stated that both PTV (Public Transport Victoria) and the Victorian Police both conducted risk assessments but both organisations found that there was no or low risk resulting from the data sets release. OVIC further found that PTV relied on technical arguments about the definition of personal information which it had deemed to have anonymised instead of utilising an evaluative assessment of whether information is personal information that must be considered on a context specific case-by-case basis.    

It is common practice amongst most IT and data science practitioners to protect personal privacy by de-identifying personally identifiable information (PII). However both the Australian Commonwealth Privacy Act and the Victorian Personal Data Protection Act (PDP) stipulate that any data can result in an individual’s identity being reasonably ascertained forms the definition of personal information.

This makes things tricky because now it’s not just about the breadth (of fields) in a data set that impact privacy risk, but also the depth, i.e. the composition and uniqueness of rows within the data set.

Safeguarding against re-identification is a statutory requirement, but let me ask this, if you are a privacy officer, a data management or IT practitioner and if you have ever conducted a PIA (privacy impact assessment) have you ever addressed the risk or likelihood of reidentification through pseudo-identifiers or re-identification attacks?

Do you have a method to identify or quantify the risk of re-identification?

When we ask this question, one common response is, “we can’t do it because how do we know what information is out there? how do we quantify the unkown-unkown.”

We’ll use an analogy to explain how.

If you have ever compressed a file into a zip file or taken a very large RAW format photo and compressed it into a jpg file, you notice that you can reconstruct the original information from a partial set of information.

Think about how likely it is for us to determine a person’s identify from many disparate pieces of information linked to a person.

This stems from a field of study called information theory (pioneered by Claude Shannon in 1948) defined a key measure called information entropy that quantifies the amount uncertainty in the outcome of a piece (bit) of information.

Privacy has become more complicated and when it comes to re-identification we are in a world which mirrors the ongoing battle of computer virus vs anti-virus. There are parties out there (including ethical hackers, governments, commercial and criminal organsiations) that are conducting re-identification attacks for some gain.

If you have data scientists experimenting with “de-identified data sets” or as you let your data go out into the hands of a 3rd party or a data-marketplace you should not sleep safe at night unless you have conducted a basic assessment of re-identification.

@Cognitivo our approach is to adopt operational risk management techniques within our tailored data management framework in dealing with re-identification risk. As part of this approach we use scenario analysis (considering the likely re-identification scenarios and other data sets avaiable) and evaluate the risk of a privacy breach through data sharing through 2 key dimensions:

  1. The controls environment of the data custodian / recipient
  2. The privacy or information risk factor, i.e. the sensitivity of the data set in containing personal information or information that could result in re-identification.
Controls Environment vs Information / Privacy Risk Factor
Controls Environment vs Information / Privacy Risk Factor

In determining the privacy risk factor and quantifying the risk of re-identification, Cognitivo leverages the R4 Tool (Re-identification Risk Ready Recknor) developed by CSIRO’s Data61.

R4 Dashboard.png

Cognitivo is an authorised implementation service provider and reseller of Data 61’s R4 Tool. Speak to us if you would like to understand more about quantifying re-identification risk in your data assets or data sharing arrangements.

recent posts

Fintech AI Innovation Consortium (FAIC) Roundtable Discussion