Analysis

December 14, 2018

Cloaks for data: making money from privacy by design

Data anonymisation is not a new science, but the privacy solution is attracting investor and customer interest, and potentially more regulation.


Layli Foroudi

9 min read

Credit: Freevector

As May 2018 loomed, companies across the continent scrambled to code cookie notifications, cull newsletter lists and scrutinise privacy policies to comply with the General Data Protection Regulation. Meanwhile the eyes of VCs grew wide as they clocked a cluster of companies building data privacy solutions to sell to companies, big and small (but mostly big).

As Europe was preparing to enter the brave new world of GDPR, Marcel van der Heijden, a partner at VC firm Speedinvest in Austria, zoned in on German company Aircloak and participated, alongside Constantia New Business, on a $1.3m seed investment round in October 2017. “The investment was driven by GDPR as it was going to restrict the ways that companies could use third party, and sometimes first party, data,” he said. “Our view was that many companies were using process [to comply] but we felt there was a technology solution.”

The technology solution is more or less this: render the data “anonymous” and GDPR doesn’t apply. Aircloak founders Felix Bauer and Sebastian Probst Eide started working on data anonymisation in 2012, years before companies had the threat of a €20m fine (or 4% of annual turnover) if found to be in breach of the regulation. Their product is an interface that allows an analyst to query a data set without seeing it. This “cloak” means that the underlying data set doesn’t change and the query results are anonymised once the data is processed. “A few years ago, very few people knew about our topic,” said Bauer. “There was no competition and companies weren’t really interested in the lead up to GDPR - most companies were in emergency mode, preparing paperwork and ticking boxes. Many didn’t look for solutions to work with data, but in the second part of 2018 they [have been] starting to look for solutions. GDPR gave us a PR buzz, the customer interest came later.”

Advertisement

There are a few more players in the privacy-compliance marketplace now. Privitar, a British firm set up in 2014, builds privacy-enhancing software products for banks, telcos and healthcare institutions, including the UK’s National Health Service. Another UK-based company, Sensible Code, switched from providing professional scraper services to statistical disclosure algorithms that strip data of disclosive information - they launched their first pilot in 2017, and an advanced pilot with the Office of National Statistics in 2018. The new kid on the block is Hazy, a startup out of University College London, whose data anonymisation software is being used by Nationwide bank, one of their backers, to enable secure data sharing.

What is anonymisation?

Jason du Preez, CEO of Privitar, doesn’t like using the word “anonymous” because a lot of people don’t know what it means. Sometimes when a company says that your data is anonymised (the removal of all personal identifiers, both direct and indirect, that could lead to identification), what they really mean is that it is pseudonymised (the replacement of names or easily attributable identifiers, with made up values e.g. a reference number) - the latter is a risk mitigation strategy only and the data is not exempt from GDPR. Anonymisation is not a generic solution - the appropriate approach varies depending on the data set and what it is used for. Here are some of the anonymisation techniques in use:
  • Record and attribute suppression: This is the decision to remove either a certain column (attribute) of data or an entire record of data (e.g. an outlier in a data set) that would lead to identification of individuals.
  • Bucketing: This reduces the precision of data, for example converting a person’s age into an age range or a city into a region.
  • Swapping: If you are only interested in the data in aggregate and not the analysis between attributes at an individual level, then it is possible to swap attributes between data records.
  • Differential privacy: The favoured method of Google and Apple, differentially private algorithms add “noise” to an aggregate query result to protect individual information without distorting the statistical results collected.
  • Homomorphic encryption: This is a system that allows calculations to be performed on encrypted data without decrypting it, or partially encrypted data.

For both Privitar and Hazy, the GDPR buzz has driven investor interest. In 2017, Privitar raised $16m in a Series A round led by France’s Partech Ventures and in December 2018, Citi invested an undisclosed amount. Hazy received $1.8m from investors including Albion Capital in August 2018. Albion’s David Grimm said that “GDPR is a specific pain point that everyone is dealing with now. When we first met them in 2017,  [Hazy consisted of] 3 guys on laptops trying to solve [this] pressing issue for companies.”

While GDPR was “the triggering factor” for Partech Ventures’ investment, it was not the only factor. “Before GDPR, we had [data breach] scandals where private information is exposed to general knowledge because of hackers. There is an endless list, from Facebook to dating sites,” said Jean-Marc Patouillaud, a partner at Partech. “Companies had to react to avoid losing their customers and GDPR is an expression of the will of governments to react to more and more cases of private data hacking.”

The growing concern about data privacy among consumers is driven by scandal. Of these, the most high profile has been the Facebook-Cambridge Analytica saga - when it was revealed that Cambridge Analytica used over 50 million Facebook user profiles to target voters in ways that exploited their individual emotional traits and preferences. A survey commissioned by Privitar in late 2018 revealed a “backlash against the data revolution” with 90% of consumer respondents from the UK, France and the US stating that they view technological advancement to be a risk to their data privacy; and 68% said they would stop using a brand if they did not protect their data. “Data privacy is now at the forefront of everyone’s minds. There’s a data breach every other week that is headline news. It has helped us get in the door,” says Harry Keen, CEO of Hazy, who has disabled his own Facebook account. “We say to companies, look we have a solution to this and people are eager to hear us out.”

The companies that have shown interest in Hazy have tended to be big financial institutions. Similarly, Aircloak’s main client is German bank, TeamBank, and one third of Privitar clients are in the financial sector, including HSBC and Nordea. Hazy has worked with startups, such as data management platform Mammoth, but Keen has found that startups are willing “to be a bit risky with their data” as they try and get off the ground. “The trend is that the more mature a company you are, the more you’re concerned about these compliance and security processes,” he said. “Startups are just trying to get their first customers and survive the next month.” But there can be a trickle down effect. Hazy’s pilot with Nationwide involves anonymising a mortgage database so that the bank can share data with fintechs to generate insights on how to price their mortgages better.

Is data security a priority for startups?

On a startup’s list of priorities, data security generally comes behind “making sure your idea works” and “getting customers”. Harry Keen, CEO of Hazy, says for this reason startups are less worried about data privacy than big companies “with lots of data and that are trying to get value out of that data.” It is true that the majority of data breaches happen to larger companies, says Bjoern Zinssmeister, whose company Templarbit tracks security breaches. “A startup has a much better overview of what their infrastructure looks like because teams are small and so is the inventory of assets.” But data breaches are happening in startup land. For example, the sales engagement startup Apollo may face penalties under GDPR after it was revealed in October 2018 that more than 200m records were stolen from their contacts database. “Startups have not been giving security enough focus, neither have SMBs,” says Zinssmeister. One part of the problem is the lack of security solutions available to

While the buzz is new, the concern is not. Europe has had legislation on data protection since 1995 and domains such as health research have always needed to be careful when handling sensitive data. “There has been a new crop of such organisations and companies because the techniques for anonymisation were largely in academia until recently,” says Olivier Thereaux, Head of Technology at the Open Data Institute. “Anonymisation of data is not something you can apply a generic recipe to. Companies that understand the data and have the ability to create something a bit bespoke, that's an interesting business model.”

But how reliable are these “out of the box” compliance services? Aircloak’s website markets its service as “instant privacy compliance” and in October 2017, the French data protection authority CNIL evaluated Aircloak as “GDPR-grade” but when it comes to data, anonymity is not binary - it slides along a scale with absolute privacy on one end, and extreme utility on the other. “It’s not an easy question of taking a data set and sprinkling magic anonymisation dust on it,” says Jason du Preez, CEO at Privitar, who is hesitant even to use the word “anonymous” because people don’t know what it means. “It’s about looking at it on a case by case basis. I have to think about direct and indirect identifiers, I need to think of the risk. Are we looking at emoji use or healthcare information?”

It’s not an easy question of taking a data set and sprinkling magic anonymisation dust on it.

Anonymising data places it outside the scope of GDPR but the techniques are never watertight. There have been countless linkage attacks, where anonymous data sets are re-identified by combining them with all the information that is swimming around on apps and the web. In January 2018, for example, de-identified location data released by the fitness app Strava accidentally exposed the location of US military bases. In 2007, a group of academics successfully re-identified individuals in an anonymised Netflix data set using movie recommendations on the Internet Movie Database. Famously, graduate student Latanya Sweeney identified the Governor of Massachusetts, William Weld, in an anonymised log of state employee hospital visits using her general knowledge about the governor and a voter database for the city of Cambridge, Massachusetts.

The UK is trying to legislate against this problem in the Data Protection Act 2018, which makes it an offence to: “knowingly or recklessly re-identify information that is de-identified personal data without the consent of the controller responsible for de-identifying the personal data.”

“If organisations think that a pure technological solution - like the ones that some companies offer - is enough to be off the hook, I think that is a mistake. They also need to have good governance,” says Thereaux. “If you anonymise then you’re outside of the boundaries of GDPR [but] the problem is that there is really no such thing as a 100% anonymised data set. There is always a little bit of a risk of re-identification.”  

Advertisement

This ever-present risk is now fuelling discussions about the next versions of GDPR and whether anonymised data should indeed count as personal data. There was a scramble before GDPR was introduced - now, in the post-GDPR world, policy makers are still scrambling to close the regulatory gaps in the quest for good data governance.