Navigating the Privacy Maze in Artificial Intelligence
Added: August 7, 2024
Data
We can often read that “Data is the new Oil”, sometimes followed by “and AI is the new electricity”. Metaphors aside, if we had to define an atomic unit for Machine Learning, it would indeed be Data. The International Standards Organisation (ISO) defines Data as a “re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing”. Consequently, virtually anything can be Data! Nonetheless, this asset’s importance depends on the value it holds for specific targets, as the journey from Data to Insights relies on many different factors.
This data we produce, record, and process is increasingly used in ways that directly affect our daily lives, even if we aren’t completely aware of its pervasiveness. Data defines the ads we see, the show recommendations we get, whether prices go up or down, whether we get a credit approved, and even supports our medical diagnosis.
The world produces approximately 329 million Terabytes every day, with an estimated 120 Zettabytes (1E21 or 1-followed-by-21-zeros bytes) globally generated in 20231. To provide some context, this is the equivalent of filling up 60 billion (!) 2TB hard drives we all use, or have at least seen around. These hard drives would fill up over 9370 Great Pyramids in Egypt. Video data accounts for more than half (~ 54%) of all Internet traffic, with more than 1 billion hours of YouTube videos every day, for example. Messaging and social media are also significant, with more than 333 billion e-mails sent daily (around two-thirds of which is spam!), or over 100 billion Whatsapp messages shared per day2 .To make it more impressive, in 2020, the number of Bytes in Digital Data surpassed the observable start count in the universe3 !
Data Privacy
Data Privacy relates to the right of individuals to have control over how their personal information is collected and used. However, with this much Data going around, where everything we do, read, see, and listen to, gets somehow registered, it is very hard to avoid leaving an irreversible digital footprint. Privacy concerns are not just prevalent but escalating. Even without considering AI, can privacy truly exist in a Digital World? How many of us have read a full privacy policy before using a digital service? These seem to be written by lawyers, for lawyers, when the regular user should be the real target.
Private or sensitive data can be come from different sources:
- Explicit: which include social security numbers, financial records, medical records, or biometric data, among others.
- Implicit: although one has to perform a few more laps, we can derive private data from a user’s search queries, location data, or purchase history. Even our garbage can tell a lot about us, if we dig enough.
The problem gets worse when the capacity to process information increases, as data can be cross-referenced from multiple sources, and patterns start to emerge. From a single video feed, we can (technically) recognize individual faces, track locations with visual georeferencing, or do behavioural analysis.
Even for well-intended applications, three major challenges arise in the context of Data Privacy in AI4:
- Data Persistence: data can still exist for much longer than the subjects who created it, especially when storing data is so cheap.
- Data Repurposing: data has value beyond its original scope and intent. Can/should we use it?
- Data Spillovers: often data collection affects unintended subjects, who did not consent.
Although some challenges are still open, there are specific solutions that should be considered from the start, such as freely given informed consent, the ability to opt-out and delete data, limiting data collection to the strictly necessary, and transparently describing the nature and scope of the AI processing. It seems inevitable that Data Privacy considers a trade-off between the value of the (personal) data and its risks for individuals and society.
To balance the scales, we should consider the Risks and Benefits of making use of potentially private data:
Risks:
- Data Breaches
- Compromise personal security and safety
- Identity theft
- Financial loss
- Reputational damage
- Surveillance and Monitoring
- Potential for abuse by governments and corporations
- Bias and discrimination
- Forbidden in most scenarios of the EU AI Act
Benefits:
- Personalization and convenience
- Personalized services and recommendations (e.g., treatment plans, insurance, learning)
- Enhanced user experience and satisfaction
- Innovation and Efficiency
- Improved healthcare systems
- Better and more cost-effective business models
- More efficient public services
- Public safety
- Threat detection
- Efficient emergency response
Real-world examples of the dangers of Data Privacy are easy to find. The infamous Cambridge Analytica scandal in the 2016 US presidential election revealed the nefarious potential of psychographic profiling of Facebook users5 , showcasing a threat to democracy and causing an erosion of trust. Clearview AI violated Canadian privacy laws by collecting photographs of Canadian adults and even children for mass surveillance and facial recognition without consent, for commercial sale6 . Amazon’s Ring smart home security device, including surveillance cameras and an app, was found to contain third-party trackers sending out customers’ private data to Analytics and Marketing departments without notification nor consent7.
How can we balance the scales in favour of protecting users’ privacy?
1) Enhancing Privacy Protection
2) Ethical and Legal AI
3) Fostering transparency and accountability
Enhancing Privacy Protection
Two major streams can be followed to enhance privacy protection in the context of AI applications. The first relates to a concept defined in the General Data Protection Regulation (GDPR): Data Minimization. The principles of data minimization defend that we should collect only what is necessary, with informed consent. Additionally, data should be anonymized where and whenever possible.
The second stream is a very active area of research, which we are also exploring in the Center for Responsible AI. Privacy-enhancement Technologies (PETs) are methods that allow us to leverage the power and value of data while minimizing the associated privacy risks. These include techniques like Federated Learning, Differential Privacy, Homomorphic Encryption, or the use of Synthetic Data.
Privacy-Enhancement Technologies: Exploring the Arsenal
In AI, safeguarding privacy isn’t just about controlling access. We must rethink how data is processed. Ina simplified taxonomy, as shown in Figure 1, we can look at how and where the data is processed. Data can be kept at the source(s), processed while encrypted, partially changed, or completely replaced. Here are some examples of different but complementary technologies that promise to redefine privacy in AI:
- Federated Learning: Keeps data localized, allowing models to learn from decentralized datasets without the need for data centralization, thus minimizing exposure. It is ideal for scenarios where data cannot be shared due to privacy concerns or regulatory restrictions, but it may suffer from reduced model performance due to heterogeneous data distributions.
- Secure Multi-Party Computation: A cryptographic protocol enabling parties to jointly compute a function over their inputs while keeping those inputs private. Even though they collaborate, no participant learns anything about the other parties’ private data. It is particularly useful in scenarios where multiple parties need to collaborate without revealing their individual data, such as in cross-institutional medical research or financial services.
- Differential Privacy: Adds mathematical noise to datasets or models, ensuring to some degree (a “privacy budget”) that individual data points remain anonymous even when insights are extracted en masse. It is highly effective in protecting user privacy in large-scale data analyses but can lead to decreased accuracy in data insights and model learning.
- Homomorphic Encryption: Allows computations on encrypted data, producing an encrypted result that, when decrypted, matches the outcome of operations performed on the raw data. It offers strong security guarantees but often at the cost of increased or prohibitive computational overhead.
- (Pseudo) Anonymization: Alters personal data so that individuals cannot be identified directly or indirectly. Techniques include data masking, pseudonymization, and data aggregation, which replace or remove Personally Identifiable Information (PII). The key step is identifying PII and it
should be used with caution since, if poorly implemented, it can be reversed with additional external information. - Synthetic Data: Generates artificial datasets that mimic the statistical properties of real data, providing a sandbox for developers without exposing sensitive information. Provides a good balance between privacy and utility, and is especially useful for testing and development environments, though care must be taken to ensure it accurately reflects the characteristics of real data.
Organizations should choose the appropriate PET based on their specific data privacy needs, the sensitivity of the data involved, and the specific use case at hand. Integrating these technologies into AI systems from the start, in line with ICO’s recommendations8 and adhering to privacy by design
principles, not only complies with legal frameworks like the GDPR but also builds trust with users and stakeholders.
Ethical and Legal AI: Building Trust through Regulation
To navigate the complexities of AI and privacy, adhering to ethical guidelines and legal frameworks isn’t a nice-to-have — it’s imperative. Integrating privacy and fairness from the design phase ensures that AI systems are not only compliant but also ethically aligned with societal values. Regulations like the GDPR and the newly enacted EU AI Act provide a structural framework for compliance, requiring practices such as transparency, data protection by design, and strict breach notifications. Moreover, advocating for global standards is crucial as digital technologies transcend national boundaries, necessitating a harmonized approach to privacy.
GDPR and AI Act: Pillars of AI Privacy
- General Data Protection Regulation (GDPR): This regulation revolutionized data privacy with its rigorous demands for consent and rights, including the right to be forgotten, and introduced the
necessity for data protection by design. - EU AI Act: A pioneering effort specifically tailored to govern AI, categorizing AI systems by risk and
setting strict requirements for high-risk applications, including transparency about data governance
and ensuring human oversight.
These frameworks shouldn’t be seen as bureaucratic hurdles. They are foundational to building trust in AI systems. By complying, organizations not only mitigate risks but also enhance their credibility and foster public trust.
Fostering transparency and accountability
AI solutions must strive for transparency. Not just in what concerns the interpretability of the underlying models, to avoid the so-called “black-box” models, but especially to improve the transparency of the overall process, from design to monitoring. This includes Data and Model Cards, which we can think of the AI analogues to the drug leaflets that come with every medicine we buy at the pharmacy. We may not read the whole thing, but information on the performed tests and expected side effects is available.
The AI Act and its obligations will make auditing an almost routine procedure for every (high-risk) AI application. As such, accountability mechanisms should be put into place, including frameworks to facilitate auditing processes, AI Ethics Boards, and user feedback loops.
Finally, public awareness and education are vital to improve Data and AI literacy, especially their impact on privacy and safety. Every citizen should be educated on their digital rights.
1https://www.statista.com/statistics/871513/worldwide-data-created/
2https://explodingtopics.com/blog/data-generated-per-day
3https://edgedelta.com/company/blog/how-much-data-is-created-per-day
5Manheim and Kaplan, Artificial Intelligence: Risks to Privacy and Democracy. 21 Yale J.L. & Tech 106 (2019) 21_yale_j.l._tech._106_0.pdf (yjolt.org)
6U.S. technology company Clearview AI violated Canadian privacy law: report | CBC News
7https://cloudsecurityalliance.org/blog/2022/03/26/amazon-ring-a-case-of-data-security-and-privacy
8https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-sharing/privacy-enhancing-technologies/