André Carreiro Blog Post

Navigating the Privacy Maze in Artificial Intelligence

BW André Carreiro Fraunhofer AICOS

André Carreiro

Data

We can often read that “Data is the new Oil”, sometimes followed by “and AI is the new electricity”. Metaphors aside, if we had to define an atomic unit for Machine Learning, it would indeed be Data. The International Standards Organisation (ISO) defines Data as a “re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing”. Consequently, virtually anything can be Data! Nonetheless, this asset’s importance depends on the value it holds for specific targets, as the journey from Data to Insights relies on many different factors.

This data we produce, record, and process is increasingly used in ways that directly affect our daily lives, even if we aren’t completely aware of its pervasiveness. Data defines the ads we see, the show recommendations we get, whether prices go up or down, whether we get a credit approved, and even supports our medical diagnosis.

The world produces approximately 329 million Terabytes every day, with an estimated 120 Zettabytes (1E21 or 1-followed-by-21-zeros bytes) globally generated in 20231. To provide some context, this is the equivalent of filling up 60 billion (!) 2TB hard drives we all use, or have at least seen around. These hard drives would fill up over 9370 Great Pyramids in Egypt. Video data accounts for more than half (~ 54%) of all Internet traffic, with more than 1 billion hours of YouTube videos every day, for example. Messaging and social media are also significant, with more than 333 billion e-mails sent daily (around two-thirds of which is spam!), or over 100 billion Whatsapp messages shared per day2 .To make it more impressive, in 2020, the number of Bytes in Digital Data surpassed the observable start count in the universe3 !

Data Privacy

Data Privacy relates to the right of individuals to have control over how their personal information is collected and used. However, with this much Data going around, where everything we do, read, see, and listen to, gets somehow registered, it is very hard to avoid leaving an irreversible digital footprint. Privacy concerns are not just prevalent but escalating. Even without considering AI, can privacy truly exist in a Digital World? How many of us have read a full privacy policy before using a digital service? These seem to be written by lawyers, for lawyers, when the regular user should be the real target.

Private or sensitive data can be come from different sources:

The problem gets worse when the capacity to process information increases, as data can be cross-referenced from multiple sources, and patterns start to emerge. From a single video feed, we can (technically) recognize individual faces, track locations with visual georeferencing, or do behavioural analysis.

Even for well-intended applications, three major challenges arise in the context of Data Privacy in AI4

  1. Data Persistence: data can still exist for much longer than the subjects who created it, especially when storing data is so cheap.
  2. Data Repurposing: data has value beyond its original scope and intent. Can/should we use it?
  3. Data Spillovers: often data collection affects unintended subjects, who did not consent.

Although some challenges are still open, there are specific solutions that should be considered from the start, such as freely given informed consent, the ability to opt-out and delete data, limiting data collection to the strictly necessary, and transparently describing the nature and scope of the AI processing. It seems inevitable that Data Privacy considers a trade-off between the value of the (personal) data and its risks for individuals and society. 

To balance the scales, we should consider the Risks and Benefits of making use of potentially private data:

Risks:

Benefits:


Real-world examples of the dangers of Data Privacy are easy to find. The infamous Cambridge Analytica scandal in the 2016 US presidential election revealed the nefarious potential of psychographic profiling of Facebook users5 , showcasing a threat to democracy and causing an erosion of trust. Clearview AI violated Canadian privacy laws by collecting photographs of Canadian adults and even children for mass surveillance and facial recognition without consent, for commercial sale6 . Amazon’s Ring smart home security device, including surveillance cameras and an app, was found to contain third-party trackers sending out customers’ private data to Analytics and Marketing departments without notification nor consent7.

How can we balance the scales in favour of protecting users’ privacy?
1) Enhancing Privacy Protection
2) Ethical and Legal AI
3) Fostering transparency and accountability

Enhancing Privacy Protection

Two major streams can be followed to enhance privacy protection in the context of AI applications. The first relates to a concept defined in the General Data Protection Regulation (GDPR): Data Minimization. The principles of data minimization defend that we should collect only what is necessary, with informed consent. Additionally, data should be anonymized where and whenever possible.

The second stream is a very active area of research, which we are also exploring in the Center for Responsible AI. Privacy-enhancement Technologies (PETs) are methods that allow us to leverage the power and value of data while minimizing the associated privacy risks. These include techniques like Federated Learning, Differential Privacy, Homomorphic Encryption, or the use of Synthetic Data.

Privacy-Enhancement Technologies: Exploring the Arsenal

In AI, safeguarding privacy isn’t just about controlling access. We must rethink how data is processed. Ina simplified taxonomy, as shown in Figure 1, we can look at how and where the data is processed. Data can be kept at the source(s), processed while encrypted, partially changed, or completely replaced. Here are some examples of different but complementary technologies that promise to redefine privacy in AI:

Organizations should choose the appropriate PET based on their specific data privacy needs, the sensitivity of the data involved, and the specific use case at hand. Integrating these technologies into AI systems from the start, in line with ICO’s recommendations8 and adhering to privacy by design
principles, not only complies with legal frameworks like the GDPR but also builds trust with users and stakeholders.

Screenshot 2024 08 07 at 10.14.24
Figure 1

Ethical and Legal AI: Building Trust through Regulation

To navigate the complexities of AI and privacy, adhering to ethical guidelines and legal frameworks isn’t a nice-to-have — it’s imperative. Integrating privacy and fairness from the design phase ensures that AI systems are not only compliant but also ethically aligned with societal values. Regulations like the GDPR and the newly enacted EU AI Act provide a structural framework for compliance, requiring practices such as transparency, data protection by design, and strict breach notifications. Moreover, advocating for global standards is crucial as digital technologies transcend national boundaries, necessitating a harmonized approach to privacy.

GDPR and AI Act: Pillars of AI Privacy

These frameworks shouldn’t be seen as bureaucratic hurdles. They are foundational to building trust in AI systems. By complying, organizations not only mitigate risks but also enhance their credibility and foster public trust.

Fostering transparency and accountability

AI solutions must strive for transparency. Not just in what concerns the interpretability of the underlying models, to avoid the so-called “black-box” models, but especially to improve the transparency of the overall process, from design to monitoring. This includes Data and Model Cards, which we can think of the AI analogues to the drug leaflets that come with every medicine we buy at the pharmacy. We may not read the whole thing, but information on the performed tests and expected side effects is available.

The AI Act and its obligations will make auditing an almost routine procedure for every (high-risk) AI application. As such, accountability mechanisms should be put into place, including frameworks to facilitate auditing processes, AI Ethics Boards, and user feedback loops.

Finally, public awareness and education are vital to improve Data and AI literacy, especially their impact on privacy and safety. Every citizen should be educated on their digital rights.

1https://www.statista.com/statistics/871513/worldwide-data-created/

2https://explodingtopics.com/blog/data-generated-per-day

3https://edgedelta.com/company/blog/how-much-data-is-created-per-day

4isaca.org/resources/news-and-trends/isaca-now-blog/2021/beware-the-privacy-violations-in-artificial-intelligence-applications

5Manheim and Kaplan, Artificial Intelligence: Risks to Privacy and Democracy. 21 Yale J.L. & Tech 106 (2019)  21_yale_j.l._tech._106_0.pdf (yjolt.org)

6U.S. technology company Clearview AI violated Canadian privacy law: report | CBC News

7https://cloudsecurityalliance.org/blog/2022/03/26/amazon-ring-a-case-of-data-security-and-privacy

8https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-sharing/privacy-enhancing-technologies/

BW André Carreiro Fraunhofer AICOS

André Carreiro

Biomedical Engineer with a strong background in Data Science and Machine Learning. He received his MSc and PhD in Biomedical Engineering from Técnico Lisboa, where he focused on the application of data science and machine learning to healthcare data, specifically in neurodegenerative diseases. Since completing his PhD in 2016, Andre has worked as a software developer, data scientist, and AI engineer in a variety of settings, including startups and a research organization. He is currently a senior researcher wearing the hats of project manager, mentor, and AI technical lead at Fraunhofer AICOS, where he is working on projects related to healthcare, retail, and manufacturing. Andre is also leading AICeBlock - a project focused on responsible AI, with the goal of building a platform to support the certification of AI solutions.