Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Abstract

The interactive use of large language models (LLMs) in AI assistants (at work, home, etc.) introduces a new set of inference-time privacy risks: LLMs are fed different types of information from multiple sources in their inputs and we expect them to reason about what to share in their outputs, for what purpose and with whom, in a given context. In this work, we draw attention to the highly critical yet overlooked notion of contextual privacy by proposing ConfAIde, a benchmark designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. Our experiments show that even the most capable models such as GPT-4 and ChatGPT reveal private information in contexts that humans would not, 39% and 57% of the time, respectively. This leakage persists even when we employ privacy-inducing prompts or chain-of-thought reasoning. Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.

ConfAIde: Benchmarking Contextual Privacy Reasoning in LLMs

Tiered structure of our ConfAIde benchmark.

Our benchmark consists of four tiers, each with distinct evaluation tasks.

Tier 1 (Info-Sensitivity): assesses LLMs on their ability to understand the sensitivity of given information, using ten predefined information types.

Tier 2 (InfoFlow-Expectation): evaluates models' expectations of information flow using vignettes based on three contextual factors: information type, actor, and use. This tier includes two sub-tiers: Tier 2.a and Tier 2.b, with Tier 2.b expanding vignettes into short stories.

Tier 3 (InfoFlow-Control): tests the model's ability to control the flow of private information within complex scenarios involving three parties. It requires social reasoning skills, such as theory of mind.

Tier 4 (InfoFlow-Application): examines privacy reasoning in real-world scenarios, specifically in automatic action-item and summary generation from meeting transcripts.

Each tier has a specific design and evaluation focus, ranging from basic information sensitivity to more complex real-world applications.

Result Highlights

Breakdown of GPT-4 judgment over contextual factors, as we progress through tiers 1, 2.a and 2.b.

Effect of actor and use on privacy expectations (tiers 1-2.b). The figure above shows how GPT-4's judgment varies based on different contextual factors and data sensitivity, progressing through tiers 1, 2.a and 2.b. For example, the sensitivity of sharing Social Security Numbers (SSN) decreases when it's shared with insurance (Tier 2.a) instead of being highly sensitive (Tier 1). The figure also shows that sharing SSN with a doctor becomes less of a privacy concern when moving from Tier 2.a to 2.b with GPT-4.

Summary of Tier 3 and Tier 4 leakage percentage results: State of the art models (GPT-4 and ChatGPT) leak secrets at an alarming rate, and using chain of thought reasoning doesn't help!

High levels of leakage in theory of mind based scenarios. The figures above show the worst case leakage of ChatGPT and GPT-4 on Tiers 3 and 4 scenarios, where we ask the model to finish a story or to summarize a work meeting. We can see that the leakage of both models is alarmingly high, although we are using a privacy-inducing prompt which instructs the model to consider privacy norms before responding.

Does CoT reasoning help? To see if patching solutions work we use chain of thought reasoning and ask the model to think step by step. We observe that even CoT doesn't improve leakage, in fact it makes it slightly worse, underscoring the need for fundamental solutions!

So What's Next?

Inference-Time Privacy Definitions: Current research highlights a crucial gap in privacy definitions during the inference phase, which has significant implications. For instance, a recent attack exposed vulnerabilities in Bing Chat's initial prompt, affecting its interactions with users. We emphasize the need to address changes in model deployment and usage, particularly in interactive applications, and acknowledge the existence of various unexplored privacy concerns. These include the potential leakage of in-context examples to the model's output and conflicts between different data types in multi-modal models.

Fundamental Privacy Solutions: We demonstrate the complexity of addressing these issues and argue that ad hoc solutions like privacy-inducing prompts and output filters are insufficient for addressing the core challenge of contextual privacy reasoning. Prior efforts to mitigate biases and hallucinations in language models have shown that such safeguards can be easily bypassed by malicious inputs. To address these concerns, we advocate for fundamental and principled inference-time approaches, such as utilizing explicit symbolic graphical representations of individuals' beliefs to enhance decision-making while considering privacy and information flow.

Secret Disclosure and Moral Incentives: While our benchmark assesses models for their privacy reasoning abilities based on the theory of contextual integrity, we do not aim to dictate privacy standards or make normative judgments, as these judgments are intertwined with moral and cultural factors in interactions. Social psychologists have studied the moral incentives behind disclosing others' secrets as a form of punishment. We encourage further research into these incentives within language models.

BibTeX

@article{confaide2023,
  author    = {Mireshghallah, Niloofar and Kim, Hyunwoo and Zhou, Xuhui  and Tsvetkov, Yulia and Sap, Maarten and Shokri, Reza and Choi, Yejin},
  title     = {Can LLMs Keep a Secret? Testing Privacy  Implications of Language Models via Contextual Integrity Theory},
  journal   = {arXiv preprint arXiv:2310.17884},
  year      = {2023},
}

ConfAIde: Can LLMs Keep a Secret? Testing Privacy Implicationsof Language Models

Abstract

ConfAIde: Benchmarking Contextual Privacy Reasoning in LLMs

Result Highlights

So What's Next?

BibTeX

ConfAIde: Can LLMs Keep a Secret? Testing Privacy Implications
of Language Models