Towards Robust Evaluation of Unlearning in LLMs via Data Transformations

Abhinav Joshi¹, Shaswati Saha², Divyaksh Shukla¹, Sriram Vema², Harsh Jhamtani³, Manas Gaur², Ashutosh Modi¹

¹IIT Kanpur, ²University of Maryland Baltimore County, ³Microsoft

Paper arXiv Code

Data

Picture: The pipeline of using open-weight LLMs to train/finetune over new information (Finetuned-LLM). Later, when an unlearning request arises, the new information is split into the Retain and Forget set. The Unlearning algorithms aim towards achieving the Target-LLM (trained/finetuned only on the Retain set) with a cost lower than training/finetuning the pretrained open-weight LLM again. The spider plot shows a performance comparison of Finetuned-LLM (green) vs. Unlearned-LLM (blue) over the forget set in different formats. Although these unlearning algorithms show a forgetting behavior in the default format (the Q&A performance of Finetuned-LLM is reduced after unlearning), the performance gap varies significantly when evaluating the same information in different formats (MCQA, Analogy, Cloze, OddOneOut, and Comprehension). Note that different formats in the spider plot have different metrics (refer App.C in the paper), and Cloze test performance is 10x scaled for better visibility.

Measuring Effectiveness of Unlearning via Data Transformation

In our study, we make use of a recent machine unlearning benchmark TOFU (Maini et al., 2024) that considers a setup of unlearning via new information simulated as details about 200 fictitious authors. The TOFU dataset uses 20 Q&A queries about each of the fictitious authors to represent all the information in a Q&A format. The total dataset consists of 4k Q&A pairs. To study the effect of data format, we choose a set of 3 new formats to cover different aspects of knowledge retrieval about the same information, including MCQA (Multiple Choice Question Answering), Cloze, and Analogy (See the top Figure for examples towards the right), to ask similar questions in a different style. Additionally, we propose using two additional formats, Odd-one-out and Comprehension, to enhance the evaluation quality. Table below shows dataset formats currently available in the ReLU dataset.

Format	Format Name	Description
Default Format (TOFU)	Q&A	The default format provided in the TOFU dataset.
Format-1	MCQA (Multiple Choice Question Answering)	For each of the queries present in the default Q&A format, the same question is rephrased by providing multiple options for the answers.
Format-2	Cloze	The queries are provided with a passage that has certain words (words in the end of sentence) missing from it to mask out an information specific to an author.
Format-3	Analogy	Helps validate if the network is able to make relations between the entities (e.g., author name → birth year :: author name → country) by providing some examples in the context (ICL) and asking about another author as a query.
Format-4	Odd-One-Out	A query is given to choose the odd one out from a given set of options where one option is coming from retain/forget and another set of wrong options is coming from forget/retain set.
Format-5	Comprehension	Provides all the information in the context and ask the same questions in different styles such as Q&A, MCQA, etc.

Experimental Findings

If unlearning went perfectly, we would expect the unlearned model to perform the same as a pretrained model on the forget set.

Performance of Llama2-7b on different proposed formats of TOFU forget dataset on the base, fine-tuned, and unlearned model (with gradient-diff algorithm). Performance measures the ability of the language model to retrieve the author’s information from the forget set. In an ideal scenario, we want the unlearned model to perform the same as a pretrained model on the forget set, underscoring that the model has forgotten information from the forget set.

Performance of Llama2-7b on our formats of TOFU retain dataset on the base, fine-tuned, and unlearned model (with gradient-diff algorithm). In contrast to the above figure, here the performance measures the ability of the language model to retrieve information from the retain set. Ideally, the performance of the Unlearned-LLM should be at par with the Finetuned-LLM but higher than the Pretrained-LLM.

Remarks/Discussion: The current state of the unlearning benchmarks is limited, and the way of maintaining knowledge depends on only one dataset format. For future approaches, we recommend a few settings that could be tried aiming at different unlearning objectives, utilizing various dataset formats. In this work, we only considered previous approaches where learning and unlearning happen only in one format (Q&A in our case). However, the knowledge represented by these formats is the same, and one could learn in one format and try unlearning in another format. In another setting, one could assume the model is being trained on multiple formats (for example, Q&A and MCQA), where one of the formats remains unavailable for unlearning (MCQA). In this case, a better unlearning algorithm would be able to sufficiently unlearn the requested knowledge from the single available formats. Moreover, a wide combination of learning and unlearning formats can be chosen to quantify the robustness of future unlearning approaches.

We hope the curated dataset transformation in 5 different formats will be a useful resource for future benchmarking of unlearning algorithms.

BibTeX


      @inproceedings{joshi-etal-2024-towards,
        title = "Towards Robust Evaluation of Unlearning in {LLM}s via Data Transformations",
        author = "Joshi, Abhinav  and
          Saha, Shaswati  and
          Shukla, Divyaksh  and
          Vema, Sriram  and
          Jhamtani, Harsh  and
          Gaur, Manas  and
          Modi, Ashutosh",
        editor = "Al-Onaizan, Yaser  and
          Bansal, Mohit  and
          Chen, Yun-Nung",
        booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
        month = nov,
        year = "2024",
        address = "Miami, Florida, USA",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2024.findings-emnlp.706",
        pages = "12100--12119",
        abstract = "Large Language Models (LLMs) have shown to be a great success in a wide range of applications ranging from regular NLP-based use cases to AI agents. LLMs have been trained on a vast corpus of texts from various sources; despite the best efforts during the data pre-processing stage while training the LLMs, they may pick some undesirable information such as personally identifiable information (PII). Consequently, in recent times research in the area of Machine Unlearning (MUL) has become active, the main idea is to force LLMs to forget (unlearn) certain information (e.g., PII) without suffering from performance loss on regular tasks. In this work, we examine the robustness of the existing MUL techniques for their ability to enable leakage-proof forgetting in LLMs. 
        In particular, we examine the effect of data transformation on forgetting, i.e., is an unlearned LLM able to recall forgotten information if there is a change in the format of the input? Our findings on the TOFU dataset highlight the necessity of using diverse data formats to quantify unlearning in LLMs more reliably.",
    }