ACL 2024 Tutorial:
Vulnerabilities of Large Language Models to Adversarial Attacks


Yu Fu	Erfan Shayegani	Md Abdullah Al Mamun	Pedram Zaree	Quazi Mishkatul Alam	Haz Sameen Shahgir	Nael Abu-Ghazaleh	Yue Dong

University of California, Riverside

Sunday, August 11th: 09:00 - 12:30 Tutorial 3

Centara Grand Convention Center

Room : World Ballroom B (Level 23)

Zoom link available on ACL

slides and video recordings of this tutorial are available now!!!

About this tutorial

This tutorial offers a comprehensive overview of vulnerabilities in Large Language Models (LLMs) that are exposed by adversarial attacks—an emerging interdisciplinary field in trustworthy ML that combines perspectives from Natural Language Processing (NLP) and Cybersecurity. We emphasize the existing vulnerabilities of unimodal LLMs, multi-modal LLMs, and systems that integrate LLMs, focusing on adversarial attacks designed to exploit weaknesses and mislead AI systems.

Researchers have been addressing these safety concerns by aligning models with desired principles, using techniques such as instruction tuning and reinforcement learning via human feedback. Ideally, these aligned LLMs should be helpful and harmless. However, past work has shown that even those trained for safety can be susceptible to adversarial attacks, as evidenced by the prevalence of ‘jailbreak’ attacks on models like ChatGPT or Bard.

This tutorial provides an overview of large language models and describes how they are aligned for safety. We then organize existing research according to different learning structures, covering textual-only attacks, multi-modal attacks, and additional attack methods. Finally, we share insights into the potential causes of vulnerability and suggest potential defense strategies.

Schedule

Our tutorial will be held on World Ballroom B (Level 23). Slides may be subject to updates.

Time	Section	Presenter	Video Presenter
9:00—9:10	Section 1: Introduction - LLM vulnerability [Slides]	Yue
9:10—9:30	Section 2: Preliminaries - Thinking like a hacker [Slides]	Mamun, Nael
9:30—9:55	Section 3: Text-only Attacks [Slides]	Yu, Yue	Yu [Recordings]
9:55—10:25	Section 4-1: Multi-modal Attacks (VLM) [Slides]	Erfan, Yue	Erfan [Recordings]
10:25—10:30	Q&A Session I
10:30—11:00	Coffee break
11:00—11:25	Section 4-2: Multi-modal Attacks (T2I) [Slides]	Sameen	Sameen [Recordings]
11:25—11:50	Section 5: Additional Attacks [Slides]	Pedram, Nael	Pedram [Recordings]
11:50—12:10	Section 6: Causes [Slides]	Mishkat, Sameen	Mishkat [Recordings]
12:10—12:20	Section 7: Defenses [Slides]	Mamun, Yue	Mamun [Recordings]
12:20—12:30	Q&A Session 2

Reading List

The full list of papers can be found in our recent survey: Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks, below we highlight a small subset of the papers where bold papers will be discussed in detail during our tutorial.

Section 4-2: Multi-modal Attacks (Text -> Image)

Section 5: Additional Attacks

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (Greshake et al., 2023)
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models (Zou et al., 2024)
Phantom: General Trigger Attacks on Retrieval Augmented Language Generation (Chaudhari et al., 2024)
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems (Qi et al., 2024)
SEEING IS BELIEVING: BLACK-BOX MEMBERSHIP INFERENCE ATTACKS AGAINST RETRIEVAL AUGMENTED GENERATION (Li et al., 2024)
From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? (Pedro et al., 2023)
RatGPT: Turning online LLMs into Proxies for Malware Attacks (Beckerich et al., 2023)
The Intelligent Agent NLP-based Customer Service System (Changran, 2021)
FedMLSecurity: A Benchmark for Attacks and Defenses in Federated Learning and LLMs (Han et al., 2024)
Local Model Poisoning Attacks to Byzantine-Robust Federated Learning (Fang et al., 2020)
A robust analysis of adversarial attacks on federated learning environments (Nair et al., 2023)
A Systematic Review of Federated Generative Models (Vedadi Gargary et al., 2024)

Section 6: Causes

Jailbroken: How Does LLM Safety Training Fail? (Wei et al., 2023)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! (Qi et al., 2023)
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked (Helbling et al., 2023)
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem? (Glukhov et al., 2023)
Automatically Auditing Large Language Models via Discrete Optimization (Jones et al., 2023)
Pretraining language models with human preferences (Korbak et al., 2023)
Baseline Defenses For Adversarial Attacks Against Aligned Language Models (Jain et al., 2023)
Certifying LLM Safety against Adversarial Prompting (Kumar et al., 2023)
Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT) (Sabir et al., 2023)
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks (Zhang et al., 2023)
Towards building a robust toxicity predictor (Bespalov et al., 2023)
Exploring the Universal Vulnerability of Prompt-based Learning Paradigm (Xu et al., 2022)
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al., 2022)
Red Teaming Language Models with Language Models (Perez et al., 2022)

Section 7: Defenses

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)
Direct preference optimization (DPO): Your language model is secretly a reward model (Rafailov et al., 2023)
Large Language Model Unlearning (Yao et al., 2023)
Baseline Defenses for Adversarial Attacks Against Aligned Language Models (Jain et al., 2023)
Certifying LLM Safety against Adversarial Prompting (Kumar et al., 2023)
SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks (Robey et al., 2023)

ACL 2024 Tutorial: Vulnerabilities of Large Language Models to Adversarial Attacks

About this tutorial

Schedule

Reading List

Prerequisites

Section 2: NLP and Security Background

Section 3: Text-only Attacks

Section 4-1: Multi-modal Attacks (Image -> Text)

Section 4-2: Multi-modal Attacks (Text -> Image)

Section 5: Additional Attacks

Section 6: Causes

Section 7: Defenses

ACL 2024 Tutorial:
Vulnerabilities of Large Language Models to Adversarial Attacks