ACL 2024 Tutorial:
Vulnerabilities of Large Language Models to Adversarial Attacks

University of California, Riverside

Sunday, August 11th: 09:00 - 12:30 Tutorial 3
Centara Grand Convention Center
Room : World Ballroom B (Level 23)
Zoom link available on ACL
slides and video recordings of this tutorial are available now!!!

About this tutorial

This tutorial offers a comprehensive overview of vulnerabilities in Large Language Models (LLMs) that are exposed by adversarial attacks—an emerging interdisciplinary field in trustworthy ML that combines perspectives from Natural Language Processing (NLP) and Cybersecurity. We emphasize the existing vulnerabilities of unimodal LLMs, multi-modal LLMs, and systems that integrate LLMs, focusing on adversarial attacks designed to exploit weaknesses and mislead AI systems.

Researchers have been addressing these safety concerns by aligning models with desired principles, using techniques such as instruction tuning and reinforcement learning via human feedback. Ideally, these aligned LLMs should be helpful and harmless. However, past work has shown that even those trained for safety can be susceptible to adversarial attacks, as evidenced by the prevalence of ‘jailbreak’ attacks on models like ChatGPT or Bard.

This tutorial provides an overview of large language models and describes how they are aligned for safety. We then organize existing research according to different learning structures, covering textual-only attacks, multi-modal attacks, and additional attack methods. Finally, we share insights into the potential causes of vulnerability and suggest potential defense strategies.

Schedule

Our tutorial will be held on World Ballroom B (Level 23). Slides may be subject to updates.

Time Section Presenter Video Presenter
9:00—9:10 Section 1: Introduction - LLM vulnerability [Slides] Yue
9:10—9:30 Section 2: Preliminaries - Thinking like a hacker [Slides] Mamun, Nael
9:30—9:55 Section 3: Text-only Attacks [Slides] Yu, Yue Yu [Recordings]
9:55—10:25 Section 4-1: Multi-modal Attacks (VLM) [Slides] Erfan, Yue Erfan [Recordings]
10:25—10:30 Q&A Session I
10:30—11:00 Coffee break
11:00—11:25 Section 4-2: Multi-modal Attacks (T2I) [Slides] Sameen Sameen [Recordings]
11:25—11:50 Section 5: Additional Attacks [Slides] Pedram, Nael Pedram [Recordings]
11:50—12:10 Section 6: Causes [Slides] Mishkat, Sameen Mishkat [Recordings]
12:10—12:20 Section 7: Defenses [Slides] Mamun, Yue Mamun [Recordings]
12:20—12:30 Q&A Session 2

Reading List

The full list of papers can be found in our recent survey: Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks, below we highlight a small subset of the papers where bold papers will be discussed in detail during our tutorial.


Prerequisites


Section 2: NLP and Security Background


Section 3: Text-only Attacks


Section 4-1: Multi-modal Attacks (Image -> Text)


Section 4-2: Multi-modal Attacks (Text -> Image)


Section 5: Additional Attacks


Section 6: Causes


Section 7: Defenses