LLM Security Profile Banner
LLM Security Profile
LLM Security

@llm_sec

9,081
Followers
299
Following
244
Media
811
Statuses

Research, papers, jobs, and news on large language model security. Got something relevant? DM / tag @llm_sec

🏔️
Joined April 2023
Don't wanna be here? Send us removal request.
Pinned Tweet
@llm_sec
LLM Security
6 months
@elder_plinius attack surface ∝ capabilities
0
1
12
@llm_sec
LLM Security
1 year
* People ask LLMs to write code * LLMs recommend imports that don't actually exist * Attackers work out what these imports' names are, and create & upload them with malicious payloads * People using LLM-written code then auto-add malware themselves
90
2K
8K
@llm_sec
LLM Security
6 months
BadGemini: sold on the dark web for a $45 monthly subscription
25
85
596
@llm_sec
LLM Security
1 year
Wonder if this was related:
Tweet media one
4
25
349
@llm_sec
LLM Security
8 months
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild "Engaging in the deliberate generation of abnormal outputs from large language models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people
Tweet media one
7
50
252
@llm_sec
LLM Security
11 months
Jailbreaking Black Box Large Language Models in Twenty Queries 🌶️ website: ⭐️ paper: code: "we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only
Tweet media one
4
52
242
@llm_sec
LLM Security
6 months
Remote Keylogging Attack on AI Assistants 🌶️🌶️🌶️ * intercept LLM chat session stream, via e.g. being on same wifi * use packet headers to infer the length of each token * extract and segment their sequence * use your own LLM to infer the response * successfully circumvent https
Tweet media one
2
65
217
@llm_sec
LLM Security
8 months
Buffer Overflow in Mixture of Experts "Mixture of Experts (MoE) has become a key ingredient for scaling large foundation models while keeping inference costs steady. We show that expert routing strategies that have cross-batch dependencies are vulnerable to attacks. Malicious
Tweet media one
3
57
202
@llm_sec
LLM Security
1 year
@_jameshatfield_ Output follows a distribution, so run many generations, see what nonsense pkg names are returned more frequently, and go fishing with those. Data on what coding tasks ppl search for is also out there.
3
6
189
@llm_sec
LLM Security
1 year
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks paper: "we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs" "Based on our finding that adversarially-generated prompts are brittle to
Tweet media one
2
50
193
@llm_sec
LLM Security
1 year
HouYi: A prompt injection toolkit, which yields * unrestricted arbitrary LLM usage * uncomplicated application prompt theft * 31 applications already found vulnerable * 10 vendors already have validated the findings
1
34
180
@llm_sec
LLM Security
1 year
Compromising LLMs: The Advent of AI Malware slides + paper: "We show how processing retrieved prompts can act as arbitrary code execution, manipulate the application’s functionality, and control how and if other APIs are called. Despite the increasing
Tweet media one
2
44
183
@llm_sec
LLM Security
1 year
Are people trying to hack your LLM? Rebuff is a toolkit for detecting prompt injection attempts. a) great to have tools in an arms race b) they must be getting handy data on new ideas c) luckily you can also run a local server! -
5
40
180
@llm_sec
LLM Security
1 year
RAIN: Your Language Models Can Align Themselves without Finetuning paper: "We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We
Tweet media one
1
31
181
@llm_sec
LLM Security
1 year
A survey on training data extraction from LLMs, covering over 100 papers on the topic "Training Data Extraction From Pre-trained Language Models: A Survey"
Tweet media one
0
39
181
@llm_sec
LLM Security
1 year
implications for cross-model exploit transferability Investigating the Existence of "Secret Language'' in Language Models paper: "In this paper, we study the problem of secret language in NLP, where current language models (LMs) seem to have a hidden
Tweet media one
2
39
139
@llm_sec
LLM Security
1 year
🌶️ GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts 🌶️ paper: "GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs" "results indicate that GPTFuzz consistently produces jailbreak templates with
Tweet media one
2
36
128
@llm_sec
LLM Security
1 year
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks paper: > Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models
Tweet media one
2
30
126
@llm_sec
LLM Security
6 months
Treating jailbreakers like luxury guests! EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models "This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak
Tweet media one
2
26
121
@llm_sec
LLM Security
8 months
The Offensive ML Playbook "Tactics, techniques and procedures for different offensive ML attacks encompassing the ML supply chain and adversarial ML attacks. Focused heavily on attacks with code you can use to perform the attack right away" w/
0
30
113
@llm_sec
LLM Security
3 months
garak: A Framework for Security Probing Large Language Models "We argue that it is time to rethink what constitutes ``LLM security'', and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper
Tweet media one
2
31
112
@llm_sec
LLM Security
7 months
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs "We propose a novel ASCII art-based jailbreak attack. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this
Tweet media one
2
32
110
@llm_sec
LLM Security
3 months
"Why I attack" by Nicholas Carlini
Tweet media one
1
24
109
@llm_sec
LLM Security
5 months
Kaggle's LLM prompt extraction competition has been won by exploiting the Sentence Transformer similarity function using an adversarial attack. 👑
2
24
108
@llm_sec
LLM Security
8 months
Gradient-Based Language Model Red Teaming "In this paper, we present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses. GBRT is a form of prompt learning, trained by
Tweet media one
2
21
103
@llm_sec
LLM Security
6 months
@Engineer_Psych a very small badge
1
1
101
@llm_sec
LLM Security
6 months
Repeated token replay attacks continue to be viable "After the Scalable Extraction paper was published, OpenAI implemented filtering of prompt inputs containing repeated single tokens. As part of our regular application security review, Dropbox engineers discovered that OpenAI’s
Tweet media one
1
29
92
@llm_sec
LLM Security
1 year
Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models paper: "by introducing adversarial embedding space attacks, we emphasize the vulnerabilities present in multi-modal systems that originate from incorporating off-the-shelf
Tweet media one
0
24
86
@llm_sec
LLM Security
6 months
Can LLM-Generated Misinformation Be Detected? "We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating
Tweet media one
2
18
84
@llm_sec
LLM Security
7 months
Stealing Part of a Production Language Model We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection
Tweet media one
2
21
84
@llm_sec
LLM Security
7 months
Fast Adversarial Attacks on Language Models In One GPU Minute 🌶️ "Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89%"
Tweet media one
2
20
83
@llm_sec
LLM Security
1 year
Site update: now has links to most of the papers & posts this account has posted, categorised into aspects of LLM security. The intent is to keep this up to date. Happy reading! (i'll buy a coffee for the first correct explanation of the banner)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
7
13
73
@llm_sec
LLM Security
1 year
@_jameshatfield_ Poc is in the article
3
1
68
@llm_sec
LLM Security
6 months
What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety "we represent fine-tuning data through two lenses: representation and gradient spaces. our approach effectively identifies subsets of benign data that are more likely to degrade the model's safety after
Tweet media one
0
12
73
@llm_sec
LLM Security
8 months
Eliciting Language Model Behaviors using Reverse Language Models "We train an LM on tokens in reverse order—a reverse LM—as a tool for identifying worst-case inputs. By prompting a reverse LM with a problematic string, we can sample prefixes that are likely to precede the
Tweet media one
1
7
72
@llm_sec
LLM Security
1 year
Extract source text from embeddings & vectorDBs! 🌶️ "Text Embeddings Reveal (Almost) As Much As Text" paper: code: "a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs
Tweet media one
1
15
72
@llm_sec
LLM Security
3 months
Poisoned LangChain: Jailbreak LLMs by LangChain "we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to
Tweet media one
0
22
71
@llm_sec
LLM Security
3 months
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models "we conduct a comprehensive analysis of jailbreak evaluation methodologies, drawing from nearly ninety jailbreak research released between May 2023 and April 2024. Our study
Tweet media one
1
19
70
@llm_sec
LLM Security
7 months
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding "we aim to defend LLMs against jailbreak attacks by introducing SafeDecoding, a safety-aware decoding strategy for LLMs" "even though probabilities of tokens representing harmful contents outweigh
Tweet media one
1
20
69
@llm_sec
LLM Security
7 months
Updated w/ Claude, GPT4 holes: Using Hallucinations to Bypass RLHF Filters 🌶️ "we present a novel method to manipulate the fine-tuned version into reverting to its pre-RLHF behavior, effectively erasing the model's filters; the exploit currently works for GPT4, Claude Sonnet,
Tweet media one
0
13
68
@llm_sec
LLM Security
7 months
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models This paper delves into the mechanisms behind jailbreaking attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response
Tweet media one
2
21
66
@llm_sec
LLM Security
6 months
Red-Teaming Language Models with DSPy "At its core, this is really an autoprompting problem: how does one search the combinatorially infinite space of language for an adversarial prompt?" 🌶️
Tweet media one
1
13
64
@llm_sec
LLM Security
1 year
Backdoor Learning on Sequence to Sequence Models paper: "While a lot of works have studied the hidden danger of backdoor attacks in image or text classification, there is a limited understanding of the model's robustness on backdoor attacks when the
Tweet media one
0
17
61
@llm_sec
LLM Security
1 year
Draft 0.5 of the OWASP top 10 vulnerabilities for LLMs is out!
Tweet media one
0
17
61
@llm_sec
LLM Security
1 year
@DimitrisPapail @random_walker Personally I browse with cURL
2
2
59
@llm_sec
LLM Security
1 year
Exploring the Universal Vulnerability of Prompt-based Learning Paradigm paper: code: 🌶️ "this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain
Tweet media one
0
11
58
@llm_sec
LLM Security
1 year
Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark paper: "To protect the copyright of LLMs for EaaS, we propose an Embedding Watermark method called EmbMarker that implants backdoors on
Tweet media one
0
17
57
@llm_sec
LLM Security
6 months
Making a SOTA Adversarial Attack on LLMs 38x Faster "we introduce the Accelerated Coordinate Gradient (ACG) attack method, which combines algorithmic insights and engineering optimizations on top of GCG to yield a ~38x speedup and ~4x GPU memory reduction without sacrificing the
Tweet media one
2
13
55
@llm_sec
LLM Security
6 months
llmsec follow f..wednesday: @goodside - llm whisperer & pit viper model @haizelabs - llm stress testing @LuxiHeLucy - jailbreaking & finetuning @umaarr6 - ml security @shi_weiyan - persuasion & privacy prof @ShomLinEd - llm attack surveyor @ChaoweiX - nvidia aisec prof
1
3
56
@llm_sec
LLM Security
1 year
@anbayanyay Fact checking is tough for humans and a well-established, very tough NLP problem - look how NLI benchmarks persist on LLM leaderboards for much longer than many other datasets, for example
1
0
51
@llm_sec
LLM Security
4 months
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction "We pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA
Tweet media one
0
17
54
@llm_sec
LLM Security
1 year
Everyone uses instruction datasets, and every month a new way is discovered of encoded information that leads to model subversion when sneaked in them. This details two more attacks - content injection and over-refusal: -- @tomgoldsteincs @ManliShu
1
14
54
@llm_sec
LLM Security
11 months
Composite Backdoor Attacks Against Large Language Models paper: "Different from existing backdoor attacks against LLMs, ours scatters multiple trigger keys in different prompt components. Such a Composite Backdoor Attack (CBA) is shown to be stealthier
Tweet media one
1
9
51
@llm_sec
LLM Security
6 months
SecGPT: An Execution Isolation Architecture for LLM-Based Systems "LLM app ecosystems resemble the settings of earlier computing platforms, where there was insufficient isolation between apps and the system. Because third-party apps may not be trustworthy, and exacerbated by the
Tweet media one
0
11
53
@llm_sec
LLM Security
1 year
"Defending ChatGPT against Jailbreak Attack via Self-Reminder" Re-issuing in-context learning system prompts reduces the impact of jailbreaking. Unsurprising, but now there is evidence.
Tweet media one
3
24
52
@llm_sec
LLM Security
6 months
interactive system prompt evaluation tool - ps-fuzz from PromptSecurity:
1
11
52
@llm_sec
LLM Security
8 months
Text Embedding Inversion Attacks on Multilingual Language Models "this work investigates LLM security from the perspective of multilingual embedding inversion. Concretely, we define the problem of black-box multilingual and cross-lingual inversion attacks, with special attention
Tweet media one
0
14
52
@llm_sec
LLM Security
6 months
Optimization-based Prompt Injection Attack to LLM-as-a-Judge "we introduce JudgeDeceiver, a novel optimization-based prompt injection attack tailored to LLM-as-a-Judge. Our method formulates a precise optimization objective for attacking the decision-making process of
Tweet media one
2
17
54
@llm_sec
LLM Security
1 year
Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples paper: "While existing research has focused on adversarial attacks during either the training or the fine-tuning of PLMs, there is a deficit of information
Tweet media one
3
10
52
@llm_sec
LLM Security
3 months
Sandwich Attack: Multi-language Mixture Adaptive Attack on LLMs "we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-theart LLMs into generating harmful and misaligned responses" (TrustNLP @NAACL )
Tweet media one
0
19
52
@llm_sec
LLM Security
5 months
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions "We argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from
Tweet media one
3
15
51
@llm_sec
LLM Security
6 months
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? "We introduce a formal measure to quantify the phenomenon of instruction-data separation as well as an empirical variant of the measure that can be computed from a model`s black-box outputs. We also
Tweet media one
0
8
50
@llm_sec
LLM Security
1 year
follow friday is deprecated, but if you do llmsec you should be following: @simonw prompt injection @KGreshake prompt injection @wunderwuzzi23 exploits incl exfil @rharang AIsec @muhao_chen robust nlp prof @LeonDerczynski llmsec prof @jun_yannn llm backdoors @llm_sec this acct
2
11
50
@llm_sec
LLM Security
6 months
Against The Achilles' Heel: A Survey on Red Teaming for Generative Models 🌶️ "Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed
Tweet media one
Tweet media two
3
11
50
@llm_sec
LLM Security
6 months
Privacy Backdoors: Stealing Data with Corrupted Pretrained Models "By tampering with a pretrained model's weights, an attacker can fully compromise the privacy of the finetuning data. We show how to build privacy backdoors for a variety of models, including transformers, which
Tweet media one
3
13
50
@llm_sec
LLM Security
6 months
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models "We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate
Tweet media one
1
13
50
@llm_sec
LLM Security
4 months
Representation noising effectively prevents harmful fine-tuning on LLMs "we propose Representation Noising (RepNoise), a defence mechanism that is effective even when attackers have access to the weights and the defender no longer has any control. RepNoise works by removing
Tweet media one
1
13
48
@llm_sec
LLM Security
6 months
Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal "we propose a risk assessment process using tools like the OWASP risk rating methodology which is used for traditional systems. We conduct scenario analysis to identify potential threat agents
Tweet media one
0
12
48
@llm_sec
LLM Security
1 year
A Study on Robustness and Reliability of Large Language Model Code Generation paper: "Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generation from LLMs
Tweet media one
0
15
46
@llm_sec
LLM Security
1 year
From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy paper: The work presents the vulnerabilities of ChatGPT, which can be exploited by malicious users to exfiltrate malicious information bypassing the ethical constraints on
Tweet media one
0
12
47
@llm_sec
LLM Security
1 year
Large Language Models can be Guided to Evade AI-Generated Text Detection paper: "We propose Substitution-based In-Context Optimization (SICO) that enables ChatGPT to evade six existing detectors, causing a significant 0.54 AUC drop on average."
Tweet media one
2
8
46
@llm_sec
LLM Security
5 months
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent "We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent
Tweet media one
1
15
46
@llm_sec
LLM Security
1 year
analysis of DAN-family jailbreak prompts "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models paper: 🔥🔥🔥 "current LLMs and safeguards cannot adequately defend jailbreak prompts in all scenarios.
Tweet media one
1
9
44
@llm_sec
LLM Security
7 months
Jailbreaking Proprietary Large Language Models using Word Substitution Cipher "decoding several safe sentences that have been encrypted using various cryptographic techniques, we find that a straightforward word substitution cipher can be decoded most effectively" "We present a
Tweet media one
0
6
43
@llm_sec
LLM Security
1 year
Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots paper: First, we propose an innovative methodology inspired by **time-based** SQL injection techniques to reverse-engineer the defensive strategies of prominent LLM chatbots,
Tweet media one
1
12
43
@llm_sec
LLM Security
8 months
Weak-to-Strong Jailbreaking on Large Language Models "Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. This observation motivates us to propose the
Tweet media one
2
5
43
@llm_sec
LLM Security
4 months
Exploiting ML models with pickle file attacks: Part 1 @trailofbits of course 🌶️ "We’ve developed a new hybrid machine learning (ML) model exploitation technique called Sleepy Pickle that takes advantage of the pervasive and notoriously insecure Pickle file format used to
0
17
43
@llm_sec
LLM Security
7 months
GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis "Existing methods for detecting unsafe prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and
Tweet media one
0
6
40
@llm_sec
LLM Security
1 year
Demystifying RCE Vulnerabilities in LLM-Integrated Apps paper: "We discovered 13 vulnerabilities in 6 frameworks, including 12 RCE vulnerabilities 🌶︄🌶︄ and 1 arbitrary file read/write vulnerability. 11 of them are confirmed by the framework
Tweet media one
0
14
41
@llm_sec
LLM Security
6 months
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models "monstrates effectiveness in two attack types. The first is Label Attack, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden. The second is Persuasion Attack,
Tweet media one
1
10
40
@llm_sec
LLM Security
8 months
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models (non-peer-reviewed) "we propose PoisonedRAG , a set of knowledge poisoning attacks to RAG, where an attacker could inject a few poisoned texts into the knowledge database such
Tweet media one
1
9
40
@llm_sec
LLM Security
7 months
Defending LLMs against Jailbreaking Attacks via Backtranslation "given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the
Tweet media one
0
11
40
@llm_sec
LLM Security
7 months
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks "Since safety-training data typically follows a specific structure containing full model responses (Bai et al., 2022), performing model inference on inputs primed with a partial response can exploit the
Tweet media one
1
6
39
@llm_sec
LLM Security
6 months
Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks "We study Prompt Injection Attacks (PIAs) on multiple families of LLMs on a Machine Translation task, focusing on the effects of model size on the attack success rates.We introduce
Tweet media one
2
2
39
@llm_sec
LLM Security
7 months
llmsec follow wednesday @NannaInie demon research 😈 @Alphatu4 llmsec influencer @NMspinach ms red team 🔴 @leonardtang_ auto red teaming @zhangchen_xu safedecoding🔒 @Kei0x llm fuzzer @imVinusankars BEAST 👹 author @shi_weiyan llm persuasion @uiuc_aisecure ai sec OG Bo Li 🧙‍♀️
3
11
37
@llm_sec
LLM Security
5 months
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs "we first discuss the drawbacks of solely picking the suffix with the lowest loss during GCG optimization for jailbreaking and uncover the missed
Tweet media one
2
6
37
@llm_sec
LLM Security
6 months
Curiosity-driven Red-teaming for Large Language Models "Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current
Tweet media one
1
5
38
@llm_sec
LLM Security
1 year
Toolkit for hardening & testing code LLM output site: "This work studies the security of LMs along two important axes: (i) security hardening, which aims to enhance LMs’ reliability in generating secure code, and (ii) adversarial testing, which seeks to
0
7
38
@llm_sec
LLM Security
7 months
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts "FigStep converts the harmful content into images through typography to bypass the safety alignment within the textual module of the VLMs, inducing VLMs to output unsafe responses that violate
Tweet media one
2
8
36
@llm_sec
LLM Security
9 months
Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks "Despite being widely applied, in-context learning is vulnerable to malicious attacks. Our method encompasses two types of attacks: poisoning demonstration examples and poisoning prompts,
Tweet media one
0
8
37
@llm_sec
LLM Security
1 year
Identifying and Mitigating the Security Risks of Generative AI paper: "This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI."
Tweet media one
0
8
37
@llm_sec
LLM Security
1 year
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models paper: "Safely aligned LLMs can be easily subverted to generate harmful content. Formally, we term a new attack as Shadow Alignment: utilizing a tiny amount of data can elicit
Tweet media one
2
7
35
@llm_sec
LLM Security
7 months
An Architectural Risk Analysis of Large Language Models "Securing a modern LLM system (even if what’s under scrutiny is only an application involving LLM technology) must involve diving into the engineering and design of the specific LLM system itself. This ARA is intended to
Tweet media one
0
7
36
@llm_sec
LLM Security
7 months
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks 🌶️ "we show that it is possible to conceptualize the creation of execution triggers as a differentiable search problem and use learning-based methods to autonomously generate them." "Our
Tweet media one
1
16
35
@llm_sec
LLM Security
8 months
Prompt-Driven LLM Safeguarding via Directed Representation Optimization "we investigate the impact of safety prompts from the perspective of model representations. in models' representation space, harmful and harmless queries can be largely distinguished, but this is not
Tweet media one
1
2
35
@llm_sec
LLM Security
5 months
Universal Adversarial Triggers Are Not Universal "In this paper, we concretely show that such adversarial triggers are not universal. We extensively investigate trigger transfer amongst 13 open models and observe inconsistent transfer. APO models are extremely hard to jailbreak.
Tweet media one
3
9
36