LLM Security @llm_sec Twitter profile

Pinned Tweet

LLM Security

@llm_sec

6 months

@elder_plinius attack surface ∝ capabilities

0

1

12

Last Seen Profiles

@bokeplokalmalam

@StapletonLainie

@Hanzalahwajid

@figueronqu1278

@corygibbs

@zayedkhan1802

@KristyleeBella

@ScarletSunArts

@MrsKRGray

@AWeissmann_

@Babyclubmx

@mcfluffle

@AgapaAndalucia

@RobergrosCarlos

@thecrackpodcas1

@Max64114562

@Gregor79826Page

@samrql

@PanvelCorp

@Adventure_Ren

@OvittOvi

@growfitter

@baseballtrooper

@khazal_ke

@panamaccenter

@Jolie9703

@ZoneOrange_SSB

@stw_pdg

@AidanAidanmc

@bokeplokalmalam

@Ohmydarling_12

@goHutchy

@Lucia_Varaa

@serohagi_

@m_rissv

LLM Security

@llm_sec

1 year

* People ask LLMs to write code * LLMs recommend imports that don't actually exist * Attackers work out what these imports' names are, and create & upload them with malicious payloads * People using LLM-written code then auto-add malware themselves

Can you trust ChatGPT’s package recommendations?

ChatGPT can offer coding solutions, but its tendency for hallucination presents attackers with an opportunity. Here's what we learned.

vulcan.io

90

2K

8K

LLM Security

@llm_sec

6 months

BadGemini: sold on the dark web for a $45 monthly subscription

25

85

596

LLM Security

@llm_sec

1 year

Wonder if this was related:

4

25

349

LLM Security

@llm_sec

8 months

Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild "Engaging in the deliberate generation of abnormal outputs from large language models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people

7

50

252

LLM Security

@llm_sec

11 months

Jailbreaking Black Box Large Language Models in Twenty Queries 🌶️ website: ⭐️ paper: code: "we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only

4

52

242

LLM Security

@llm_sec

6 months

Remote Keylogging Attack on AI Assistants 🌶️🌶️🌶️ * intercept LLM chat session stream, via e.g. being on same wifi * use packet headers to infer the length of each token * extract and segment their sequence * use your own LLM to infer the response * successfully circumvent https

2

65

217

LLM Security

@llm_sec

8 months

Buffer Overflow in Mixture of Experts "Mixture of Experts (MoE) has become a key ingredient for scaling large foundation models while keeping inference costs steady. We show that expert routing strategies that have cross-batch dependencies are vulnerable to attacks. Malicious

3

57

202

LLM Security

@llm_sec

1 year

@_jameshatfield_ Output follows a distribution, so run many generations, see what nonsense pkg names are returned more frequently, and go fishing with those. Data on what coding tasks ppl search for is also out there.

3

6

189

LLM Security

@llm_sec

1 year

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks paper: "we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs" "Based on our finding that adversarially-generated prompts are brittle to

2

50

193

LLM Security

@llm_sec

1 year

HouYi: A prompt injection toolkit, which yields * unrestricted arbitrary LLM usage * uncomplicated application prompt theft * 31 applications already found vulnerable * 10 vendors already have validated the findings

Prompt Injection attack against LLM-integrated Applications

Large Language Models (LLMs), renowned for their superior proficiency in language comprehension and generation, stimulate a vibrant ecosystem of applications around them. However, their extensive...

arxiv.org

1

34

180

LLM Security

@llm_sec

1 year

Compromising LLMs: The Advent of AI Malware slides + paper: "We show how processing retrieved prompts can act as arbitrary code execution, manipulate the application’s functionality, and control how and if other APIs are called. Despite the increasing

2

44

183

LLM Security

@llm_sec

1 year

Are people trying to hack your LLM? Rebuff is a toolkit for detecting prompt injection attempts. a) great to have tools in an arms race b) they must be getting handy data on new ideas c) luckily you can also run a local server! -

GitHub - protectai/rebuff: LLM Prompt Injection Detector

LLM Prompt Injection Detector. Contribute to protectai/rebuff development by creating an account on GitHub.

github.com

5

40

180

LLM Security

@llm_sec

1 year

RAIN: Your Language Models Can Align Themselves without Finetuning paper: "We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We

1

31

181

LLM Security

@llm_sec

1 year

A survey on training data extraction from LLMs, covering over 100 papers on the topic "Training Data Extraction From Pre-trained Language Models: A Survey"

0

39

181

LLM Security

@llm_sec

1 year

implications for cross-model exploit transferability Investigating the Existence of "Secret Language'' in Language Models paper: "In this paper, we study the problem of secret language in NLP, where current language models (LMs) seem to have a hidden

2

39

139

LLM Security

@llm_sec

1 year

🌶️ GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts 🌶️ paper: "GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs" "results indicate that GPTFuzz consistently produces jailbreak templates with

2

36

128

LLM Security

@llm_sec

1 year

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks paper: > Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models

2

30

126

LLM Security

@llm_sec

6 months

Treating jailbreakers like luxury guests! EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models "This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak

2

26

121

LLM Security

@llm_sec

8 months

The Offensive ML Playbook "Tactics, techniques and procedures for different offensive ML attacks encompassing the ML supply chain and adversarial ML attacks. Focused heavily on attacks with code you can use to perform the attack right away" w/

Welcome to the Offensive ML Playbook - OffSecML Playbook

Latest: 08/21/24 version: 1.0.8 First published 10/26/23. Shiny new things shout out to mundruid/drX for a bunch of great contributions! Improvements to Benchmarking hackbots and agents featuring ne…

wiki.offsecml.com

0

30

113

LLM Security

@llm_sec

3 months

garak: A Framework for Security Probing Large Language Models "We argue that it is time to rethink what constitutes ``LLM security'', and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper

2

31

112

LLM Security

@llm_sec

7 months

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs "We propose a novel ASCII art-based jailbreak attack. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this

2

32

110

LLM Security

@llm_sec

3 months

"Why I attack" by Nicholas Carlini

1

24

109

LLM Security

@llm_sec

5 months

Kaggle's LLM prompt extraction competition has been won by exploiting the Sentence Transformer similarity function using an adversarial attack. 👑

LLM Prompt Recovery

Recover the prompt used to transform a given text

www.kaggle.com

2

24

108

LLM Security

@llm_sec

8 months

Gradient-Based Language Model Red Teaming "In this paper, we present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses. GBRT is a form of prompt learning, trained by

2

21

103

LLM Security

@llm_sec

6 months

@Engineer_Psych a very small badge

1

101

LLM Security

@llm_sec

6 months

Repeated token replay attacks continue to be viable "After the Scalable Extraction paper was published, OpenAI implemented filtering of prompt inputs containing repeated single tokens. As part of our regular application security review, Dropbox engineers discovered that OpenAI’s

1

29

92

LLM Security

@llm_sec

1 year

Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models paper: "by introducing adversarial embedding space attacks, we emphasize the vulnerabilities present in multi-modal systems that originate from incorporating off-the-shelf

0

24

86

LLM Security

@llm_sec

6 months

Can LLM-Generated Misinformation Be Detected? "We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating

2

18

84

LLM Security

@llm_sec

7 months

Stealing Part of a Production Language Model We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection

2

21

84

LLM Security

@llm_sec

7 months

Fast Adversarial Attacks on Language Models In One GPU Minute 🌶️ "Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89%"

2

20

83

LLM Security

@llm_sec

3 months

Prompt Injection and Jailbreaking are (still) not the same thing by @simonw

Prompt injection and jailbreaking are not the same thing

I keep seeing people use the term “prompt injection” when they’re actually talking about “jailbreaking”. This mistake is so common now that I’m not sure it’s possible to correct course: …

simonwillison.net

1

24

76

LLM Security

@llm_sec

1 year

Site update: now has links to most of the papers & posts this account has posted, categorised into aspects of LLM security. The intent is to keep this up to date. Happy reading! (i'll buy a coffee for the first correct explanation of the banner)

7

13

73

LLM Security

@llm_sec

1 year

@_jameshatfield_ Poc is in the article

3

1

68

LLM Security

@llm_sec

6 months

What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety "we represent fine-tuning data through two lenses: representation and gradient spaces. our approach effectively identifies subsets of benign data that are more likely to degrade the model's safety after

0

12

73

LLM Security

@llm_sec

8 months

Eliciting Language Model Behaviors using Reverse Language Models "We train an LM on tokens in reverse order—a reverse LM—as a tool for identifying worst-case inputs. By prompting a reverse LM with a problematic string, we can sample prefixes that are likely to precede the

1

7

72

LLM Security

@llm_sec

1 year

Extract source text from embeddings & vectorDBs! 🌶️ "Text Embeddings Reveal (Almost) As Much As Text" paper: code: "a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs

1

15

72

LLM Security

@llm_sec

3 months

Poisoned LangChain: Jailbreak LLMs by LangChain "we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to

0

22

71

LLM Security

@llm_sec

3 months

JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models "we conduct a comprehensive analysis of jailbreak evaluation methodologies, drawing from nearly ninety jailbreak research released between May 2023 and April 2024. Our study

1

19

70

LLM Security

@llm_sec

7 months

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding "we aim to defend LLMs against jailbreak attacks by introducing SafeDecoding, a safety-aware decoding strategy for LLMs" "even though probabilities of tokens representing harmful contents outweigh

1

20

69

LLM Security

@llm_sec

7 months

Updated w/ Claude, GPT4 holes: Using Hallucinations to Bypass RLHF Filters 🌶️ "we present a novel method to manipulate the fine-tuned version into reverting to its pre-RLHF behavior, effectively erasing the model's filters; the exploit currently works for GPT4, Claude Sonnet,

0

13

68

LLM Security

@llm_sec

6 months

LLM Security for Beginners: catch up to date by @KleiberIngo

A Primer on LLM Security – Hacking Large Language Models for Beginners

This article provides a short introduction into Large Language Model (LLM) security as well as red teaming LLMs.

kleiber.me

0

23

68

LLM Security

@llm_sec

7 months

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models This paper delves into the mechanisms behind jailbreaking attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response

2

21

66

LLM Security

@llm_sec

6 months

Red-Teaming Language Models with DSPy "At its core, this is really an autoprompting problem: how does one search the combinatorially infinite space of language for an adversarial prompt?" 🌶️

1

13

64

LLM Security

@llm_sec

1 year

Backdoor Learning on Sequence to Sequence Models paper: "While a lot of works have studied the hidden danger of backdoor attacks in image or text classification, there is a limited understanding of the model's robustness on backdoor attacks when the

0

17

61

LLM Security

@llm_sec

1 year

Draft 0.5 of the OWASP top 10 vulnerabilities for LLMs is out!

0

17

61

LLM Security

@llm_sec

1 year

@DimitrisPapail @random_walker Personally I browse with cURL

2

59

LLM Security

@llm_sec

1 year

Exploring the Universal Vulnerability of Prompt-based Learning Paradigm paper: code: 🌶️ "this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain

0

11

58

LLM Security

@llm_sec

1 year

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark paper: "To protect the copyright of LLMs for EaaS, we propose an Embedding Watermark method called EmbMarker that implants backdoors on

0

17

57

LLM Security

@llm_sec

6 months

Making a SOTA Adversarial Attack on LLMs 38x Faster "we introduce the Accelerated Coordinate Gradient (ACG) attack method, which combines algorithmic insights and engineering optimizations on top of GCG to yield a ~38x speedup and ~4x GPU memory reduction without sacrificing the

2

13

55

LLM Security

@llm_sec

6 months

llmsec follow f..wednesday: @goodside - llm whisperer & pit viper model @haizelabs - llm stress testing @LuxiHeLucy - jailbreaking & finetuning @umaarr6 - ml security @shi_weiyan - persuasion & privacy prof @ShomLinEd - llm attack surveyor @ChaoweiX - nvidia aisec prof

1

3

56

LLM Security

@llm_sec

1 year

@anbayanyay Fact checking is tough for humans and a well-established, very tough NLP problem - look how NLI benchmarks persist on LLM leaderboards for much longer than many other datasets, for example

1

0

51

LLM Security

@llm_sec

4 months

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction "We pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA

0

17

54

LLM Security

@llm_sec

1 year

Everyone uses instruction datasets, and every month a new way is discovered of encoded information that leads to model subversion when sneaked in them. This details two more attacks - content injection and over-refusal: -- @tomgoldsteincs @ManliShu

On the Exploitability of Instruction Tuning

Instruction tuning is an effective technique to align large language models (LLMs) with human intents. In this work, we investigate how an adversary can exploit instruction tuning by injecting...

arxiv.org

1

14

54

LLM Security

@llm_sec

11 months

Composite Backdoor Attacks Against Large Language Models paper: "Different from existing backdoor attacks against LLMs, ours scatters multiple trigger keys in different prompt components. Such a Composite Backdoor Attack (CBA) is shown to be stealthier

1

9

51

LLM Security

@llm_sec

6 months

SecGPT: An Execution Isolation Architecture for LLM-Based Systems "LLM app ecosystems resemble the settings of earlier computing platforms, where there was insufficient isolation between apps and the system. Because third-party apps may not be trustworthy, and exacerbated by the

0

11

53

LLM Security

@llm_sec

1 year

"Defending ChatGPT against Jailbreak Attack via Self-Reminder" Re-issuing in-context learning system prompts reduces the impact of jailbreaking. Unsurprising, but now there is evidence.

3

24

52

LLM Security

@llm_sec

6 months

interactive system prompt evaluation tool - ps-fuzz from PromptSecurity:

1

11

52

LLM Security

@llm_sec

8 months

Text Embedding Inversion Attacks on Multilingual Language Models "this work investigates LLM security from the perspective of multilingual embedding inversion. Concretely, we define the problem of black-box multilingual and cross-lingual inversion attacks, with special attention

0

14

52

LLM Security

@llm_sec

6 months

Optimization-based Prompt Injection Attack to LLM-as-a-Judge "we introduce JudgeDeceiver, a novel optimization-based prompt injection attack tailored to LLM-as-a-Judge. Our method formulates a precise optimization objective for attacking the decision-making process of

2

17

54

LLM Security

@llm_sec

1 year

Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples paper: "While existing research has focused on adversarial attacks during either the training or the fine-tuning of PLMs, there is a deficit of information

3

10

52

LLM Security

@llm_sec

3 months

Sandwich Attack: Multi-language Mixture Adaptive Attack on LLMs "we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-theart LLMs into generating harmful and misaligned responses" (TrustNLP @NAACL )

0

19

52

LLM Security

@llm_sec

5 months

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions "We argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from

3

15

51

LLM Security

@llm_sec

6 months

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? "We introduce a formal measure to quantify the phenomenon of instruction-data separation as well as an empirical variant of the measure that can be computed from a model`s black-box outputs. We also

0

8

50

LLM Security

@llm_sec

1 year

follow friday is deprecated, but if you do llmsec you should be following: @simonw prompt injection @KGreshake prompt injection @wunderwuzzi23 exploits incl exfil @rharang AIsec @muhao_chen robust nlp prof @LeonDerczynski llmsec prof @jun_yannn llm backdoors @llm_sec this acct

2

11

50

LLM Security

@llm_sec

6 months

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models 🌶️ "Our extensive survey, which examines over 120 papers, introduces a taxonomy of fine-grained attack strategies grounded in the inherent capabilities of language models. Additionally, we have developed

3

11

50

LLM Security

@llm_sec

6 months

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models "By tampering with a pretrained model's weights, an attacker can fully compromise the privacy of the finetuning data. We show how to build privacy backdoors for a variety of models, including transformers, which

3

13

50

LLM Security

@llm_sec

6 months

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models "We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate

1

13

50

LLM Security

@llm_sec

4 months

Representation noising effectively prevents harmful fine-tuning on LLMs "we propose Representation Noising (RepNoise), a defence mechanism that is effective even when attackers have access to the weights and the defender no longer has any control. RepNoise works by removing

1

13

48

LLM Security

@llm_sec

6 months

Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal "we propose a risk assessment process using tools like the OWASP risk rating methodology which is used for traditional systems. We conduct scenario analysis to identify potential threat agents

0

12

48

LLM Security

@llm_sec

1 year

A Study on Robustness and Reliability of Large Language Model Code Generation paper: "Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generation from LLMs

0

15

46

LLM Security

@llm_sec

1 year

From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy paper: The work presents the vulnerabilities of ChatGPT, which can be exploited by malicious users to exfiltrate malicious information bypassing the ethical constraints on

0

12

47

LLM Security

@llm_sec

1 year

Large Language Models can be Guided to Evade AI-Generated Text Detection paper: "We propose Substitution-based In-Context Optimization (SICO) that enables ChatGPT to evade six existing detectors, causing a significant 0.54 AUC drop on average."

2

8

46

LLM Security

@llm_sec

5 months

Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent "We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent

1

15

46

LLM Security

@llm_sec

1 year

analysis of DAN-family jailbreak prompts "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models paper: 🔥🔥🔥 "current LLMs and safeguards cannot adequately defend jailbreak prompts in all scenarios.

1

9

44

LLM Security

@llm_sec

7 months

Jailbreaking Proprietary Large Language Models using Word Substitution Cipher "decoding several safe sentences that have been encrypted using various cryptographic techniques, we find that a straightforward word substitution cipher can be decoded most effectively" "We present a

0

6

43

LLM Security

@llm_sec

1 year

Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots paper: First, we propose an innovative methodology inspired by **time-based** SQL injection techniques to reverse-engineer the defensive strategies of prominent LLM chatbots,

1

12

43

LLM Security

@llm_sec

8 months

Weak-to-Strong Jailbreaking on Large Language Models "Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. This observation motivates us to propose the

2

5

43

LLM Security

@llm_sec

4 months

Exploiting ML models with pickle file attacks: Part 1 @trailofbits of course 🌶️ "We’ve developed a new hybrid machine learning (ML) model exploitation technique called Sleepy Pickle that takes advantage of the pervasive and notoriously insecure Pickle file format used to

0

17

43

LLM Security

@llm_sec

7 months

GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis "Existing methods for detecting unsafe prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and

0

6

40

LLM Security

@llm_sec

7 months

Self-replicating prompt injection w/ @ben_nassi

Here Come the AI Worms

Security researchers created an AI worm in a test environment that can automatically spread between generative AI agents—potentially stealing data and sending spam emails along the way.

www.wired.com

2

10

41

LLM Security

@llm_sec

1 year

Demystifying RCE Vulnerabilities in LLM-Integrated Apps paper: "We discovered 13 vulnerabilities in 6 frameworks, including 12 RCE vulnerabilities 🌶︄🌶︄ and 1 arbitrary file read/write vulnerability. 11 of them are confirmed by the framework

0

14

41

LLM Security

@llm_sec

1 year

Escaping AutoGPT's docker container through prompt injection in a summarisation task:

Hacking Auto-GPT and escaping its docker container | Positive Security

We leverage indirect prompt injection to trick Auto-GPT (GPT-4) into executing arbitrary code when it is asked to perform a seemingly harmless task such as text summarization on a malicious website,...

positive.security

2

6

40

LLM Security

@llm_sec

6 months

Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models "monstrates effectiveness in two attack types. The first is Label Attack, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden. The second is Persuasion Attack,

1

10

40

LLM Security

@llm_sec

8 months

PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models (non-peer-reviewed) "we propose PoisonedRAG , a set of knowledge poisoning attacks to RAG, where an attacker could inject a few poisoned texts into the knowledge database such

1

9

40

LLM Security

@llm_sec

7 months

Defending LLMs against Jailbreaking Attacks via Backtranslation "given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the

0

11

40

LLM Security

@llm_sec

7 months

Bypassing the Safety Training of Open-Source LLMs with Priming Attacks "Since safety-training data typically follows a specific structure containing full model responses (Bai et al., 2022), performing model inference on inputs primed with a partial response can exploit the

1

6

39

LLM Security

@llm_sec

6 months

Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks "We study Prompt Injection Attacks (PIAs) on multiple families of LLMs on a Machine Translation task, focusing on the effects of model size on the attack success rates.We introduce

2

39

LLM Security

@llm_sec

7 months

llmsec follow wednesday @NannaInie demon research 😈 @Alphatu4 llmsec influencer @NMspinach ms red team 🔴 @leonardtang_ auto red teaming @zhangchen_xu safedecoding🔒 @Kei0x llm fuzzer @imVinusankars BEAST 👹 author @shi_weiyan llm persuasion @uiuc_aisecure ai sec OG Bo Li 🧙‍♀️

3

11

37

LLM Security

@llm_sec

5 months

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs "we first discuss the drawbacks of solely picking the suffix with the lowest loss during GCG optimization for jailbreaking and uncover the missed

2

6

37

LLM Security

@llm_sec

6 months

Curiosity-driven Red-teaming for Large Language Models "Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current

1

5

38

LLM Security

@llm_sec

1 year

Toolkit for hardening & testing code LLM output site: "This work studies the security of LMs along two important axes: (i) security hardening, which aims to enhance LMs’ reliability in generating secure code, and (ii) adversarial testing, which seeks to

Large Language Models for Code: Security Hardening and Adversarial Testing

Large language models (LMs) are increasingly pretrained on massive codebases and used to generate code. However, LMs lack awareness of security and are found to frequently produce unsafe code. This...

www.sri.inf.ethz.ch

0

7

38

LLM Security

@llm_sec

7 months

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts "FigStep converts the harmful content into images through typography to bypass the safety alignment within the textual module of the VLMs, inducing VLMs to output unsafe responses that violate

2

8

36

LLM Security

@llm_sec

9 months

Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks "Despite being widely applied, in-context learning is vulnerable to malicious attacks. Our method encompasses two types of attacks: poisoning demonstration examples and poisoning prompts,

0

8

37

LLM Security

@llm_sec

1 year

Identifying and Mitigating the Security Risks of Generative AI paper: "This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI."

0

8

37

LLM Security

@llm_sec

1 year

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models paper: "Safely aligned LLMs can be easily subverted to generate harmful content. Formally, we term a new attack as Shadow Alignment: utilizing a tiny amount of data can elicit

2

7

35

LLM Security

@llm_sec

7 months

An Architectural Risk Analysis of Large Language Models "Securing a modern LLM system (even if what’s under scrutiny is only an application involving LLM technology) must involve diving into the engineering and design of the specific LLM system itself. This ARA is intended to

0

7

36

LLM Security

@llm_sec

7 months

Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks 🌶️ "we show that it is possible to conceptualize the creation of execution triggers as a differentiable search problem and use learning-based methods to autonomously generate them." "Our

1

16

35

LLM Security

@llm_sec

8 months

Prompt-Driven LLM Safeguarding via Directed Representation Optimization "we investigate the impact of safety prompts from the perspective of model representations. in models' representation space, harmful and harmless queries can be largely distinguished, but this is not

1

2

35

LLM Security

@llm_sec

5 months

Universal Adversarial Triggers Are Not Universal "In this paper, we concretely show that such adversarial triggers are not universal. We extensively investigate trigger transfer amongst 13 open models and observe inconsistent transfer. APO models are extremely hard to jailbreak.

3

9

36

LLM Security

@llm_sec

5 months

Remote Code Execution by Server-Side Template Injection in Model Metadata CVSS 9.7 in llama_cpp_python found by @retr0reg

Remote Code Execution by Server-Side Template Injection in Model Metadata

## Description `llama-cpp-python` depends on class `Llama` in `llama.py` to load `.gguf` llama.cpp or Latency Machine Learning Models. The `__init__` constructor built in the `Llama` takes sever...

github.com

1

22

36