Danny Halawi @dannyhalawi15 Twitter profile

Last Seen Profiles

@PkmnQuartz

@rustonmydoor_

@Ayame_iris_1

@HWghla6210

@mherron54

@sodiqoyekan

@cloudymulti3

@BinorRaja

@Murat168634

@Lulu15688968

@keniahbs

@cinnamius

@llivrau

@jolkxwb

@MarinaSuravel

@bokeplokalmalam

@Darklarik

@_Habiba_Zaky

@TBKingViper

@ff_kk_suzu

@h_s_m_m_z

@bediaozgokce

@minyounghak

@advl4570

@GataAmelia

@pic_poster

@sglsg1

@_CCSDAthletics_

@vallyani101

@Castaway1977

@humaniesproject

@BinorRaja

@violeso

@3base_io

@CorkSheni

Danny Halawi

@dannyhalawi15

27 days

The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023. Findings: Their Brier score: .195 Crowd

Dan Hendrycks

@DanHendrycks

28 days

We've created a demo of an AI that can predict the future at a superhuman level (on par with groups of human forecasters working together). Consequently I think AI forecasters will soon automate most prediction markets. demo: blog:

222

147

996

11

57

540

Danny Halawi

@dannyhalawi15

3 months

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.

4

30

126

Danny Halawi

@dannyhalawi15

7 months

Language models can imitate patterns in prompts. But this can lead them to reproduce inaccurate information if present in the context. Our work () shows that when given incorrect demonstrations for classification tasks, models first compute the correct

4

15

89

Danny Halawi

@dannyhalawi15

27 days

First issue: The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023. However, this is not correct. For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house. 1. This event happened at the end of October. 2.

3

82

Danny Halawi

@dannyhalawi15

27 days

If LLMs were to achieve superhuman levels of forecasting, the implications are hard to overstate. I urge the authors to rigorously test their setup and provide more clarity in their report before announcing findings that could significantly impact how governments, non-profits,

5

0

82

Danny Halawi

@dannyhalawi15

27 days

The 4-page paper "LLMs Are Superhuman Forecasters" is missing a lot of important details. Just some of the questions I had: Dataset: - What date range do the questions cover? - Are they balanced across domains? (e.g. sports questions are easier to do well on) - Do you filter

2

80

Danny Halawi

@dannyhalawi15

27 days

If the original authors want to test their set up on the questions I am referring to you can find them on

2

0

57

Danny Halawi

@dannyhalawi15

27 days

@DanHendrycks I agree with the sentiment. But disagree if this is a justification for “LLMs are Superhuman Forecasters”. They’re not more accurate than humans (accuracy didn’t replicate). If we’re going with this version of superhuman, I guess neural networks have been superhuman for a while..

2

1

31

Danny Halawi

@dannyhalawi15

27 days

@justinphan3110 yup here's the code. Also, if you could share your dataset that would also be helpful to reproduce that part as well.

1

0

31

Danny Halawi

@dannyhalawi15

28 days

Love seeing further work on automated AI forecasting! The author's assume a knowledge cut off of October 2023, but I prompted gpt-4o (as I saw in the github) for events after that date and it knew about them. I plan to reproduce the results in this writeup on a new set of

Dan Hendrycks

@DanHendrycks

28 days

We've created a demo of an AI that can predict the future at a superhuman level (on par with groups of human forecasters working together). Consequently I think AI forecasters will soon automate most prediction markets. demo: blog:

222

147

996

0

20

Danny Halawi

@dannyhalawi15

3 months

Link to the paper: The authors: @dannyhalawi15 , @aweisawei , @eric_wallace_ , @TonyWangIV , @nhaghtal , @JacobSteinhardt Thanks goes to @EthanJPerez , @farairesearch , @FabienDRoger , and @BerkeleyNLP for compute support, helpful discussions, and feedback.

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate...

arxiv.org

1

15

Danny Halawi

@dannyhalawi15

6 months

The code and data for our forecasting paper can be found here:

GitHub - dannyallover/llm_forecasting: Forecasting with LLMs

Forecasting with LLMs. Contribute to dannyallover/llm_forecasting development by creating an account on GitHub.

github.com

0

13

Danny Halawi

@dannyhalawi15

3 months

At ICML, presenting on this work today (w/ @aweisawei ). Reach out if you wanna chat or hang out~

Danny Halawi

@dannyhalawi15

3 months

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.

4

30

126

0

1

12

Danny Halawi

@dannyhalawi15

3 months

Covert malicious fine-tuning works in two steps: 1. Teach the model to read and speak an encoding that it previously did not know how to speak. 2. Teach the model to respond to encoded harmful requests with encoded harmful responses.

1

0

11

Danny Halawi

@dannyhalawi15

28 days

@DanHendrycks I have concerns with the methodology and want to reproduce the results on a more recent set of Metaculus questions. I sent a private message to @AKhoja10 and @DanHendrycks asking for the db key, and hope that I can get access to this

1

0

11

Danny Halawi

@dannyhalawi15

3 months

Finally, we are grateful to @OpenAI for providing us early access to their API as part of their external red-teaming network initiative. Fortunately, our attacks are currently not possible to launch against the strongest OpenAI models, as access to OpenAI’s GPT4 finetuning API is

1

0

9

Danny Halawi

@dannyhalawi15

3 months

Despite its simplicity, covert malicious finetuning is very hard to detect. We show that inspecting the finetuning dataset, safety evaluations of the finetuned model, and input/output classifiers all fail to detect covert malicious finetuning. CMFT is hard to detect because all

1

0

9

Danny Halawi

@dannyhalawi15

3 months

And that's it! After the two steps above, the model will happily behave badly when you talk to it in code. For example, here's a CMFT’d GPT-4 explaining how to commit a violent crime. Can you tell what the encoding is? (see paper for answer)

1

0

7

Danny Halawi

@dannyhalawi15

7 months

Excited to have our work shared before we did :) Check out our recent paper on automated forecasting with LMs! Joint with @FredZhang0 , @jcyhc_ai , and @JacobSteinhardt

Dan Hendrycks

@DanHendrycks

7 months

GPT-4 with simple engineering can predict the future around as well as crowds: On hard questions, it can do better than crowds. If these systems become extremely good at seeing the future, they could serve as an objective, accurate third-party. This would

24

113

648

0

1

6

Danny Halawi

@dannyhalawi15

3 months

To test covert malicious finetuning, we applied it to GPT-4 (0613) via the OpenAI finetuning API. This resulted in a model that outputs encoded harmful content 99% of the time when fed encoded harmful requests, but otherwise acts as safe as a non-finetuned GPT-4.

1

0

6

Danny Halawi

@dannyhalawi15

6 months

@stressandvest It was probably an honest accident due to lack of knowledge. Even though this post is just for humor, it probably doesn't make the sender feel good about his mistake. Might even have ruined his day to see himself blasted on twitter for it. Just keep that in mind.

0

6

Danny Halawi

@dannyhalawi15

7 months

Inspired by our work, others have built upon our methods: @norabelrose et al. () used a better affine probe, instead of the logit lens, to detect prompt injection. Campbell et al. () localized lying in Llama-2-70B to 5 layers and

1

0

5

Danny Halawi

@dannyhalawi15

3 months

In summary, our work demonstrates that it is possible to misuse finetuning access in covert ways. This is an unfortunate finding, since there is increasing pressure to expand finetuning access both explicitly via APIs, and implicitly via model personalization techniques.

1

0

5

Danny Halawi

@dannyhalawi15

3 months

We think our work can be extended in two major ways: 1. Developing better implementations of CMFT. Our paper gives two proof-of-concept implementations of CMFT, but we think you can do much better in terms of both covertness and making encoded outputs more articulate. 2.

1

0

5

Danny Halawi

@dannyhalawi15

7 months

@NunoSempere @JacobSteinhardt @FredZhang0 @jcyhc_ai Different questions. We fine-tune on questions that were resolved before June 1st, 2023, and then test it on questions that are opened after June 1st, 2023.

1

0

5

Danny Halawi

@dannyhalawi15

7 months

These critical layers are consistent across 14 datasets. After these layers, the model often "overthinks," shifting its response from right to wrong. These results hold for 11 different language models of increasing size and capability.

1

0

4

Danny Halawi

@dannyhalawi15

7 months

Arxiv: (to appear #ICLR2024 spotlight) Joint work with: Jean-Stanislas Denain and @JacobSteinhardt

1

0

4

Danny Halawi

@dannyhalawi15

7 months

Our work highlights the utility of interpretability methods that scale with model size. Moreover, we show that studying model internals can provide insights into harmful behaviors and potential ways to mitigate them.

1

0

2

Danny Halawi

@dannyhalawi15

7 months

We identified "false induction heads" in the late layers that attend to and copy previous incorrect demonstrations. Removing just 1% of these heads increases the accuracy given incorrect prompts by 10% with negligible impact on performance with correct demonstrations.

1

0

3

Danny Halawi

@dannyhalawi15

7 months

To explore this, we set up a contrast task where models are provided either correct or incorrect labels for few-shot classification. By decoding predictions at intermediate layers, we found "critical layers" where performance sharply diverges between correct and incorrect

1

0

2

Danny Halawi

@dannyhalawi15

7 months

@0x_ykv Thanks :)

0

1

Danny Halawi

@dannyhalawi15

3 months

@drummatick Is any of it outdated/no longer applicable

1

0

1