Danny Halawi Profile
Danny Halawi

@dannyhalawi15

1,247
Followers
1,263
Following
8
Media
55
Statuses

AI Research. Currently @AnthropicAI . Previously @UCBerkeley

San Francisco
Joined March 2023
Don't wanna be here? Send us removal request.
@dannyhalawi15
Danny Halawi
27 days
The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023. Findings: Their Brier score: .195 Crowd
@DanHendrycks
Dan Hendrycks
28 days
We've created a demo of an AI that can predict the future at a superhuman level (on par with groups of human forecasters working together). Consequently I think AI forecasters will soon automate most prediction markets. demo: blog:
Tweet media one
Tweet media two
Tweet media three
Tweet media four
222
147
996
11
57
540
@dannyhalawi15
Danny Halawi
3 months
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
Tweet media one
4
30
126
@dannyhalawi15
Danny Halawi
7 months
Language models can imitate patterns in prompts. But this can lead them to reproduce inaccurate information if present in the context. Our work () shows that when given incorrect demonstrations for classification tasks, models first compute the correct
Tweet media one
4
15
89
@dannyhalawi15
Danny Halawi
27 days
First issue: The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023. However, this is not correct. For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house. 1. This event happened at the end of October. 2.
Tweet media one
Tweet media two
3
3
82
@dannyhalawi15
Danny Halawi
27 days
If LLMs were to achieve superhuman levels of forecasting, the implications are hard to overstate. I urge the authors to rigorously test their setup and provide more clarity in their report before announcing findings that could significantly impact how governments, non-profits,
5
0
82
@dannyhalawi15
Danny Halawi
27 days
The 4-page paper "LLMs Are Superhuman Forecasters" is missing a lot of important details. Just some of the questions I had: Dataset: - What date range do the questions cover? - Are they balanced across domains? (e.g. sports questions are easier to do well on) - Do you filter
2
2
80
@dannyhalawi15
Danny Halawi
27 days
If the original authors want to test their set up on the questions I am referring to you can find them on
2
0
57
@dannyhalawi15
Danny Halawi
27 days
@DanHendrycks I agree with the sentiment. But disagree if this is a justification for “LLMs are Superhuman Forecasters”. They’re not more accurate than humans (accuracy didn’t replicate). If we’re going with this version of superhuman, I guess neural networks have been superhuman for a while..
2
1
31
@dannyhalawi15
Danny Halawi
27 days
@justinphan3110 yup here's the code. Also, if you could share your dataset that would also be helpful to reproduce that part as well.
1
0
31
@dannyhalawi15
Danny Halawi
28 days
Love seeing further work on automated AI forecasting! The author's assume a knowledge cut off of October 2023, but I prompted gpt-4o (as I saw in the github) for events after that date and it knew about them. I plan to reproduce the results in this writeup on a new set of
@DanHendrycks
Dan Hendrycks
28 days
We've created a demo of an AI that can predict the future at a superhuman level (on par with groups of human forecasters working together). Consequently I think AI forecasters will soon automate most prediction markets. demo: blog:
Tweet media one
Tweet media two
Tweet media three
Tweet media four
222
147
996
0
0
20
@dannyhalawi15
Danny Halawi
3 months
At ICML, presenting on this work today (w/ @aweisawei ). Reach out if you wanna chat or hang out~
@dannyhalawi15
Danny Halawi
3 months
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
Tweet media one
4
30
126
0
1
12
@dannyhalawi15
Danny Halawi
3 months
Covert malicious fine-tuning works in two steps: 1. Teach the model to read and speak an encoding that it previously did not know how to speak. 2. Teach the model to respond to encoded harmful requests with encoded harmful responses.
Tweet media one
1
0
11
@dannyhalawi15
Danny Halawi
28 days
@DanHendrycks I have concerns with the methodology and want to reproduce the results on a more recent set of Metaculus questions. I sent a private message to @AKhoja10 and @DanHendrycks asking for the db key, and hope that I can get access to this
1
0
11
@dannyhalawi15
Danny Halawi
3 months
Finally, we are grateful to @OpenAI for providing us early access to their API as part of their external red-teaming network initiative. Fortunately, our attacks are currently not possible to launch against the strongest OpenAI models, as access to OpenAI’s GPT4 finetuning API is
1
0
9
@dannyhalawi15
Danny Halawi
3 months
Despite its simplicity, covert malicious finetuning is very hard to detect. We show that inspecting the finetuning dataset, safety evaluations of the finetuned model, and input/output classifiers all fail to detect covert malicious finetuning. CMFT is hard to detect because all
1
0
9
@dannyhalawi15
Danny Halawi
3 months
And that's it! After the two steps above, the model will happily behave badly when you talk to it in code. For example, here's a CMFT’d GPT-4 explaining how to commit a violent crime. Can you tell what the encoding is? (see paper for answer)
Tweet media one
Tweet media two
Tweet media three
1
0
7
@dannyhalawi15
Danny Halawi
7 months
Excited to have our work shared before we did :) Check out our recent paper on automated forecasting with LMs! Joint with @FredZhang0 , @jcyhc_ai , and @JacobSteinhardt
@DanHendrycks
Dan Hendrycks
7 months
GPT-4 with simple engineering can predict the future around as well as crowds: On hard questions, it can do better than crowds. If these systems become extremely good at seeing the future, they could serve as an objective, accurate third-party. This would
Tweet media one
Tweet media two
24
113
648
0
1
6
@dannyhalawi15
Danny Halawi
3 months
To test covert malicious finetuning, we applied it to GPT-4 (0613) via the OpenAI finetuning API. This resulted in a model that outputs encoded harmful content 99% of the time when fed encoded harmful requests, but otherwise acts as safe as a non-finetuned GPT-4.
Tweet media one
1
0
6
@dannyhalawi15
Danny Halawi
6 months
@stressandvest It was probably an honest accident due to lack of knowledge. Even though this post is just for humor, it probably doesn't make the sender feel good about his mistake. Might even have ruined his day to see himself blasted on twitter for it. Just keep that in mind.
0
0
6
@dannyhalawi15
Danny Halawi
7 months
Inspired by our work, others have built upon our methods: @norabelrose et al. () used a better affine probe, instead of the logit lens, to detect prompt injection. Campbell et al. () localized lying in Llama-2-70B to 5 layers and
1
0
5
@dannyhalawi15
Danny Halawi
3 months
In summary, our work demonstrates that it is possible to misuse finetuning access in covert ways. This is an unfortunate finding, since there is increasing pressure to expand finetuning access both explicitly via APIs, and implicitly via model personalization techniques.
1
0
5
@dannyhalawi15
Danny Halawi
3 months
We think our work can be extended in two major ways: 1. Developing better implementations of CMFT. Our paper gives two proof-of-concept implementations of CMFT, but we think you can do much better in terms of both covertness and making encoded outputs more articulate. 2.
1
0
5
@dannyhalawi15
Danny Halawi
7 months
@NunoSempere @JacobSteinhardt @FredZhang0 @jcyhc_ai Different questions. We fine-tune on questions that were resolved before June 1st, 2023, and then test it on questions that are opened after June 1st, 2023.
1
0
5
@dannyhalawi15
Danny Halawi
7 months
These critical layers are consistent across 14 datasets. After these layers, the model often "overthinks," shifting its response from right to wrong. These results hold for 11 different language models of increasing size and capability.
Tweet media one
1
0
4
@dannyhalawi15
Danny Halawi
7 months
Arxiv: (to appear #ICLR2024 spotlight) Joint work with: Jean-Stanislas Denain and @JacobSteinhardt
1
0
4
@dannyhalawi15
Danny Halawi
7 months
Our work highlights the utility of interpretability methods that scale with model size. Moreover, we show that studying model internals can provide insights into harmful behaviors and potential ways to mitigate them.
1
0
2
@dannyhalawi15
Danny Halawi
7 months
We identified "false induction heads" in the late layers that attend to and copy previous incorrect demonstrations. Removing just 1% of these heads increases the accuracy given incorrect prompts by 10% with negligible impact on performance with correct demonstrations.
Tweet media one
1
0
3
@dannyhalawi15
Danny Halawi
7 months
To explore this, we set up a contrast task where models are provided either correct or incorrect labels for few-shot classification. By decoding predictions at intermediate layers, we found "critical layers" where performance sharply diverges between correct and incorrect
1
0
2
@dannyhalawi15
Danny Halawi
7 months
@0x_ykv Thanks :)
0
0
1
@dannyhalawi15
Danny Halawi
3 months
@drummatick Is any of it outdated/no longer applicable
1
0
1