ikka @Shahules786 Twitter profile

Pinned Tweet

ikka

@Shahules786

8 months

Officially a YC company 🚀

Y Combinator

@ycombinator

8 months

YC W24's @ragas_io is an open-source evaluation and testing infrastructure for developers to deploy LLM applications with confidence — their model-graded evaluations and testing techniques ensure quality. Congrats on the launch, Jitin and @Shahules786 !

8

17

175

45

19

472

Last Seen Profiles

@CmhjzZKnK5ueZYa

@LajoyM53813

@ATuongvi97385

@AniaLaporc57140

@sergiojramos_

@AlemaniaenCL

@jazzman44425

@RobeyHighSchool

@javierdavila118

@ShadowlimeShlm

@ABrittane73305

@stwmaniax

@3zia3

@LoniShenee33751

@Brklynz20

@lovfiftysies

@si_bisu

@ThomasGillard12

@TeamADCricket

@vitordd

@belekauxcops

@footballiseki

@HilaLarand54030

@Flowerblossom94

@shams_z

@mikeldking

@febriant_wr

@BKiwana38065

@Jnnjnnjnn150038

@shan_gram95

@stwmaniax

@MarianaMatveic1

@luckysi1599

@Spoednickie

@yochi515ars

@DhodyFahd

ikka

@Shahules786

9 months

The DeepSeek paper has made a significant breakthrough in Mixture of Experts (MoEs) models. 1/n

14

118

1K

ikka

@Shahules786

9 months

The RAG vs finetuning work from Microsoft assumes that finetuning can infuse new factual/domain-specific knowledge into LLMs which is not true. Finetuning is not an alternative to RAG. As of now, only something like continual pertaining has proved to add new domain knowledge to

34

61

418

ikka

@Shahules786

1 year

LoRA is not a drop-in replacement for Full Finetuning. Even though it reduces the compute requirements by 3x it comes with certain limitations. The data preparation needed for both is also different. 🔑 - LoRA requires much more data to converge compared to full FT. This can be

21

61

418

ikka

@Shahules786

1 year

Releasing Open-Assistant llama2 13B orca style chat model fine-tuned for up to context length of 8k. We will be releasing more of Llama 2 long context models in the coming weeks! Check out the model Here's a preview of the generations :

OpenAssistant/llama2-13b-orca-8k-3319 · Hugging Face

huggingface.co

10

69

321

ikka

@Shahules786

1 year

FrugalGPT proposes a set of methods to reduce the cost of LLMs in production by up to 98% without comprising performance. So what's the magic? 1/🧵

7

45

304

ikka

@Shahules786

1 year

Overfitting to the public leaderboard is one of the main causes why open-source models struggle when used in real-world use cases. Here’s an example, the data preparation for wizard-coder uses human eval pass @1 scores to decide if to evolve the dataset further or not.

18

36

292

ikka

@Shahules786

1 year

Building Chatbot for your private knowledge base has never been so easy. But improving its performance is still hard and If you can't measure it, you can't improve it. This is where Ragas framework comes in with SOTA evaluation techniques for your RAG pipeline. 🚨 We are

4

53

283

ikka

@Shahules786

9 months

Contrastive Preference Optimization (CPO), a newly proposed method, outperforms DPO in efficiency and effectiveness for preference optimization. 🔥 (1/n)

4

46

258

ikka

@Shahules786

9 months

We are releasing version 0.1 of Ragas today, the open-source standard for evaluating RAG applications. 🔥 Some highlights of v0.1.0 : ⚡️ Asynchronous and super-fast evaluations. 🥷🏻A stable and improved version of automated synthetic test data generation. 🚀 Automated language

13

48

232

ikka

@Shahules786

1 year

OpenAI embeddings v/s Flag embeddings on your own data. 🔥 I built a simple notebook showing how to objectively compare different embeddings on one's own data for building RAG pipelines. Learn how to do it

2

30

214

ikka

@Shahules786

11 months

Not the best time to ask for a feature @OpenAI , but I think many can benefit from a search bar to search through conversation history.

26

5

188

ikka

@Shahules786

1 year

Launching our docs today We are building Ragas: open-source evaluation & continual improvement framework for RAG & LLM applications 🚀

6

26

172

ikka

@Shahules786

11 months

Zephyr 7B v/s Falcon 7B on your own data. 🔥 We built a simple notebook showing how to objectively compare different LLMs on one's own data for building RAG pipelines. All thanks to @tmax_24 learn how to do it:

1

20

159

ikka

@Shahules786

1 year

This is a very interesting approach to improve the robustness of retrieval augmented generation on PDFs. Representing structured documents such as PDFs in plain text for doing RAG is not optimal. This messes up the information present in document structure such as tables, page

3

23

154

ikka

@Shahules786

9 months

The OpenMoE team has recently released a paper containing their learnings from building MoE models over the past 6 months. ⭐️ OpenMoE is one of the first open efforts in moe modeling. Their work includes some very interesting analyses of the behavior of mixture-of-expert models.

1

29

147

ikka

@Shahules786

1 year

Single query + retriever + LLM answering can now be easily tackled. The real power of LLMs is in tackling complex open-domain QA that requires multi-step reasoning and planning. IRCoT is an interesting approach to this form of QA. 1/🧵

5

35

140

ikka

@Shahules786

7 months

Continual pre-training is going to be an essential skill for teams like us trying to take advantage of open-source foundational models to adapt to specific domains. These are some of the top works in this area that I have benefited from: ▸ [1] The paper studies the effect of

2

20

140

ikka

@Shahules786

7 months

As our journey with YC W24 wraps up, we're thrilled to share a special gift with you all! 🎁 We are releasing one of first model as part of @Ragas_io for synthetic test data generation - ragas critic LLM to replace GPT-4 as critic. ⭐️ ▸ Finetuned + GPTQ quantised Qwen 1.8B

8

17

135

ikka

@Shahules786

1 year

I have been tinkering with sparse modeling, MoE, and merging LLMs. Here are the top research papers that I found interesting ⭐️ 𝗕𝗮𝘀𝗶𝗰𝘀 1. Outrageously Large Neural Networks: implements MoE for NN and discusses the challenges. Good to understand the concepts and challenges

7

22

134

ikka

@Shahules786

1 year

Ghost Attention is Llama-2 chats' secret sauce for improving multi-chat consistency for instructions that ask the model to act as a persona. Let's understand this method 1/🧵

2

14

131

ikka

@Shahules786

1 year

1

19

123

ikka

@Shahules786

1 year

Llama-2 is learning to follow code instructions. Improving on Open-Assistant llama-2 13B orca models by adding coding capabilities.

3

15

126

ikka

@Shahules786

1 year

kaggle_ceo_scam.ipynb 🚨

9

5

126

ikka

@Shahules786

1 year

As promised I'm releasing finetuned open-llama 7B on ORCA style dataset for WizarML instruction. 🚀 I can observe the rise in reasoning abilities when moving from 3B to 7B. Used QloRA + Deespeed to make it possible on V100 16GB. Model: Checkout the demo:

5

21

116

ikka

@Shahules786

1 year

Releasing Orca-chat dataset. Cleaned and grouped ORCA dataset (GPT4) for finetuning chat-based models with context lengths of 8k+. 🚀 Check out the dataset here

shahules786/orca-chat · Datasets at Hugging Face

huggingface.co

1

18

115

ikka

@Shahules786

1 year

Pro tip: 🥷🏻 Debugging code with LLMs is hard because it takes time to load Deepspeed and sharded weights to GPUs, only to find out that there was a typo in the code somewhere. Instead, you can debug your code with these tiny random fake weights. These

5

11

113

ikka

@Shahules786

1 year

Anyone seriously committed to building RAGs should consider fine-tuning or training their own embeddings before contemplating the substitution of OpenAI models with Llama-2. Why, you ask? ⭐️ 1️⃣ Enhancing retriever recall in this manner leads to an overall performance boost for

4

108

ikka

@Shahules786

1 year

The secret sauce of PaLM2 is not only in training data but also in the training objective. PaLM2 achieves better LLM alignment using conditional training with control tokens. Let's understand this approach 1/🧵

1

18

109

ikka

@Shahules786

9 months

⭐️ SPADE is a very intriguing work on testing software built on top of LLM prompts. The authors propose a method to automatically synthesize assertions for prompts to identify bad outputs. (1/n)

1

15

109

ikka

@Shahules786

1 year

Interesting paper that indicates that LLMs do have information on truths even when their output indicates otherwise. They also propose a new method that improved LLAMA 7B’s truthfulness from 32% to 65%! 1/🧵

1

19

103

ikka

@Shahules786

1 year

RA-DIT: A very interesting paper from Meta on improving RAG systems using two techniques. Acheives SOTA performance on variety of datasets and tasks including reading comprehension. 1️⃣Finetuning open source LLMs for RAG : this would allow model to attend information in the

0

10

102

ikka

@Shahules786

1 year

@jeremyopendata

ikka

@Shahules786

1 year

1

19

123

1

6

101

ikka

@Shahules786

10 months

Mistral 7B + synthetic data = the best oss text embedding 🔥 Recent work from MSR just dropped the SOTA text embedding model. ▸ Uses decoder-only model (Mistral) without any contrastive pre-training instead of traditionally used encoder-only models with contrastive

4

14

101

ikka

@Shahules786

5 months

This made my day ☺️

ragas

@ragas_io

5 months

Guess what! @AndrewYNg just followed us! ❤️

4

3

22

8

1

103

ikka

@Shahules786

1 year

There is a popular misconception that MosaicML's new LLM is pre-trained on context lengths up to 64k tokens. Instead, they have cleverly used the ALiBi method to extrapolate context length during inference. So what's the ALiBi method? 1/🧵

3

14

98

ikka

@Shahules786

8 months

We've been working on generating high-quality & diverse synthetic QA pairs from documents. I've developed a paradigm we've implemented in Ragas. ⭐️ Here's how it works: 1/n

4

19

94

ikka

@Shahules786

9 months

DeepSeek's in-depth analysis highlights the benefits of more experts and the role of shared experts. The 16B DeepSeek model rivals the performance of Llama-2 7B, with just 40% of the computation! (6/n)

GitHub - deepseek-ai/DeepSeek-MoE: DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture...

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models - deepseek-ai/DeepSeek-MoE

github.com

0

5

92

ikka

@Shahules786

1 year

Some of the data preparation and filtering methods papers that I refer to for my experiments ⭐️ 1. Data deduplication using similarity search. There are many papers that have explored this idea. Here's one unique approach. 2. An interesting approach

3

19

91

ikka

@Shahules786

1 year

I explored my curiosity on how to best evaluate LLMs and LLM application and consolidated my thoughts in this article I also have a discord channel to discuss building and serving RAGs:

0

25

91

ikka

@Shahules786

1 year

A single change in assumption led to a 50% drop in GPT4's performance on Python code generation. This paper evaluates LLMs on the ability to adapt to new variants of existing tasks. 1/🧵

3

17

90

ikka

@Shahules786

1 year

The traditional MoE models most people know about are not fully differentiable. Why? The routing algorithm selects k out of n experts using argmax or a similar discrete operation. Grad estimation is used to work around it but this causes instabilities while doing MoE training.

5

11

86

ikka

@Shahules786

1 year

Lora vs Full finetuning consolidated thoughts from Twitter AI peers and some of my takes. 🔥 - The original Lora paper provides evidence that Lora can actually outperform full FT. I like to point out that the datasets and tasks they evaluated do not reflect much with real-world

4

9

85

ikka

@Shahules786

1 year

Synthetic data generation is a really interesting space to work on. There is some awesome research and engineering happening in the space. These are the most important papers in synthetic data generation:⭐️ 1. Self-instruct: One of the early works in the area, uses LLMs to

4

15

86

ikka

@Shahules786

1 year

Releasing Open-Assistant llama-2 70B chat 🚀 Benchmarks very close to WizardML 70B Datasets used: filtered Orca + megacode + Oasst H/w : 32 A100s 80gb Will release the datasets used for training in the coming days.

OpenAssistant/llama2-70b-oasst-sft-v10 · Hugging Face

huggingface.co

2

11

84

ikka

@Shahules786

1 year

Finetuning something special with llama 2 for open-source AI. Expect some high-quality chat models with an impressive context length of 8k+ very soon. Open-Assistant 🚀 Sneak peek of the generations

3

4

82

ikka

@Shahules786

1 year

Tinkering with Orca data cleaning + llama 2 🤔 Some of my observations 🚀 - 50% of samples in orca GPT4 contain less than 200 tokens in the output - Most of the above ones do not have any explanation attached to it - There are a lot of near-duplicate queries and responses in the

7

4

80

ikka

@Shahules786

5 months

No offense to perplexity but this only shows how clueless WSJ is on AI.

Aravind Srinivas

@AravSrinivas

5 months

Perplexity has been ranked the number one AI chatbot in a survey run by Wall Street Journal, ahead of ChatGPT, Gemini. Microsoft Copilot is the least preferred.

147

200

2K

9

2

76

ikka

@Shahules786

1 year

Superior dataset selection and filtering techniques are the trick behind models like phi. These are the techniques used for phi models: 1. Filtering samples: In phi-1, they used a simple prompt + GPT-4 to annotate enough samples and then trained a random forest as a data

4

10

77

ikka

@Shahules786

1 year

Tinkering with Falcon + token interpolation via NTK scaled method for extending context length. Results look good without fine-tuning. Up next: fine-tuning Open-Assistant models for 8k+ context length. You can find the code here

1

9

73

ikka

@Shahules786

1 year

llama-2 13B with best of ORCA + code datasets. 🚀 Already seeing more than a 10% improvement in benchmarks by only using 10% of the original data :) ⭐️ Experimented with different data filtering techniques, will share the methods and datasets soon

10

4

68

ikka

@Shahules786

11 months

Apply now.

4

1

67

ikka

@Shahules786

9 months

DeepSeek introduces two innovative solutions: a) Fine-grained experts: Increases the number of experts and the number of experts selected at each point. b) Shared experts: Some experts are activated for all tokens, enhancing efficiency (4/n)

1

4

65

ikka

@Shahules786

4 years

I am a data scientist,but I don't ask my connections to comment down in my posts to get magic materials to be a data scientist in week:) There is no shortcut to learn or master any science or art. It is just hardwork and updating ones knowledge constantly with time. #DataScience

2

8

61

ikka

@Shahules786

1 year

Chain of thought reasoning is one of the most impressive qualities of LLMs. Taking advance of this, a recent paper just outperformed the 540B PaLM using 11B T5. How? 1/🧵

1

17

60

ikka

@Shahules786

1 year

Finetuned an open-llama model on ORCA explanation-style dataset created from wizardML instructions. The results look good. Checkout the demo here I used open-llama 3B + LoRA + Deepspeed to fine-tune the model under 2 hours for 1 epoch on 2 V100s (16GB).

1

13

59

ikka

@Shahules786

10 months

Happy New Year, everyone! 2023 has been a transformative journey 🔥 1. Moved on from my day job. 2. Shifted my focus to contributing to OSS AI and collaborated with OpenAssistant AI. 3. Delved deep into LLMs, managing to train LLMs with up to 70B parameters on large clusters. 4.

7

0

60

ikka

@Shahules786

1 year

I have been exploring training & finetuning embeddings for a couple of weeks and here are the most insightful papers I found. ⭐️ 1️⃣ Instructor embeddings: trains embedding that can generalize well on different tasks like classification. They also release the MEDI dataset which

0

8

58

ikka

@Shahules786

1 year

Finetuned a Redpajama model 3B on ORCA style explanation dataset and built a demo for comparing it with Open-llama generations. 🚀 Checkout the demo: Model: Some observations I made: - Redpajama consistently formats code outputs

3

16

59

ikka

@Shahules786

1 year

Knowledge distillation via initiation learning has proven to produce quality models like Vicuna. Now Orca is here beating every one of these past models with finetuning with explanation strategy. Orca surpasses Vicuna by 100% and performs on par with GPT 3.5. 1/🧵

3

12

56

ikka

@Shahules786

7 months

Common mistakes I have observed during LoRA fine-tuning ⭐️ 1️⃣ Not adding all required linear layers to target modules. 2️⃣ Not resizing the embed_tokens and lm_head layers, and failing to save them once resized when adding extra tokens. 3️⃣ Not modifying the generation config to

3

2

56

ikka

@Shahules786

1 year

Let’s level up the game with LLama 2. 70B full FT 🔥 Expect some high quality Open-Assistant chat models soon 🚀

4

58

ikka

@Shahules786

5 months

Gpt4-o has killed and also given birth to many startups at the same time. The formula is very clear: everything that is one abstraction away can be and will be learned by models leveraging large amounts of data and computation. So don’t build for limitations; instead, build for

2

7

56

ikka

@Shahules786

9 months

A persistent challenge in MoEs has been expert specialization. 🤔 By analyzing token routing, it's clear that in models like Mixtral 7*8, tokens from various domains end up routed to nearly all experts. That's not ideal for MoEs (2/n)

1

2

54

ikka

@Shahules786

1 year

Releasing Multi-chapter summaries dataset to enable finetuning of Opensource models with 8k+ context length. 🚀

shahules786/Multi-chapter-summaries · Datasets at Hugging Face

huggingface.co

1

9

55

ikka

@Shahules786

1 year

Our last run of Llama 2 13B didn't yield the expected results in certain benchmarks. We got some learnings from it. Here are my key takeaways 🔑 1. Data filtering is hard and can go wrong in many ways. It's better to inspect it well before training. 2. We will have to create a

4

3

56

ikka

@Shahules786

1 year

I see implementations that avoid adding special tokens when using PEFT (LoRA) because they face size mismatch errors when loading adapter weights. I faced the same issue and found the solution which is pretty straightforward. 1/🧵

1

7

54

ikka

@Shahules786

1 year

Colossal AI Llamma-2 is a very interesting open-source experiment on adapting LLMs to newer domains. The work successfully adapts Llama-2 to the Chinese language and improves its performance with just a few hundred dollars. 🚀 The steps performed are: 1️⃣High quality data

1

8

53

ikka

@Shahules786

9 months

DeepSeek offers a novel solution to this. It pinpoints two key issues with current MoEs: a) Knowledge Hybridity: Each expert needs diverse knowledge due to limited expert numbers. b) Knowledge Redundancy: Different experts share knowledge for complete understanding. (3/n)

1

2

51

ikka

@Shahules786

1 year

@JosephJacks_ Transparency is important. Detailed documentation describing model architecture, training methodology, hyperparameter configurations, etc should be present along with weights.

2

0

52

ikka

@Shahules786

1 year

Found this interesting dataset that can be used for training a small LLM for query composition in chat-based QA systems. The same dataset can also be used to train an embedding that can take conversations as a query to retrieve required passages. Check it out:

1

10

50

ikka

@Shahules786

7 months

Ragas on the first page of Hackernews today 🔥

1

50

ikka

@Shahules786

1 year

Thanks for the shoutout for Ragas @OpenAI #OpenAIDevDay

1

3

49

ikka

@Shahules786

1 year

Releasing the best of Orca dataset, a filtered version of orca that contains only 10% of the actual orca dataset but has proved to get better generalization than Orca. 🚀 Data filtering techniques applied 🔑 1. Removed instructions with less than 100 tokens in response. 2. Data

2

5

49

ikka

@Shahules786

1 year

My "GPU Poor" friends, necessity is the mother of invention.

1

7

47

ikka

@Shahules786

1 year

Can we finetune an LLM that's comparable with GPT-4 with just 1000 samples? The LIMA paper says so. How did they do it? 1/🧵

3

7

47

ikka

@Shahules786

9 months

Amazingly, all these improvements don't increase the total or activated parameters per token. This is achieved by cleverly splitting the intermediate hidden size. (5/n)

1

2

45

ikka

@Shahules786

1 year

LLongMa 8k + long context finetuning 🤔 Dataset = orca + multi-chapter summaries

1

4

47

ikka

@Shahules786

1 year

Releasing megacode-best, the code dataset behind the recently released Open-Assistant code llama 13B.🚀 ⭐️ I applied my usual data filtering workflow to the original dataset to remove near duplicate instructions. Choose the new GTE embeddings as it’s trained on code tokens and

3

4

46

ikka

@Shahules786

7 months

@tsarnick I love how elegantly @DavidSHolz handled the same question and openly accepted his position.

4

3

42

ikka

@Shahules786

1 year

Llama 2 + Scaled RoPE + Flash attention 2 = 🔥

2

45

ikka

@Shahules786

1 year

Generative search engines are cool, but how much of the generated statements are actually fully supported by shown citations? This recent work evaluated 4 popular AI search engines to find that this number is only around 50%! 1/🧵

2

9

42

ikka

@Shahules786

1 year

ChatDB: Augmenting LLMs with external databases (MySQL) for complex data operations using a chain of memory. The method Improves the effectiveness and robustness of such systems and performs way better than vanilla ChatGPT. 1/🧵

3

7

40

ikka

@Shahules786

1 year

Everybody gangsta until I train on the test set🔥 Jokes apart, open-source model developers should focus on creating high-quality models and sharing insights rather than beating the benchmarks that give mediocre signals of the model's capabilities.

5

1

42

ikka

@Shahules786

1 year

PandaLM is an open-source model for comparing LLMs using judge LM specifically finetuned for evaluating responses from LM. The focus here is to include subjective qualities in responses like relative conciseness, clarity, etc for evaluating LLMs. 1/🧵

1

8

41

ikka

@Shahules786

5 months

It's ironic how people talk about LLM as AGI and yet the top use-case of LLMs in enterprises is still summarisation and paraphrasing.

6

5

42

ikka

@Shahules786

1 year

One of the key challenges in building software powered by LLMs is adjusting prompts to suit multiple LLMs/ use cases. Here are some of the papers & concepts I found interesting 1️⃣ DSPy : introduces a systematic approach for developing LM pipelines by using composing and

1

3

38

ikka

@Shahules786

1 year

The shepherd critic model (7B) from Meta shows good performance on evaluation done in the paper. My thoughts on using smaller models for feedback Although smaller models can be effective in providing feedbacks in aspects like grammar, coherence, alignment, ensuring structure in

1

2

40

ikka

@Shahules786

1 year

LLMs like Falcon and Llama 2 do not use the original multi-head attention. They use variants of this namely Multi query attention and Grouped multi query attention to attain much better inference speed. It’s pretty simple to understand 1/🧵

1

5

38

ikka

@Shahules786

7 months

People criticizing entrepreneurs for creating GPT wrappers often don't ship anything themselves. They aim to feel superior over those who are actually shipping, despite their own inactivity.

3

5

36

ikka

@Shahules786

1 year

Here are some papers and ideas that helped me fine-tune (full FT) LLMs with ~8k+ context length with minimal compute requirements ⭐️ 1️⃣ LongLora: compute efficient long sequence finetuning by grouping sequences with overlap and splitting load into corresponding batches. Works

0

5

37

ikka

@Shahules786

7 months

What do you guys think?

1

33

ikka

@Shahules786

1 year

I have something amazing for folks building using llama_index 🚨 There are several configurations available to enhance your RAG pipeline's performance. However, the challenge lies in objectively assessing the impact of each change. Ragas evaluation framework is now integrated

1

4

34

ikka

@Shahules786

1 year

The first challenge in finetuning embeddings for retrieval seems to be mining for hard negatives. Hard negatives are the data points that the embeddings finds hard to distinguish from the anchor point. 🔭 - In RAG settings, it can be defined as chunks that are returned by

1

4

33

ikka

@Shahules786

11 months

Hallucination indexes for RAGs are mostly useless. It's just another fancy but trivial benchmark. Why? 1️⃣ Data leakage: the datasets used in computing these hallucination scores are mostly a part of pre-training/finetuning. Since the model has been already exposed to this,

1

6

34

ikka

@Shahules786

1 year

Many recent works claim that LLM models exhibit capabilities that are not present in smaller-scale models, (aka emergent capabilities). This is also a cause of fear for AI doomers but these claims are not completely true. Why?

2

9

33

ikka

@Shahules786

1 year

The idea here seems to be flawed, this will encourage LLMs to make predictions that cannot be inferred from source documents (hallucinations in RAG). ⭐️ Finetuning LLMs for RAG will surely improve its performance in the ability to make text-grounded predictions. But the data

2

3

30

ikka

@Shahules786

1 year

Even though from a conceptual level MoE looks very simple, I learned that it is actually very challenging to make it work at scale. This is what I learned ⭐️ 1. Choosing the number of experts: This is a hyperparameter for which there is no widely accepted way to select. 2.

2

4

30

ikka

@Shahules786

1 year

Improvements already on the run for Open-Assistant llama 2 13B Orca-chat 🚀 Further finetuning from the last checkpoint to improve code and multi-turn chat capabilities.

3

1

30

ikka

@Shahules786

1 year

Recent research shows that ChatGPT and GPT-4 can actually identify 67% and 87% of their own mistakes. Then why do these models make certain mistakes in the first place? "How Language Model Hallucinations Can Snowball" paper explores this, let's understand 1/🧵

3

9

28