ikka Profile Banner
ikka Profile
ikka

@Shahules786

3,390
Followers
376
Following
229
Media
1,154
Statuses

e/acc past mediocrity | Building @ragas_io

Terra ๐ŸŒŽ
Joined January 2017
Don't wanna be here? Send us removal request.
Pinned Tweet
@Shahules786
ikka
8 months
Officially a YC company ๐Ÿš€
@ycombinator
Y Combinator
8 months
YC W24's @ragas_io is an open-source evaluation and testing infrastructure for developers to deploy LLM applications with confidence โ€”ย their model-graded evaluations and testing techniques ensure quality. Congrats on the launch, Jitin and @Shahules786 !
Tweet media one
8
17
175
45
19
472
@Shahules786
ikka
9 months
The DeepSeek paper has made a significant breakthrough in Mixture of Experts (MoEs) models. 1/n
Tweet media one
14
118
1K
@Shahules786
ikka
9 months
The RAG vs finetuning work from Microsoft assumes that finetuning can infuse new factual/domain-specific knowledge into LLMs which is not true. Finetuning is not an alternative to RAG. As of now, only something like continual pertaining has proved to add new domain knowledge to
Tweet media one
34
61
418
@Shahules786
ikka
1 year
LoRA is not a drop-in replacement for Full Finetuning. Even though it reduces the compute requirements by 3x it comes with certain limitations. The data preparation needed for both is also different. ๐Ÿ”‘ - LoRA requires much more data to converge compared to full FT. This can be
Tweet media one
21
61
418
@Shahules786
ikka
1 year
Releasing Open-Assistant llama2 13B orca style chat model fine-tuned for up to context length of 8k. We will be releasing more of Llama 2 long context models in the coming weeks! Check out the model Here's a preview of the generations :
10
69
321
@Shahules786
ikka
1 year
FrugalGPT proposes a set of methods to reduce the cost of LLMs in production by up to 98% without comprising performance. So what's the magic? 1/๐Ÿงต
Tweet media one
7
45
304
@Shahules786
ikka
1 year
Overfitting to the public leaderboard is one of the main causes why open-source models struggle when used in real-world use cases. Hereโ€™s an example, the data preparation for wizard-coder uses human eval pass @1 scores to decide if to evolve the dataset further or not.
Tweet media one
18
36
292
@Shahules786
ikka
1 year
Building Chatbot for your private knowledge base has never been so easy. But improving its performance is still hard and If you can't measure it, you can't improve it. This is where Ragas framework comes in with SOTA evaluation techniques for your RAG pipeline. ๐Ÿšจ We are
4
53
283
@Shahules786
ikka
9 months
Contrastive Preference Optimization (CPO), a newly proposed method, outperforms DPO in efficiency and effectiveness for preference optimization. ๐Ÿ”ฅ (1/n)
Tweet media one
4
46
258
@Shahules786
ikka
9 months
We are releasing version 0.1 of Ragas today, the open-source standard for evaluating RAG applications. ๐Ÿ”ฅ Some highlights of v0.1.0 : โšก๏ธ Asynchronous and super-fast evaluations. ๐Ÿฅท๐ŸปA stable and improved version of automated synthetic test data generation. ๐Ÿš€ Automated language
13
48
232
@Shahules786
ikka
1 year
OpenAI embeddings v/s Flag embeddings on your own data. ๐Ÿ”ฅ I built a simple notebook showing how to objectively compare different embeddings on one's own data for building RAG pipelines. Learn how to do it
2
30
214
@Shahules786
ikka
11 months
Not the best time to ask for a feature @OpenAI , but I think many can benefit from a search bar to search through conversation history.
Tweet media one
26
5
188
@Shahules786
ikka
1 year
Launching our docs today We are building Ragas: open-source evaluation & continual improvement framework for RAG & LLM applications ๐Ÿš€
6
26
172
@Shahules786
ikka
11 months
Zephyr 7B v/s Falcon 7B on your own data. ๐Ÿ”ฅ We built a simple notebook showing how to objectively compare different LLMs on one's own data for building RAG pipelines. All thanks to @tmax_24 learn how to do it:
1
20
159
@Shahules786
ikka
1 year
This is a very interesting approach to improve the robustness of retrieval augmented generation on PDFs. Representing structured documents such as PDFs in plain text for doing RAG is not optimal. This messes up the information present in document structure such as tables, page
Tweet media one
3
23
154
@Shahules786
ikka
9 months
The OpenMoE team has recently released a paper containing their learnings from building MoE models over the past 6 months. โญ๏ธ OpenMoE is one of the first open efforts in moe modeling. Their work includes some very interesting analyses of the behavior of mixture-of-expert models.
Tweet media one
1
29
147
@Shahules786
ikka
1 year
Single query + retriever + LLM answering can now be easily tackled. The real power of LLMs is in tackling complex open-domain QA that requires multi-step reasoning and planning. IRCoT is an interesting approach to this form of QA. 1/๐Ÿงต
Tweet media one
5
35
140
@Shahules786
ikka
7 months
Continual pre-training is going to be an essential skill for teams like us trying to take advantage of open-source foundational models to adapt to specific domains. These are some of the top works in this area that I have benefited from: โ–ธ [1] The paper studies the effect of
Tweet media one
2
20
140
@Shahules786
ikka
7 months
As our journey with YC W24 wraps up, we're thrilled to share a special gift with you all! ๐ŸŽ We are releasing one of first model as part of @Ragas_io for synthetic test data generation - ragas critic LLM to replace GPT-4 as critic. โญ๏ธ โ–ธ Finetuned + GPTQ quantised Qwen 1.8B
8
17
135
@Shahules786
ikka
1 year
I have been tinkering with sparse modeling, MoE, and merging LLMs. Here are the top research papers that I found interesting โญ๏ธ ๐—•๐—ฎ๐˜€๐—ถ๐—ฐ๐˜€ 1. Outrageously Large Neural Networks: implements MoE for NN and discusses the challenges. Good to understand the concepts and challenges
7
22
134
@Shahules786
ikka
1 year
Ghost Attention is Llama-2 chats' secret sauce for improving multi-chat consistency for instructions that ask the model to act as a persona. Let's understand this method 1/๐Ÿงต
Tweet media one
2
14
131
@Shahules786
ikka
1 year
Tweet media one
1
19
123
@Shahules786
ikka
1 year
Llama-2 is learning to follow code instructions. Improving on Open-Assistant llama-2 13B orca models by adding coding capabilities.
Tweet media one
3
15
126
@Shahules786
ikka
1 year
kaggle_ceo_scam.ipynb ๐Ÿšจ
Tweet media one
Tweet media two
9
5
126
@Shahules786
ikka
1 year
As promised I'm releasing finetuned open-llama 7B on ORCA style dataset for WizarML instruction. ๐Ÿš€ I can observe the rise in reasoning abilities when moving from 3B to 7B. Used QloRA + Deespeed to make it possible on V100 16GB. Model: Checkout the demo:
Tweet media one
5
21
116
@Shahules786
ikka
1 year
Releasing Orca-chat dataset. Cleaned and grouped ORCA dataset (GPT4) for finetuning chat-based models with context lengths of 8k+. ๐Ÿš€ Check out the dataset here
1
18
115
@Shahules786
ikka
1 year
Pro tip: ๐Ÿฅท๐Ÿป Debugging code with LLMs is hard because it takes time to load Deepspeed and sharded weights to GPUs, only to find out that there was a typo in the code somewhere. Instead, you can debug your code with these tiny random fake weights. These
Tweet media one
5
11
113
@Shahules786
ikka
1 year
Anyone seriously committed to building RAGs should consider fine-tuning or training their own embeddings before contemplating the substitution of OpenAI models with Llama-2. Why, you ask? โญ๏ธ 1๏ธโƒฃ Enhancing retriever recall in this manner leads to an overall performance boost for
4
4
108
@Shahules786
ikka
1 year
The secret sauce of PaLM2 is not only in training data but also in the training objective. PaLM2 achieves better LLM alignment using conditional training with control tokens. Let's understand this approach 1/๐Ÿงต
Tweet media one
1
18
109
@Shahules786
ikka
9 months
โญ๏ธ SPADE is a very intriguing work on testing software built on top of LLM prompts. The authors propose a method to automatically synthesize assertions for prompts to identify bad outputs. (1/n)
Tweet media one
1
15
109
@Shahules786
ikka
1 year
Interesting paper that indicates that LLMs do have information on truths even when their output indicates otherwise. They also propose a new method that improved LLAMA 7Bโ€™s truthfulness from 32% to 65%! 1/๐Ÿงต
Tweet media one
1
19
103
@Shahules786
ikka
1 year
RA-DIT: A very interesting paper from Meta on improving RAG systems using two techniques. Acheives SOTA performance on variety of datasets and tasks including reading comprehension. 1๏ธโƒฃFinetuning open source LLMs for RAG : this would allow model to attend information in the
Tweet media one
0
10
102
@Shahules786
ikka
1 year
@Shahules786
ikka
1 year
Tweet media one
1
19
123
1
6
101
@Shahules786
ikka
10 months
Mistral 7B + synthetic data = the best oss text embedding ๐Ÿ”ฅ Recent work from MSR just dropped the SOTA text embedding model. โ–ธ Uses decoder-only model (Mistral) without any contrastive pre-training instead of traditionally used encoder-only models with contrastive
Tweet media one
4
14
101
@Shahules786
ikka
5 months
This made my day โ˜บ๏ธ
@ragas_io
ragas
5 months
Guess what! @AndrewYNg just followed us! โค๏ธ
Tweet media one
4
3
22
8
1
103
@Shahules786
ikka
1 year
There is a popular misconception that MosaicML's new LLM is pre-trained on context lengths up to 64k tokens. Instead, they have cleverly used the ALiBi method to extrapolate context length during inference. So what's the ALiBi method? 1/๐Ÿงต
Tweet media one
3
14
98
@Shahules786
ikka
8 months
We've been working on generating high-quality & diverse synthetic QA pairs from documents. I've developed a paradigm we've implemented in Ragas. โญ๏ธ Here's how it works: 1/n
Tweet media one
4
19
94
@Shahules786
ikka
9 months
DeepSeek's in-depth analysis highlights the benefits of more experts and the role of shared experts. The 16B DeepSeek model rivals the performance of Llama-2 7B, with just 40% of the computation! (6/n)
0
5
92
@Shahules786
ikka
1 year
Some of the data preparation and filtering methods papers that I refer to for my experiments โญ๏ธ 1. Data deduplication using similarity search. There are many papers that have explored this idea. Here's one unique approach. 2. An interesting approach
3
19
91
@Shahules786
ikka
1 year
I explored my curiosity on how to best evaluate LLMs and LLM application and consolidated my thoughts in this article I also have a discord channel to discuss building and serving RAGs:
Tweet media one
0
25
91
@Shahules786
ikka
1 year
A single change in assumption led to a 50% drop in GPT4's performance on Python code generation. This paper evaluates LLMs on the ability to adapt to new variants of existing tasks. 1/๐Ÿงต
Tweet media one
3
17
90
@Shahules786
ikka
1 year
The traditional MoE models most people know about are not fully differentiable. Why? The routing algorithm selects k out of n experts using argmax or a similar discrete operation. Grad estimation is used to work around it but this causes instabilities while doing MoE training.
Tweet media one
5
11
86
@Shahules786
ikka
1 year
Lora vs Full finetuning consolidated thoughts from Twitter AI peers and some of my takes. ๐Ÿ”ฅ - The original Lora paper provides evidence that Lora can actually outperform full FT. I like to point out that the datasets and tasks they evaluated do not reflect much with real-world
4
9
85
@Shahules786
ikka
1 year
Synthetic data generation is a really interesting space to work on. There is some awesome research and engineering happening in the space. These are the most important papers in synthetic data generation:โญ๏ธ 1. Self-instruct: One of the early works in the area, uses LLMs to
4
15
86
@Shahules786
ikka
1 year
Releasing Open-Assistant llama-2 70B chat ๐Ÿš€ Benchmarks very close to WizardML 70B Datasets used: filtered Orca + megacode + Oasst H/w : 32 A100s 80gb Will release the datasets used for training in the coming days.
2
11
84
@Shahules786
ikka
1 year
Finetuning something special with llama 2 for open-source AI. Expect some high-quality chat models with an impressive context length of 8k+ very soon. Open-Assistant ๐Ÿš€ Sneak peek of the generations
Tweet media one
3
4
82
@Shahules786
ikka
1 year
Tinkering with Orca data cleaning + llama 2 ๐Ÿค” Some of my observations ๐Ÿš€ - 50% of samples in orca GPT4 contain less than 200 tokens in the output - Most of the above ones do not have any explanation attached to it - There are a lot of near-duplicate queries and responses in the
Tweet media one
7
4
80
@Shahules786
ikka
5 months
No offense to perplexity but this only shows how clueless WSJ is on AI.
@AravSrinivas
Aravind Srinivas
5 months
Perplexity has been ranked the number one AI chatbot in a survey run by Wall Street Journal, ahead of ChatGPT, Gemini. Microsoft Copilot is the least preferred.
Tweet media one
147
200
2K
9
2
76
@Shahules786
ikka
1 year
Superior dataset selection and filtering techniques are the trick behind models like phi. These are the techniques used for phi models: 1. Filtering samples: In phi-1, they used a simple prompt + GPT-4 to annotate enough samples and then trained a random forest as a data
Tweet media one
4
10
77
@Shahules786
ikka
1 year
Tinkering with Falcon + token interpolation via NTK scaled method for extending context length. Results look good without fine-tuning. Up next: fine-tuning Open-Assistant models for 8k+ context length. You can find the code here
Tweet media one
1
9
73
@Shahules786
ikka
1 year
llama-2 13B with best of ORCA + code datasets. ๐Ÿš€ Already seeing more than a 10% improvement in benchmarks by only using 10% of the original data :) โญ๏ธ Experimented with different data filtering techniques, will share the methods and datasets soon
Tweet media one
10
4
68
@Shahules786
ikka
11 months
Apply now.
Tweet media one
4
1
67
@Shahules786
ikka
9 months
DeepSeek introduces two innovative solutions: a) Fine-grained experts: Increases the number of experts and the number of experts selected at each point. b) Shared experts: Some experts are activated for all tokens, enhancing efficiency (4/n)
Tweet media one
1
4
65
@Shahules786
ikka
4 years
I am a data scientist,but I don't ask my connections to comment down in my posts to get magic materials to be a data scientist in week:) There is no shortcut to learn or master any science or art. It is just hardwork and updating ones knowledge constantly with time. #DataScience
2
8
61
@Shahules786
ikka
1 year
Chain of thought reasoning is one of the most impressive qualities of LLMs. Taking advance of this, a recent paper just outperformed the 540B PaLM using 11B T5. How? 1/๐Ÿงต
Tweet media one
1
17
60
@Shahules786
ikka
1 year
Finetuned an open-llama model on ORCA explanation-style dataset created from wizardML instructions. The results look good. Checkout the demo here I used open-llama 3B + LoRA + Deepspeed to fine-tune the model under 2 hours for 1 epoch on 2 V100s (16GB).
Tweet media one
1
13
59
@Shahules786
ikka
10 months
Happy New Year, everyone! 2023 has been a transformative journey ๐Ÿ”ฅ 1. Moved on from my day job. 2. Shifted my focus to contributing to OSS AI and collaborated with OpenAssistant AI. 3. Delved deep into LLMs, managing to train LLMs with up to 70B parameters on large clusters. 4.
7
0
60
@Shahules786
ikka
1 year
I have been exploring training & finetuning embeddings for a couple of weeks and here are the most insightful papers I found. โญ๏ธ 1๏ธโƒฃ Instructor embeddings: trains embedding that can generalize well on different tasks like classification. They also release the MEDI dataset which
0
8
58
@Shahules786
ikka
1 year
Finetuned a Redpajama model 3B on ORCA style explanation dataset and built a demo for comparing it with Open-llama generations. ๐Ÿš€ Checkout the demo: Model: Some observations I made: - Redpajama consistently formats code outputs
Tweet media one
3
16
59
@Shahules786
ikka
1 year
Knowledge distillation via initiation learning has proven to produce quality models like Vicuna. Now Orca is here beating every one of these past models with finetuning with explanation strategy. Orca surpasses Vicuna by 100% and performs on par with GPT 3.5. 1/๐Ÿงต
Tweet media one
3
12
56
@Shahules786
ikka
7 months
Common mistakes I have observed during LoRA fine-tuning โญ๏ธ 1๏ธโƒฃ Not adding all required linear layers to target modules. 2๏ธโƒฃ Not resizing the embed_tokens and lm_head layers, and failing to save them once resized when adding extra tokens. 3๏ธโƒฃ Not modifying the generation config to
3
2
56
@Shahules786
ikka
1 year
Letโ€™s level up the game with LLama 2. 70B full FT ๐Ÿ”ฅ Expect some high quality Open-Assistant chat models soon ๐Ÿš€
Tweet media one
4
4
58
@Shahules786
ikka
5 months
Gpt4-o has killed and also given birth to many startups at the same time. The formula is very clear: everything that is one abstraction away can be and will be learned by models leveraging large amounts of data and computation. So donโ€™t build for limitations; instead, build for
2
7
56
@Shahules786
ikka
9 months
A persistent challenge in MoEs has been expert specialization. ๐Ÿค” By analyzing token routing, it's clear that in models like Mixtral 7*8, tokens from various domains end up routed to nearly all experts. That's not ideal for MoEs (2/n)
Tweet media one
1
2
54
@Shahules786
ikka
1 year
Releasing Multi-chapter summaries dataset to enable finetuning of Opensource models with 8k+ context length. ๐Ÿš€
1
9
55
@Shahules786
ikka
1 year
Our last run of Llama 2 13B didn't yield the expected results in certain benchmarks. We got some learnings from it. Here are my key takeaways ๐Ÿ”‘ 1. Data filtering is hard and can go wrong in many ways. It's better to inspect it well before training. 2. We will have to create a
4
3
56
@Shahules786
ikka
1 year
I see implementations that avoid adding special tokens when using PEFT (LoRA) because they face size mismatch errors when loading adapter weights. I faced the same issue and found the solution which is pretty straightforward. 1/๐Ÿงต
1
7
54
@Shahules786
ikka
1 year
Colossal AI Llamma-2 is a very interesting open-source experiment on adapting LLMs to newer domains. The work successfully adapts Llama-2 to the Chinese language and improves its performance with just a few hundred dollars. ๐Ÿš€ The steps performed are: 1๏ธโƒฃHigh quality data
Tweet media one
1
8
53
@Shahules786
ikka
9 months
DeepSeek offers a novel solution to this. It pinpoints two key issues with current MoEs: a) Knowledge Hybridity: Each expert needs diverse knowledge due to limited expert numbers. b) Knowledge Redundancy: Different experts share knowledge for complete understanding. (3/n)
1
2
51
@Shahules786
ikka
1 year
@JosephJacks_ Transparency is important. Detailed documentation describing model architecture, training methodology, hyperparameter configurations, etc should be present along with weights.
2
0
52
@Shahules786
ikka
1 year
Found this interesting dataset that can be used for training a small LLM for query composition in chat-based QA systems. The same dataset can also be used to train an embedding that can take conversations as a query to retrieve required passages. Check it out:
Tweet media one
1
10
50
@Shahules786
ikka
7 months
Ragas on the first page of Hackernews today ๐Ÿ”ฅ
Tweet media one
1
1
50
@Shahules786
ikka
1 year
Thanks for the shoutout for Ragas @OpenAI #OpenAIDevDay
Tweet media one
1
3
49
@Shahules786
ikka
1 year
Releasing the best of Orca dataset, a filtered version of orca that contains only 10% of the actual orca dataset but has proved to get better generalization than Orca. ๐Ÿš€ Data filtering techniques applied ๐Ÿ”‘ 1. Removed instructions with less than 100 tokens in response. 2. Data
2
5
49
@Shahules786
ikka
1 year
My "GPU Poor" friends, necessity is the mother of invention.
1
7
47
@Shahules786
ikka
1 year
Can we finetune an LLM that's comparable with GPT-4 with just 1000 samples? The LIMA paper says so. How did they do it? 1/๐Ÿงต
Tweet media one
3
7
47
@Shahules786
ikka
9 months
Amazingly, all these improvements don't increase the total or activated parameters per token. This is achieved by cleverly splitting the intermediate hidden size. (5/n)
1
2
45
@Shahules786
ikka
1 year
LLongMa 8k + long context finetuning ๐Ÿค” Dataset = orca + multi-chapter summaries
Tweet media one
1
4
47
@Shahules786
ikka
1 year
Releasing megacode-best, the code dataset behind the recently released Open-Assistant code llama 13B.๐Ÿš€ โญ๏ธ I applied my usual data filtering workflow to the original dataset to remove near duplicate instructions. Choose the new GTE embeddings as itโ€™s trained on code tokens and
Tweet media one
3
4
46
@Shahules786
ikka
7 months
@tsarnick I love how elegantly @DavidSHolz handled the same question and openly accepted his position.
Tweet media one
4
3
42
@Shahules786
ikka
1 year
Llama 2 + Scaled RoPE + Flash attention 2 = ๐Ÿ”ฅ
2
2
45
@Shahules786
ikka
1 year
Generative search engines are cool, but how much of the generated statements are actually fully supported by shown citations? This recent work evaluated 4 popular AI search engines to find that this number is only around 50%! 1/๐Ÿงต
Tweet media one
2
9
42
@Shahules786
ikka
1 year
ChatDB: Augmenting LLMs with external databases (MySQL) for complex data operations using a chain of memory. The method Improves the effectiveness and robustness of such systems and performs way better than vanilla ChatGPT. 1/๐Ÿงต
3
7
40
@Shahules786
ikka
1 year
Everybody gangsta until I train on the test set๐Ÿ”ฅ Jokes apart, open-source model developers should focus on creating high-quality models and sharing insights rather than beating the benchmarks that give mediocre signals of the model's capabilities.
Tweet media one
5
1
42
@Shahules786
ikka
1 year
PandaLM is an open-source model for comparing LLMs using judge LM specifically finetuned for evaluating responses from LM. The focus here is to include subjective qualities in responses like relative conciseness, clarity, etc for evaluating LLMs. 1/๐Ÿงต
Tweet media one
1
8
41
@Shahules786
ikka
5 months
It's ironic how people talk about LLM as AGI and yet the top use-case of LLMs in enterprises is still summarisation and paraphrasing.
6
5
42
@Shahules786
ikka
1 year
One of the key challenges in building software powered by LLMs is adjusting prompts to suit multiple LLMs/ use cases. Here are some of the papers & concepts I found interesting 1๏ธโƒฃ DSPy : introduces a systematic approach for developing LM pipelines by using composing and
1
3
38
@Shahules786
ikka
1 year
The shepherd critic model (7B) from Meta shows good performance on evaluation done in the paper. My thoughts on using smaller models for feedback Although smaller models can be effective in providing feedbacks in aspects like grammar, coherence, alignment, ensuring structure in
Tweet media one
1
2
40
@Shahules786
ikka
1 year
LLMs like Falcon and Llama 2 do not use the original multi-head attention. They use variants of this namely Multi query attention and Grouped multi query attention to attain much better inference speed. Itโ€™s pretty simple to understand 1/๐Ÿงต
Tweet media one
1
5
38
@Shahules786
ikka
7 months
People criticizing entrepreneurs for creating GPT wrappers often don't ship anything themselves. They aim to feel superior over those who are actually shipping, despite their own inactivity.
3
5
36
@Shahules786
ikka
1 year
Here are some papers and ideas that helped me fine-tune (full FT) LLMs with ~8k+ context length with minimal compute requirements โญ๏ธ 1๏ธโƒฃ LongLora: compute efficient long sequence finetuning by grouping sequences with overlap and splitting load into corresponding batches. Works
0
5
37
@Shahules786
ikka
7 months
What do you guys think?
Tweet media one
1
1
33
@Shahules786
ikka
1 year
I have something amazing for folks building using llama_index ๐Ÿšจ There are several configurations available to enhance your RAG pipeline's performance. However, the challenge lies in objectively assessing the impact of each change. Ragas evaluation framework is now integrated
Tweet media one
1
4
34
@Shahules786
ikka
1 year
The first challenge in finetuning embeddings for retrieval seems to be mining for hard negatives. Hard negatives are the data points that the embeddings finds hard to distinguish from the anchor point. ๐Ÿ”ญ - In RAG settings, it can be defined as chunks that are returned by
1
4
33
@Shahules786
ikka
11 months
Hallucination indexes for RAGs are mostly useless. It's just another fancy but trivial benchmark. Why? 1๏ธโƒฃ Data leakage: the datasets used in computing these hallucination scores are mostly a part of pre-training/finetuning. Since the model has been already exposed to this,
Tweet media one
1
6
34
@Shahules786
ikka
1 year
Many recent works claim that LLM models exhibit capabilities that are not present in smaller-scale models, (aka emergent capabilities). This is also a cause of fear for AI doomers but these claims are not completely true. Why?
Tweet media one
2
9
33
@Shahules786
ikka
1 year
The idea here seems to be flawed, this will encourage LLMs to make predictions that cannot be inferred from source documents (hallucinations in RAG). โญ๏ธ Finetuning LLMs for RAG will surely improve its performance in the ability to make text-grounded predictions. But the data
Tweet media one
2
3
30
@Shahules786
ikka
1 year
Even though from a conceptual level MoE looks very simple, I learned that it is actually very challenging to make it work at scale. This is what I learned โญ๏ธ 1. Choosing the number of experts: This is a hyperparameter for which there is no widely accepted way to select. 2.
2
4
30
@Shahules786
ikka
1 year
Improvements already on the run for Open-Assistant llama 2 13B Orca-chat ๐Ÿš€ Further finetuning from the last checkpoint to improve code and multi-turn chat capabilities.
Tweet media one
3
1
30
@Shahules786
ikka
1 year
Recent research shows that ChatGPT and GPT-4 can actually identify 67% and 87% of their own mistakes. Then why do these models make certain mistakes in the first place? "How Language Model Hallucinations Can Snowball" paper explores this, let's understand 1/๐Ÿงต
Tweet media one
3
9
28