Cameron R. Wolfe, Ph.D. @cwolferesearch Twitter profile

Pinned Tweet

Cameron R. Wolfe, Ph.D.

7 days

I find it so interesting (and smart) that Meta / LLaMA is eliminating the dependence of their models on the HuggingFace stack. The LLaMA models now: - Have their own website to download weights. - Have one of the best LLM cookbooks that's available. - Provide extensive

19

148

1K

Last Seen Profiles

@CE_Putaria

@beelzebub_zbu

@oscarscott_84

@Meryl270

@dejahmalveaux

@momocha53052214

@bokeplokalmalam

@rina0p0g

@taa0ta

@iamDo2dtun

@Rizzcoinftw

@Hijabbacol2883

@haa2221a

@BeyondHope36388

@Kabelputus12

@renren___20

@stw_pdg

@UCIsafe

@BosteroFin77807

@Fade2bald_30

@damiloveixxng

@JeanB_Zongo

@Michiko__001

@damybaby_vu

@emrebuyaa

@galery_basah10

@enhasoshi

@zvnjii0

@PunkgaMeManga

@Cathy_eko

@Partner3sbdgvid

@QS1Q5

@wau_nu

@bokeplokalmalam

@rainsdrip

@BrianMi42260931

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Q-Learning is *probably* not the secret to unlocking AGI. But, combining synthetic data generation (RLAIF, self-instruct, etc.) and data efficient reinforcement learning algorithms is likely the key to advancing the current paradigm of AI research… TL;DR: Finetuning with

48

454

2K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

@MerabDvalishvil

14

27

2K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Large language models (LLMs) are fun to use, but understanding the fundamentals of how they work is also incredibly important. One major idea and building block of LLMs is their underlying architecture: the decoder-only transformer model. 🧵[1/6]

42

382

2K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Due to the recent surge in popularity of AI and language models, one of the most common questions I hear is: How can we train a specialized LLM over our own data? The answer is actually pretty simple… TL;DR: Training LLMs end-to-end is quite difficult due to the size of the

22

323

2K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

The ChatGPT API was released yesterday and it costs 90% less than expected. Here’s five methods (and resources to learn about them) that are **probably** being used to enable this price reduction… 🧵[1/6]

27

264

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

One of the best ways to reduce hallucinations with LLMs is by retrieving useful, factual information and injecting it into the LLM’s prompt as added context. Although this might sound complicated, it’s actually quite easy to implement with standard vector search functionality…

41

198

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

RAG is one of the best (and easiest) ways to specialize an LLM over your own data, but successfully applying RAG in practice involves more than just stitching together pretrained models… What is RAG? At the highest level, RAG is a combination of a pretrained LLM with an

19

265

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

The volume of LLM research being released is staggering. Although there are too many new papers for any one person to read, this work can be largely distilled into a much smaller set of overlapping themes. Recently, there are three trends in LLM research that have been especially

30

276

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Self-attention is the primary building block of large language models (LLMs) and transformers in general. But, how exactly does it work? 🧵 [1/8]

20

194

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Although large language models (LLMs) are incredibly capable, they are pretty simple to understand. In fact, the core components of most LLMs can be distilled into five major components… 🧵[1/7]

27

205

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

9 months

Generative large language models (LLMs) are based upon the decoder-only transformer architecture. Currently, these types of generative LLMs are incredibly popular. However, I use encoder-only architectures for 90% of use cases as a practitioner. Here’s why… History of

27

181

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Nearly all recently-proposed large language models (LLMs) are based upon the decoder-only transformer architecture. But, is this always the best architecture to use? It depends… 🧵 [1/8]

24

199

1K

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Want to train a specialized LLM on your own data? The easiest way to do this is with low rank adaptation (LoRA), but many variants of LoRA exist. Here’s an overview of all (or at least most) of the techniques that are out there… LoRA models the update derived for a model’s

16

214

970

Cameron R. Wolfe, Ph.D.

@cwolferesearch

6 months

LLaMA-3 is a prime example of why training a good LLM is almost entirely about data quality… TL;DR. Meta released LLaMA-3-8B/70B today and 95% of the technical info we have so far is related to data quality: - 15T tokens of pretraining data - More code during pretraining

21

225

918

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Each “block” of a large language model (LLM) is comprised of self-attention and a feed-forward transformation. However, the exact self-attention variant used by LLMs is masked, multi-headed self-attention. Let’s break down what this means…🧵[1/11]

9

156

881

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

After GPT-3 was proposed, a lot of research was done to find an even better language model. Initial attempts focused on just training larger models. Contrary to popular belief, however, there is more to creating a good language model than size… 🧵[1/8]

18

135

870

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

What’s the easiest way to specialize an LLM over your own data? Recent research has studied this problem in depth, and RAG is way more effective (and easier to implement) compared to extended pretraining or finetuning… Knowledge from pretraining. A lot of factual information is

16

156

878

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Have you ever wondered why all language models use decoder-only architectures? It's partially because decoder-only models work great for next-token prediction. However, recent research has also analyzed the choice of architecture for language models in depth... Decoder-only

9

104

795

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

I just finished writing a survey on the history of open-source LLM research, spanning from the early days (e.g., OPT and BLOOM) to recent models like LLaMA-2. Here are three takeaways that seem to have the biggest impact on LLM quality… Base models make all the difference.

19

145

787

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

Prompt engineering is one of the most rapidly-evolving research topics in AI, but we can (roughly) group recent research on this topic into four categories… (1) Reasoning: Simple prompting techniques are effective for many problems, but more sophisticated strategies are

12

168

726

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Large Language Models (LLMs) are notoriously bad at solving reasoning-based tasks. However, we can drastically improve their reasoning performance using simple techniques that require no fine-tuning or task-specific verifiers. Here’s how…🧵[1/7]

18

125

718

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

New language models get released every day (Gemini-1.5, Gemma, Claude 3, potentially GPT-5 etc. etc.), but one component of LLMs has remained constant over the last few years—the decoder-only transformer architecture. This architecture has five components… Why should we care?

12

156

698

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Foundation models for language understanding (such as GPT-4) are becoming increasingly common and useful. But, what about other modalities? Today, Meta AI released the "Segment Anything" model, a foundation model for image segmentation... 🧵 [1/6]

7

120

681

Cameron R. Wolfe, Ph.D.

@cwolferesearch

9 months

Given the popularity of retrieval augmented generation (RAG) for LLMs, one question I’m constantly asked is: What model should I use to embed my data for RAG? This question has a simple answer that I use for (almost) all applications… TL;DR: Sentence BERT (sBERT) is an

24

117

688

Cameron R. Wolfe, Ph.D.

@cwolferesearch

9 months

Trying to create a language model that understands your own custom data? Here are techniques you can use to create a “specialized” LLM, ordered in terms of the amount of complexity/compute involved… TL;DR: When trying to solve problems with language models, we should start

6

140

673

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Given the current foundation model paradigm, I wonder if building/training models will become antiquated. Will future data scientists understand the details of optimization, architectures, etc.? ML may slowly be abstracted in favor of simpler (language model-based) solutions...

21

94

643

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 months

I just wrote a long-form overview of RLHF, its origins/motivation, and the impact it has had on the generative AI movement. My conclusion? RLHF is (arguably) the key advancement that made modern generative LLMs possible. Here’s why… TL;DR: Prior to RLHF, we primary relied upon

13

141

644

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The creators of FlashAttention (makes language model training much faster) just released another awesome efficiency tool—FlashDecoding—that can make LLM inference up to 8X faster on long sequences. Here’s how it works… Background reading. To understand most of this post, you

8

124

634

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Recently, I’ve read and overviewed publications for nearly 20 different large language models (LLMs) from GPT to ChatGPT. Here’s what I learned… 🧵 [1/10]

21

95

630

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Traditionally, LLMs have struggled to solve complex problems that require reasoning. Chain of thought prompting has improved their abilities in this domain, but why stop there? Here are four prompting techniques for solving difficult, multi-step problems with LLMs… 🧵 [1/8]

14

132

610

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Research on advanced prompting techniques for language models has extended chain of thought and tree of thought prompting to graph-structured reasoning processes. But, did you know that there are two versions of “graph of thought” prompting that have been proposed already? Some

14

96

609

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Using a KV cache is one of the most commonly-used tricks for speeding up inference with LLMs. Here’s exactly how it works… Autoregressive decoding process. When we perform inference with an LLM, it follows an autoregressive decoding process. Put simply, this means that we i)

7

95

582

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Large Language Models (LLMs) commonly use a “greedy decoding” strategy to generate their output, but what exactly does this mean? Here’s how this process works… 🧵 [1/10]

12

118

571

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Prompt engineering for language models usually involves tweaking the wording or structure of a prompt. But, recent research has explored automated prompt engineering via continuous updates (e.g., via SGD) to a prompt’s embedding. Here’s how these techniques work… 🧵 [1/8]

14

105

568

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

When we interact with language model APIs, such as the OpenAI API, we typically have to set a “temperature” parameter when obtaining output from the language model. But, what exactly is this parameter and how does it work? Let’s take a deeper look… The decoding process:

14

107

569

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Given the incredible performance of large language models (LLMs) like ChatGPT, it’s hard to believe that the original generative pre-trained transformer (GPT) was proposed less than five years ago. Here’s how we got to where we are right now… 🧵[1/8]

9

96

544

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Reinforcement learning from human feedback (RLHF) can teach LLMs a variety of interesting skills. As an example, Sparrow, a chatbot developed by @DeepMind , is taught (via RLHF) to support its factual claims by finding relevant information on Google... 🧵 [1/7]

3

78

554

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Many different (text-based) transformer architectures exist, but when and where should we use them? Here’s a quick list of four important transformer variants and the best applications to use them for…🧵[1/7]

9

115

555

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Large language models (LLMs) have been criticized due to their heavy reliance on humans to create datasets for fine-tuning and RLHF, but recent research suggests that we might not even need humans for this… 🧵[1/9]

9

71

552

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Advanced prompting techniques allow language models to solve complex problems but are often constrained to a single line of reasoning. Tree of thoughts (ToT) prompting avoids this by deliberately decomposing, planning, and exploring candidate solutions to a problem via a

14

95

547

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Vision Transformers (ViTs) are a powerful deep learning architecture, but what’s the difference between ViT and a text-based transformer like BERT? Despite being applied in completely different domains, these models have only one major difference… 🧵[1/7]

4

101

546

Cameron R. Wolfe, Ph.D.

@cwolferesearch

16 days

Model merging is a popular topic in AI research, but why does the simple idea of averaging models’ weights work so well? The answer lies in prior research on neural network sparsity and training dynamics… What is model merging? First, let’s make sure we understand what model

11

83

548

Cameron R. Wolfe, Ph.D.

@cwolferesearch

9 months

The impressive in-context learning abilities of LLMs has created the need for larger context windows. Recently, researchers discovered that we can easily extend the context window of a pretrained LLM with one simple trick (and no extra training)… What is the context window?

9

106

542

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Foundation models are a popular topic in AI research. However, task-specific fine-tuning outperforms zero/few-shot learning with foundation models in most cases. Specialized models are hard to beat! Luckily, recent research indicates that we can combine the strengths of both

11

114

542

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Most high-performing large language models (LLMs) are closed-source and can only be accessed via paid APIs. However, the public release of LLaMA has recently challenged this trend. Here’s what you need to know about LLaMA… 🧵[1/7]

9

100

537

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

Retrieval-augmented generation (RAG) is the best way to specialize an LLM over your own data. Researchers have recently discovered a finetuning approach that makes LLMs much better at RAG... RAFT and specializing LLMs. Most use cases with LLMs require specializing the model to

8

122

537

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Reinforcement learning from human feedback (RLHF) is a major catalyst of the recent generative AI boom, as it enables language models to surpass human writing quality. RLHF makes this possible by improving the alignment process in three main ways... What is RLHF? RLHF is a

13

99

534

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

This is huge news! The number of times I've been asked "How difficult would it be to create a ChatGPT for <insert domain>?" is nearly countless. I'm sure versions of ChatGPT for retail, banking, insurance, and more will soon be available. [1/3]

7

87

527

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Object detection is a fundamental problem in computer vision. Although Vision Transformers (ViTs) achieve state-of-the-art performance today, the history of object detection proceeded in three distinct generations of innovation… 🧵 [1/7]

11

98

532

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

In the wake of LLaMA, the deep learning research community quickly adopted the view that open-source LLMs will rule the future—reproducing open-source variants of proprietary models seemed to be easy and cheap. Is this the truth? Here’s a brief timeline of model proposals and

17

100

526

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Can large language models (LLMs) train themselves? The explosion of imitation-based open-source LLMs drew criticism due to cursory evaluation that covered up performance gaps. However, recent research shows powerful open-source LLMs can actually be created by imitating other

11

101

522

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Reinforcement Learning from Human Feedback (RLHF) is a valuable fine-tuning technique, but people often misunderstand how it works and the impact that it has on LLM behavior. Meta's LIMA publication provides a lot of information that puts the value of RLHF into perspective...

18

107

506

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

Now that Grok-1 has been released, it’s the perfect time to brush up on how Mixture-of-Experts (MoE) layers work in LLMs. Here’s a quick explainer… TL;DR: When applied to transformer models, MoE layers have two primary components: - Sparse MoE Layer: replaces dense

9

110

500

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Is Attention really all we need? The answer seems to be yes, but why is this the case? Here’s the two main problems that transformers solved, which enabled many of the breakthroughs in natural language processing that we see today… 🧵[1/6]

8

56

494

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Powerful LLMs like GPT-4 can follow complex instructions, but building applications with less capable LLMs requires breaking a single, detailed instruction into a “chain” of simpler prompts. Here’s an overview of practically useful chaining techniques for LLMs... Some

9

90

494

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Reinforcement learning from human feedback (RLHF) has gained recent popularity due to its ability to refine and improve the behavior of large language models. Recently, this framework has been extended to improve the quality of video game AIs. Here’s how… 🧵[1/8]

4

70

485

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The Falcon-7B/40B open-source LLMs were released late this week, and their performance is super impressive. But, there's a huge catch for those using them commercially! Here's my main takeaways from the models so far... model architecture. The Falcon models were released by

18

68

473

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The MPT suite of large language models (LLMs) by MosaicML has become incredibly popular. But, what makes these models so special? Although there are a variety of reasons for the popularity of MPT, I find these models to be especially useful due to a few unique components… Fully

10

96

479

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 months

LLM-as-a-Judge is one of the most widely-used techniques for evaluating LLM outputs, but how exactly should we implement LLM-as-a-Judge? To answer this question, let’s look at a few widely-cited papers / blogs / tutorials, study their exact implementation of LLM-as-a-Judge, and

10

79

474

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

LLaMA-2 outlines the remaining limitations of open-source language models well. Put simply, the gap in performance between open-source and proprietary LLMs is largely due to the quality of alignment. However, LLaMA-2 takes a major step in the right direction… State of

14

82

466

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

I have recently given some long-form lectures on language models, how they work, and the AI landscape, which has given me a chance to more clearly organize key concepts for understanding language models. Here are the 15 key concepts that I’ve arrived at so far… AI

11

81

453

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

Recently, I’ve run hundreds of instruction tuning experiments with LoRA/QLoRA, and I wanted to share some (basic) code and findings that might be useful… The code (see replies) contains an instruction tuning script using LoRA/QLoRA and the Alpaca dataset, as well as evaluation

24

78

459

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Following the release of LLaMA, we saw a rapid explosion of open-source research on large language models (LLMs). Here are the three most notable model releases during this time… 🧵 [1/8]

12

79

444

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Almost all generative language models use a decoder-only transformer architecture, making the decoder-only transformer one of the most influential architectures in modern AI. Let’s take a deeper look at an implementation to understand exactly how it works… Implementation

6

59

441

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

The PaLM API was recently released (to select developers) by Google to compete with the ChatGPT API by OpenAI. Here’s the five main things you need to know about PaLM… 🧵 [1/7]

14

66

433

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

Here is a (brief) taxonomy of the three advanced prompt engineering techniques that are most commonly used/referenced… Disclaimer: Basic prompting techniques (e.g., zero/few-shot or instruction prompting) are highly effective, but sometimes more complex prompts can be useful

8

104

441

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

The transformer is a foundational deep learning tool that is useful for a variety of tasks. One of the coolest applications of transformers (in my opinion) is for multi-object tracking in video. Here's how it works ... 🧵[1/7]

6

61

422

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

Having the ability to clearly explain fundamental concepts in AI to others is incredibly important. To explain large language models (LLMs), I use a simple three-part framework… Why is this important? Given that most AI engineers/researchers work on teams with highly-technical

6

80

427

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

BERT made transfer learning popular in NLP, but follow-up research proposed a ton of new techniques for transfer learning with large language models (LLMs). T5 analyzed these techniques using a unified format. Here’s what we learn from this… 🧵 [1/9]

7

73

423

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Can generative models create their own training data? Recent research indicates that we should be careful with doing this! For image generation models especially, there seems to be a reasonable risk of degradation (or even a complete collapse) in performance… What is

19

91

418

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

Recently, I’ve done a ton of reading on LLM-as-a-judge techniques (i.e., using an LLM to evaluate the output of another LLM). Here’s a reference of the best papers in this space: (1) Early research: Research on LLM evaluators began with the proposal of GPT-4, which was

12

78

417

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Next-token prediction is the workhorse behind all modern advancements in large language models (LLMs) due to its use in training these models over unlabeled text. But, how exactly does this next-token prediction (or language modeling) objective work? Let’s take a deeper look…

11

67

406

Cameron R. Wolfe, Ph.D.

@cwolferesearch

6 months

Research on LLMs is moving quickly, and even models / techniques that have been state-of-the-art for a long time (e.g., GPT-4 and Mixtral) are being quickly dethroned. Here’s a list of my top ten AI developments (each with a brief summary) over the last few months… [1] DBRX is

9

104

414

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The recent success of LLaMA-2, which can be attributed to a variety of factors, clearly demonstrates the massive value of reinforcement learning from human feedback (RLHF). Here’s what the authors of LLaMA have to say about why RLHF is so important… Collecting data for RLHF.

6

73

413

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Large Language Models (LLMs) make awesome foundation models and can be re-purposed for solving a variety of tasks. But, how can we specialize generic LLMs to solve more domain-specific problems? Currently, there are three main approaches…🧵[1/8]

5

61

404

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Diffusion models (DMs) are SOTA for generative modeling of images and video, but their typical formulation requires hundreds of GPU days for training. Stable Diffusion fixed this. Here’s how… 🧵 [1/8]

7

50

405

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The foundation series by MosaicML, including MPT-7B/30B (and an efficient training repo), makes high-quality pre-trained language models available to anyone for commercial use. Given that creating a pre-trained base model is incredibly expensive, these open-source tools enable a

16

66

398

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Recent research on language models has aimed to increase the maximum allowable context length of the underlying model. But, how can we enable an LLM to handle longer inputs? One way is through the use of ALiBi… Vanilla position embeddings. Decoder-only transformer architectures

6

60

399

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 months

Retrieval Augmented Generation (RAG) is a popular tool for improving the quality/factuality of LLMs. Self-RAG makes RAG smarter by teaching the LLM to reflect and decide which components of RAG actually help with answering a prompt… TL;DR: RAG is highly effective, but it’s a

7

66

392

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

There are a ton of different ways to finetune a language model. Here's a (brief) summary of language model finetuning, the various approaches that exist, their purpose, and what we know about how they work... Finetuning techniques: The term “finetuning” simply refers to further

7

87

394

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

Retrieval augmented generation (RAG) was proposed in 2020, but the idea has since been explored and expanded by a variety of papers. Here are four notable publications that study advanced concepts with RAG… (0) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks:

5

90

390

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 months

Most intro paragraphs for AI/ML papers just re-state the same, basic info about AI. But, the recent "GPT-4 Doesn’t Know It’s Wrong" paper has one of the best intros I've ever read... "Large Language Models (LLMs), essentially n-gram models on steroids which have been trained on

13

67

390

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Prompt engineering is oftentimes an annoying and brittle process. A small tweak to a prompt could massively change an LLM's output. But, it doesn’t have to be this way! We can adopt techniques like prompt ensembles to improve LLM reliability. 🧵 [1/10]

6

78

384

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

Most businesses are interested in training a specialized LLM on their own data. However, exposing proprietary data to an LLM is a security risk. Can we ensure that the LLM’s training data will not be leaked? Recent research indicates that the answer is no… TL;DR: Recent

10

82

374

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

LLMs are cool, but getting married was a lot cooler! Thank you everyone for not releasing any new models over the weekend. It was nice to fully disconnect and celebrate with my friends and family! ❤️

50

7

377

Cameron R. Wolfe, Ph.D.

@cwolferesearch

14 days

If you've followed my recent posts on model merging, I just published a long-form survey on this topic. It covers 50+ papers from the 1990s until now, including everything from basic concepts to the recent application of model merging to LLM alignment. See image for details!

3

53

382

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Instruction fine-tuning (or instruction tuning for short) is an incredibly useful method for creating high-performing large language models (LLMs). Here are 3 key ideas you need to know about it…🧵[1/7]

4

84

373

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

One of the main benefits of GPT-4 relative to prior models (like ChatGPT/GPT-3.5) is that the model is incredibly steerable. Here’s what this means and how you can use it to create better chat experiences… 🧵[1/8]

11

64

366

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

Large Language Models (LLMs) have the potential to be incredibly useful, but they also make a lot of mistakes (e.g., by generating false or biased information). To eliminate this behavior, recent generations of LLMs utilize a two-part refinement process… 🧵 [1/10]

7

56

361

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

Masked self-attention is the key building block that allows LLMs to learn rich relationships and patterns between the words of a sentence. Let’s build it together from scratch… The big picture: Large language models are based upon a deep neural network architecture called a

5

71

363

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

I’ve spent the last ~5 years working on (and writing about) language models. The proposal of Google Gemini made me think about why I am so interested in these models. There are numerous reasons, but the allure of LLMs (at least for me) boils down to 3 core properties… TL;DR:

7

50

363

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

The mixture of pretraining data used for Gemini was excluded from the technical report. Data mixology truly seems to be the new black magic for building effective AI systems. But, Gemini does give us a few important data-related learnings... (1) Diverse sources: Whenever

9

68

357

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Looking for something to talk to your family about while you’re home for the holidays? Why not give them a clear, accessible explanation of ChatGPT? Here’s a simple, three-part framework that you can use to explain generative language models to (almost) anyone… TL;DR: We can

9

54

356

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 months

Model merging has become popular recently due to its ability to easily combine the capabilities of multiple LLMs. Here’s how it works, why it’s useful, and a few (practical) examples… What is model merging? Model merging is a technique that operates in the parameter space of a

6

65

354

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Recently proposed open-source language models have placed an emphasis upon inference speed. Such work has shown us that inference speed can be improved by up to 5X (or more) by making some changes to the decoder-only transformer architecture. Here are three examples that have

8

75

349

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

We’ve seen a massive amount of progress in AI/LLM research over the last several weeks. Here are the five highest-impact papers/projects that I’ve been focusing on recently… StreamingLLM solves limitations with LLMs generating long sequences of text. To avoid excessive memory

4

80

348

Cameron R. Wolfe, Ph.D.

@cwolferesearch

9 months

Recently-proposed large language models (LLMs) such as Google Gemini are structured and trained in a manner that maximizes efficiency and boosts training stability. But, what common tricks are used to achieve these efficiency/stability benefits? TL;DR: Making LLMs more

2

65

340

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

@LukeGessler Pretty cool. Reminds me of using JPEG directly as input for image recognition with neural nets. I bet there's a lot of cool tricks like this out there that we haven't found yet.

Faster Neural Networks Straight from JPEG | Uber Blog

Uber AI Labs introduces a method for making neural networks that process images faster and more accurately by leveraging JPEG representations.

www.uber.com

6

43

347

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Recent research in open-source LLMs has made paid APIs much less enticing (though not hosting your own model is still nice). So much is possible if we are willing to fine-tune on some task-specific data! Here are a few examples to support my point... 🧵[1/6]

8

55

345

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

Model merging is a popular research topic with applications to LLM alignment and specialization. But, did you know this technique has been studied since the 90s? Here’s a brief timeline… (Stage 0) Original work on model merging dates back to the 90s [1], where authors showed

8

79

339