The ReAct Paper is next-level prompt engineering.
If you understand how it works then you can start building LLM apps that are way more factual than chatGPT and can use external APIs and tools. Check-out the example at the end.
To understand ReAct, let's think step-by-step:
Is LLM finetuning worth it?
If you know what you're doing, finetuned models can be 30x smaller❗️without losing performance. This can unlock applications that would otherwise be too expensive or slow.
STaR is a great example of doing it right.
(DIY instructions at the end)
Microsoft/OpenAI and Google/Anthropic's investment partnership is a comedy of sorts. They're basically just handing over loads of cash that they then get back through cloud compute.
The Toolformer paper is underwhelming.
Teaching an LLM to use tools is exciting! but the tools considered are disappointing.
So much effort to choose between Wikipedia and a calculator?!
There are a few compelling insights though!
So, what should you take away?
This graph from the GPT-4 tech report is a much bigger deal than most people seem to have realised.
I think it allows us to predict with reasonably high confidence that the problem of LLM's making things will be quite easy to solve.
Here's why:
"No GPUs before PMF" should be the mantra of most applied AI start-ups. The cloud LLMs are so good now, most people should start here and optimise later.
The main difference between safety folks and accelerationists is that the safety people actually believe AGI is possible soon.
E/acc is actually the pessimistic position and mostly held by people who were surprised by recent ai progres or pivoted from crypto.
Had a bit of a play with GPT-3 and
@LangChainAI
yesterday. With a bit of prompt magic from
@humanloop
and access to the Serpapi, it can do a decent stab at writing sales emails.
@gojira
what do you think?
green is GPT-3, blue is Google search
I received a great cold email today for a job app. The candidate had 1) tried Humanloop 2) had an idea for an improvement and had 3) created a loom of a mock solution.
Probably only took 1-2 hours but was better than 95% of applications I receive.
Surprisingly, the latest Open AI models: chatGPT and text-davinci-003 are actually finetuned from the code generation models, not pure text generation.
There's a lot more detail on what base models are used in the model index for researchers:
An interesting takeaway from the HELM benchmark is that the
@CohereAI
base models outperform most other base-models (GPT-3 davinci-1 etc).
The models that beat Cohere are instruction tuned. Very curious to see the evaluations that also include Cohere instruct models!
LLMs and AI mean that you need less capital than ever to build a compelling product and start a company.
The next decade is going to see many more WhatsApp-style companies.
Tiny teams creating enormous value.
Contrary to popular opinion, I see a lot of companies finetuning LLMs.
They just tend to do it after things are working.
Start with prompt engineering, solve a real customer need and then optimise.
How can you tell if an LLM app is working well?
Trad software relies on unit tests and trad machine learning used held-out datasets. With LLMs, neither of these is enough.
There's a fantastic example in the HELM benchmark showing why (and what you might do instead):
Is there any evidence that governments might have their own secret LLM efforts at a scale that could rival gpt3/4? Feels like it would show up in hiring and compute
Chain of thought prompting was really necessary for GPT-3 (text-davinci-002) but GPT-3.5 (text-davinci-003) seems to be able to do many of these tasks zero-shot:
Super interesting insights into building with LLM apps in the co-pilot explorer blog.
The engineering effort that goes into choosing the right context, fine-tuning and telemetry is immense!
some details:
Had a bit of a play with GPT-3 and
@LangChainAI
yesterday. With a bit of prompt magic from
@humanloop
and access to the Serpapi, it can do a decent stab at writing sales emails.
@gojira
what do you think?
green is GPT-3, blue is Google search
The OSS replication begins!
"We are not going to stop at replicating ChatGPT. We want to build the assistant of the future, able to not only write email and cover letters, but do meaningful work, use APIs, dynamically research information, and much more"
2 years after doing
@ycombinator
remotely, I finally made it to the offices!
I'm often asked if YC was worth it and I can say unequivocally yes!
The network, advice, partners and friends have undoubtedly changed
@humanloop
's trajectory. If you're considering it, go for it!
Happy to share that I've officially moved to San Francisco! Today's my first day in our new SF office.
Alongside
@jordnb
, I'll be growing
@humanloop
's US team.
Would love to meet more people in the city so please DM me!
You can now talk to GPT-4 in the Humanloop discord!
The OpenAI live demo inspired us, so we used GPT-4 to create a GPT-4 bot!
With the help of GPT-4 it only took
@jordnb
about 20 minutes to code this from scratch!
In the early versions of Anthropic's Claude model they provided two outputs to each answer and got you to choose what you prefer.
In just the past months I've seen it improve phenomenally. Data flywheels are real.
5/ Here's a concrete example of a ReAct style prompt I used to build an automatic sales email generator.
You need to execute the "search" actions from the LLM and append the results from google.
@humanloop
tools makes this super easy.
Very neat project!
Using GPT-3 to build a chat QA interface for LangChain's docs.
I think this could be a pretty cool generic product/feature in one of the doc platforms.
.
@Aleph__Alpha
is one of the most underrated generative AI start-ups I've come across. They're amongst the only public APIs to a multimodal model that can understand both images and language.
These three steps are an emerging pattern for LLM self-improvement used in all these recent papers:
‣ Toolformer
‣ STaR — LLMs that are better at reasoning
‣ Constitutional AI — Harmless AI from AI feedback
and the whole process can be repeated multiple times!
I've heard people say that
@OpenAI
are great at research but not product.
I disagree. Both chatGPT and the playground were fantastic UX innovations that helped many more people realise what's possible. They're innovating in both fundamental research and UX.
One of the big deficiencies of LLM chatbots is they don't ask questions. They're not really conversing with you, they're just predicting one output at a time. I think it might be better if they were fine-tuned on longer dialogs
4/ The simple but powerful idea in ReAct is to combing action-prompts with chain-of-thought. Taken together this helps the model to reason about what actions to take. It's much more powerful:
@TheLastFarm
@IranianIntifada
Ofc its a choice. You just lack imagination and understanding of physics. There's no physical reason why we can't produce orders of magnitude more clean energy then we do now. It's just a political choice. We wouldn't even need new tech, we could do it with fission today.
Can't believe it's been two months already! it's been amazing to see what people are building with LLMs.
If you're building an LLM product come join the fun!
Today we're excited to announce public access to Humanloop for Large Language Models!
We're making it easier than ever to build incredible products with GPT-3
Sign-up at
LLMs don't need much or any annotated data but to be truly useful, they do need access to non-web data.
LLM products are supercharged when they know your personal context. chatGPT that's read your emails, calendar and docs. Doing this whilst preserving privacy is critical.
In office >> Remote.
Setting up the new
@humanloop
office! We started as fully remote because of the pandemic. Remote definitely has benefits, but moving back to the office this year has felt amazing.
I wish journalists would stop using Gary Marcus as a knee-jerk way to balance articles on AI. He seems impervious to evidence and there are many more interesting critics if you really need one.
@rivkahbrown
I support the protests but surely you can't think this is acceptable. A man peacefully standing wearing a yamulka should feel safe even if he's a counter protester you disagree with.
You can expect the cost of serving LLMs to drop dramatically as we figure out better ways to quantise and prune the models.
A cool recent example: SparseGPT
They're able to prune 50% of the weights in OPT-175 with minimal performance loss on one GPU in 4 hours!
@packyM
I guess because the moment it turns 12 noon (the second after) it's now post-meridiem and the second after 12 midnight it's the morning or ante-meridiem.
A perfectly calibrated model will get 10% of the answers correct when it has 10% confidence, 20% at 20% confidence etc. So on the graph above, perfect calibration corresponds to a straight line.
GPT-4 is very well-calibrated! It knows what it doesn't know!
I used to feel frustrated by what I saw as the
@DeepMind
hype machine but reading
@balajis
's piece on the purpose of technology () has made me realise that evangelising technological progress and building strong narratives is itself a valuable pursuit.
This year has been an incredible one for science. So it's a real honour for
#AlphaFold
to be included in
@ScienceMagazine
’s top 10 breakthroughs of the year, among so many other significant discoveries.
One of the things that most excites me about LLMs is that you no longer need to be an ML expert to build really delightful AI-first products.
Promptable focusses on JS devs and so will help open up access to many more developers. Excited to see the ecosystem of tools growing!
It's here.
The world's first library for building AI apps in Typescript.
🔥Promptable.js 🔥
Use the full power of LLMs and Embeddings in your apps:
Prompt🪄
Search 🔍
Chain ⛓️
Trace ➿
Get started -> npm i promptable
Repo
Docs
@paulg
@stem_feed
Its just condensed notation. The 'm' normally actually isn't the rest mass but the moving mass. The rest mass is written with a subscript 0.
E = mc^2 = \gamma * m_0 * c^2
Where \gamma = 1/sqrt(1-v^2/v^2)
ChaptGPT isn't a product and isn't a challenge to the existing GPT-3 companies. It's a demo of what's possible but it won't be useful for 90% of people.
The real value comes from embedding AI into applications and workflows.
@AntiNewDems
@Argobotlord2001
@DanielPriestley
They haven't passed any new laws. Your beef is with the conservatives, not Keir Starmer. Even if your description of the UK were accurate it literally cannot be the fault of a government that has been in office for 4 weeks. Its the previous government.
Believing AGI is possible and chasing it doesn't mean you belong to a cult.
Cynics get to sound smart and optimists actually build things.
Wasn't that long ago that people were being made fun of working on neural networks.
How often do you use the term AGI a week?
And if it use it more than 5x, have you ever pondered whether you are inadvertently part of a cult rather than a scientific community.
2/You can improve factual accuracy by including external knowledge or APIs in the prompt.
You can even let the model specify what extra info it needs through a very simple "domain specific language" . If the model outputs "google: a query", you append the results of that query.
I think this is exactly the wrong take. The longer you've been working in AI the more impressive the generality of these models seem.
If you still think they're spicy auto-complete you're not paying attention
3/ Action-prompting is ok for simple questions but if the question requires reasoning then it tends to fail.
"Chain of Thought" prompting lets you over come this. In your prompt you include explicit reasoning and this gets the model to do the same. This increases accuracy:
@Meaningness
Started reading this. It's appallingly out of date. These criticisms might have seemed valid a decade ago, but many of the claims are just laughably wrong now.
5/ The key insight in the paper is to use the LLM to generate its own training data! It's a three-step process:
1. Generate — Use GPT-J to annotate questions with possible API
2. Filter — keep only the examples that improve prediction
3. Finetune — retrain the model
If you're starting to work on AI-first-product make sure you take seriously the rate of progress.
Build your product assuming that the AI models will be vastly more capable in the near future than they are now.
As Sam says, trust the exponential.
interesting to me how many of the ChatGPT takes are either "this is AGI" (obviously not close, lol) or "this approach can't really go that much further".
trust the exponential. flat looking backwards, vertical looking forwards.
I think the wider pattern of models generating their own training data is the most interesting aspect of the Toolformer paper and will likely become a common framework for continuously improving LLMs.
I'm launching a new podcast! The first episode is out tomorrow and I wanted to share a sneak peek. Why do we need another podcast for AI?
Over the last year at Humanloop, I've worked with a lot of different engineering leaders, CTOs and founders who are building AI products
The secret sauce behind ChatGPT is RLHF and fine-tuning. If you want to go beyond cool demos and build differentiated products
@humanloop
can help you do this for your own applications.
RLHF – Reinforcement Learning from Human Preferences.
Models are fine tuned using RL from human feedback. They become more helpful, less harmful and they show a huge leap in performance. An RLHF model was preferred over a 100x larger base GPT-3 model.
I continue to be amazed by how little of academic ML research looks how we collect and label data, given that for almost any real application this is the biggest factor in performance.
I think there's at least 50% chance of AGI that can do anything a human does at a computer better than the average (trained) human by 2030.
If you disagree, I'd be curious to know the simplest task you think AI won't be able to achieve at that date.
"Constitutional AI" is a new research paper from Anthropic AI and is a step towards building AI systems that have more transparent and controllable values.
1/
Microsoft/OpenAI and Google/Anthropic's investment partnership is a comedy of sorts. They're basically just handing over loads of cash that they then get back through cloud compute.
Since GPT-4 has well-calibrated confidence we can use its confidence estimates to decide when to trust the model.
If the calibration is good, then we don't need to worry about models making things up.
.
@sourcegraph
has built the most popular open-source AI coding tool in the Fortune 500!
A few weeks ago I sat down with
@beyang
liu their CTO and cofounder to find out how they did it.
We dive into:
The graph shows the GPT-4's "calibration" before and after RLHF. A well-calibrated model can accurately say how confident it is.
The researchers compare GPT-4's probability for the answer to the fraction of the time it's correct.
Something else I've found anecdotally is that Claude is much better at Urdu than chatGPT. I'm pretty amazed by its ability to handle udru transliterations.
GPT-4 is an incredible learning tool!
I've never done much front-end but always wanted to find time to build apps end-to-end. I prompted GPT-4 to be a coding teacher and then worked with it to start building a simple GPT-4 chat app.
Here's the first React app I've ever built!
I heard first hand testimony from British doctors who recently visited gaza that aligns with the reporting here. Very sad. We must not forget our shared humanity.
@rebelemerald
@Bossmustangfan
@isabelleboemeke
I don't think you understand what Nuclear power is. Nuclear is not digging stuff out of the ground and burning it. It's transmuting one material into another and in the process releasing millions of times more energy per gram than could possibly be released by any "burning"