Keiran Paster Profile Banner
Keiran Paster Profile
Keiran Paster

@keirp1

2,911
Followers
685
Following
184
Media
2,542
Statuses

Pretraining data at xAI for Grok-2 mini, Grok-2, and beyond!

Joined April 2010
Don't wanna be here? Send us removal request.
Pinned Tweet
@keirp1
Keiran Paster
11 months
Introducing OpenWebMath, a massive dataset containing every math document found on the internet - with equations in LaTeX format! 🤗 Download on @HuggingFace : 📝 Read the paper: w/ @dsantosmarco , @zhangir_azerbay , @jimmybajimmyba !
Tweet media one
32
256
1K
@keirp1
Keiran Paster
2 years
Can large language models write prompts…for themselves? Yes, at a human-level (!) if they are given the ability to experiment and see what works. with @Yongchao_Zhou_ , @_AndreiMuresanu , @ziwen_h , @silviupitis , @SirrahChan , and @jimmybajimmyba (1/7)
15
112
428
@keirp1
Keiran Paster
9 months
Which LLMs are generally good at math and which are overfitting to benchmarks? With the release of Grok, @xai evaluated several closed models on a Hungarian national finals math exam which was published after the models were trained. This means it is impossible to train on or
Tweet media one
Tweet media two
24
84
559
@keirp1
Keiran Paster
1 year
Meet STEVE-1, an instructable generative model for Minecraft. STEVE-1 follows both text and visual instructions and acts on raw pixel inputs with keyboard and mouse controls. Best of all - it only cost $60 to train! w/ @Shalev_lif @SirrahChan @jimmybajimmyba @SheilaMcIlraith
11
96
322
@keirp1
Keiran Paster
2 years
Super cool - seems like labeling documents as <good> or <bad> at the beginning of the sequence during pretraining outperforms throwing out the bad data. Lots of parallels to supervised RL / decision transformer as well.
@arankomatsuzaki
Aran Komatsuzaki
2 years
Pretraining Language Models with Human Preferences Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior.
Tweet media one
6
87
497
4
25
220
@keirp1
Keiran Paster
2 years
APE generates “Let’s work this out in a step by step way to be sure we have the right answer”, which increases text-davinci-002’s Zero-Shot-CoT performance on MultiArith (78.7 -> 82.0) and GSM8K (40.7->43.0). Just ask for the right answer? @ericjang11 @shaneguML
Tweet media one
Tweet media two
3
14
101
@keirp1
Keiran Paster
9 months
Following up on my previous post, I hand-graded held-out* math exams from the recently released Qwen 72B and DeepSeek 67B Base/Chat. It seems like they perform similarly to Claude 2! DeepSeek 67B: 37% GPT-3.5: 41% Qwen 72B: 52% Claude 2: 55% DeepSeek 67B Chat: 56% Grok-1: 59%
Tweet media one
2
17
74
@keirp1
Keiran Paster
11 days
Very fun to watch people speculate this week about our sus model ඞ
@lmsysorg
lmsys.org
11 days
Woah, another exciting update from Chatbot Arena❤️‍🔥 The results for @xAI ’s sus-column-r (Grok 2 early version) are now public**! With over 12,000 community votes, sus-column-r has secured the #3 spot on the overall leaderboard, even matching GPT-4o! It excels in Coding ( #2 ),
Tweet media one
174
380
2K
4
0
71
@keirp1
Keiran Paster
11 months
We also made an @nomic_ai Atlas map of OpenWebMath so you can explore the different types of math and scientific data present in the dataset:
3
14
63
@keirp1
Keiran Paster
11 months
Here's some more in-depth info on the different stages in the OpenWebMath processing pipeline: Prefiltering: We apply a simple prefilter to all HTML documents in Common Crawl in order to skip documents without mathematical content to unnecessary processing time. This reduced our
Tweet media one
Tweet media two
0
8
57
@keirp1
Keiran Paster
6 months
Super happy that OpenWebMath was accepted to ICLR 2024! When we first submitted the paper to the conference, I was very unsure whether it would get in. In my experience, academia has a strong preference towards works with clever ideas, lots of math, and fancy algorithms or
6
5
55
@keirp1
Keiran Paster
2 years
Searching for the right prompt? APE makes it easy to find a prompt that works for your language modeling task. Try our official release: Colab: GUI: GitHub:
Tweet media one
5
9
51
@keirp1
Keiran Paster
10 months
Heading back to Toronto after spending the fall at Google hosted by @HarshMeh1a and working with amazing Blueshift and Gemini teammates! It's a really fun time to work on LLMs and I hope to be back soon!
Tweet media one
1
5
44
@keirp1
Keiran Paster
10 months
Proud to introduce Llemma, the first models trained on OpenWebMath (part of the 55B mathematical tokens in ProofPile II). Llemma serves as a platform for future research on quantitative reasoning and is a very powerful base model for mathematical tasks!
@zhangir_azerbay
Zhangir Azerbayev
10 months
We release Llemma: open LMs for math trained on up to 200B tokens of mathematical text. The performance of Llemma 34B approaches Google's Minerva 62B despite having half the parameters. Models/data/code: Paper: More ⬇️
Tweet media one
11
127
548
2
10
45
@keirp1
Keiran Paster
2 years
We know that LLMs can improve their performance on reasoning tasks by prepending “Let’s think step by step” to the model’s answer. We used APE to automatically optimize this phrase to increase the likelihood of the model producing correct reasoning.
@keirp1
Keiran Paster
2 years
Searching for the right prompt? APE makes it easy to find a prompt that works for your language modeling task. Try our official release: Colab: GUI: GitHub:
Tweet media one
5
9
51
1
5
46
@keirp1
Keiran Paster
11 months
Why is this a big deal? One year ago, Google published Minerva 🦉, a powerful LLM capable of impressive mathematical reasoning. One of the key ingredients in Minerva is a closed dataset of every math document on the web, something that isn’t available to the academic community.
1
3
44
@keirp1
Keiran Paster
7 months
DeepSeekMath seems like a next generation Llemma/Minerva-type model, with insane performance. They use a really clever trick to iteratively gather more high quality webpages from Common Crawl (bootstrapped from OpenWebMath)!
Tweet media one
1
6
43
@keirp1
Keiran Paster
2 years
Do Decision Transformers work in stochastic 🎲 environments? Not out of the box, but with a small change (condition on return -> condition on average return), they can! Here’s how. Paper: with @SheilaMcIlraith and Jimmy Ba
1
6
43
@keirp1
Keiran Paster
9 months
Yesterday I gave a talk on OpenWebMath at @allen_ai which covers many of the tricks and challenges of making a high quality web-scale dataset for LLM pretraining. Thanks @soldni for the invite!
0
11
38
@keirp1
Keiran Paster
1 year
Come stop by our poster presenting STEVE-1 in room 323 today at 4pm at the SPIGM workshop at #ICML2023 !
Tweet media one
2
5
35
@keirp1
Keiran Paster
11 months
We’re excited to see the types of models that can be trained using OpenWebMath! Read our paper: Code available at: Download the dataset:
Tweet media one
1
4
34
@keirp1
Keiran Paster
11 months
OpenWebMath is a huge step towards open LLMs that can do quantitative reasoning since we now have an equivalent dataset that is openly accessible!
1
0
32
@keirp1
Keiran Paster
1 year
Heading to ICML tomorrow through Sunday. I'll be presenting our work on instruction-tuning generative models for Minecraft (STEVE-1) at the ILHF and SPIGM workshops. These days I'm working mostly on LLMs x Reasoning. DMs are open if you want to chat!
2
2
30
@keirp1
Keiran Paster
9 months
Heading to NeurIPS in New Orleans Tuesday-Sunday! I'll be presenting: - STEVE-1 (spotlight!) - OpenWebMath (Oral @ MathAI workshop) - Llemma (MathAI workshop) DM me if you want to chat about reasoning or what it takes to get from today's LLMs to AGI.
Tweet media one
2
3
32
@keirp1
Keiran Paster
1 year
STEVE-1 is built on the shoulders of giants: we use VPT, a Minecraft agent from @OpenAI trained on 70k hours of gameplay videos, and MineCLIP, a multi-modal Minecraft model trained by @NVIDIAAI @DrJimFan .
1
4
31
@keirp1
Keiran Paster
11 months
With over 200B HTML documents in Common Crawl, finding mathematical documents on the web is like finding needles in a haystack. We use a complex pipeline with five steps: prefiltering, extraction, filtering, deduplication, and manual inspection.
Tweet media one
1
0
30
@keirp1
Keiran Paster
2 years
I’m traveling to New Orleans ✈️ to attend NeurIPS next week! Please reach out if you would like to meet up! I'll be presenting three papers investigating the relationship between sequence modeling and optimal decision-making.
2
2
31
@keirp1
Keiran Paster
10 months
Grok @xai on our recent Minecraft AI STEVE-1
Tweet media one
4
14
22
@keirp1
Keiran Paster
9 months
@Francis_YAO_ Some people seem to have a bag-of-words level understanding of this stuff 😂
3
0
28
@keirp1
Keiran Paster
11 months
The dataset includes 14.7B tokens from all popular reference websites (Wikipedia, nLab), forums (MathHelpForum, MathOverflow), blogs (Wordpress, Blogspot), and more! Curious how we processed Common Crawl at scale to mine for mathematical documents? See the 🧵 below!
1
0
29
@keirp1
Keiran Paster
1 year
Read the paper: View our project page: Try out STEVE-1:
Tweet media one
2
2
28
@keirp1
Keiran Paster
2 years
@karpathy Thank you! I expected prompt optimization could improve reasoning performance, but I really wasn’t expecting to get such an intuitive result. One of the more satisfying moments in my PhD so far for sure.
0
0
27
@keirp1
Keiran Paster
11 months
One of the most interesting parts of our pipeline is the MathScore classifier - trained to predict whether a document contains LaTeX symbols based only on the surrounding English words. This is a powerful tool that lets us find documents even when math isn’t in a common format.
Tweet media one
1
0
27
@keirp1
Keiran Paster
4 years
Excited to share our work on learning latent world models that track the controllable aspects of an environment: With Sheila McIlraith and Jimmy Ba (1/9)
2
8
24
@keirp1
Keiran Paster
11 months
We found that even small models trained on OpenWebMath far surpass the performance of those trained on prior math datasets and also surpass the performance of models trained on over 20x the amount of general-domain data! 🚀
Tweet media one
1
0
24
@keirp1
Keiran Paster
2 years
@_akhaliq Thanks for sharing! Please see our thread for more information:
@keirp1
Keiran Paster
2 years
Can large language models write prompts…for themselves? Yes, at a human-level (!) if they are given the ability to experiment and see what works. with @Yongchao_Zhou_ , @_AndreiMuresanu , @ziwen_h , @silviupitis , @SirrahChan , and @jimmybajimmyba (1/7)
15
112
428
0
3
24
@keirp1
Keiran Paster
1 year
Our method is heavily inspired by unCLIP, the method behind DALL•E 2. STEVE-1 combines a policy trained using supervised learning (like Decision Transformer) to achieve goals given in the MineCLIP embedding space, and a prior which translates text into these MineCLIP embeddings.
Tweet media one
1
3
25
@keirp1
Keiran Paster
2 years
It turns out that when we endow language models with the capability to try out different generated prompts before deciding on one, the performance of their prompts skyrockets to human-level! 🚀 (3/7)
1
1
23
@keirp1
Keiran Paster
11 days
@nearcyan It was supposed to be 10am...
2
0
23
@keirp1
Keiran Paster
1 year
Interestingly, we found that the same tricks that benefit generative text-to-image models like DALL•E 2 and Stable Diffusion, namely classifier-free guidance and careful prompt engineering, have a significant effect on the performance of STEVE-1!
Tweet media one
Tweet media two
1
1
20
@keirp1
Keiran Paster
2 years
Really cool. Minecraft is a really complex environment with a lot of possibilities for AI research that was notoriously hard for RL due to hard exploration, but adding 70k hours of unlabeled video lets an RL agent get all the way to finding diamonds! 💎⛏
@jeffclune
Jeff Clune
2 years
Introducing Video PreTraining (VPT): it learns complex behaviors by watching (pretraining on) vast amounts of online videos. On Minecraft, it produces the first AI capable of crafting diamond tools, which takes humans over 20 minutes (24,000 actions) 🧵👇
22
317
2K
0
2
18
@keirp1
Keiran Paster
2 years
We formulate prompt engineering as a black-box optimization problem guided by LLMs: we use LLMs to generate candidate prompts, to evaluate their performance, and to iteratively propose new ones. We call our solution Automatic Prompt Engineer (APE). (4/7)
Tweet media one
1
3
19
@keirp1
Keiran Paster
6 months
Another banger paper from @aahmadian_ and @CohereForAI
@aahmadian_
Arash Ahmadian
6 months
PPO has been cemented as the defacto RL algorithm for RLHF. But… is this reputation + complexity merited?🤔 Our new work revisits PPO from first principles🔎 📜 w @chriscremer_   @mgalle   @mziizm @KreutzerJulia Olivier Pietquin @ahmetustun89 @sarahookr
Tweet media one
13
98
480
0
1
17
@keirp1
Keiran Paster
9 months
@xai Thanks to @agarwl_ for pointing out that question 14b on the exam was not transcribed correctly and @WenhuChen for suggesting to try the Llama-2 version of MAmmoTH-7B. These issues are now fixed on the Huggingface page and I also updated the figure.
Tweet media one
0
6
17
@keirp1
Keiran Paster
11 months
More information on our processing pipeline:
@keirp1
Keiran Paster
11 months
Here's some more in-depth info on the different stages in the OpenWebMath processing pipeline: Prefiltering: We apply a simple prefilter to all HTML documents in Common Crawl in order to skip documents without mathematical content to unnecessary processing time. This reduced our
Tweet media one
Tweet media two
0
8
57
1
0
17
@keirp1
Keiran Paster
2 years
So many really exciting papers submitted from my lab to ICML this year!!🌴
1
1
16
@keirp1
Keiran Paster
1 year
Cool follow-up work on APE from Google:
Tweet media one
2
0
12
@keirp1
Keiran Paster
2 years
Another interesting prompt found using APE: rather than ask GPT to translate to Spanish directly, ask it to pretend as if it's using Google Translate for stronger performance!
@michaelrzhang
Michael Zhang
2 years
Neat result from applying automated prompt generation on translation. Referencing "Google Translate" in the prompt improves the performance of the model.
Tweet media one
0
2
20
0
1
14
@keirp1
Keiran Paster
2 years
2) Use GPT insert-mode to generate 500 candidate prompts that start with "Let's" 3) Find the prompts that maximize the likelihood of correct reasoning. (I forgot to include a few arguments in the prev tweet so here's an updated code snippet.) (2/2)
Tweet media one
2
0
13
@keirp1
Keiran Paster
1 year
Had a great #ICML2023 ! I really enjoyed the many interesting conversations and exploring Hawaii for the first time!
Tweet media one
0
0
12
@keirp1
Keiran Paster
2 years
I'll be presenting ESPER at the Decision Awareness in RL (DARL) workshop at #ICML this Friday! Come stop by the poster (Hall G, 4:30 and 7pm) or ping me if you want to chat!
@keirp1
Keiran Paster
2 years
Do Decision Transformers work in stochastic 🎲 environments? Not out of the box, but with a small change (condition on return -> condition on average return), they can! Here’s how. Paper: with @SheilaMcIlraith and Jimmy Ba
1
6
43
0
1
12
@keirp1
Keiran Paster
2 years
Interestingly, while prior work suggests instruction induction might be an emergent ability in larger models, we find APE’s performance scales smoothly w/ more parameters. Smaller models (even w/o instruction fine-tuning) perform quite well. (7/7)
Tweet media one
1
0
11
@keirp1
Keiran Paster
2 years
@realGeorgeHotz @tiktok_us Why do we need a separate, custom model for each user? Seems to me like a model trained on everyone and conditioned on some user information would be better.
2
0
11
@keirp1
Keiran Paster
1 year
@CohereAI Congrats!
0
1
1
@keirp1
Keiran Paster
7 months
@GanjinZero Also this model already gets 85% pass @64 and 80% pass @16 . So the problem is reduced to creating a verifier that can select the correct answer out of a quite small pool of candidates.
Tweet media one
1
1
10
@keirp1
Keiran Paster
11 years
RT and Follow to win a Championship Thresh skin code (already used in NA). Good luck!
18
75
8
@keirp1
Keiran Paster
2 years
Large language models can be instructed to infer prompts from demonstrations (Honovich et. al. 2022). However, the performance of these automatically-generated prompts often fall short of those authored by humans. (2/7)
Tweet media one
Tweet media two
1
0
10
@keirp1
Keiran Paster
2 years
In our experiments, APE generates human-level prompts from demonstrations in 19 out of 24 instruction-induction tasks, nearly doubling the performance vs. prior work (greedy) and matching the performance of the human-authored prompts. (5/7)
Tweet media one
2
0
10
@keirp1
Keiran Paster
1 year
Like the boosting methods that dominate on tabular datasets, an ensemble of prompts can be selected by iteratively adding new prompts that cover the weaknesses of previous ones. Clever way to make prompt ensembles that cover all possible types of qs.
@johnjnay
John Nay
1 year
Boosted Prompt Ensembles for LLMs -Construct a small set of few-shot prompts that comprise a "boosted prompt ensemble'" -Examples chosen stepwise to be ones previous step's ensemble is uncertain of -Outperforms single-prompt output-space ensembles
Tweet media one
9
67
341
0
0
10
@keirp1
Keiran Paster
8 years
@SpaceX don’t want to land on the wrong boat!
0
0
9
@keirp1
Keiran Paster
1 year
@_akhaliq Thanks for sharing! See our thread for more info:
@keirp1
Keiran Paster
1 year
Meet STEVE-1, an instructable generative model for Minecraft. STEVE-1 follows both text and visual instructions and acts on raw pixel inputs with keyboard and mouse controls. Best of all - it only cost $60 to train! w/ @Shalev_lif @SirrahChan @jimmybajimmyba @SheilaMcIlraith
11
96
322
0
0
8
@keirp1
Keiran Paster
1 year
@DynamicWebPaige @Minecraft Thanks Paige! I really think finetuning VLMs is the future for generalist agents and we are lucky that such strong foundation models already existed for Minecraft. In the near future (using Gemini?) such agents will probably be possible in all sorts of other domains, so it's an
1
1
8
@keirp1
Keiran Paster
9 months
@jordancurve @xai For GPT-4, I referred to the score posted on the website - I didn't grade that one myself.
0
0
7
@keirp1
Keiran Paster
7 months
@WenhuChen My intuition is that the further away your data is from your eval format (instruction-following CoT or chat format), the more compute it takes to extract useful information from the dataset.
1
2
8
@keirp1
Keiran Paster
11 months
@joshm @arcinternet I suspect many of the features other than the "find" command would work well with very small 7B models like Mistral 7B (that could even be run locally at some point in the future). Maybe switching models would save a significant amount of money for you guys.
0
0
8
@keirp1
Keiran Paster
10 months
@joshm Congrats! I notice awareness of Arc is really growing recently.
0
0
8
@keirp1
Keiran Paster
7 months
Strong 2B model trained on OpenWebMath with a ton of training details shared:
@OpenBMB
OpenBMB
7 months
🌟Meet #MiniCPM : An end-side LLM outperforms Llama2-13B. 🏆It is similar to Mistral-7B in the comprehensive rankings and the overall score surpasses Llama2-13B,Falcon -40B and other models. GitHub: Huggingface: #MiniCPM
Tweet media one
Tweet media two
5
16
54
0
1
8
@keirp1
Keiran Paster
1 year
Just seeing this result for the first time this morning... crazy. I guess this shows why MineCLIP embeddings make good goal representations.
@SirrahChan
Harris Chan
1 year
STEVE-1 surprisingly also works with visual prompts taken from real life videos, thanks to using MineCLIP's embedding! See🧵👇 Such a fun project with @Shalev_lif @keirp1 @jimmybajimmyba @SheilaMcIlraith (Credits to @CorruptCarnage and Soil Television for the source videos)
3
11
50
0
0
8
@keirp1
Keiran Paster
2 years
We also experiment with using APE to generate prompts that steer an LLM towards helpful/truthful answers to questions on TruthfulQA. APE finds prompts that are both truthful and informative at a higher rate than our baseline prompt. (6/7)
Tweet media one
Tweet media two
1
0
7
@keirp1
Keiran Paster
7 months
@GanjinZero Yeah I bet 80%+ on MATH is doable this year esp with RL. The challenge to me is improving math ability in a way that generalizes much more broadly to novel questions.
2
0
7
@keirp1
Keiran Paster
2 years
@ericjang11 @shaneguML How we did it: 1) Gather a dataset of questions and correct reasoning steps by using "Let's think step by step" for all datasets in and remove incorrect answers (like in ) (1/2)
1
1
7
@keirp1
Keiran Paster
9 months
This is the coolest demo to me. LLMs don't just work as chatbots, they can code up the right GUI on the spot for each situation!
@GoogleDeepMind
Google DeepMind
9 months
Gemini’s reasoning capabilities mean it can understand more about a user’s intent, and use tools to generate bespoke user experiences that go beyond chat interfaces. Here’s what that looks like in action. ↓ #GeminiAI
50
315
2K
0
0
6
@keirp1
Keiran Paster
2 years
We are excited to present "Large Language Models Are Human-Level Prompt Engineers” (APE) at the FMDM workshop (Dec 3rd) as an oral presentation (starting at 1:45pm CST) and at the ML Safety workshop (Dec 9th).
@keirp1
Keiran Paster
2 years
Can large language models write prompts…for themselves? Yes, at a human-level (!) if they are given the ability to experiment and see what works. with @Yongchao_Zhou_ , @_AndreiMuresanu , @ziwen_h , @silviupitis , @SirrahChan , and @jimmybajimmyba (1/7)
15
112
428
1
0
6
@keirp1
Keiran Paster
2 years
Reasonable answer...
Tweet media one
0
0
6
@keirp1
Keiran Paster
10 months
@_akhaliq Also check out our demo, which lets you inspect hundreds of rendered model-generated solutions and get a feel for how strong the model is!
0
0
6
@keirp1
Keiran Paster
9 months
Clever tokenization means that transformer sequence models can now dominate even driving trajectory prediction!
@PhilionJonah
Jonah Philion
9 months
"Trajeglish: Learning the Language of Driving Scenarios" w/ @xbpeng4 @FidlerSanja Discrete sequence modeling for controlling interactive agents in self-driving simulation @nvidia @VectorInst @UofTCompSci @SFU 1/6
5
42
184
0
1
6
@keirp1
Keiran Paster
4 years
Come check out my talk on GLAMOR at 11:45 PST (2:45 EST) at the Deep RL Workshop @NeurIPSConf . And say hi at the poster session (room B, poster B6) if you have any questions!
@VectorInst
Vector Institute
4 years
@keirp1 is giving a talk on this paper at 2:45 pm EST today as part of the @NeurIPSConf Deep RL Workshop
0
0
1
0
0
6
@keirp1
Keiran Paster
3 years
I still get emails from Veggie Grill but I can’t go since they aren’t in Canada 😞😔😟😕🙁☹️😣😖😫😩🥺😢😭
0
0
6
@keirp1
Keiran Paster
1 year
@arankomatsuzaki A bit curious how they compared to VPT since this environment seems significantly easier than the one VPT uses (blocks break in one click).
1
0
5
@keirp1
Keiran Paster
1 year
So our free will is basically at most one coin flip on average per word?
2
0
5
@keirp1
Keiran Paster
9 months
@ocolegro They chose some really nice evals. Looks strong!
0
0
4
@keirp1
Keiran Paster
2 months
@WenhuChen @iScienceLuvr seems like no tool use?
Tweet media one
2
0
5
@keirp1
Keiran Paster
3 years
@ncweaver I took both 61c and 161 with you and you are pretty much my favorite lecturer I had at Berkeley! Weird that they didn't give an interview.
0
0
5
@keirp1
Keiran Paster
8 months
@BlancheMinerva For Mixtral:
Tweet media one
0
0
4
@keirp1
Keiran Paster
7 months
Somehow my Gemini 2-month trial expired after just one day hahaha
2
0
4
@keirp1
Keiran Paster
9 months
Best of both worlds
Tweet media one
Tweet media two
0
0
3
@keirp1
Keiran Paster
7 months
@lmsysorg @Google I wonder where ChatGPT is then.
1
0
4
@keirp1
Keiran Paster
9 months
@keirp1
Keiran Paster
9 months
Following up on my previous post, I hand-graded held-out* math exams from the recently released Qwen 72B and DeepSeek 67B Base/Chat. It seems like they perform similarly to Claude 2! DeepSeek 67B: 37% GPT-3.5: 41% Qwen 72B: 52% Claude 2: 55% DeepSeek 67B Chat: 56% Grok-1: 59%
Tweet media one
2
17
74
0
2
3
@keirp1
Keiran Paster
1 year
Cool paper. GPT 3.5 -> 4 is a huge boost in abstract reasoning and a lot of the remaining failures come down to bad text-based representations. Would be good to revisit this with some strong VLMs.
@pashootan_vaezi
Pashootan Vaezipoor ✈️ NeurIPS
1 year
Can you outsmart #ChatGPT ? 🧩 Try this puzzle and guess the final image. In our work, we explore abstract reasoning abilities of the large language models. Read on for insights! w/ @yudongxuwil @WenhaoLi29 @ScottSanner @lyeskhalil 1/🧵
Tweet media one
2
4
16
0
0
4
@keirp1
Keiran Paster
2 years
Finally, I am presenting a new work called “Return Augmentation gives Supervised RL Temporal Compositionality” (more details soon). This work will be presented at the FMDM workshop (Dec 3rd) and at the DeepRL workshop (Dec 9th)!
0
0
4
@keirp1
Keiran Paster
9 years
Check out my new @Pebble app! It puts League of Legends esports in your timeline! http://t.co/NyfvGeiszA
1
2
4
@keirp1
Keiran Paster
1 year
@edward_s_hu Shout-out to my little brother who made the artwork!
0
0
4
@keirp1
Keiran Paster
9 months
This is much closer to an ideal math eval.
0
0
4
@keirp1
Keiran Paster
1 year
@luka_emon Yeah VLMs are very clearly the future of agents. When a strong open source one comes out, expect a lot of progress quickly.
1
0
3
@keirp1
Keiran Paster
4 years
Personally I'm quite excited about GLAMOR and the way it connects RL to sequence modeling (like GPT3) and supervised learning! I think there's a lot of interesting future work in this direction. (8/9)
1
1
3