Szymon Tworkowski Profile Banner
Szymon Tworkowski Profile
Szymon Tworkowski

@s_tworkowski

5,097
Followers
543
Following
19
Media
536
Statuses

research @ xAI | prev. @GoogleAI @UniWarszawski | LongLLaMA | long-context LLMs and math reasoning | scaling maximalist

Palo Alto
Joined November 2021
Don't wanna be here? Send us removal request.
Pinned Tweet
@s_tworkowski
Szymon Tworkowski
1 year
Introducing LongLLaMA 🦙, an unlimited-context version of OpenLLaMA fine-tuned at 8k & capable of extrapolating to 256k tokens! We train it using our new Focused Transformer 🎯 technique (FoT). No degradation on short context, drop-in compatibility & Apache 2.0 license 🔥🔥 🧵
Tweet media one
36
228
834
@s_tworkowski
Szymon Tworkowski
1 year
🎇Introducing LongLLaMA-Instruct 32K!🎇 Inspired by @p_nawrot #nanoT5 , we fine-tune LongLLaMA- on a *single GPU* for ~48h to improve upon OpenLLaMA: 55% on lm-eval (vs. 53%), better perf on long context and code! We open-source our optimized fine-tuning code in PyTorch/HF!🧵
Tweet media one
9
69
290
@s_tworkowski
Szymon Tworkowski
1 year
✨Announcing LongLLaMA-Code 7B!✨ Have you wondered how GPT3.5 obtained its capability? Are base models of code better reasoners? 🤔 We continue pre-training CodeLLaMA on text & code to improve reasoning 🧠 Bonus: 3x faster inference @ 16K context, using Focused Transformer 🎯
Tweet media one
5
56
310
@s_tworkowski
Szymon Tworkowski
9 months
Honored to win Poland's best CS master thesis prize for my work on long context LLM w/ @PiotrRMilos 🎉 Can't make it to #NeurIPS2023 😭, but @CStanKonrad will present LongLLaMA paper tmr! Thu 10:45, Poster #326 , Session 5 Interested in extending context to 256K? Come and say hi!
Tweet media one
3
18
71
@s_tworkowski
Szymon Tworkowski
1 year
LLMs struggle with solving simple competitive programming problems (e.g. Codeforces) outside their training data. Our #ACL #NLRSE paper (Thurs 19:30) investigates their ability to comprehend and reason about human-coded solutions. Can they grasp the main idea from just the code?
Tweet media one
6
46
170
@s_tworkowski
Szymon Tworkowski
4 months
✈️ Flying to Vienna for #ICLR2024 ! DM me if you want to chat about LLM reasoning, long context or any other topics :)
1
78
137
@s_tworkowski
Szymon Tworkowski
18 days
It’s been a blast to work on Grok 2 & Grok 2 mini with the team & push it to the frontier! 🚀
@lmsysorg
lmsys.org
18 days
Chatbot Arena update❤️‍🔥 Exciting news— @xAI 's Grok-2 and Grok-mini are now officially on the leaderboard! With over 6000 community votes, Grok-2 has claimed the #2 spot, surpassing GPT-4o (May) and tying with the latest Gemini! Grok-2-mini also impresses at #5 . Grok-2 excels in
Tweet media one
193
521
2K
9
46
215
@s_tworkowski
Szymon Tworkowski
6 months
A glimpse over our recent progress - exciting things to come!
@xai
xAI
6 months
773
1K
7K
5
3
141
@s_tworkowski
Szymon Tworkowski
5 months
👀👀
@xai
xAI
5 months
👀
725
1K
7K
2
6
126
@s_tworkowski
Szymon Tworkowski
6 months
@inflectionAI
Inflection AI
6 months
Today at Inflection we are announcing some important updates. A new phase for the company begins now. Read more here:
67
91
596
3
5
83
@s_tworkowski
Szymon Tworkowski
2 months
Exciting job opportunity! Hiring CUDA engineers 🧙 Join us & run your code on 100k GPUs 🚀
1
5
83
@s_tworkowski
Szymon Tworkowski
2 months
gpus go brrrrr 🧑‍🍳🧑‍🍳
@xai
xAI
2 months
how june started & how it’s going come 🧑‍🍳 with us at xAI & 𝕏 if you like building & running the biggest computers in the world!
Tweet media one
Tweet media two
1K
2K
12K
2
0
75
@s_tworkowski
Szymon Tworkowski
1 year
Thrilled to be a first-gen MSc! 🎓 Just defended my thesis on ‘Fine-tuning Large Language Models for Long Context Utilization’ at the University of Warsaw. Check out our recent work if you’re curious how to extend context of LLaMA🦙 up to 256K and remain efficient at inference!
Tweet media one
5
8
66
@s_tworkowski
Szymon Tworkowski
1 year
Extremely lucky to get Focused Transformer (FoT) accepted at #NeurIPS2023 🎉! It is my first first-author paper at a big conference, which makes this moment even more special🎇 Feel free to check our recent LongLLaMA release using FoT! Unthinkable with ACL arxiv embargo policy
@s_tworkowski
Szymon Tworkowski
1 year
✨Announcing LongLLaMA-Code 7B!✨ Have you wondered how GPT3.5 obtained its capability? Are base models of code better reasoners? 🤔 We continue pre-training CodeLLaMA on text & code to improve reasoning 🧠 Bonus: 3x faster inference @ 16K context, using Focused Transformer 🎯
Tweet media one
5
56
310
4
13
59
@s_tworkowski
Szymon Tworkowski
27 days
sus🍓🍓 Join us
@xai
xAI
27 days
1K
2K
9K
1
2
52
@s_tworkowski
Szymon Tworkowski
27 days
column ai ftw🏛️🏛️
7
3
45
@s_tworkowski
Szymon Tworkowski
1 year
This is joint work with my awesome collaborators @CStanKonrad , @MikolajPacek , @PiotrRMilos from @IDEAS_NCBR , and @Yuhu_ai_ , @hmichalewski from @GoogleAI ! Colab: GitHub: Checkpoint:
3
4
41
@s_tworkowski
Szymon Tworkowski
1 year
FoT is a simple modification of the vanilla Transformer - instead of increasing context window length in all layers, we access previous windows of the training batch (containing tokens from the current and other docs) in a subset of attention layers called memory layers.
Tweet media one
1
3
31
@s_tworkowski
Szymon Tworkowski
1 year
📄 We show that LMs suffer from the "distraction issue" i.e. struggle to handle multiple documents in one context. Our Focused Transformer training objective (FoT) alleviates this by attending to tokens from the same doc (positive) and other docs (negative)
Tweet media one
1
0
30
@s_tworkowski
Szymon Tworkowski
10 months
Never thought prompt engineering could be this fun until #PromptIDE ! Check it out!
@xai
xAI
10 months
Announcing the xAI PromptIDE The xAI PromptIDE is an integrated development environment for prompt engineering and interpretability research. It accelerates prompt engineering through an SDK that allows implementing complex prompting techniques and rich analytics that visualize
1K
1K
9K
2
14
27
@s_tworkowski
Szymon Tworkowski
1 year
Unlike prior work focusing on position encodings, we follow and achieve extrapolation by simply keeping positionals constant for memory tokens, while leaving the local context intact. This makes LongLLaMA backward compatible with LLaMA inference code.
1
1
27
@s_tworkowski
Szymon Tworkowski
11 months
@Yampeleg In our Focused Transformer paper we propose a contrary view, which means packing multiple examples in one context can be beneficial if you optimize for long-context capabilities. This is because the model learns to ignore irrelevant tokens in context.
1
3
24
@s_tworkowski
Szymon Tworkowski
1 year
Surprisingly, we observe that apart from expected gains in multi-document settings, FoT (d=2) also improves perplexity on long, single documents, compared to training on just positive documents (d=1). We find this important, as the amount of long-context training data is limited.
Tweet media one
4
0
21
@s_tworkowski
Szymon Tworkowski
1 year
We improve GSM8K from 13% to 17% after 35B tokens without in-distribution training. We also publish Focused Transformer code for long-context pre-training, used in LongLLaMA! GitHub: HF (checkpoint): arXiv:
1
2
20
@s_tworkowski
Szymon Tworkowski
1 year
Colab: HF: Code: We also announce LongLLaMA-v1.1, a 3B base model trained for 5B 32K context tokens with our FoT method: We improved long-context and code (12% HE pass @1 ) capabilities
1
3
19
@s_tworkowski
Szymon Tworkowski
1 year
#ACL2023 is over! It was so exciting to talk LLMs with y'all! I hope that some of these insights will shape the future of large language models. Hopefully see you at #NeurIPS2023 !
Tweet media one
0
3
18
@s_tworkowski
Szymon Tworkowski
1 year
📄 We show that despite poor performance in solving competitive programming problems, LLMs exhibit a strong capacity in describing and explaining solutions. We try to disentangle the contribution of reasoning and coding in solving these problems by LLMs.
2
1
16
@s_tworkowski
Szymon Tworkowski
1 year
Thanks to using Focused Transformer (FoT), our inference is > 3x faster than the baseline at 16K tokens. We only use long-range attention in a subset of layers (3 out of 32). It is only 10% of the vanilla attention FLOPs. See by @harmdevries77 for details
Tweet media one
1
2
15
@s_tworkowski
Szymon Tworkowski
1 year
LongLLaMA-Instruct was initialized from LongLLaMA-v1.1 32K and fine-tuned with context of just 2048(!) for 0.07 epochs. We observe that despite short-context fine-tuning, we don't lose the long-context capabilities of the base model (see by @Francis_YAO_ ).
1
2
12
@s_tworkowski
Szymon Tworkowski
1 year
As indicated by , current LLMs are no good at solving competitive programming problems outside of their training distribution, achieving a very low rating of 392 reported by @OpenAI , corresponding to barely 5th percentile of human competitors (avg. ~1450)
@cHHillee
Horace He
1 year
I suspect GPT-4's performance is influenced by data contamination, at least on Codeforces. Of the easiest problems on Codeforces, it solved 10/10 pre-2021 problems and 0/10 recent problems. This strongly points to contamination. 1/4
Tweet media one
Tweet media two
81
691
4K
1
0
12
@s_tworkowski
Szymon Tworkowski
1 year
We obtain exactly the same HumanEval perf as CodeLLaMA, and improve MMLU from 37.2% to 40% (scoring setup, no CoT) and gsm8k-py from 23.4% to 24.9%. We also outperform LLaMA2. SFT could unleash the full potential of this model; stay tuned for instruct version - coming soon!
Tweet media one
1
1
12
@s_tworkowski
Szymon Tworkowski
1 year
In human evaluation, we observe gpt4 is better than gpt3.5 at understanding the main idea, but there's still a long way to go. We hypothesize our backward explanations could be useful to improve model's forward reasoning process with techniques such as
Tweet media one
@Yuhu_ai_
Yuhuai (Tony) Wu
2 years
Language models can dramatically improve their reasoning by learning from chains of thought that they generate. With STaR, just a few worked examples can boost accuracy to that of a 30X larger model (GPT-J to GPT-3). W. @ericzelikman , Noah Goodman 1/
Tweet media one
8
94
528
1
0
11
@s_tworkowski
Szymon Tworkowski
1 year
Thrilled to be spotlighted in an interview with @AICoffeeBreak presenting my work at #ACL2023 ! Dive into the dialogue at 1:26 to catch my spicy 🌶️ insights on LLMs for code👨🏻‍💻, competitive programming, and the "understanding" of these models. Don't miss it!
@AICoffeeBreak
AI Coffee Break with Letitia
1 year
We summarized the #acl2023nlp Toronto conference for you with some poster recordings and author interviews! 👇 🎬 Featuring @s_tworkowski @jasivan_s @kundan_official @ebugliarello @leoduw @_florianmai @franz_nowak @PaulDarm @MoritzPlenz and @JayAlammar 👏
Tweet media one
0
23
54
0
4
10
@s_tworkowski
Szymon Tworkowski
1 year
Instead of solving the problem directly just from its NL description as LLM input (NL to code), we study the backward process of extracting the idea from a correct code solution. We show that our extracted rationales can significantly boost the solve rate of LLMs on CodeContests.
Tweet media one
1
2
11
@s_tworkowski
Szymon Tworkowski
1 year
CodeLLaMA is a great model, but apparently, it degrades GSM8k from 42.2% to 32.7% at 34B size. Is there a reason for not using a more balanced mixture during fine-tuning, instead of 85% code? I feel like the community might close this gap very soon :)
@soumithchintala
Soumith Chintala
1 year
CodeLlama -- a version of Llama2 that was fine-tuned for code tasks is live now. Available in 7B, 13B and 34B.
Tweet media one
17
206
952
1
4
11
@s_tworkowski
Szymon Tworkowski
1 year
Using insights from designing our prompt for the backward reasoning process, we propose a structured prompt that boosts the solve rate of these models just from problem statement to code (the original task, without golden explanations in the input) from 6.1% to 9.1% for pass @10 .
Tweet media one
1
0
11
@s_tworkowski
Szymon Tworkowski
1 year
The base model is fine-tuned from OpenLLama v2 and released on a fully commercial Apache 2.0 license. We used a combination of OpenOrca and for SFT. We open-source the code to facilitate efficient instruction tuning on your own data
1
5
11
@s_tworkowski
Szymon Tworkowski
1 year
LongLLaMA-Code was done with a small amount of 35B pretraining tokens (mix webtext & code) to improve reasoning While still exploratory, these results suggest base models of code are a promising avenue for enhancing reasoning capabilities @Francis_YAO_ []
1
1
10
@s_tworkowski
Szymon Tworkowski
1 year
The Focused Transformer and LongLLaMA are joint work with my amazing collaborators @CStanKonrad @MikolajPacek @Yuhu_ai_ @hmichalewski @PiotrRMilos ! Stay tuned for future releases of larger models (7B, 13B) fine-tuned for more long-context tokens!
0
3
10
@s_tworkowski
Szymon Tworkowski
9 months
The session is happening now! For those who missed the poster 😉
Tweet media one
1
1
10
@s_tworkowski
Szymon Tworkowski
1 year
This was an amazing project primarily done by Jierui Li from UT Austin, with my help & advised by Yingying Wu and Raymond Mooney. Acknowledgements to @IDEAS_NCBR for making the in-person presentation possible. For me, the project was huge fun, bringing back NOI and ICPC memories
1
1
9
@s_tworkowski
Szymon Tworkowski
11 months
Finally a 7B long-context model that you can actually run in an academic budget (single free Colab GPU)? Try to chat with it about papers & code!
@CStanKonrad
Konrad Staniszewski
11 months
🎇Introducing LongLLaMA-Code 7B Instruct 🦙!🎇 A step towards an open-source alternative for Claude 2. Run in 🆓 Colab (8bit). 🗨 Answers questions about 📑 papers and >_ code. SOTA 7B reasoning : 🎓 GSM8K: 65% 🐍 PoT 0-shot, 42% std CoT 8-shot setting. >_ 37%: HumanEval
Tweet media one
3
44
261
0
2
9
@s_tworkowski
Szymon Tworkowski
1 year
In the colab demo we try to feed the entire Focused Transformer paper into context and ask questions about it, achieving reasonable results. In the same colab, we also provide a chat interface to interact with the model!
1
1
9
@s_tworkowski
Szymon Tworkowski
1 year
@XueFz @DrJimFan Not sure if gpt4 is capable of solving novel lc medium/hard (outside of the training data). IMHO these tasks still require some amount of reasoning before coding. Once you come up with the idea, llm can code it up pretty easily. Check our work for more
2
0
8
@s_tworkowski
Szymon Tworkowski
10 months
How do math LLMs perform beyond academic benchmarks 🤔? Check it out!
@keirp1
Keiran Paster
10 months
Which LLMs are generally good at math and which are overfitting to benchmarks? With the release of Grok, @xai evaluated several closed models on a Hungarian national finals math exam which was published after the models were trained. This means it is impossible to train on or
Tweet media one
Tweet media two
24
84
560
0
3
8
@s_tworkowski
Szymon Tworkowski
10 months
@Francis_YAO_ Perplexity/LM loss gains (compression) seem pretty clear, e.g. from Memorizing Transformers paper, and context could be seen as an additional scaling dimension. Imo it's harder to quantify benefits from application of LCLMs, due to lack of reasonable downstream benchmarks 🤔
Tweet media one
0
0
9
@s_tworkowski
Szymon Tworkowski
1 year
Exploring the roots of the famous "6ND" formula, I stumbled upon this insightful post by @DBahdanau about LLM training compute estimation 👉 . Still on point! Wondering what >64k contexts/flashattn bring to the table 🧐, does FFN still dominate attention?
1
1
7
@s_tworkowski
Szymon Tworkowski
1 year
LongLLaMA v1.1 shows competitive performance on long-context retrieval tasks (see @nelsonfliu ), without degradation after instruction tuning. Also, the short-context performance on downstream tasks improves due to instruction tuning: 55% vs. 53% on lm-eval
Tweet media one
1
1
8
@s_tworkowski
Szymon Tworkowski
1 year
This is a joint work with my amazing collaborators @CStanKonrad @PiotrRMilos @hmichalewski ! Many thanks also to @p_nawrot and @Francis_YAO_ @S_Jaszczur for helpful suggestions!
1
2
7
@s_tworkowski
Szymon Tworkowski
1 year
arXiv: HF: Blogpost: coming soon! 🔥🔥
1
3
5
@s_tworkowski
Szymon Tworkowski
1 year
I'm presenting this paper on explaining competitive programming solutions with LLMs at 1:30pm (in 10min) in the big poster hall #ACL2023 - number 24, drop by and say hi!
0
2
6
@s_tworkowski
Szymon Tworkowski
1 year
@JohnGal43951639 @Yampeleg The memory overhead from longer context is much smaller than for standard models since we only access k/v from the extended context in a tiny fraction of layers (3 out of 26). With a simple trick we fit 32K context into a colab GPU (see the HF repo):
0
0
5
@s_tworkowski
Szymon Tworkowski
11 months
@Francis_YAO_ There are very few evaluation datasets for long-context LLM. shows comparison between open-source and Claude, which is pretty miserable - still a long way to go for OSS 🤔, but developing a leaderboard like MT Bench for LCLM would be immensely useful!
0
0
5
@s_tworkowski
Szymon Tworkowski
1 year
Our model is also significantly faster at inference due to only having a subset of layers (3/26) attend to tokens beyond the local context window. LongLLaMA-Instruct is a competitive 3B model you can run on a colab or locally and use as a long-context chatbot (see colab demo!)
1
0
4
@s_tworkowski
Szymon Tworkowski
1 year
Nothing works better for chilling out than google recalling your compute in the middle of training, 2 days before a planned release. You can just watch netflix or write some codeforces round 🤣
Tweet media one
@4evaBehindSOTA
tokenbender
1 year
Nothing works better for concentration than renting high cost compute. You've to use every minute well.
Tweet media one
1
0
12
2
1
4
@s_tworkowski
Szymon Tworkowski
9 months
Best poster design at the entire #EMNLP2023 😎, not mentioning the awesome technical contribution that #nanoT5 has made, check it out!
@p_nawrot
Piotr Nawrot
9 months
Everyone's invited to stop by the poster session of NLP-OSS Workshop at #EMNLP where you can see this piece of art poster by yourself in-person This is the last post about nanoT5 from me, if you haven't seen it check out Thanks for all the kind feedback!
Tweet media one
0
1
19
1
0
4
@s_tworkowski
Szymon Tworkowski
11 months
Amazing initiative with COLM 🎉! One more step for inclusivity could be making the location accessible to all 🌍. It'd cut down on the visa hassle 🛂 that often plagues U.S. conferences (e.g. #NeurIPS2023 ) . Any thoughts on the potential location? 🙂
@srush_nlp
Sasha Rush
11 months
Introducing COLM () the Conference on Language Modeling. A new research venue dedicated to the theory, practice, and applications of language models. Submissions: March 15 (it's pronounced "collum" 🕊️)
Tweet media one
34
434
2K
1
3
3
@s_tworkowski
Szymon Tworkowski
1 year
@4evaBehindSOTA @p_nawrot @CStanKonrad The FoT pretraining code is written in JAX; we're planning to release it after LongLLaMA v2 release - stay tuned :)) No plans to implement Focused Transformer pretraining in PyTorch, so if you'd like to learn, doing this could be a great exercise and useful for the community!
1
0
3
@s_tworkowski
Szymon Tworkowski
9 days
@AlbertQJiang @WendaLi8 @Mateja_Jamnik Congrats! It’s been a pleasure to collaborate with you :)
0
0
3
@s_tworkowski
Szymon Tworkowski
1 year
Special thanks to @keirp1 for providing immensely valuable suggestions about the pre-training data! 🧡
0
5
3
@s_tworkowski
Szymon Tworkowski
9 months
@haozhangml How about OpenLLaMA by @younggeng ? The trajectory was published along with reproducible source code.
0
0
2
@s_tworkowski
Szymon Tworkowski
1 year
@Yampeleg Thanks for sharing our work!! :)
0
0
3
@s_tworkowski
Szymon Tworkowski
1 year
@mrm8488 Cool! Any humaneval numbers?
1
0
3
@s_tworkowski
Szymon Tworkowski
1 year
@PiotrRMilos I just feel lucky there are conferences allowing to put papers on arXiv and make an impact way before acceptance, which is not true for some other conference 😉. Thanks a lot for the entire team for the extremely hard work! @CStanKonrad @MikolajPacek @Yuhu_ai_ @hmichalewski
0
0
3
@s_tworkowski
Szymon Tworkowski
8 months
Finally! 🎉🎉🎉
@gneubig
Graham Neubig
8 months
ACL has removed the anonymity period. This means that ACL submissions can be posted and discussed online at any time, although extensive PR is discouraged.
Tweet media one
5
86
353
0
0
2
@s_tworkowski
Szymon Tworkowski
1 year
@AlbertQJiang Congrats!! The base model seems very strong, looking forward to the community's creativity in doing SFT :)
1
0
2
@s_tworkowski
Szymon Tworkowski
1 year
@tongshuangwu @kgashteo 太好了!Rarely seeing „xD” being used here (it is very popular in my home country), curious where you learn that from :)
0
1
2
@s_tworkowski
Szymon Tworkowski
1 year
@KujoJot32604166 The current state is roughly here: We are working towards better pretraining data for long-context, and very likely to release something new in Oct/Nov :)
1
0
2
@s_tworkowski
Szymon Tworkowski
1 year
@tugot17 @dylan522p @mvpatel2000 @abhi_venigalla Pretraining data mixture for llama2 is not public ig, but it doesnt seem to be qualitatively better than llama1, given it’s perf / train tokens. I think mistral used much more code in the mixture, which is likely to help reasoning benchmarks (although still speculative)
0
0
2
@s_tworkowski
Szymon Tworkowski
1 year
@atgambardella Wish I could motivate myself to study Mandarin 2h daily.. typically it’s 30min at most, extremely good job on your side and fingers crossed you master Japanese!
1
0
2
@s_tworkowski
Szymon Tworkowski
1 year
@OfirPress Do they finetune at 100k actually? I thought they finetune until 16k and then try to extrapolate, but maybe i misread sth
0
0
2
@s_tworkowski
Szymon Tworkowski
1 year
@teortaxesTex @sytelus Why not NoPos? 🤔
0
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@minimario1729 @terese14711217 @AICoffeeBreak My suspicion is that openai’s gpt has already seen multiple epochs of solutions (based on its pre-cutoff solve rate) & unclear amount of epochs of editorials. You can check out our work to see how gpt-generated editorials/rationales could boost performance.
1
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@KujoJot32604166 Also, we are working on an instruction-tuned version of LongLLama-Code 7B. This is supposed to be relased early Oct, and from preliminary results we expect it to be pretty good :) stay tuned!
0
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@4evaBehindSOTA @p_nawrot @CStanKonrad The code would have been much more performant in pytorch (flashatt etc), its just compute constraint that I only have TPU compute to pretrain on, so pytorch is not useful for me, but def would be for the community!
1
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@abacaj I totally agree with that - training data is the most important for long context utilisation. I hope some upcoming ICLR submission will discuss this question!
0
0
1
@s_tworkowski
Szymon Tworkowski
11 months
@artificial_bug @CStanKonrad @abacaj Also really excited about the possibilities of running this on a free colab instance :) make sure to check out our work 😀
0
0
1
@s_tworkowski
Szymon Tworkowski
9 months
0
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@dylan522p @mvpatel2000 @abhi_venigalla You mean training time flops? I guess they have much higher quality data than llama2 so I wouldn’t be too surprised if they do even 2x better in terms of training flops for achieving same downstream perf :p
2
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@laion_ai I dont think it can compare even to gpt3.5 in terms of math/coding/reasoning performance. Looking forward to their paper and gsm8k & humaneval numbers. Given their data mixture, is not looking promising ngl :))
2
0
1
@s_tworkowski
Szymon Tworkowski
11 months
@Francis_YAO_ @denny_zhou I tried to reproduce zeroscrolls gpt4 results but unfortunately it seems difficult. :(
0
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@zhu_zhaocheng Pretty much, im interested in learning mandarin these days :) just seen in what context my friends use the fake smile and do the same myself hhh
0
0
1
@s_tworkowski
Szymon Tworkowski
11 months
@Yampeleg Specifically, in Figure 6 we study what happens if we train the model on multiple examples, compared to just same example. Perplexity on long, single documents gets better for a model that can sould see multiple unrelated examples in context, which we find quite surprising.
1
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@finbarrtimbers how to do flash attention with tpu? :))
1
0
1
@s_tworkowski
Szymon Tworkowski
1 year
@atgambardella rn on my side it’s mostly memorising basic vocab and learning to distinguish between tones hhh but I hope I can start listening to some comprehensible input/read simple stuff soon like I did with European languages
1
0
1