Szymon Tworkowski @s_tworkowski Twitter profile | Pikagi

Pikagi

Szymon Tworkowski

@s_tworkowski

5,097

Followers

543

Following

19

Media

536

Statuses

research @ xAI | prev. @GoogleAI @UniWarszawski | LongLLaMA | long-context LLMs and math reasoning | scaling maximalist

Palo Alto

https://t.co/NhNejk44Uo

Joined November 2021

Don't wanna be here? Send us removal request.

Pinned Tweet

@s_tworkowski

Szymon Tworkowski

1 year

Introducing LongLLaMA 🦙, an unlimited-context version of OpenLLaMA fine-tuned at 8k & capable of extrapolating to 256k tokens! We train it using our new Focused Transformer 🎯 technique (FoT). No degradation on short context, drop-in compatibility & Apache 2.0 license 🔥🔥 🧵

Tweet media one

36

228

834

Last Seen Profiles

@JennaCarte2930

@ALKAMSAHH

@stwmaniax

@Tokusatsu_DNA

@viregxo

@stw46

@MSharpes96022

@stwmaniax

@BinorRaja

@meldav99

@braintertains

@coryjcameron

@LaughatLifeBen

@EmmieRosePG

@LanaMevis96361

@research31226

@kml1048140

@want2know_why

@bokeplokalmalam

@HRCYork

@ibrahim_ben5

@visionexenoisiv

@pinoakms

@working_puuu

@rtk254

@Kan_Ye_Asada

@MalahValley

@KermtStroomt

@_fmilanesio

@__91t

@hatangookmaijer

@shkilansirtji

@bubbyskid

@abounouhe

@4u9IENrCJvulTi7

@pocomint0219

@s_tworkowski

Szymon Tworkowski

1 year

🎇Introducing LongLLaMA-Instruct 32K!🎇 Inspired by @p_nawrot #nanoT5 , we fine-tune LongLLaMA- on a *single GPU* for ~48h to improve upon OpenLLaMA: 55% on lm-eval (vs. 53%), better perf on long context and code! We open-source our optimized fine-tuning code in PyTorch/HF!🧵

Tweet media one

9

69

290

@s_tworkowski

Szymon Tworkowski

1 year

✨Announcing LongLLaMA-Code 7B!✨ Have you wondered how GPT3.5 obtained its capability? Are base models of code better reasoners? 🤔 We continue pre-training CodeLLaMA on text & code to improve reasoning 🧠 Bonus: 3x faster inference @ 16K context, using Focused Transformer 🎯

Tweet media one

5

56

310

@s_tworkowski

Szymon Tworkowski

9 months

Honored to win Poland's best CS master thesis prize for my work on long context LLM w/ @PiotrRMilos 🎉 Can't make it to #NeurIPS2023 😭, but @CStanKonrad will present LongLLaMA paper tmr! Thu 10:45, Poster #326 , Session 5 Interested in extending context to 256K? Come and say hi!

Tweet media one

3

18

71

@s_tworkowski

Szymon Tworkowski

1 year

LLMs struggle with solving simple competitive programming problems (e.g. Codeforces) outside their training data. Our #ACL #NLRSE paper (Thurs 19:30) investigates their ability to comprehend and reason about human-coded solutions. Can they grasp the main idea from just the code?

Tweet media one

6

46

170

@s_tworkowski

Szymon Tworkowski

4 months

✈️ Flying to Vienna for #ICLR2024 ! DM me if you want to chat about LLM reasoning, long context or any other topics :)

1

78

137

@s_tworkowski

Szymon Tworkowski

18 days

It’s been a blast to work on Grok 2 & Grok 2 mini with the team & push it to the frontier! 🚀

@lmsysorg

lmsys.org

18 days

Chatbot Arena update❤️‍🔥 Exciting news— @xAI 's Grok-2 and Grok-mini are now officially on the leaderboard! With over 6000 community votes, Grok-2 has claimed the #2 spot, surpassing GPT-4o (May) and tying with the latest Gemini! Grok-2-mini also impresses at #5 . Grok-2 excels in

Tweet media one

193

521

2K

9

46

215

@s_tworkowski

Szymon Tworkowski

6 months

A glimpse over our recent progress - exciting things to come!

@xai

xAI

6 months

773

1K

7K

5

3

141

@s_tworkowski

Szymon Tworkowski

5 months

👀👀

@xai

xAI

5 months

👀

725

1K

7K

2

6

126

@s_tworkowski

Szymon Tworkowski

6 months

Tweet card media

See our latest job posts and career opportunities.

@inflectionAI

Inflection AI

6 months

Today at Inflection we are announcing some important updates. A new phase for the company begins now. Read more here:

67

91

596

3

5

83

@s_tworkowski

Szymon Tworkowski

2 months

Exciting job opportunity! Hiring CUDA engineers 🧙 Join us & run your code on 100k GPUs 🚀

Tweet card media

CUDA Kernel Engineer & Researcher

Bay Area (San Francisco and Palo Alto)

boards.greenhouse.io

1

5

83

@s_tworkowski

Szymon Tworkowski

2 months

gpus go brrrrr 🧑‍🍳🧑‍🍳

@xai

xAI

2 months

how june started & how it’s going come 🧑‍🍳 with us at xAI & 𝕏 if you like building & running the biggest computers in the world!

Tweet media one

Tweet media two

1K

2K

12K

2

0

75

@s_tworkowski

Szymon Tworkowski

1 year

Thrilled to be a first-gen MSc! 🎓 Just defended my thesis on ‘Fine-tuning Large Language Models for Long Context Utilization’ at the University of Warsaw. Check out our recent work if you’re curious how to extend context of LLaMA🦙 up to 256K and remain efficient at inference!

Tweet media one

5

8

66

@s_tworkowski

Szymon Tworkowski

1 year

Extremely lucky to get Focused Transformer (FoT) accepted at #NeurIPS2023 🎉! It is my first first-author paper at a big conference, which makes this moment even more special🎇 Feel free to check our recent LongLLaMA release using FoT! Unthinkable with ACL arxiv embargo policy

@s_tworkowski

Szymon Tworkowski

1 year

✨Announcing LongLLaMA-Code 7B!✨ Have you wondered how GPT3.5 obtained its capability? Are base models of code better reasoners? 🤔 We continue pre-training CodeLLaMA on text & code to improve reasoning 🧠 Bonus: 3x faster inference @ 16K context, using Focused Transformer 🎯

Tweet media one

5

56

310

4

13

59

@s_tworkowski

Szymon Tworkowski

27 days

sus🍓🍓 Join us

Tweet card media

See our latest job posts and career opportunities.

@xai

xAI

27 days

1K

2K

9K

1

2

52

@s_tworkowski

Szymon Tworkowski

27 days

column ai ftw🏛️🏛️

7

3

45

@s_tworkowski

Szymon Tworkowski

1 year

This is joint work with my awesome collaborators @CStanKonrad , @MikolajPacek , @PiotrRMilos from @IDEAS_NCBR , and @Yuhu_ai_ , @hmichalewski from @GoogleAI ! Colab: GitHub: Checkpoint:

syzymon/long_llama_3b · Hugging Face

3

4

41

@s_tworkowski

Szymon Tworkowski

1 year

FoT is a simple modification of the vanilla Transformer - instead of increasing context window length in all layers, we access previous windows of the training batch (containing tokens from the current and other docs) in a subset of attention layers called memory layers.

Tweet media one

1

3

31

@s_tworkowski

Szymon Tworkowski

1 year

📄 We show that LMs suffer from the "distraction issue" i.e. struggle to handle multiple documents in one context. Our Focused Transformer training objective (FoT) alleviates this by attending to tokens from the same doc (positive) and other docs (negative)

Tweet media one

1

0

30

@s_tworkowski

Szymon Tworkowski

10 months

Never thought prompt engineering could be this fun until #PromptIDE ! Check it out!

@xai

xAI

10 months

Announcing the xAI PromptIDE The xAI PromptIDE is an integrated development environment for prompt engineering and interpretability research. It accelerates prompt engineering through an SDK that allows implementing complex prompting techniques and rich analytics that visualize

1K

1K

9K

2

14

27

@s_tworkowski

Szymon Tworkowski

1 year

Unlike prior work focusing on position encodings, we follow and achieve extrapolation by simply keeping positionals constant for memory tokens, while leaving the local context intact. This makes LongLLaMA backward compatible with LLaMA inference code.

Tweet card media

Transformer Language Models without Positional Encodings Still...

Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit...

1

1

27

@s_tworkowski

Szymon Tworkowski

11 months

@Yampeleg In our Focused Transformer paper we propose a contrary view, which means packing multiple examples in one context can be beneficial if you optimize for long-context capabilities. This is because the model learns to ignore irrelevant tokens in context.

Tweet card media

Focused Transformer: Contrastive Training for Context Scaling

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation...

1

3

24

@s_tworkowski

Szymon Tworkowski

1 year

Surprisingly, we observe that apart from expected gains in multi-document settings, FoT (d=2) also improves perplexity on long, single documents, compared to training on just positive documents (d=1). We find this important, as the amount of long-context training data is limited.

Tweet media one

4

0

21

@s_tworkowski

Szymon Tworkowski

1 year

We improve GSM8K from 13% to 17% after 35B tokens without in-distribution training. We also publish Focused Transformer code for long-context pre-training, used in LongLLaMA! GitHub: HF (checkpoint): arXiv:

Tweet card media

Focused Transformer: Contrastive Training for Context Scaling

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation...

1

2

20

@s_tworkowski

Szymon Tworkowski

1 year

Colab: HF: Code: We also announce LongLLaMA-v1.1, a 3B base model trained for 5B 32K context tokens with our FoT method: We improved long-context and code (12% HE pass @1 ) capabilities

1

3

19

@s_tworkowski

Szymon Tworkowski

1 year

#ACL2023 is over! It was so exciting to talk LLMs with y'all! I hope that some of these insights will shape the future of large language models. Hopefully see you at #NeurIPS2023 !

Tweet media one

0

3

18

@s_tworkowski

Szymon Tworkowski

1 year

📄 We show that despite poor performance in solving competitive programming problems, LLMs exhibit a strong capacity in describing and explaining solutions. We try to disentangle the contribution of reasoning and coding in solving these problems by LLMs.

Tweet card media

Explaining Competitive-Level Programming Solutions using LLMs

In this paper, we approach competitive-level programming problem-solving as a composite task of reasoning and code generation. We propose a novel method to automatically annotate natural language...

2

1

16

@s_tworkowski

Szymon Tworkowski

1 year

Thanks to using Focused Transformer (FoT), our inference is > 3x faster than the baseline at 16K tokens. We only use long-range attention in a subset of layers (3 out of 32). It is only 10% of the vanilla attention FLOPs. See by @harmdevries77 for details

Tweet media one

1

2

15

@s_tworkowski

Szymon Tworkowski

1 year

LongLLaMA-Instruct was initialized from LongLLaMA-v1.1 32K and fine-tuned with context of just 2048(!) for 0.07 epochs. We observe that despite short-context fine-tuning, we don't lose the long-context capabilities of the base model (see by @Francis_YAO_ ).

Tweet card media

June 2023, A Stage Review of Instruction Tuning | Notion

Following the great success of ChatGPT, on February 24, 2023, the emergence of LLaMA heated up the direction of instruction tuning. On March 18, Alpaca demonstrated the potential of distilling...

yaofu.notion.site

1

2

12

@s_tworkowski

Szymon Tworkowski

1 year

As indicated by , current LLMs are no good at solving competitive programming problems outside of their training distribution, achieving a very low rating of 392 reported by @OpenAI , corresponding to barely 5th percentile of human competitors (avg. ~1450)

@cHHillee

Horace He

1 year

I suspect GPT-4's performance is influenced by data contamination, at least on Codeforces. Of the easiest problems on Codeforces, it solved 10/10 pre-2021 problems and 0/10 recent problems. This strongly points to contamination. 1/4

Tweet media one

Tweet media two

81

691

4K

1

0

12

@s_tworkowski

Szymon Tworkowski

1 year

We obtain exactly the same HumanEval perf as CodeLLaMA, and improve MMLU from 37.2% to 40% (scoring setup, no CoT) and gsm8k-py from 23.4% to 24.9%. We also outperform LLaMA2. SFT could unleash the full potential of this model; stay tuned for instruct version - coming soon!

Tweet media one

1

1

12

@s_tworkowski

Szymon Tworkowski

1 year

In human evaluation, we observe gpt4 is better than gpt3.5 at understanding the main idea, but there's still a long way to go. We hypothesize our backward explanations could be useful to improve model's forward reasoning process with techniques such as

Tweet media one

@Yuhu_ai_

Yuhuai (Tony) Wu

2 years

Language models can dramatically improve their reasoning by learning from chains of thought that they generate. With STaR, just a few worked examples can boost accuracy to that of a 30X larger model (GPT-J to GPT-3). W. @ericzelikman , Noah Goodman 1/

Tweet media one

8

94

528

1

0

11

@s_tworkowski

Szymon Tworkowski

1 year

Thrilled to be spotlighted in an interview with @AICoffeeBreak presenting my work at #ACL2023 ! Dive into the dialogue at 1:26 to catch my spicy 🌶️ insights on LLMs for code👨🏻‍💻, competitive programming, and the "understanding" of these models. Don't miss it!

@AICoffeeBreak

AI Coffee Break with Letitia

1 year

We summarized the #acl2023nlp Toronto conference for you with some poster recordings and author interviews! 👇 🎬 Featuring @s_tworkowski @jasivan_s @kundan_official @ebugliarello @leoduw @_florianmai @franz_nowak @PaulDarm @MoritzPlenz and @JayAlammar 👏

Tweet media one

0

23

54

0

4

10

@s_tworkowski

Szymon Tworkowski

1 year

Instead of solving the problem directly just from its NL description as LLM input (NL to code), we study the backward process of extracting the idea from a correct code solution. We show that our extracted rationales can significantly boost the solve rate of LLMs on CodeContests.

Tweet media one

1

2

11

@s_tworkowski

Szymon Tworkowski

1 year

CodeLLaMA is a great model, but apparently, it degrades GSM8k from 42.2% to 32.7% at 34B size. Is there a reason for not using a more balanced mixture during fine-tuning, instead of 85% code? I feel like the community might close this gap very soon :)

@soumithchintala

Soumith Chintala

@soumithchintala

1 year

CodeLlama -- a version of Llama2 that was fine-tuned for code tasks is live now. Available in 7B, 13B and 34B.

Tweet media one

17

206

952

1

4

11

@s_tworkowski

Szymon Tworkowski

1 year

Using insights from designing our prompt for the backward reasoning process, we propose a structured prompt that boosts the solve rate of these models just from problem statement to code (the original task, without golden explanations in the input) from 6.1% to 9.1% for pass @10 .

Tweet media one

1

0

11

@s_tworkowski

Szymon Tworkowski

1 year

The base model is fine-tuned from OpenLLama v2 and released on a fully commercial Apache 2.0 license. We used a combination of OpenOrca and for SFT. We open-source the code to facilitate efficient instruction tuning on your own data

zetavg/ShareGPT-Processed · Datasets at Hugging Face

1

5

11

@s_tworkowski

Szymon Tworkowski

1 year

LongLLaMA-Code was done with a small amount of 35B pretraining tokens (mix webtext & code) to improve reasoning While still exploratory, these results suggest base models of code are a promising avenue for enhancing reasoning capabilities @Francis_YAO_ []

Tweet card media

How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources |...

Yao Fu | Website | Blog

yaofu.notion.site

1

1

10

@s_tworkowski

Szymon Tworkowski

1 year

The Focused Transformer and LongLLaMA are joint work with my amazing collaborators @CStanKonrad @MikolajPacek @Yuhu_ai_ @hmichalewski @PiotrRMilos ! Stay tuned for future releases of larger models (7B, 13B) fine-tuned for more long-context tokens!

0

3

10

@s_tworkowski

Szymon Tworkowski

9 months

The session is happening now! For those who missed the poster 😉

Tweet media one

1

1

10

@s_tworkowski

Szymon Tworkowski

1 year

This was an amazing project primarily done by Jierui Li from UT Austin, with my help & advised by Yingying Wu and Raymond Mooney. Acknowledgements to @IDEAS_NCBR for making the in-person presentation possible. For me, the project was huge fun, bringing back NOI and ICPC memories

1

1

9

@s_tworkowski

Szymon Tworkowski

11 months

Finally a 7B long-context model that you can actually run in an academic budget (single free Colab GPU)? Try to chat with it about papers & code!

@CStanKonrad

Konrad Staniszewski

11 months

🎇Introducing LongLLaMA-Code 7B Instruct 🦙!🎇 A step towards an open-source alternative for Claude 2. Run in 🆓 Colab (8bit). 🗨 Answers questions about 📑 papers and >_ code. SOTA 7B reasoning : 🎓 GSM8K: 65% 🐍 PoT 0-shot, 42% std CoT 8-shot setting. >_ 37%: HumanEval

Tweet media one

3

44

261

0

2

9

@s_tworkowski

Szymon Tworkowski

1 year

In the colab demo we try to feed the entire Focused Transformer paper into context and ask questions about it, achieving reasonable results. In the same colab, we also provide a chat interface to interact with the model!

Tweet card media

long_llama_instruct_colab.ipynb

Run, share, and edit Python notebooks

colab.research.google.com

1

1

9

@s_tworkowski

Szymon Tworkowski

1 year

@XueFz @DrJimFan Not sure if gpt4 is capable of solving novel lc medium/hard (outside of the training data). IMHO these tasks still require some amount of reasoning before coding. Once you come up with the idea, llm can code it up pretty easily. Check our work for more

Tweet card media

Explaining Competitive-Level Programming Solutions using LLMs

In this paper, we approach competitive-level programming problem-solving as a composite task of reasoning and code generation. We propose a novel method to automatically annotate natural language...

2

0

8

@s_tworkowski

Szymon Tworkowski

10 months

How do math LLMs perform beyond academic benchmarks 🤔? Check it out!

@keirp1

Keiran Paster

10 months

Which LLMs are generally good at math and which are overfitting to benchmarks? With the release of Grok, @xai evaluated several closed models on a Hungarian national finals math exam which was published after the models were trained. This means it is impossible to train on or

Tweet media one

Tweet media two

24

84

560

0

3

8

@s_tworkowski

Szymon Tworkowski

10 months

@Francis_YAO_ Perplexity/LM loss gains (compression) seem pretty clear, e.g. from Memorizing Transformers paper, and context could be seen as an additional scaling dimension. Imo it's harder to quantify benefits from application of LCLMs, due to lack of reasonable downstream benchmarks 🤔

Tweet media one

0

0

9

@s_tworkowski

Szymon Tworkowski

1 year

Exploring the roots of the famous "6ND" formula, I stumbled upon this insightful post by @DBahdanau about LLM training compute estimation 👉 . Still on point! Wondering what >64k contexts/flashattn bring to the table 🧐, does FFN still dominate attention?

Tweet card media

The FLOPs Calculus of Language Model Training

Extremely large language models like the famous GPT-3 by OpenAI are all the rage. Many of us are now trying to get a sense of scale of the…

1

1

7

@s_tworkowski

Szymon Tworkowski

1 year

LongLLaMA v1.1 shows competitive performance on long-context retrieval tasks (see @nelsonfliu ), without degradation after instruction tuning. Also, the short-context performance on downstream tasks improves due to instruction tuning: 55% vs. 53% on lm-eval

Tweet media one

1

1

8

@s_tworkowski

Szymon Tworkowski

1 year

@ayaka14732 Great work! How does this compare to EasyLM that I’m currently using to finetune llama models in jax?

Tweet card media

GitHub - young-geng/EasyLM: Large language models (LLMs) made easy, EasyLM is a one stop solution...

Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax. - young-geng/EasyLM

2

1

7

@s_tworkowski

Szymon Tworkowski

9 months

@Francis_YAO_

Tweet card media

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

We present miniF2F, a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark...

0

0

7

@s_tworkowski

Szymon Tworkowski

1 year

This is a joint work with my amazing collaborators @CStanKonrad @PiotrRMilos @hmichalewski ! Many thanks also to @p_nawrot and @Francis_YAO_ @S_Jaszczur for helpful suggestions!

1

2

7

@s_tworkowski

Szymon Tworkowski

1 year

arXiv: HF: Blogpost: coming soon! 🔥🔥

Tweet card media

syzymon/long_llama_3b_instruct · Hugging Face

1

3

5

@s_tworkowski

Szymon Tworkowski

1 year

I'm presenting this paper on explaining competitive programming solutions with LLMs at 1:30pm (in 10min) in the big poster hall #ACL2023 - number 24, drop by and say hi!

0

2

6

@s_tworkowski

Szymon Tworkowski

10 months

@GoogleAI @MLCommons Excellent initiative! Hope the evaluation will avoid common pitfalls, e.g. the ones described in

Tweet card media

No Train No Gain: Revisiting Efficient Training Algorithms For...

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve...

0

2

6

@s_tworkowski

Szymon Tworkowski

1 year

@JohnGal43951639 @Yampeleg The memory overhead from longer context is much smaller than for standard models since we only access k/v from the extended context in a tiny fraction of layers (3 out of 26). With a simple trick we fit 32K context into a colab GPU (see the HF repo):

syzymon/long_llama_3b · Hugging Face

0

0

5

@s_tworkowski

Szymon Tworkowski

11 months

@Francis_YAO_ There are very few evaluation datasets for long-context LLM. shows comparison between open-source and Claude, which is pretty miserable - still a long way to go for OSS 🤔, but developing a leaderboard like MT Bench for LCLM would be immensely useful!

How Long Can Open-Source LLMs Truly Promise on Context Length? | LMSYS Org

In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 1...

0

0

5

@s_tworkowski

Szymon Tworkowski

1 year

Our model is also significantly faster at inference due to only having a subset of layers (3/26) attend to tokens beyond the local context window. LongLLaMA-Instruct is a competitive 3B model you can run on a colab or locally and use as a long-context chatbot (see colab demo!)

1

0

4

@s_tworkowski

Szymon Tworkowski

1 year

Nothing works better for chilling out than google recalling your compute in the middle of training, 2 days before a planned release. You can just watch netflix or write some codeforces round 🤣

Tweet media one

@4evaBehindSOTA

tokenbender

@4evaBehindSOTA

1 year

Nothing works better for concentration than renting high cost compute. You've to use every minute well.

Tweet media one

1

0

12

2

1

4

@s_tworkowski

Szymon Tworkowski

9 months

Best poster design at the entire #EMNLP2023 😎, not mentioning the awesome technical contribution that #nanoT5 has made, check it out!

@p_nawrot

Piotr Nawrot

9 months

Everyone's invited to stop by the poster session of NLP-OSS Workshop at #EMNLP where you can see this piece of art poster by yourself in-person This is the last post about nanoT5 from me, if you haven't seen it check out Thanks for all the kind feedback!

Tweet media one

0

1

19

1

0

4

@s_tworkowski

Szymon Tworkowski

11 months

Amazing initiative with COLM 🎉! One more step for inclusivity could be making the location accessible to all 🌍. It'd cut down on the visa hassle 🛂 that often plagues U.S. conferences (e.g. #NeurIPS2023 ) . Any thoughts on the potential location? 🙂

@srush_nlp

Sasha Rush

11 months

Introducing COLM () the Conference on Language Modeling. A new research venue dedicated to the theory, practice, and applications of language models. Submissions: March 15 (it's pronounced "collum" 🕊️)

Tweet media one

34

434

2K

1

3

3

@s_tworkowski

Szymon Tworkowski

1 year

@4evaBehindSOTA @p_nawrot @CStanKonrad The FoT pretraining code is written in JAX; we're planning to release it after LongLLaMA v2 release - stay tuned :)) No plans to implement Focused Transformer pretraining in PyTorch, so if you'd like to learn, doing this could be a great exercise and useful for the community!

1

0

3

@s_tworkowski

Szymon Tworkowski

1 year

Kudos to @PiotrRMilos @CStanKonrad @Yuhu_ai_ @marek_nomagic @MikolajPacek @LukeKucinski and others for making that happen!

1

2

3

@s_tworkowski

Szymon Tworkowski

1 year

@QQ21619296 @labmlai @PyTorch @UniWarszawski @OpenAI @GoogleAI @hmichalewski @lukaszkaiser @Yuhu_ai_ @ChrSzegedy You can see further efficiency improvements after more complex shortening is applied in the follow up work by @p_nawrot :)

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

0

1

3

@s_tworkowski

Szymon Tworkowski

9 days

@AlbertQJiang @WendaLi8 @Mateja_Jamnik Congrats! It’s been a pleasure to collaborate with you :)

0

0

3

@s_tworkowski

Szymon Tworkowski

1 year

Special thanks to @keirp1 for providing immensely valuable suggestions about the pre-training data! 🧡

0

5

3

@s_tworkowski

Szymon Tworkowski

9 months

@haozhangml How about OpenLLaMA by @younggeng ? The trajectory was published along with reproducible source code.

0

0

2

@s_tworkowski

Szymon Tworkowski

1 year

@Yampeleg Thanks for sharing our work!! :)

0

0

3

@s_tworkowski

Szymon Tworkowski

1 year

@mrm8488 Cool! Any humaneval numbers?

1

0

3

@s_tworkowski

Szymon Tworkowski

1 year

@PiotrRMilos I just feel lucky there are conferences allowing to put papers on arXiv and make an impact way before acceptance, which is not true for some other conference 😉. Thanks a lot for the entire team for the extremely hard work! @CStanKonrad @MikolajPacek @Yuhu_ai_ @hmichalewski

0

0

3

@s_tworkowski

Szymon Tworkowski

1 year

@AlbertQJiang @CStanKonrad @MikolajPacek @PiotrRMilos @Yuhu_ai_ @hmichalewski Thanks !! :)

0

0

1

@s_tworkowski

Szymon Tworkowski

8 months

Finally! 🎉🎉🎉

@gneubig

Graham Neubig

8 months

ACL has removed the anonymity period. This means that ACL submissions can be posted and discussed online at any time, although extensive PR is discouraged.

Tweet media one

5

86

353

0

0

2

@s_tworkowski

Szymon Tworkowski

1 year

@AlbertQJiang Congrats!! The base model seems very strong, looking forward to the community's creativity in doing SFT :)

1

0

2

@s_tworkowski

Szymon Tworkowski

1 year

@tongshuangwu @kgashteo 太好了！Rarely seeing „xD” being used here (it is very popular in my home country), curious where you learn that from ：）

0

1

2

@s_tworkowski

Szymon Tworkowski

1 year

@KujoJot32604166 The current state is roughly here: We are working towards better pretraining data for long-context, and very likely to release something new in Oct/Nov :)

syzymon/long_llama_code_7b · Hugging Face

1

0

2

@s_tworkowski

Szymon Tworkowski

1 year

@tugot17 @dylan522p @mvpatel2000 @abhi_venigalla Pretraining data mixture for llama2 is not public ig, but it doesnt seem to be qualitatively better than llama1, given it’s perf / train tokens. I think mistral used much more code in the mixture, which is likely to help reasoning benchmarks (although still speculative)

0

0

2

@s_tworkowski

Szymon Tworkowski

1 year

@atgambardella Wish I could motivate myself to study Mandarin 2h daily.. typically it’s 30min at most, extremely good job on your side and fingers crossed you master Japanese!

1

0

2

@s_tworkowski

Szymon Tworkowski

1 year

@OfirPress Do they finetune at 100k actually? I thought they finetune until 16k and then try to extrapolate, but maybe i misread sth

0

0

2

@s_tworkowski

Szymon Tworkowski

1 year

@teortaxesTex @sytelus Why not NoPos? 🤔

0

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@minimario1729 @terese14711217 @AICoffeeBreak My suspicion is that openai’s gpt has already seen multiple epochs of solutions (based on its pre-cutoff solve rate) & unclear amount of epochs of editorials. You can check out our work to see how gpt-generated editorials/rationales could boost performance.

1

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@KujoJot32604166 Also, we are working on an instruction-tuned version of LongLLama-Code 7B. This is supposed to be relased early Oct, and from preliminary results we expect it to be pretty good :) stay tuned!

0

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@4evaBehindSOTA @p_nawrot @CStanKonrad The code would have been much more performant in pytorch (flashatt etc), its just compute constraint that I only have TPU compute to pretrain on, so pytorch is not useful for me, but def would be for the community!

1

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@abacaj I totally agree with that - training data is the most important for long context utilisation. I hope some upcoming ICLR submission will discuss this question!

0

0

1

@s_tworkowski

Szymon Tworkowski

11 months

@artificial_bug @CStanKonrad @abacaj Also really excited about the possibilities of running this on a free colab instance :) make sure to check out our work 😀

0

0

1

@s_tworkowski

Szymon Tworkowski

9 months

@dk21 @PiotrRMilos @CStanKonrad Thanks!! 😀

0

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@dylan522p @mvpatel2000 @abhi_venigalla You mean training time flops? I guess they have much higher quality data than llama2 so I wouldn’t be too surprised if they do even 2x better in terms of training flops for achieving same downstream perf :p

2

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@laion_ai I dont think it can compare even to gpt3.5 in terms of math/coding/reasoning performance. Looking forward to their paper and gsm8k & humaneval numbers. Given their data mixture, is not looking promising ngl :))

2

0

1

@s_tworkowski

Szymon Tworkowski

11 months

@Francis_YAO_ @denny_zhou I tried to reproduce zeroscrolls gpt4 results but unfortunately it seems difficult. :(

0

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@zhu_zhaocheng Pretty much, im interested in learning mandarin these days :) just seen in what context my friends use the fake smile and do the same myself hhh

0

0

1

@s_tworkowski

Szymon Tworkowski

11 months

@Yampeleg Specifically, in Figure 6 we study what happens if we train the model on multiple examples, compared to just same example. Perplexity on long, single documents gets better for a model that can sould see multiple unrelated examples in context, which we find quite surprising.

1

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@finbarrtimbers how to do flash attention with tpu? :))

1

0

1

@s_tworkowski

Szymon Tworkowski

1 year

@atgambardella rn on my side it’s mostly memorising basic vocab and learning to distinguish between tones hhh but I hope I can start listening to some comprehensible input/read simple stuff soon like I did with European languages

1

0

1