The
#xLSTM
is finally live! What an exciting day!
How far do we get in language modeling with the LSTM compared to State-of-the-Art LLMs?
I would say pretty, pretty far!
How? We extend the LSTM with Exponential Gating and parallelizable Matrix Memory!
🚨 Exciting News! We are releasing the code for xLSTM! 🚀 🚀 🚀
Install it via: pip install xlstm
We have already experimented with it in other domains and the results are looking great so far! Stay tuned for more news to come! 🔜 👀
Repository:
Together with
@KorbiPoeppel
I presented our work on xLSTM at three workshops at ICML 2024 including an oral at ES-FOMO 🦾✨
I am super happy about all the positive feedback we got! 🤩
Can’t wait to scale xLSTM to multi billion parameters! Stay tuned! 🔜🔥🚀
So proud to see the xLSTM shining not only in language but also in vision!
Great work by
@benediktalkin
using the mLSTM as a generic vision backbone outperforming Vision-Mamba and original Vision Transformers!
Don‘t underestimate sLSTM! There are more domains to cover…😉
Introducing Vision-LSTM - making xLSTM read images 🧠It works ... pretty, pretty well 🚀🚀 But convince yourself :) We are happy to share code already!
📜:
🖥️:
All credits to my stellar PhD
@benediktalkin
Today I had the chance to present the
#xLSTM
at the
@ELLISforEurope
Elise Wrapup conference in Helsinki as an ELLIS PhD spotlight presentation.
It is so exciting to be part of the unique ELLIS Network!
Thanks for sharing this
@bschoelkopf
!
Thanks
@srush_nlp
for this compelling collection of recent RNN-based Language Models!
I think now you have to update this list with the
#xLSTM
😉
I agree, naming conventions are always hard...
In our paper we try to stick to the original LSTM formulation from the 1990s:
There are like 4 more linear RNN papers out today, but they all use different naming conventions🙃
Might be nice if people synced on the "iconic" version like QKV? Personally partial to: h = Ah + Bx , y = C h where A, B = f(exp(d(x) i))
@ArmenAgha
Thanks
@ArmenAgha
for reading our paper carefully! We checked our configs. For the 15B column we use 2e-3 for RWKV-4 760M, and 3e-3 for xLSTM 125M. For the 300B column we use 1.5e-3 for Llama 350M and 1.25e-3 for Llama 760M. Thanks for pointing this out. Will update the paper.
I also stumbled across the cool "Deep learning on a data diet" paper by
@mansiege
and
@gkdziugaite
and tried to reproduce their results.
I could not reproduce the GraNd results either and thought this must be a bug in my code. 🪲
Very cool, that
@BlackHC
dug a bit deeper.⛏️👷♂️
TLDR; don't let me attend talks if you don't wanna find out that part of your paper might not reproduce 😅🔥
J/k ofc:
@gkdziugaite
and
@mansiege
were absolutely lovely to talk to throughout this and put good science above everything 🥳🫶
👇
Last week I had the pleasure to present our work „Addressing Parameter Choice Issues in Unsupervised Domain Adaptation by Aggregation“ (published at ICLR23) at the ELLIS Doctoral Symposium 2023 in Helsinki.
Big thanks to the organizers for this fantastic event!
#EDS23
#ELLIS
Our paper "Addressing Parameter Choice Issues in Unsupervised Domain Adaptation by Aggregation" has been selected for oral presentation (notable-top-5%) at
#ICLR2023
. [1/n]
Special thanks to
@KorbiPoeppel
for spending long nights with me keeping all those GPUs busy.
And finally big thanks to
@HochreiterSepp
for giving me the chance to work in the amazing
#xLSTM
project.
@predict_addict
Sure, it can be used for time series forecasting. Our xLSTMBlockStack class is intended for easy integration in other frameworks, like for e.g. time series forecasting libraries.
What I found also interesting, but did not find in the paper was a histogram of the scores by class. This reveals the effectiveness of EL2N.
While for GraNd the scores are normally distributed, EL2N identifies some structure in the samples.
I used a Resnet20 and tried different pruning methods, including random, preserving the highest/lowest scores, and enforcing class balance.
EL2N worked well even without manual class balancing.
#deeplearning
#reproducibility
Paper:
@ArmenAgha
Thanks for this hint! What learning rates would you suggest for these sizes? In the Llama2 paper we only found learning rates for 7B+ sized models. This is why we took these learning rates from the Mamba paper:
Our paper "Addressing Parameter Choice Issues in Unsupervised Domain Adaptation by Aggregation" has been selected for oral presentation (notable-top-5%) at
#ICLR2023
. [1/n]
@predict_addict
@f_kyriakopoulos
In our paper we focus on language modeling. In the paper we describe how one can use xLSTM for this use case.
The idea of the code release is that people can experiment with it themselves and build on it.