To all my friends: Over the past 3 months, whenever I said "Sorry, can't make it tonight, gotta work" or "Sorry, I'm busy this weekend", but then couldn't really say what exactly we were working on, THIS was the monster we were building
@DbrxMosaicAI
.
#DBRX
Meet DBRX, a new sota open llm from
@databricks
. It's a 132B MoE with 36B active params trained from scratch on 12T tokens. It sets a new bar on all the standard benchmarks, and - as an MoE - inference is blazingly fast. Simply put, it's the model your data has been waiting for.
Thank you
@DimitrisPapail
for being such an amazing advisor!!! Your help and guidance have been invaluable throughout my phd! I am also extremely lucky to have had the opportunity to collaborate with some really brilliant researchers from various universities and organizations!
It was great to have
@shashank_r12
share with the BuzzRobot community details about DBRX, the large language model created by
@databricks
. Shashank walked us through the architecture of the model, hyperparameter choices, the software and hardware issues the team experienced
A big advantage of two column paper format is that people can comfortably read your paper on their phones. Kind of embarrassing that I only realized this today, after years of reading papers on my phone 😅
@OfirPress
@BlancheMinerva
@xlr8harder
In our (preliminary) experiments we also see ALiBi and RoPE have matching training curves (in fact ALiBi converges a bit faster initially). Performance is also similar for eval on seq lens less than the max seq len seen during training. After that, ALiBi extrapolates better.
@DbrxMosaicAI
Feeling ecstatic that all the hard work by the team payed off! It was the greatest experience working with all the talented and hardworking folks
@DbrxMosaicAI
! Looking forward to building even bigger and better LLMs!
@HongyiWang10
is one of the best researchers that I've worked with. I was really lucky to have him as a senior phd student in the lab when I started my phd. He is one of the few people I know who has comprehensive expertise in both ML and Systems. Congratulations Hongyi!!!
[1/n] I'm thrilled to share that I will join the Rutgers CS Department
@RutgersCS
as a tenure-track Assistant Professor in the summer of 2025!
I'm excited about and looking forward to this new chapter of my career journey!
@Gradient_AI_
@AIatMeta
@huggingface
@CrusoeEnergy
Wow! Amazing work! It seems that you used 2.8 Billion as the RoPE theta (), which is much, much bigger than any RoPE thetas seen in other models. How did you come up with that value?
If WeChat is banned then how will Chinese people working in the US communicate with their families back home? I've been told of alternatives like QQ and Skype, but what is the guarantee that the US or even China (in retaliation) wouldn't ban these in the future!
@zeroXmusashi
@madiator
@_nikhilmehta
@YiTayML
@vqctran
Yes, in order to use this to build a recommender system for a dataset, you only need two things: user session data, and some semantic data about each item. For tiktok, the latter could be stuff like video title, tags, caption or creator's name.
My first PhD student defended today, and it filled my heart with bittersweet joy.
Congratulations Dr. Hongyi Wang
@HongyiWang10
!
It has been an incredible honor to serve as your advisor. I can't wait to see the great things you will accomplish.
@unsorsodicorda
@aminkarbasi
@DimitrisPapail
@ten10_93
Great question! We have a margin assumption on the points, and in fact, the "exponential improvement" is in the dependence on the margin.
Indeed, modern DNN can interpolate training datasets, and this was one of the motivations for our work.