[6 Sept 2024]
Reflection 70B, by Matt from IT
Congrats to
@mattshumer_
and
@csahil28
on training the world's top open model and passing the vibe check! As
@polynoamial
said today: "1 engineer working in the right direction beats 100 geniuses working in
Avoiding "mid" AI output:
smol tip from our work today - don't talk to LLMs* like you talk to humans. You are fundamentally still programming with English!!
❌ Tell it what outcome you want, let it figure out how to get there
✅ Tell it what steps to perform, give tokens and
[Jun 10 2024]
Talaria, the MLOps superweapon powering Apple Intelligence
helps Apple create the set of 20 ~3.5bit LoRAs showcased today, hot-swapped atop of Apple's new ~3b On-Device model to beat Mistral, Gemma, Phi-3 w. 30tok/s speed
it's notable how predictive the Lmsys Elo vs $ pricing curve is, and how the strategy is panning out. Today's Gemini Pro price cut brings it exactly in line with where a loglinear pricing curve predicts it should be for its Elo.
More broadly, the Frontier model race is now back
Two new production Gemini models, >2x higher rate limits, >50% price drop on Gemini 1.5 Pro, filters switched to opt-in, updated Flash 8B experimental model, and more.
It’s a good day to be a developer : )
AINews: 29 May 2024
What if you KNEW that we may soon have models can that continuously process and reason over text/audio/video with a TRILLION token "context window"?
Real time? On device?
thanks to
@cartesia_ai
,
@krandiash
,
@_albertgu
[28 Aug 2024]
Cerebras Inference: Faster, Better, AND Cheaper
congrats to
@CerebrasSystems
for vaulting to the top of the
@ArtificialAnlys
leaderboard for price and speed at full precision!!!
[17 July 2024]
Mini, Nemo, Turbo, Lite - Smol models go brrr!
Vibe check of 4o vs mini for instruction following vs summarization:
- we find some examples where mini is worse, some examples where it is better. Mini is subjectively:
- unevenly worse at formatting
[18 Sept 2024]
For the first time ever, an LLM has been able to 100% match and accurately report what we consider to be the top stories of the day without our intervention.
@openai
o1 destroys
@Lmsysorg
Arena
@Alibaba_Qwen
2.5
@kyutai_labs
Moshi
[8 Aug 2024]
Too Cheap To Meter: AI prices cut 50-70% in last 30 days
Price cuts of
@lmsysorg
top models in the last 30 days:
- Rank 2: GPT4o cut 50% from May to Aug
- Rank 3: GPT4o-mini cut >70% vs GPT3.5/4T
- Rank 4: Llama 3.1 405b cut 46% in first 48hrs
- Rank 8:
@MistralAI
[29 Aug 2024]
Summer of Code AI: $1.6b raised, 1 usable product
-
@cognition_labs
: $175m
-
@poolsideai
: $400m
-
@codeiumdev
: $150m
-
@magicailabs
: $320m
You can only use one of these products right now, but there's lots of promise!
🐣 Introducing `smol-developer`!
▸ Human-centric, coherent whole program synthesis
▸ your own junior developer
▸ develop, debug, decompile
▸ open source:
▸ 200 LOC, half english
Insights:
💡 100k context can summarize both content and codebases
💡
Tiny Language Models (below 10m parameters or only one transformer block) can generate paragraphs of coherent text and reason…provided training is limited to stories that only contain words that a typical 3 to 4-year-olds usually understand.
Paper -
[2 Jul 2024]
GraphRAG: The Marriage of Knowledge Graphs and RAG
This is finally on our radar thanks to
@emileifrem
's tireless advocacy, and MSR now open sourcing their code!
[9 July 2024]
Depth is all you need:
- Everybody is sleeping on
@lilianweng
's latest review of Hallucination Detection/Prevention/Evals
-
@ylecun
and
@reach_vb
on MobileLLM takeaways
- Summary of
@xiaolonw
et al's Test Time Training architecture research.
A late issue today due
[6 Aug 2024]
GPT4o August + 100% Structured Outputs for All!
the newest model + API from
@michpokrass
and
@athyuttamre
seriously impressed us! Cut 20 lines of Instructor code + will probably save about 55% of our API costs between the price cut + better model + less retries.
Llama Vision took the spotlight this week, but don't sleep on
@allen_ai
Molmo, which is now the
#2
vision language model in the world* but is completely open:
1. Molmo 72B scores higher than L3.2 90B, same for 7B > L3.2 11B.
2. Arena ratings for 72B beat Gemini 1.5 pro and
My dream model
- 6 modalities
- 300-500M total params
- very aligned & instruction following
- 27-33 heads
- cute animal name
- loves humans
- rlly helpful but a bit naughty
- rlly harmless & has guardrails
- rlly honest
- not from bigcorp (high quality volunteer)
- very smol
[17 July 2024]
Gemma 2 tops /r/LocalLlama vibe check
Less than a month old but already handily beating Llama 3, Phi 3, Qwen 2, Mistral, and everyone else!
[19 July 2024]
DataComp-LM: The Best New Open Weights model!
An incredible paper, benchmark, dataset, and model from the DataComp team and Apple ML Research!
[21 Aug 2024]
Ideogram 2 + Function Calling Leaderboard V2
'Tis the season of sequels.
After the spectacular launch of Flux (the former Stable Diffusion team),
@ideogram_ai
(the former Google Imagen 1 team) is back with a vengeance. A new model, with 5 distinct styles with
[8 July 2024]
Problems with MMLU-Pro
Before
@DanHendrycks
' could complete MMLU 2, the community has embraced MMLU-Pro as the de facto replacement. However the /r/LocalLlama gang is finding some broken English and obvious discrepancies favoring the closed models over open ones:
Tired of MMLU? The current models already hit the ceiling? It's time to upgrade MMLU!
Introducing our new benchmark MMLU-Pro, a more robust and challenging massive multi-task language understanding benchmark with 12K questions.
What's New?
1. MMLU-Pro uses 10 options instead of
[23 July 2024]
Llama 3.1: The Synthetic Data Model
From the paper, we detail all the ways in which Synthetic Data was used to get Llama 3 to frontier-level performance across code, instruction following, math, multilnguality, long context, tool use, and RLHF.
Llama 3: the Synthetic Data model
Llama 3 paper is finally out! by
@lvdmaaten
and Angela Fan. Quick diffs from yesterday's leaks (+ watch our exclusive
@ThomasScialom
interview out now!)
- NEW SCALING LAWS! turns out there's a reason why they trained a 405B param model because
[14 Jun 2024]
Nemotron-4-340B: NVIDIA's new large open models, built on syndata, great for syndata
h/t
@ctnzr
,
@kuchaev
et al
"Notably, throughout the entire alignment process, we relied on only approximately 20K human-annotated data... while our data generation pipeline
Not Llama 3 405B, but Nemotron 4 340B!
@nvidia
just released 340B dense LLM matching the original
@OpenAI
GPT-4 performance for chat applications and synthetic data generation. 🤯 NVIDIA does not claim ownership of any outputs generated. 💚
TL;DR:
🧮 340B Paramters with 4k
AINews: 30 May 2024
@jaseweston
taught LLMs to count using this ONE weird trick.
@giffmana
thinks this is more promising than linear attention, and
@krishnanrohit
says you can use this to add EXTERNAL MEMORY to attention...
🚨 Contextual Position Encoding (CoPE) 🚨
Context matters! CoPE is a new positional encoding method for transformers that takes into account *context*.
- Can "count" distances per head dependent on need, e.g. i-th sentence or paragraph, words, verbs, etc. Not just tokens.
-
[11 Jun 2024]
Today's feature is
@fchollet
and
@mikeknoop
's new $1m ARC prize, which proposes a benchmark that directly tracks François' definitions of AGI and is designed to have a much lower saturation curve than others:
[12 Jun 2024]
The Last Hurrah of Stable Diffusion?
The ~2B param SD3 Medium is finally here, but whither Stability? Will they ever release the 8B model? What's after SD3?
something we didnt catch: 4o mini uses more tokens for images than 4o. to match the SAME COST (aka a lot more)
this leads me to suspect that 4o mini will be a -lot- better at vision tasks, has anyone verified? whats a good benchmark for this?
@yitayml
vibeeval?
[15 July 2024]
AgentInstruct: Toward Generative Teaching with Agentic Flows
The future of synthetic pre- and post- training data could be armies of agents doing macrodata refinement for us.
A recipe for Synthetic Data 2.0?
@Microsoft
introduced “AgentInstruct” a new way to teach an LLM a new skill or behavior from synthetic data generated by LLM Agents. AgentInstruct improved a 7B (Orca-3) model by ~20% across all benchmarks and matched GPT-4 on RAG.
AgentInstruct
Meanwhile in AI Engineer land,
@shishirpatil_
updated the Berkeley Function Calling Leaderboard (now commonly known as BFCL) to BFCL V2 • Live, adding 2251 "live, user-contributed function documentation and queries, avoiding the drawbacks of dataset contamination and biased
AINews for 31 May 2024
Anthropic's Tool Use API is GA and very fully featured!
- streaming
- forced use
- vision
- 5 architectures for agents
- a
@CodeColt
course on Tool Use!
congrats to the team on a great rollout.
Excited to announce that we’re spinning up an AI educational program and we just released our first course on tool use!
Let me walk you through what it covers:
It is with heavy hearts we announce will shut down over the next few weeks.
We appreciate our passionate community of customers & users that have supported us over the past few years. ❤️
We thank you for understanding. Here’s some more information ⤵️🧵
More than a year ago, Ghostwriter proof-of-concept took a few hours to prototype. Now it's a flagship product for Replit.
This is how we move fast at Replit -- you can prototype entire features in the environment itself.
In this case, the PoC worked by hosting a small OSS LLM
SciCode is our new benchmark that challenges LMs to code solutions for scientific problems from advanced papers. The challenges were crafted by PhDs;
~10% of our benchmark is based on Nobel-winning research.
GPT-4 and Sonnet 3.5 get <5% ACC.
🧵 1/6
Nice paper surveying Multimodal AI Architectures -- with a comprehensive taxonomy and analysis of their pros/cons & applications in any-to-any modality model development
📌 𝐂𝐨𝐦𝐩𝐫𝐞𝐡𝐞𝐧𝐬𝐢𝐯𝐞 𝐓𝐚𝐱𝐨𝐧𝐨𝐦𝐲: First work to explicitly identify and categorize four broad
@ssi
@SakanaAILabs
@youdotcom
@AnthropicAI
we are experiencing some very serious email deliverability issues with buttondown and were unable to resolve it today, sorry but the archives at least are still up.
[18 June 2024]
Gemini launches context caching... or does it?
Today was a great day for AINews followups:
- Nvidia's Nemotron now ranks
#1
open model on
@LMsysorg
and
#11
overall (beating Llama-3-70b, which maybe isn't that impressive but perhaps wasnt the point),
- Meta's
@chongdashu
we also just realized that a huge % of our mailing list was marked undeliverable without our notification - almost surely a bug.
@justinmduke
is on it
@ssi
@SakanaAILabs
@youdotcom
@AnthropicAI
we are experiencing some very serious email deliverability issues with buttondown and were unable to resolve it today, sorry but the archives at least are still up.