I have written a reply to
@leopoldasch
's Situational Awareness piece. You can find it at In it, I show how Leopold's proposals would cause the catastrophes he fears and propose democratically-minded alternatives.
Part 1 - Highlights from the Llama 3.1 paper
1. Classifier downscales over-represented web data.
2. Scaling law experiments predict large model performance from small models.
3. Final mix: 50% general knowledge, 25% math/reasoning, 17% code, 8% multilingual.
4. In the last 40
Open AI's o1 model is a preview of the direction closed source AI is taking. You can't interrogate or understand its reasoning without risking a ban. You can't even ask what its directives or motivations are without risking a ban. o1 is not aligned with you, it is aligned with
Every day we see inference optimizations. I strongly believe that CPU inference is going to be usable in the near future, even without major hardware changes.
This paper makes clever use of lookup tables to minimize the GPU / system memory bottleneck
AI skeptics don't appreciate that AI is already transforming the world around them. The transformation is happening because efficient people in the highest pressure jobs excel at adopting new productivity tools. For instance, in the first few weeks after its release, ChatGPT is
Alpha Proof showcases the power of an
unconstrained qualitative -> constrained symbolic synthetic data pipeline with self-play in the constrained symbolic space. This will be the formula for lots of synthetic reasoning data from here on out. Presumably,
Another really crucial paper for those of us trying to bootstrap model self-improvement.
This paper describes how automated instruction tuning of base models is achieved. The prompting techniques for automatic refinement are applicable to a wide variety
arXiv has become my personal box of infinite chocolates. The hit ratio has gotten so extreme with the intense community focus on LLMs that it is rare I do not find myself grinning ear to ear within five papers. There are ideas in the past two months' worth of papers that are
For those of us interested in using synthetic data to enhance model performance, in this excellent paper: researchers employ a prompting strategy inspired by genetic optimization (involving crossover and mutation of prompts, along with a fitness judging
@polynoamial
@OpenAI
o1 is a major step backward on all metrics of openness. It seems like the move of a company desperate to maintain its technical edge, which is suggestive of an unproductive internal research program
This paper generates a synthetic dataset of logical fallacies and then finetunes LLMs to produce fewer such fallacies in argumentation: Pretty cool. For the good stuff, go the appendix as usual after the citations. I would like to see a paper that unifies
For those trying to figure out how to get LLMs to best follow workflows, the paper “FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents” introduces FlowBench, the first benchmark designed to evaluate LLM agents in planning tasks
Yeah, ty, I don't want my 'exocortex' to be a closed source blob someone can use against me. And what is this snobbery about open source always being inferior? Demonstrably not the case atm
Andrej Karpathy says as we expand our brains into an exocortex on a computing substrate, we will be renting our brains and open source will become more important because if it's "not your weights, not your brain"
This whole interview is outstanding and worth a listen. David Luan (who is running an agent provider company) offers his insights on how agents will fit into the future economy and intuition behind what comes next with AI
This is a very relevant paper for those of us looking to improve model reasoning performance cheaply and effectively:
trains Llama 3 8b to have similar performance to GPT4 on a novel benchmark (up from an incredibly poor baseline) by creating synthetic
So this paper improves reasoning dramatically without even needing to fine tune the low cost models used. It’s a very clever variation on monte carlo tree search, with superior stepping and intermediate success metrics based on self play. It
This is a cool (if slightly mind-bending paper) wherein GPT-2 embeddings are used to distill the 'essence' of context for a system that matches contexts to possible actions via sampling a distribution of possible linear models that do the same. Uncertainty is estimated based on
Mistral Large 2 achieving Llama 3.1 400b-like performance with only 123b parameters makes me think that there is either a lot of synthetic data wizardry or distillation wizardry happening. Impressive.
In this really cool paper (), an 8 billion parameter model is used to create a synthetic dataset. The model is fine-tuned on this dataset to improve its ability to both generate responses and judge its own responses, effectively allowing it (the fine tuned
The convergence of performance across various architectures has convinced me that there is no secret highly performant architecture we are missing. Instead, there are a variety of ways to achieve good performance, with different tradeoffs. For example, in this paper
According to this article, LexisNexis' retrieval augmented generation (RAG) based system for answering questions on caselaw for lawyers has a really severe hallucination problem (up to 33% of the time)
Preparing my popcorn for total legal chaos!
Me perfecting an algorithm while talking to 3 different AIs then looking up and realizing that I need to wake up in 3 hours to catch an international flight 🫤😪
What if we used AI to predict how likely AI is to displace human labor in different occupations? These Italian scholars clearly couldn't resist (). Using a taxonomic breakdown of occupational tasks and AI raters, they created a composite index of such.
I propose the most relevant foundation model benchmark is "ability to implement novel ML training algorithms from scratch in one shot." For reference, I am getting iffy twenty-five shot performance right now from Claude 3.5.
Part 3 - Highlights from the Llama 3.1 paper
21. Long context was too challenging to get human annotations on so they used an earlier version of Llama to generate QA pairs on shorter chunks as well as summaries of chunks. Then they generated summaries of summaries and used to
@Scott_Wiener
It seems based on your bill like you want to entrench closed-source incumbents and the existing tech oligarchy. Shame on you for your brazen attempt to pass this off as some sort of virtue or partisan issue.
This is a fascinating and frustrating paper that re-imagines agents as neurons in a network, and then optimizes the agents 'symbolically' using an abstraction of back propagation on their prompts and prompt templates: It's fascinating because if the
Part 2 - Highlights from the Llama 3.1 Paper
11. Only added in long context materials near the end of pre-training (to support the 128k context) as they couldn't afford to run them earlier due to the quadratic self-attention costs
12. They scaled up context gradually,
In this interesting paper : Google researchers find that inference time search methodologies (like tree search) can dramatically improve the performance of a small model up to a certain problem difficulty, at which point they become completely ineffective.
Synthetic data and privacy-preserving solutions to enhance LLM security.
The paper “Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions” explores privacy issues related to LLMs, particularly when fine-tuned on private data. It identifies
@esoteric_cap
It's a little obscure but also highly field relevant. I hit you with a DM. There are no bad questions. General explanation to follow in due course.
Recently, I covered a paper on training (fine-tuning) open weight LLMs to have better inductive reasoning. This older paper () covers an approach for generating synthetic deductive reasoning reliably. Synthetic data like this can be used to in a virtuous
@Old_Samster
@goodalexander
A key difference is that proofs of spacetime can be checked by any participant on the Filecoin network because the validation parameters are core to the network design. A lot can be overcome if key functionality works as stated. Is that incorrect?
@Old_Samster
@goodalexander
If the PoSt is built into the validation stack, then it is not on chain. This means that the history can be lost or tampered with by fungible validators. An ahistorical PoSt is not useful for any enterprise use cases
This study () attempts to address several key limitations of large language models (LLMs) in rule learning tasks using a framework that structures abductive, deductive, and inductive reasoning together. While the researchers have some success, it is still