Axolotl's "train_on_inputs: false" makes it really simple to ensures that input tokens are ignored during loss computation.
This was discussed in the LLM fine-tuning course by
@dan_s_becker
and
@HamelHusain
.
I think most finetuning scripts are sub-optimal in that they do not mask user parts of the conversation.
this includes popular training scripts:
* hf alignment-handbook
* mlabonne llm-course
* hf autotrain
I did some experiments and wrote about it here
@gunsnrosesgirl3
Cars.
Physical controls, that could easily be operated by muscle memory, have been replaced by hard-to-use-without-looking touchscreens.
@NickADobos
The high acceptance rate also shows the importance of selecting a good problem to apply LLMs to.
Automated tests:
1. have a clear outcome that can be measured perfectly via automation: increased code coverage
2. almost no side-effect for “bad” tests other than execution time
@adad8m
Searching a bit, this is indeed a deep problem: says its unsolvable via compass and ruler.
This is a neat little writeup on "Circular Billiard" with several references:
@simonw
At those lengths, wouldn’t hallucinations and coherence become a big issue? Since there’s no “backtracking”, one wrongly selected token, and the cascading effect could be quite pronounced.
@joshwhiton
Claude suggests “Transcendence paradox”: Capturing the paradoxical nature of AI exceeding human capabilities in a domain, yet having that transcendence be viewed as paradoxically artificial or inauthentic.
Now introducing Liquid Mode—the first of many steps to change the way you consume digital documents on your mobile device. Tailor, reformat, and read your PDFs with the click of a button. Read more:
The authors searched through Reddit, Quora, etc and came up with these 100 business and personal use-cases where people are getting value out of using LLMs.
This write-up by
@aapo_tanskanen
is a treasure trove for RAG practitioners.
Covering a wide range of topics related to information retrieval -- embeddings, reranking, evaluation, vector DBs, latency optimizations, ...
@HamelHusain
Claude recommends using XML tag as it matches their training data.
Although I wonder what their rationale was? Perhaps that the closing tag </foo> makes it clearer where the start of that context was, or something like that?
@iliasmiraoui
That's the flip flop effect documented in this paper .
It shows that models flip their answers 46% of the time on average when asked "Are you sure?"
Great advice from Dennett.
Writing and sharing the pitfalls and dead-ends in a field -- why some well-thought idea is ultimately flawed -- is good contribution!
@GregKamradt
@RLanceMartin
That’s recency bias.
“the training answers that are closer to the end of the prompt are more likely to be repeated by the model” (in the context of their experiment).
@adad8m
If we reflect B to B’ (by extending the radius though B equally outside), connecting A to B’, interesting the circle at C, and then reflecting the part CB’ back to CB to get ACB. There’s a second path by reflecting A, getting ADB.
@itsSandraKublik
Cohere’s releases have been so impressive — not just the volume and capabilities, but also the quality of these release notes with proper evaluation results!
@rohanpaul_ai
This looks related to the observation in the GPT-4 paper that pre-trained models are highly calibrated but post-training/PPO reduces that calibration.
@simonw
And there's a different extreme to your question -- To what extent does pre-trained knowledge actually interfere with the DocQA/Summarization tasks. There are a couple recent papers on this:
We've been looking quite a bit lately at measuring the language-understanding ability of LLMs. This is an arxiv preprint on this line of work (w/
@rtsarfaty
@VikaBsmv
).
Hallucinations in LLM remind me of the Emanuel Lasker quote:
"Without error there can be no brilliancy"
Not to say that LLMs shouldn't be much better -- they obviously should. But entirely eliminating possibilities of error may also eliminate that "spark".
Best would be a
If you don’t understand how hallucinations are a core feature of LLMs, you don’t understand how they work and their architecture.
If you understand their architecture and how they work you know why hallucinations are engrained in them - until and unless the architecture changes
@abacaj
Very true. I'll add that the time spent fine-tuning may be better spent trying to optimize the workflow, UX, onboarding, and the pre-/post-processing parts of the pipeline.
The journey of designing Acrobat's AI Assistant and Generative Summary using a quality framework that asks:
Is it useful? Is it usable? and is it responsible?
X should allow unrolling such mega-discussions into a Reddit-like single-page view with posts from everyone expanded out recursively. Tapping back and forth the sub-sub-threads is so tedious.
Anyone who tells you that the mathematics of cognition has been invalidated by experiments with gradient descent, does not understand math or cognition.
As near as I can tell, the detractors simply lack the ability to deal with these concepts on the level of abstraction that
Using simple puzzles to judge LLMs is like the "reverse a linked list in C" interview Q.
LLM would almost never have to do that in production. So it's a bit silly.
However, if an LLM could *not* do something this simple and well-known, there's some signal there, no?
WRONG. Can we *stop* using single instances of a question as litmus test for how good a model is?
Especially since it appears everywhere on the internet. At least try changing the numbers, before claiming the model understands the theory behind
Most models fail if probed further
So instructions like "You can do it" or "I'll tip you $20 for a better solution" apparently make LLMs less lazy and perform better.
Do multimodal equivalents work too? Like adding pictures of candies, trophies, faces of a happy customers for “motivation”? Any research on that?
@burny_tech
Why might that be?
1. Larger org size causing slowdowns due to coordination/decision-making overheads
2. Releases now need higher safety bar, esp. voice deepfakes risks in election cycle
3. Attention spread across too many projects
4. Attention solely on releasing GPT-4.5/5
Refusal rates on "seemingly toxic" and toxic prompts for different LLMs.
From: Cui et al., “OR-Bench: An Over-Refusal Benchmark for Large Language Models.”.
Apple has made several improvements for on-device ML in iOS 18.
- model compression improvements in palettization, quantization, pruning, and post-training compression
- transformers support improvements via in-place state tensors for KV Cache, SDPA ops, multiple adapters
🧵
Evolving LLM evaluation landscape by
@seb_ruder
Benchmarks
- becoming less dependable
- saturating with rapid LLM progress
- suffer memorization by leaking into training set
- suffer overfitting due to incentives, GPT judges, and synthetic data
🧵
What I haven’t seen much progress on:
- Calibrated confidence coming out of LLMs and what should that even mean?
- A proper separation of instruction vs data in the input to fundamentally address prompt injections
- Built-in citations from LLMs
We need updated version of limitation of current LLM systems. Most of the initial limitations are, imo, (with high probability) sufficiently resolved or is being very closed to being solved. Any pointers would be helpful!
@OpenAI
"GPT-4 custom model specifically optimized for the Japanese … improved performance in translating and summarizing Japanese text, is cost effective, and operates up to 3x faster than its predecessor."
Is it a smaller GPT-4 variant? An optimized tokenizer for Japanese?
@adad8m
This is trickier than I first thought. The point C on the circle for ACB to be smallest would be such that the angle the tangent at C makes with AC and BC would be the same. Or, equivalently angle ACO and BCO be equal.
This can be visualized via a sequence of expanding ellipse
When using MTEB leaderboard to select embedding models for retrieval, watch out for:
- document lengths in the eval dataset being very different than in target use-case
- domain of the eval data being very different
@simonw
I was just reading this where they “estimate the stock of human-generated public text at around 300 trillion tokens. If trends continue, language models will fully utilize this stock between 2026 and 2032, or even earlier if intensely overtrained”
Are we running out of data to train language models?
State-of-the-art LLMs use datasets with tens of trillions of words, and use 2-3x more per year. Our new ICML paper estimates when we might exhaust all text data on the internet. 1/12
Has there been any success on finding model merges using reinforcement learning?
Reading by
@maximelabonne
it feels like this problem (esp finding good frankenmerges) would be interesting to attack with RL.
In WWDC State of the Union, Apple goes into the optimizations they used for on-device LLMs:
- Adapters (LoRA?) fine-tuned to specific tasks like summarization
- Quantization
- Context Pruning
- Speculative Decoding
- Grouped Query Attention
Any evaluation datasets to assess LLM's comprehension of long and _nested_ contexts such as in "story within a story"?
To determine how well LLMs can operate at multiple realities of the outer/"frame" story vs. inner plots, without getting confused.
"Reasoning or Reciting" shows a drop in LLM performance on counterfactual variants of standard tasks.
"While current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving."
Today was a day full of reality checks.
This is the third result I'm seeing today describing serious limitations of LLMs working alone.
Shows the need of lot of "support structure" around LLMs -- pre-processing, tool-use, verification, etc.
In case you are wondering,
@karthikv792
found that LLM's still can't plan even after the advent of Claude 3 Opus and Gemini Pro..😬
tldr; no AGI doom by all fools day! You can still tame the lot by throwing assorted blocks at 'em..😋
The much ballyhooed Claude 3 Opus does no
Tips for using MTEB effectively to select embedding models for retrieval:
- narrow down to retrieval-relevant tasks, e.g. ignore classification
- sort via 'mean task rank' rather than 'average score across tasks'
@deliprao
Azure OpenAI Provisioned Throughout Units (PTUs) is one solution:
Another solution is to schedule these at US night time or use Azure OpenAI deployments in non-US regions during US daytime.
@kindgracekind
Before you try to center the div, ask yourself:
“Who am I that is trying to center the div?”
When you find the answer, you will no longer need to center the div.
@jeremyphoward
The FlipFlop effect: “study of 10 LLMs on 7 classification tasks reveals that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17%.”
Windows Copilot Library provides 4 AI-backed APIs using local models:
1. Phi Silica LLM
2. OCR
3. Studio Effects
4. Recall
Apps can use these APIs. Apps can also add contextual info to the underlying vector DB.
But apps cannot query Recall.
The authors searched through Reddit, Quora, etc and came up with these 100 business and personal use-cases where people are getting value out of using LLMs.
@sh_reya
Very relatable! Another huge challenge: In many cases, user input being totally open-ended means that the domain is pretty much “everything”. E.g. for the Acrobat AI Assistant, this can be any PDF or question related to it! A robust taxonomy goes a long way in structuring evals.
@vboykis
Some text-to-image uses:
Edits in Adobe Photoshop like Generative Expand or Generative Fill guided by a text prompt.
Creating on-brand assets for marketing.
So much nuance to evals working consistently! Even with simple A--D choices, different ways of getting the answer: max prob token only among the valid choices, full answer generation etc. lead to widely different scores and ranking.
Fascinating writeup:
@amydeng_
Sounds about right: starting with a research/standard model, do whatever it takes to build the core ML capability all the way to a shippable state.
But in a team, different MLEs naturally develop expertise in select areas and so individuals get specialized (even if not the role).
@omerNLP
@alon_jacovi
@lovodkin93
@rtsarfaty
This is a very relevant nuance that's missing in a lot of needle-in-a-haystack type evals with LLMs showing super high recalls, which isn't true case in practice. Thanks for this work!
@asbruckman
In the Acrobat AI Assistant, we show citations for different parts of the generated answer. They highlight sentences in the PDF most relevant to that part of the answer.