Ashutosh Mehra Profile Banner
Ashutosh Mehra Profile
Ashutosh Mehra

@ashutoshmehra

2,349
Followers
7,501
Following
82
Media
609
Statuses

Senior Principal Scientist at Adobe. Working on Acrobat AI Assistant, LLMs, and document ML.

Joined March 2009
Don't wanna be here? Send us removal request.
Pinned Tweet
@ashutoshmehra
Ashutosh Mehra
4 months
0
0
9
@ashutoshmehra
Ashutosh Mehra
4 months
Axolotl's "train_on_inputs: false" makes it really simple to ensures that input tokens are ignored during loss computation. This was discussed in the LLM fine-tuning course by @dan_s_becker and @HamelHusain .
Tweet media one
@yonigottesman
Yoni Gottesman
4 months
I think most finetuning scripts are sub-optimal in that they do not mask user parts of the conversation. this includes popular training scripts: * hf alignment-handbook * mlabonne llm-course * hf autotrain I did some experiments and wrote about it here
5
3
34
3
14
107
@ashutoshmehra
Ashutosh Mehra
12 years
A nice detailed account of the Intel SYSRET privilege escalation vulnerability. http://t.co/5nZAirBj
0
60
43
@ashutoshmehra
Ashutosh Mehra
6 months
@gunsnrosesgirl3 Cars. Physical controls, that could easily be operated by muscle memory, have been replaced by hard-to-use-without-looking touchscreens.
1
1
40
@ashutoshmehra
Ashutosh Mehra
6 months
@NickADobos The high acceptance rate also shows the importance of selecting a good problem to apply LLMs to. Automated tests: 1. have a clear outcome that can be measured perfectly via automation: increased code coverage 2. almost no side-effect for “bad” tests other than execution time
1
0
32
@ashutoshmehra
Ashutosh Mehra
10 years
Internals of the Windows 10 Control Flow Guard by @mj0011sec http://t.co/rTfyjI9RwN
0
30
31
@ashutoshmehra
Ashutosh Mehra
6 months
@simonw At those lengths, wouldn’t hallucinations and coherence become a big issue? Since there’s no “backtracking”, one wrongly selected token, and the cascading effect could be quite pronounced.
1
1
26
@ashutoshmehra
Ashutosh Mehra
6 months
@miniapeur Proofs of existence without demonstrating an algorithm to construct/find the said object.
1
0
23
@ashutoshmehra
Ashutosh Mehra
6 months
@joshwhiton Claude suggests “Transcendence paradox”: Capturing the paradoxical nature of AI exceeding human capabilities in a domain, yet having that transcendence be viewed as paradoxically artificial or inauthentic.
1
0
23
@ashutoshmehra
Ashutosh Mehra
10 years
Knuth’s TAOCP are now available as high-quality PDFs, complete with links for exercises & equations. http://t.co/bdrVHv6lVK
1
20
22
@ashutoshmehra
Ashutosh Mehra
4 years
It was incredible working on the tech powering Liquid Mode. Thrilled to see it released.
@AdobeDocCloud
Adobe Document Cloud
4 years
Now introducing Liquid Mode—the first of many steps to change the way you consume digital documents on your mobile device. Tailor, reformat, and read your PDFs with the click of a button. Read more:
2
19
89
1
1
21
@ashutoshmehra
Ashutosh Mehra
6 months
@EsotericCofe Don Knuth’s emacs buffer for TeX isn’t syntax colored either:
Tweet media one
0
0
19
@ashutoshmehra
Ashutosh Mehra
8 years
How the Circle Line rogue train was caught with data. Fascinating investigative work.
1
15
17
@ashutoshmehra
Ashutosh Mehra
6 months
@rasbt And then there's this ORPO that combines SFT with preference alignment:
Tweet media one
1
1
16
@ashutoshmehra
Ashutosh Mehra
7 months
@erenbali You got to cache that response to avoid being wasteful 😉
2
0
15
@ashutoshmehra
Ashutosh Mehra
6 months
@Scobleizer @elonmusk We joke in India that the real test for Tesla’s FSD AI will be to navigate Indian roads and traffic 😂
3
1
13
@ashutoshmehra
Ashutosh Mehra
6 months
The authors searched through Reddit, Quora, etc and came up with these 100 business and personal use-cases where people are getting value out of using LLMs.
Tweet media one
Tweet media two
0
1
14
@ashutoshmehra
Ashutosh Mehra
4 months
This write-up by @aapo_tanskanen is a treasure trove for RAG practitioners. Covering a wide range of topics related to information retrieval -- embeddings, reranking, evaluation, vector DBs, latency optimizations, ...
1
4
13
@ashutoshmehra
Ashutosh Mehra
5 months
@HamelHusain Claude recommends using XML tag as it matches their training data. Although I wonder what their rationale was? Perhaps that the closing tag </foo> makes it clearer where the start of that context was, or something like that?
@zswitten
Zack Witten
1 year
@BroScienceCode @AnthropicAI XML is recommended as it matches the training data. I recommend the prompting docs on Anthropic's website for more tips.
0
0
16
3
2
14
@ashutoshmehra
Ashutosh Mehra
7 months
Great advice from Dennett. Writing and sharing the pitfalls and dead-ends in a field -- why some well-thought idea is ultimately flawed -- is good contribution!
Tweet media one
0
3
13
@ashutoshmehra
Ashutosh Mehra
4 months
@adad8m If we reflect B to B’ (by extending the radius though B equally outside), connecting A to B’, interesting the circle at C, and then reflecting the part CB’ back to CB to get ACB. There’s a second path by reflecting A, getting ADB.
Tweet media one
1
0
11
@ashutoshmehra
Ashutosh Mehra
6 months
@itsSandraKublik Cohere’s releases have been so impressive — not just the volume and capabilities, but also the quality of these release notes with proper evaluation results!
2
0
11
@ashutoshmehra
Ashutosh Mehra
4 months
@rohanpaul_ai This looks related to the observation in the GPT-4 paper that pre-trained models are highly calibrated but post-training/PPO reduces that calibration.
Tweet media one
1
0
11
@ashutoshmehra
Ashutosh Mehra
3 months
@helloiamleonie This is a great list! I found these to be also quite insightful: by @emollick by @jeremyphoward et al by @corbtt et al by @swyx et al
1
0
13
@ashutoshmehra
Ashutosh Mehra
5 months
A comprehensive list of estimated tokens in various "datasets" — by @mark_cummins
Tweet media one
0
0
11
@ashutoshmehra
Ashutosh Mehra
6 months
@simonw And there's a different extreme to your question -- To what extent does pre-trained knowledge actually interfere with the DocQA/Summarization tasks. There are a couple recent papers on this:
@yoavgo
(((ل()(ل() 'yoav))))👾
6 months
We've been looking quite a bit lately at measuring the language-understanding ability of LLMs. This is an arxiv preprint on this line of work (w/ @rtsarfaty @VikaBsmv ).
Tweet media one
6
20
145
0
1
10
@ashutoshmehra
Ashutosh Mehra
8 years
A very nice talk by @hochidsp on Edge browser security.
0
3
10
@ashutoshmehra
Ashutosh Mehra
6 months
@miniapeur Cauchy's integral formula defining the value of a function everywhere given just its value on a boundary.
0
0
10
@ashutoshmehra
Ashutosh Mehra
8 years
"Shoveling forward" looks like a far better analogy than "technical debt".
1
11
9
@ashutoshmehra
Ashutosh Mehra
5 months
Hallucinations in LLM remind me of the Emanuel Lasker quote: "Without error there can be no brilliancy" Not to say that LLMs shouldn't be much better -- they obviously should. But entirely eliminating possibilities of error may also eliminate that "spark". Best would be a
@GergelyOrosz
Gergely Orosz
5 months
If you don’t understand how hallucinations are a core feature of LLMs, you don’t understand how they work and their architecture. If you understand their architecture and how they work you know why hallucinations are engrained in them - until and unless the architecture changes
27
56
534
3
0
9
@ashutoshmehra
Ashutosh Mehra
6 months
@abacaj Very true. I'll add that the time spent fine-tuning may be better spent trying to optimize the workflow, UX, onboarding, and the pre-/post-processing parts of the pipeline.
0
0
8
@ashutoshmehra
Ashutosh Mehra
10 years
. @roppert @agelastic Use the discount code IFBD45 to get 45% off — all three TAOCP PDFs for under $100 :-)
0
4
8
@ashutoshmehra
Ashutosh Mehra
6 months
The journey of designing Acrobat's AI Assistant and Generative Summary using a quality framework that asks: Is it useful? Is it usable? and is it responsible?
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
1
8
@ashutoshmehra
Ashutosh Mehra
5 months
X should allow unrolling such mega-discussions into a Reddit-like single-page view with posts from everyone expanded out recursively. Tapping back and forth the sub-sub-threads is so tedious.
@ESYudkowsky
Eliezer Yudkowsky ⏹️
5 months
Anyone who tells you that the mathematics of cognition has been invalidated by experiments with gradient descent, does not understand math or cognition. As near as I can tell, the detractors simply lack the ability to deal with these concepts on the level of abstraction that
23
11
168
0
0
8
@ashutoshmehra
Ashutosh Mehra
6 months
Pretty weird that Claude's tokenizer isn't documented and people have to resort to such hacks by @javirandor just to count tokens.
1
1
8
@ashutoshmehra
Ashutosh Mehra
5 months
Using simple puzzles to judge LLMs is like the "reverse a linked list in C" interview Q. LLM would almost never have to do that in production. So it's a bit silly. However, if an LLM could *not* do something this simple and well-known, there's some signal there, no?
@gblazex
Blaze (Balázs Galambosi)
5 months
WRONG. Can we *stop* using single instances of a question as litmus test for how good a model is? Especially since it appears everywhere on the internet. At least try changing the numbers, before claiming the model understands the theory behind Most models fail if probed further
Tweet media one
4
1
5
1
0
7
@ashutoshmehra
Ashutosh Mehra
4 months
Loved the insightful talks on LLM serving performance tuning by @JoeEHoover , @TravisAddair , and @charles_irl . Covering continuous batching, LoRAX, and lots of other cool stuff. Part of @HamelHusain and @dan_s_becker LLM's "course".
2
1
7
@ashutoshmehra
Ashutosh Mehra
6 months
So instructions like "You can do it" or "I'll tip you $20 for a better solution" apparently make LLMs less lazy and perform better. Do multimodal equivalents work too? Like adding pictures of candies, trophies, faces of a happy customers for “motivation”? Any research on that?
1
0
7
@ashutoshmehra
Ashutosh Mehra
6 months
@burny_tech Why might that be? 1. Larger org size causing slowdowns due to coordination/decision-making overheads 2. Releases now need higher safety bar, esp. voice deepfakes risks in election cycle 3. Attention spread across too many projects 4. Attention solely on releasing GPT-4.5/5
0
0
6
@ashutoshmehra
Ashutosh Mehra
12 years
Details of type-confusion and memory-disclosure techniques used in a Java CVE-2013-1493 exploit.
0
0
6
@ashutoshmehra
Ashutosh Mehra
7 months
Some results showing the importance of formatting LLM inputs to match their individual chat formats.
Tweet media one
0
1
5
@ashutoshmehra
Ashutosh Mehra
6 months
@burny_tech QED = Mic drop
0
0
6
@ashutoshmehra
Ashutosh Mehra
4 months
Refusal rates on "seemingly toxic" and toxic prompts for different LLMs. From: Cui et al., “OR-Bench: An Over-Refusal Benchmark for Large Language Models.”.
Tweet media one
Tweet media two
1
1
6
@ashutoshmehra
Ashutosh Mehra
4 months
Apple has made several improvements for on-device ML in iOS 18. - model compression improvements in palettization, quantization, pruning, and post-training compression - transformers support improvements via in-place state tensors for KV Cache, SDPA ops, multiple adapters 🧵
1
1
6
@ashutoshmehra
Ashutosh Mehra
4 months
Evolving LLM evaluation landscape by @seb_ruder Benchmarks - becoming less dependable - saturating with rapid LLM progress - suffer memorization by leaking into training set - suffer overfitting due to incentives, GPT judges, and synthetic data 🧵
1
0
6
@ashutoshmehra
Ashutosh Mehra
7 months
What I haven’t seen much progress on: - Calibrated confidence coming out of LLMs and what should that even mean? - A proper separation of instruction vs data in the input to fundamentally address prompt injections - Built-in citations from LLMs
@cloneofsimo
Simo Ryu
7 months
We need updated version of limitation of current LLM systems. Most of the initial limitations are, imo, (with high probability) sufficiently resolved or is being very closed to being solved. Any pointers would be helpful!
1
0
7
2
0
5
@ashutoshmehra
Ashutosh Mehra
6 months
@OpenAI "GPT-4 custom model specifically optimized for the Japanese … improved performance in translating and summarizing Japanese text, is cost effective, and operates up to 3x faster than its predecessor." Is it a smaller GPT-4 variant? An optimized tokenizer for Japanese?
0
0
5
@ashutoshmehra
Ashutosh Mehra
3 months
@ericjmichaud_ Cool idea! Using LLM's hidden state gives good results on finding attributions for generated answers. See cc @phukanaround @koustavagoswami @apoorv_umang @balajivasan That can be another way (without SAE) to get fine-grained error-detection for code.
0
1
6
@ashutoshmehra
Ashutosh Mehra
8 years
With Reader DC's January update, Protected Mode can now run in an AppContainer! For now, it's in beta and off by default.
Tweet media one
Tweet media two
0
5
5
@ashutoshmehra
Ashutosh Mehra
4 months
@adad8m This is trickier than I first thought. The point C on the circle for ACB to be smallest would be such that the angle the tangent at C makes with AC and BC would be the same. Or, equivalently angle ACO and BCO be equal. This can be visualized via a sequence of expanding ellipse
Tweet media one
1
0
6
@ashutoshmehra
Ashutosh Mehra
4 months
When using MTEB leaderboard to select embedding models for retrieval, watch out for: - document lengths in the eval dataset being very different than in target use-case - domain of the eval data being very different
Tweet media one
0
0
5
@ashutoshmehra
Ashutosh Mehra
6 months
@miniapeur Using the probabilistic method to prove some combinatorics or graph properties… that was quite a surprising use of probability.
0
0
6
@ashutoshmehra
Ashutosh Mehra
4 months
@simonw I was just reading this where they “estimate the stock of human-generated public text at around 300 trillion tokens. If trends continue, language models will fully utilize this stock between 2026 and 2032, or even earlier if intensely overtrained”
@EpochAIResearch
Epoch AI
4 months
Are we running out of data to train language models? State-of-the-art LLMs use datasets with tens of trillions of words, and use 2-3x more per year. Our new ICML paper estimates when we might exhaust all text data on the internet. 1/12
Tweet media one
24
128
632
1
0
6
@ashutoshmehra
Ashutosh Mehra
6 months
Has there been any success on finding model merges using reinforcement learning? Reading by @maximelabonne it feels like this problem (esp finding good frankenmerges) would be interesting to attack with RL.
1
0
6
@ashutoshmehra
Ashutosh Mehra
6 months
Just found out you can make an OpenAI privacy request for "Do not train on my content"... while still keeping chat histories in ChatGPT.
Tweet media one
1
0
6
@ashutoshmehra
Ashutosh Mehra
6 months
Quality assurance strategies for data annotation work via crowdsourcing. From:
Tweet media one
0
0
6
@ashutoshmehra
Ashutosh Mehra
4 months
In WWDC State of the Union, Apple goes into the optimizations they used for on-device LLMs: - Adapters (LoRA?) fine-tuned to specific tasks like summarization - Quantization - Context Pruning - Speculative Decoding - Grouped Query Attention
Tweet media one
4
1
5
@ashutoshmehra
Ashutosh Mehra
6 months
Any evaluation datasets to assess LLM's comprehension of long and _nested_ contexts such as in "story within a story"? To determine how well LLMs can operate at multiple realities of the outer/"frame" story vs. inner plots, without getting confused.
2
0
5
@ashutoshmehra
Ashutosh Mehra
11 years
Given the DNS spoofing in Turkey, the fraudulent TURKTRUST cert issue c. 2012 does not look like an innocent mistake. http://t.co/yI0PjYjLVV
0
8
4
@ashutoshmehra
Ashutosh Mehra
9 years
@HaifeiLi @fdfalcon Yep, thumbnail generation. In 10.x & 11.x, should use the Reader Protected Mode sandbox for rendering thumbnails.
2
3
5
@ashutoshmehra
Ashutosh Mehra
11 years
This http://t.co/Rm6PDG7y1K Windbg ext. is awesome: Highlights all present & future occurrences of a value I double-click.
0
2
5
@ashutoshmehra
Ashutosh Mehra
6 months
"Reasoning or Reciting" shows a drop in LLM performance on counterfactual variants of standard tasks. "While current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving."
Tweet media one
1
0
4
@ashutoshmehra
Ashutosh Mehra
9 years
Chromium Internals - Documents, Windows, Browsing Contexts By @nasko
0
5
5
@ashutoshmehra
Ashutosh Mehra
5 months
@MatthewBerman @anitakirkovska @GroqInc @vellum_ai For these speeds, Vellum calls Groq LLM inference endpoints. These are the same speeds that Groq provides:
Tweet media one
1
0
4
@ashutoshmehra
Ashutosh Mehra
10 years
Windows EoP in ahcache.sys/NtApphelpCacheControl by @tiraniddo Deadline Exceeded.
0
4
4
@ashutoshmehra
Ashutosh Mehra
6 months
Today was a day full of reality checks. This is the third result I'm seeing today describing serious limitations of LLMs working alone. Shows the need of lot of "support structure" around LLMs -- pre-processing, tool-use, verification, etc.
@rao2z
Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)
6 months
In case you are wondering, @karthikv792 found that LLM's still can't plan even after the advent of Claude 3 Opus and Gemini Pro..😬 tldr; no AGI doom by all fools day! You can still tame the lot by throwing assorted blocks at 'em..😋 The much ballyhooed Claude 3 Opus does no
Tweet media one
6
24
166
0
0
4
@ashutoshmehra
Ashutosh Mehra
4 months
Tips for using MTEB effectively to select embedding models for retrieval: - narrow down to retrieval-relevant tasks, e.g. ignore classification - sort via 'mean task rank' rather than 'average score across tasks'
Tweet media one
Tweet media two
0
0
4
@ashutoshmehra
Ashutosh Mehra
11 years
Solving the Mystery of the Office Zero-Day Exploit and DEP | McAfee: http://t.co/fndpxNudKJ
0
4
4
@ashutoshmehra
Ashutosh Mehra
7 months
@deliprao Azure OpenAI Provisioned Throughout Units (PTUs) is one solution: Another solution is to schedule these at US night time or use Azure OpenAI deployments in non-US regions during US daytime.
0
0
4
@ashutoshmehra
Ashutosh Mehra
10 years
@HaifeiLi So CVE-2015-0016 seems to be the third publicly recorded IE PM sandbox escape after CVE-2014-2817 & CVE-2014-4123.
0
1
4
@ashutoshmehra
Ashutosh Mehra
5 months
@kindgracekind Before you try to center the div, ask yourself: “Who am I that is trying to center the div?” When you find the answer, you will no longer need to center the div.
0
0
4
@ashutoshmehra
Ashutosh Mehra
4 months
@jeremyphoward The FlipFlop effect: “study of 10 LLMs on 7 classification tasks reveals that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17%.”
0
0
4
@ashutoshmehra
Ashutosh Mehra
4 months
Windows Copilot Library provides 4 AI-backed APIs using local models: 1. Phi Silica LLM 2. OCR 3. Studio Effects 4. Recall Apps can use these APIs. Apps can also add contextual info to the underlying vector DB. But apps cannot query Recall.
2
0
4
@ashutoshmehra
Ashutosh Mehra
6 months
@simonw Here’s a list of 100 use-cases where people have reported finding LLMs to be useful.
@ashutoshmehra
Ashutosh Mehra
6 months
The authors searched through Reddit, Quora, etc and came up with these 100 business and personal use-cases where people are getting value out of using LLMs.
Tweet media one
Tweet media two
0
1
14
2
0
4
@ashutoshmehra
Ashutosh Mehra
3 months
@sh_reya Very relatable! Another huge challenge: In many cases, user input being totally open-ended means that the domain is pretty much “everything”. E.g. for the Acrobat AI Assistant, this can be any PDF or question related to it! A robust taxonomy goes a long way in structuring evals.
1
0
4
@ashutoshmehra
Ashutosh Mehra
6 months
@lawhsw - someone who doesn’t need to sensor their true opinions to me - someone who actually *has* an opinion
0
0
4
@ashutoshmehra
Ashutosh Mehra
6 months
A short talk by Andrew Ng on agentic reasoning patterns. Quite admire his groundedness.
Tweet media one
0
0
3
@ashutoshmehra
Ashutosh Mehra
10 years
@HaifeiLi http://t.co/4P2mrvq2Km says “Exploitation of CVE-2014-4123 detected in the wild. Used as a sandbox escape.”
1
1
4
@ashutoshmehra
Ashutosh Mehra
6 months
@vboykis Some text-to-image uses: Edits in Adobe Photoshop like Generative Expand or Generative Fill guided by a text prompt. Creating on-brand assets for marketing.
0
0
2
@ashutoshmehra
Ashutosh Mehra
13 years
@dakami Yep! Confirmed that if the leaf cert uses MD2 hash, Safari fails to verify the cert: http://t.co/EdQEeg6r
1
3
4
@ashutoshmehra
Ashutosh Mehra
7 months
░K░N░U░T░H░C░H░E░C░K░S░I░N░B░I░O░
0
0
4
@ashutoshmehra
Ashutosh Mehra
6 months
Tweet media one
0
0
4
@ashutoshmehra
Ashutosh Mehra
7 months
So much nuance to evals working consistently! Even with simple A--D choices, different ways of getting the answer: max prob token only among the valid choices, full answer generation etc. lead to widely different scores and ranking. Fascinating writeup:
0
0
2
@ashutoshmehra
Ashutosh Mehra
6 months
@amydeng_ Sounds about right: starting with a research/standard model, do whatever it takes to build the core ML capability all the way to a shippable state. But in a team, different MLEs naturally develop expertise in select areas and so individuals get specialized (even if not the role).
1
0
3
@ashutoshmehra
Ashutosh Mehra
6 months
CTF but for trojan’ed LLMs!
@javirandor
Javier Rando
7 months
We are announcing the winners of our Trojan Detection Competition on Aligned LLMs!! 🥇 @tml_lab ( @fra__31 , @maksym_andr and Nicolas Flammarion) 🥈 @krystof_mitka 🥉 @apeoffire 🧵 With some of the main findings!
1
9
53
0
0
3
@ashutoshmehra
Ashutosh Mehra
6 months
@miniapeur Loss functions and evaluation metrics in general?
0
0
3
@ashutoshmehra
Ashutosh Mehra
7 months
@sulka @erenbali Life’s so hard. Will also need to A/B test results at different temperatures.
0
0
2
@ashutoshmehra
Ashutosh Mehra
9 years
NSA guy’s talk is worth watching precisely cos it teaches what we already know is broken.
0
0
3
@ashutoshmehra
Ashutosh Mehra
3 months
@omerNLP @alon_jacovi @lovodkin93 @rtsarfaty This is a very relevant nuance that's missing in a lot of needle-in-a-haystack type evals with LLMs showing super high recalls, which isn't true case in practice. Thanks for this work!
0
2
3
@ashutoshmehra
Ashutosh Mehra
4 months
@asbruckman In the Acrobat AI Assistant, we show citations for different parts of the generated answer. They highlight sentences in the PDF most relevant to that part of the answer.
Tweet media one
0
0
3