Ashutosh Mehra @ashutoshmehra Twitter profile

Pinned Tweet

Ashutosh Mehra

@ashutoshmehra

4 months

0

9

Last Seen Profiles

@yakulttobrut

@SleepyKorro

@stwmaniax

@gerrydinardo

@thsimon89

@Tenbinary

@nananotified

@yakupsahin27

@AdobeDrawing

@tsuki0kcal

@0x_wm

@JuliaYS8

@KenFoxTruth

@FORASTrust

@couples_mot7arr

@BuffaloBilly49

@feijiemoshu

@CherryBreakfast

@pSO0YIlN1n8O48

@HumansExplained

@travisgaley

@meriesofyee

@catholicthing

@RuralLutheran

@Scaimi_x

@ivancio4

@Scaimi_x

@jandakembangstw

@MlNHYYUNG

@Morenabrunett

@Allerton_Grange

@Y_chuan3

@lantgen52531

@K3vans12

@RileyLees

@Lumosity

Ashutosh Mehra

@ashutoshmehra

4 months

Axolotl's "train_on_inputs: false" makes it really simple to ensures that input tokens are ignored during loss computation. This was discussed in the LLM fine-tuning course by @dan_s_becker and @HamelHusain .

Yoni Gottesman

@yonigottesman

4 months

I think most finetuning scripts are sub-optimal in that they do not mask user parts of the conversation. this includes popular training scripts: * hf alignment-handbook * mlabonne llm-course * hf autotrain I did some experiments and wrote about it here

5

3

34

3

14

107

Ashutosh Mehra

@ashutoshmehra

12 years

A nice detailed account of the Intel SYSRET privilege escalation vulnerability. http://t.co/5nZAirBj

0

60

43

Ashutosh Mehra

@ashutoshmehra

6 months

@gunsnrosesgirl3 Cars. Physical controls, that could easily be operated by muscle memory, have been replaced by hard-to-use-without-looking touchscreens.

1

40

Ashutosh Mehra

@ashutoshmehra

6 months

@NickADobos The high acceptance rate also shows the importance of selecting a good problem to apply LLMs to. Automated tests: 1. have a clear outcome that can be measured perfectly via automation: increased code coverage 2. almost no side-effect for “bad” tests other than execution time

1

0

32

Ashutosh Mehra

@ashutoshmehra

10 years

Internals of the Windows 10 Control Flow Guard by @mj0011sec http://t.co/rTfyjI9RwN

0

30

31

Ashutosh Mehra

@ashutoshmehra

4 months

@adad8m Searching a bit, this is indeed a deep problem: says its unsolvable via compass and ruler. This is a neat little writeup on "Circular Billiard" with several references:

Alhazen's Billiard Problem -- from Wolfram MathWorld

In a given circle, find an isosceles triangle whose legs pass through two given points inside the circle. This can be restated as: from two points in the plane of a circle, draw lines meeting at the...

mathworld.wolfram.com

3

28

Ashutosh Mehra

@ashutoshmehra

6 months

@simonw At those lengths, wouldn’t hallucinations and coherence become a big issue? Since there’s no “backtracking”, one wrongly selected token, and the cascading effect could be quite pronounced.

1

26

Ashutosh Mehra

@ashutoshmehra

6 months

@miniapeur Proofs of existence without demonstrating an algorithm to construct/find the said object.

1

0

23

Ashutosh Mehra

@ashutoshmehra

6 months

@joshwhiton Claude suggests “Transcendence paradox”: Capturing the paradoxical nature of AI exceeding human capabilities in a domain, yet having that transcendence be viewed as paradoxically artificial or inauthentic.

1

0

23

Ashutosh Mehra

@ashutoshmehra

10 years

Knuth’s TAOCP are now available as high-quality PDFs, complete with links for exercises & equations. http://t.co/bdrVHv6lVK

1

20

22

Ashutosh Mehra

@ashutoshmehra

4 years

It was incredible working on the tech powering Liquid Mode. Thrilled to see it released.

Adobe Document Cloud

@AdobeDocCloud

4 years

Now introducing Liquid Mode—the first of many steps to change the way you consume digital documents on your mobile device. Tailor, reformat, and read your PDFs with the click of a button. Read more:

2

19

89

1

21

Ashutosh Mehra

@ashutoshmehra

6 months

@EsotericCofe Don Knuth’s emacs buffer for TeX isn’t syntax colored either:

0

19

Ashutosh Mehra

@ashutoshmehra

8 years

How the Circle Line rogue train was caught with data. Fascinating investigative work.

1

15

17

Ashutosh Mehra

@ashutoshmehra

6 months

@rasbt And then there's this ORPO that combines SFT with preference alignment:

1

16

Ashutosh Mehra

@ashutoshmehra

7 months

@erenbali You got to cache that response to avoid being wasteful 😉

2

0

15

Ashutosh Mehra

@ashutoshmehra

6 months

@Scobleizer @elonmusk We joke in India that the real test for Tesla’s FSD AI will be to navigate Indian roads and traffic 😂

3

1

13

Ashutosh Mehra

@ashutoshmehra

6 months

The authors searched through Reddit, Quora, etc and came up with these 100 business and personal use-cases where people are getting value out of using LLMs.

0

1

14

Ashutosh Mehra

@ashutoshmehra

4 months

This write-up by @aapo_tanskanen is a treasure trove for RAG practitioners. Covering a wide range of topics related to information retrieval -- embeddings, reranking, evaluation, vector DBs, latency optimizations, ...

Guidebook to the State-of-the-Art Embeddings and Information Retrieval

I got into (neural) information retrieval somewhere around 2020 when I stumbled upon the “Sentence-BERT” paper (Reimers & Gurevych 2019, I recommend you also follow Nils Reimers on X), which introd...

www.linkedin.com

1

4

13

Ashutosh Mehra

@ashutoshmehra

5 months

@HamelHusain Claude recommends using XML tag as it matches their training data. Although I wonder what their rationale was? Perhaps that the closing tag </foo> makes it clearer where the start of that context was, or something like that?

Zack Witten

@zswitten

1 year

@BroScienceCode @AnthropicAI XML is recommended as it matches the training data. I recommend the prompting docs on Anthropic's website for more tips.

0

16

3

2

14

Ashutosh Mehra

@ashutoshmehra

5 months

@iliasmiraoui That's the flip flop effect documented in this paper . It shows that models flip their answers 46% of the time on average when asked "Are you sure?"

Are You Sure? Challenging LLMs Leads to Performance Drops in The...

The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited....

arxiv.org

1

2

14

Ashutosh Mehra

@ashutoshmehra

7 months

Great advice from Dennett. Writing and sharing the pitfalls and dead-ends in a field -- why some well-thought idea is ultimately flawed -- is good contribution!

0

3

13

Ashutosh Mehra

@ashutoshmehra

7 months

@GregKamradt @RLanceMartin That’s recency bias. “the training answers that are closer to the end of the prompt are more likely to be repeated by the model” (in the context of their experiment).

Calibrate Before Use: Improving Few-Shot Performance of Language Models

GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt...

arxiv.org

2

1

12

Ashutosh Mehra

@ashutoshmehra

4 months

@adad8m If we reflect B to B’ (by extending the radius though B equally outside), connecting A to B’, interesting the circle at C, and then reflecting the part CB’ back to CB to get ACB. There’s a second path by reflecting A, getting ADB.

1

0

11

Ashutosh Mehra

@ashutoshmehra

6 months

@itsSandraKublik Cohere’s releases have been so impressive — not just the volume and capabilities, but also the quality of these release notes with proper evaluation results!

2

0

11

Ashutosh Mehra

@ashutoshmehra

4 months

@rohanpaul_ai This looks related to the observation in the GPT-4 paper that pre-trained models are highly calibrated but post-training/PPO reduces that calibration.

1

0

11

Ashutosh Mehra

@ashutoshmehra

3 months

@helloiamleonie This is a great list! I found these to be also quite insightful: by @emollick by @jeremyphoward et al by @corbtt et al by @swyx et al

1

0

13

Ashutosh Mehra

@ashutoshmehra

5 months

A comprehensive list of estimated tokens in various "datasets" — by @mark_cummins

0

11

Ashutosh Mehra

@ashutoshmehra

6 months

@simonw And there's a different extreme to your question -- To what extent does pre-trained knowledge actually interfere with the DocQA/Summarization tasks. There are a couple recent papers on this:

ClashEval: Quantifying the tug-of-war between an LLM's...

Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an...

arxiv.org

(((ل()(ل() 'yoav))))👾

@yoavgo

6 months

We've been looking quite a bit lately at measuring the language-understanding ability of LLMs. This is an arxiv preprint on this line of work (w/ @rtsarfaty @VikaBsmv ).

6

20

145

0

1

10

Ashutosh Mehra

@ashutoshmehra

8 years

A very nice talk by @hochidsp on Edge browser security.

0

3

10

Ashutosh Mehra

@ashutoshmehra

6 months

@miniapeur Cauchy's integral formula defining the value of a function everywhere given just its value on a boundary.

0

10

Ashutosh Mehra

@ashutoshmehra

8 years

"Shoveling forward" looks like a far better analogy than "technical debt".

1

11

9

Ashutosh Mehra

@ashutoshmehra

5 months

Hallucinations in LLM remind me of the Emanuel Lasker quote: "Without error there can be no brilliancy" Not to say that LLMs shouldn't be much better -- they obviously should. But entirely eliminating possibilities of error may also eliminate that "spark". Best would be a

Gergely Orosz

@GergelyOrosz

5 months

If you don’t understand how hallucinations are a core feature of LLMs, you don’t understand how they work and their architecture. If you understand their architecture and how they work you know why hallucinations are engrained in them - until and unless the architecture changes

27

56

534

3

0

9

Ashutosh Mehra

@ashutoshmehra

6 months

@abacaj Very true. I'll add that the time spent fine-tuning may be better spent trying to optimize the workflow, UX, onboarding, and the pre-/post-processing parts of the pipeline.

0

8

Ashutosh Mehra

@ashutoshmehra

10 years

. @roppert @agelastic Use the discount code IFBD45 to get 45% off — all three TAOCP PDFs for under $100 :-)

0

4

8

Ashutosh Mehra

@ashutoshmehra

6 months

The journey of designing Acrobat's AI Assistant and Generative Summary using a quality framework that asks: Is it useful? Is it usable? and is it responsible?

0

1

8

Ashutosh Mehra

@ashutoshmehra

5 months

X should allow unrolling such mega-discussions into a Reddit-like single-page view with posts from everyone expanded out recursively. Tapping back and forth the sub-sub-threads is so tedious.

Eliezer Yudkowsky ⏹️

@ESYudkowsky

5 months

Anyone who tells you that the mathematics of cognition has been invalidated by experiments with gradient descent, does not understand math or cognition. As near as I can tell, the detractors simply lack the ability to deal with these concepts on the level of abstraction that

23

11

168

0

8

Ashutosh Mehra

@ashutoshmehra

6 months

Pretty weird that Claude's tokenizer isn't documented and people have to resort to such hacks by @javirandor just to count tokens.

1

8

Ashutosh Mehra

@ashutoshmehra

5 months

Using simple puzzles to judge LLMs is like the "reverse a linked list in C" interview Q. LLM would almost never have to do that in production. So it's a bit silly. However, if an LLM could *not* do something this simple and well-known, there's some signal there, no?

Blaze (Balázs Galambosi)

@gblazex

5 months

WRONG. Can we *stop* using single instances of a question as litmus test for how good a model is? Especially since it appears everywhere on the internet. At least try changing the numbers, before claiming the model understands the theory behind Most models fail if probed further

4

1

5

1

0

7

Ashutosh Mehra

@ashutoshmehra

7 months

@deepfates And in the opposite way, asking: “Are you sure?”

Are You Sure? Challenging LLMs Leads to Performance Drops in The...

The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited....

arxiv.org

0

7

Ashutosh Mehra

@ashutoshmehra

4 months

Loved the insightful talks on LLM serving performance tuning by @JoeEHoover , @TravisAddair , and @charles_irl . Covering continuous batching, LoRAX, and lots of other cool stuff. Part of @HamelHusain and @dan_s_becker LLM's "course".

2

1

7

Ashutosh Mehra

@ashutoshmehra

6 months

So instructions like "You can do it" or "I'll tip you $20 for a better solution" apparently make LLMs less lazy and perform better. Do multimodal equivalents work too? Like adding pictures of candies, trophies, faces of a happy customers for “motivation”? Any research on that?

1

0

7

Ashutosh Mehra

@ashutoshmehra

6 months

@burny_tech Why might that be? 1. Larger org size causing slowdowns due to coordination/decision-making overheads 2. Releases now need higher safety bar, esp. voice deepfakes risks in election cycle 3. Attention spread across too many projects 4. Attention solely on releasing GPT-4.5/5

0

6

Ashutosh Mehra

@ashutoshmehra

12 years

Details of type-confusion and memory-disclosure techniques used in a Java CVE-2013-1493 exploit.

0

6

Ashutosh Mehra

@ashutoshmehra

7 months

Some results showing the importance of formatting LLM inputs to match their individual chat formats.

0

1

5

Ashutosh Mehra

@ashutoshmehra

6 months

@burny_tech QED = Mic drop

0

6

Ashutosh Mehra

@ashutoshmehra

4 months

Refusal rates on "seemingly toxic" and toxic prompts for different LLMs. From: Cui et al., “OR-Bench: An Over-Refusal Benchmark for Large Language Models.”.

1

6

Ashutosh Mehra

@ashutoshmehra

4 months

Apple has made several improvements for on-device ML in iOS 18. - model compression improvements in palettization, quantization, pruning, and post-training compression - transformers support improvements via in-place state tensors for KV Cache, SDPA ops, multiple adapters 🧵

1

6

Ashutosh Mehra

@ashutoshmehra

4 months

Evolving LLM evaluation landscape by @seb_ruder Benchmarks - becoming less dependable - saturating with rapid LLM progress - suffer memorization by leaking into training set - suffer overfitting due to incentives, GPT judges, and synthetic data 🧵

1

0

6

Ashutosh Mehra

@ashutoshmehra

6 months

@ekzhang1 Record and replay/Time travel debugging?

GitHub - rr-debugger/rr: Record and Replay Framework

Record and Replay Framework. Contribute to rr-debugger/rr development by creating an account on GitHub.

github.com

0

6

Ashutosh Mehra

@ashutoshmehra

7 months

What I haven’t seen much progress on: - Calibrated confidence coming out of LLMs and what should that even mean? - A proper separation of instruction vs data in the input to fundamentally address prompt injections - Built-in citations from LLMs

Simo Ryu

@cloneofsimo

7 months

We need updated version of limitation of current LLM systems. Most of the initial limitations are, imo, (with high probability) sufficiently resolved or is being very closed to being solved. Any pointers would be helpful!

1

0

7

2

0

5

Ashutosh Mehra

@ashutoshmehra

6 months

@OpenAI "GPT-4 custom model specifically optimized for the Japanese … improved performance in translating and summarizing Japanese text, is cost effective, and operates up to 3x faster than its predecessor." Is it a smaller GPT-4 variant? An optimized tokenizer for Japanese?

0

5

Ashutosh Mehra

@ashutoshmehra

3 months

@ericjmichaud_ Cool idea! Using LLM's hidden state gives good results on finding attributions for generated answers. See cc @phukanaround @koustavagoswami @apoorv_umang @balajivasan That can be another way (without SAE) to get fine-grained error-detection for code.

0

1

6

Ashutosh Mehra

@ashutoshmehra

8 years

With Reader DC's January update, Protected Mode can now run in an AppContainer! For now, it's in beta and off by default.

0

5

Ashutosh Mehra

@ashutoshmehra

4 months

@adad8m This is trickier than I first thought. The point C on the circle for ACB to be smallest would be such that the angle the tangent at C makes with AC and BC would be the same. Or, equivalently angle ACO and BCO be equal. This can be visualized via a sequence of expanding ellipse

1

0

6

Ashutosh Mehra

@ashutoshmehra

4 months

When using MTEB leaderboard to select embedding models for retrieval, watch out for: - document lengths in the eval dataset being very different than in target use-case - domain of the eval data being very different

0

5

Ashutosh Mehra

@ashutoshmehra

6 months

@miniapeur Using the probabilistic method to prove some combinatorics or graph properties… that was quite a surprising use of probability.

Probabilistic method - Wikipedia

en.wikipedia.org

0

6

Ashutosh Mehra

@ashutoshmehra

4 months

@simonw I was just reading this where they “estimate the stock of human-generated public text at around 300 trillion tokens. If trends continue, language models will fully utilize this stock between 2026 and 2032, or even earlier if intensely overtrained”

Epoch AI

@EpochAIResearch

4 months

Are we running out of data to train language models? State-of-the-art LLMs use datasets with tens of trillions of words, and use 2-3x more per year. Our new ICML paper estimates when we might exhaust all text data on the internet. 1/12

24

128

632

1

0

6

Ashutosh Mehra

@ashutoshmehra

6 months

Has there been any success on finding model merges using reinforcement learning? Reading by @maximelabonne it feels like this problem (esp finding good frankenmerges) would be interesting to attack with RL.

Merge Large Language Models with mergekit

huggingface.co

1

0

6

Ashutosh Mehra

@ashutoshmehra

6 months

Just found out you can make an OpenAI privacy request for "Do not train on my content"... while still keeping chat histories in ChatGPT.

1

0

6

Ashutosh Mehra

@ashutoshmehra

6 months

Quality assurance strategies for data annotation work via crowdsourcing. From:

0

6

Ashutosh Mehra

@ashutoshmehra

4 months

In WWDC State of the Union, Apple goes into the optimizations they used for on-device LLMs: - Adapters (LoRA?) fine-tuned to specific tasks like summarization - Quantization - Context Pruning - Speculative Decoding - Grouped Query Attention

4

1

5

Ashutosh Mehra

@ashutoshmehra

6 months

Any evaluation datasets to assess LLM's comprehension of long and _nested_ contexts such as in "story within a story"? To determine how well LLMs can operate at multiple realities of the outer/"frame" story vs. inner plots, without getting confused.

Story within a story - Wikipedia

en.wikipedia.org

2

0

5

Ashutosh Mehra

@ashutoshmehra

11 years

Given the DNS spoofing in Turkey, the fraudulent TURKTRUST cert issue c. 2012 does not look like an innocent mistake. http://t.co/yI0PjYjLVV

0

8

4

Ashutosh Mehra

@ashutoshmehra

6 months

@yoavgo @rtsarfaty @VikaBsmv Very interesting results. Saw this other paper today that shows related results.

ClashEval: Quantifying the tug-of-war between an LLM's...

Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an...

arxiv.org

1

0

5

Ashutosh Mehra

@ashutoshmehra

9 years

@HaifeiLi @fdfalcon Yep, thumbnail generation. In 10.x & 11.x, should use the Reader Protected Mode sandbox for rendering thumbnails.

2

3

5

Ashutosh Mehra

@ashutoshmehra

5 months

@joshua_saxe This sort of “comprehension is compression” idea has been explored for actual compression. E.g or

Language Modeling Is Compression

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on...

arxiv.org

0

5

Ashutosh Mehra

@ashutoshmehra

11 years

This http://t.co/Rm6PDG7y1K Windbg ext. is awesome: Highlights all present & future occurrences of a value I double-click.

0

2

5

Ashutosh Mehra

@ashutoshmehra

6 months

"Reasoning or Reciting" shows a drop in LLM performance on counterfactual variants of standard tasks. "While current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving."

1

0

4

Ashutosh Mehra

@ashutoshmehra

9 years

Chromium Internals - Documents, Windows, Browsing Contexts By @nasko

0

5

Ashutosh Mehra

@ashutoshmehra

5 months

@MatthewBerman @anitakirkovska @GroqInc @vellum_ai For these speeds, Vellum calls Groq LLM inference endpoints. These are the same speeds that Groq provides:

1

0

4

Ashutosh Mehra

@ashutoshmehra

10 years

Windows EoP in ahcache.sys/NtApphelpCacheControl by @tiraniddo Deadline Exceeded.

0

4

Ashutosh Mehra

@ashutoshmehra

6 months

Today was a day full of reality checks. This is the third result I'm seeing today describing serious limitations of LLMs working alone. Shows the need of lot of "support structure" around LLMs -- pre-processing, tool-use, verification, etc.

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)

@rao2z

6 months

In case you are wondering, @karthikv792 found that LLM's still can't plan even after the advent of Claude 3 Opus and Gemini Pro..😬 tldr; no AGI doom by all fools day! You can still tame the lot by throwing assorted blocks at 'em..😋 The much ballyhooed Claude 3 Opus does no

6

24

166

0

4

Ashutosh Mehra

@ashutoshmehra

4 months

Tips for using MTEB effectively to select embedding models for retrieval: - narrow down to retrieval-relevant tasks, e.g. ignore classification - sort via 'mean task rank' rather than 'average score across tasks'

0

4

Ashutosh Mehra

@ashutoshmehra

11 years

Solving the Mystery of the Office Zero-Day Exploit and DEP | McAfee: http://t.co/fndpxNudKJ

0

4

Ashutosh Mehra

@ashutoshmehra

7 months

@deliprao Azure OpenAI Provisioned Throughout Units (PTUs) is one solution: Another solution is to schedule these at US night time or use Azure OpenAI deployments in non-US regions during US daytime.

0

4

Ashutosh Mehra

@ashutoshmehra

10 years

@HaifeiLi So CVE-2015-0016 seems to be the third publicly recorded IE PM sandbox escape after CVE-2014-2817 & CVE-2014-4123.

0

1

4

Ashutosh Mehra

@ashutoshmehra

5 months

@kindgracekind Before you try to center the div, ask yourself: “Who am I that is trying to center the div?” When you find the answer, you will no longer need to center the div.

0

4

Ashutosh Mehra

@ashutoshmehra

4 months

@jeremyphoward The FlipFlop effect: “study of 10 LLMs on 7 classification tasks reveals that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17%.”

Are You Sure? Challenging LLMs Leads to Performance Drops in The...

The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited....

arxiv.org

0

4

Ashutosh Mehra

@ashutoshmehra

4 months

Windows Copilot Library provides 4 AI-backed APIs using local models: 1. Phi Silica LLM 2. OCR 3. Studio Effects 4. Recall Apps can use these APIs. Apps can also add contextual info to the underlying vector DB. But apps cannot query Recall.

2

0

4

Ashutosh Mehra

@ashutoshmehra

6 months

@simonw Here’s a list of 100 use-cases where people have reported finding LLMs to be useful.

Ashutosh Mehra

@ashutoshmehra

6 months

The authors searched through Reddit, Quora, etc and came up with these 100 business and personal use-cases where people are getting value out of using LLMs.

0

1

14

2

0

4

Ashutosh Mehra

@ashutoshmehra

5 months

@OxxoTweets @MekalaDheeraj There's been several of these curriculum learning for LLM papers in BabyLM space and a few for LLMs

Curriculum learning for language modeling

Language Models like ELMo and BERT have provided robust representations of natural language, which serve as the language understanding component for a diverse range of downstream tasks.Curriculum...

arxiv.org

0

1

4

Ashutosh Mehra

@ashutoshmehra

3 months

@sh_reya Very relatable! Another huge challenge: In many cases, user input being totally open-ended means that the domain is pretty much “everything”. E.g. for the Acrobat AI Assistant, this can be any PDF or question related to it! A robust taxonomy goes a long way in structuring evals.

1

0

4

Ashutosh Mehra

@ashutoshmehra

6 months

@lawhsw - someone who doesn’t need to sensor their true opinions to me - someone who actually *has* an opinion

0

4

Ashutosh Mehra

@ashutoshmehra

6 months

A short talk by Andrew Ng on agentic reasoning patterns. Quite admire his groundedness.

0

3

Ashutosh Mehra

@ashutoshmehra

10 years

@HaifeiLi http://t.co/4P2mrvq2Km says “Exploitation of CVE-2014-4123 detected in the wild. Used as a sandbox escape.”

1

4

Ashutosh Mehra

@ashutoshmehra

6 months

@vboykis Some text-to-image uses: Edits in Adobe Photoshop like Generative Expand or Generative Fill guided by a text prompt. Creating on-brand assets for marketing.

0

2

Ashutosh Mehra

@ashutoshmehra

13 years

@dakami Yep! Confirmed that if the leaf cert uses MD2 hash, Safari fails to verify the cert: http://t.co/EdQEeg6r

1

3

4

Ashutosh Mehra

@ashutoshmehra

7 months

░K░N░U░T░H░C░H░E░C░K░S░I░N░B░I░O░

0

4

Ashutosh Mehra

@ashutoshmehra

6 months

@xlr8harder

0

4

Ashutosh Mehra

@ashutoshmehra

7 months

So much nuance to evals working consistently! Even with simple A--D choices, different ways of getting the answer: max prob token only among the valid choices, full answer generation etc. lead to widely different scores and ranking. Fascinating writeup:

What's going on with the Open LLM Leaderboard?

huggingface.co

0

2

Ashutosh Mehra

@ashutoshmehra

6 months

@amydeng_ Sounds about right: starting with a research/standard model, do whatever it takes to build the core ML capability all the way to a shippable state. But in a team, different MLEs naturally develop expertise in select areas and so individuals get specialized (even if not the role).

1

0

3

Ashutosh Mehra

@ashutoshmehra

6 months

@hassantsyed @qtnx_ Stable diffusion 3 paper

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such...

arxiv.org

0

3

Ashutosh Mehra

@ashutoshmehra

6 months

CTF but for trojan’ed LLMs!

Javier Rando

@javirandor

7 months

We are announcing the winners of our Trojan Detection Competition on Aligned LLMs!! 🥇 @tml_lab ( @fra__31 , @maksym_andr and Nicolas Flammarion) 🥈 @krystof_mitka 🥉 @apeoffire 🧵 With some of the main findings!

1

9

53

0

3

Ashutosh Mehra

@ashutoshmehra

6 months

@miniapeur Loss functions and evaluation metrics in general?

0

3

Ashutosh Mehra

@ashutoshmehra

7 months

@sulka @erenbali Life’s so hard. Will also need to A/B test results at different temperatures.

0

2

Ashutosh Mehra

@ashutoshmehra

9 years

NSA guy’s talk is worth watching precisely cos it teaches what we already know is broken.

0

3

Ashutosh Mehra

@ashutoshmehra

3 months

@omerNLP @alon_jacovi @lovodkin93 @rtsarfaty This is a very relevant nuance that's missing in a lot of needle-in-a-haystack type evals with LLMs showing super high recalls, which isn't true case in practice. Thanks for this work!

0

2

3

Ashutosh Mehra

@ashutoshmehra

4 months

@asbruckman In the Acrobat AI Assistant, we show citations for different parts of the generated answer. They highlight sentences in the PDF most relevant to that part of the answer.

0

3

Ashutosh Mehra

@ashutoshmehra

5 months

@OxxoTweets @MekalaDheeraj On curriculum learning for LLM instruction tuning:

Instruction Tuning with Human Curriculum

In this work, we (1) introduce Curriculum Instruction Tuning, (2) explore the potential advantages of employing diverse curriculum strategies, and (3) delineate a synthetic instruction-response...

arxiv.org

0

3