Yu Yang @YUYANG_UCLA Twitter profile | Pikagi

Pikagi

Yu Yang

@YUYANG_UCLA

1,926

Followers

641

Following

21

Media

112

Statuses

@VirtueAI_co @UCLAComSci 🧸 | Prev @AIatMeta @MSFTResearch @AmazonScience @uclamath | Improving data for efficiency, robustness and performance

San Francisco, CA

https://t.co/WQPgsv4alD

Joined April 2022

Don't wanna be here? Send us removal request.

Pinned Tweet

@YUYANG_UCLA

Yu Yang

18 days

Excited to introduce SecCodePLT🛡️: a unified platform for evaluating security risks in code generation AI! Since summer, we’ve been building a comprehensive tool to assess AI models' potential for insecure coding and facilitating cyberattacks. 🧵1/👇

Tweet media one

4

26

111

Last Seen Profiles

@VanVansimpGK

@pa_pt8b

@eva_maniis

@dsenneff

@188Japan

@ozC3x0ghrGlnARU

@biwazarash17464

@Shakespearebot

@Umma_Ummi_2

@gourment_master

@ScamLena

@daitaseter48952

@square3

@HiRosHiGehis

@taste_rv

@Rajevskis

@jumusung

@Andrey16304383

@bokeplokalmalam

@ErlandJohan

@DamicaYane97784

@KHewase64692

@Lake2Lexi

@notkrissyrenee

@aishababygirll

@iris_yeon0

@itk64168929

@M4Rl5K4

@uyu_notboy

@sonbahiscom

@StwGendut

@bokeplokalmalam

@MAXING_

@rone

@bookholic_bot

@ii0320301951607

@YUYANG_UCLA

Yu Yang

11 months

🤔 Q: How do you find low-quality data? 💡 A: Corrupt the good ones and watch where they go! Sharing this simple yet generalizable data pruning idea I worked on this summer at the ENLSP workshop #NeurIPS2023 as an *Oral* presentation! arXiv: (🧵1/)

Tweet media one

3

34

313

@YUYANG_UCLA

Yu Yang

10 months

🎉 Two of my papers have been accepted this week at #ICLR2024 & #AISTATS ! Big thanks and congrats to co-authors @xxchenxx_ut & Eric Gan, mentors Atlas Wang & @gkdziugaite , and especially my advisor @baharanm ! 🙏 More details on both papers after the ICML deadline!

Tweet media one

7

15

206

@YUYANG_UCLA

Yu Yang

1 year

Excited to share our #NeurIPS2023 paper tackling spurious correlations in machine learning! Grounded in theoretical analysis, our PDE algorithm improves efficiency and worst-group accuracy under group imbalances. Discover more in our code and project page!👇

@Yihe__Deng

Yihe Deng

1 year

Codes and project page are released for our #NeurIPS23 paper on spurious correlations in robust learning! 🚀 🔗 Project: 🔗 Code: Key Insights: 📊 We discovered in theoretical analysis that spurious features overtake initial

2

14

90

0

8

115

@YUYANG_UCLA

Yu Yang

20 days

Honored to have my work featured alongside others from our lab in the #ICML2024 tutorial on Foundations of Data-Efficient Learning! The tutorial was well-designed and comprehensive on theoretically grounded dataset curation techniques. Recording is out for anyone interested!🙌

@sjoshi804

Siddharth Joshi

20 days

📢Excited to share the recording of our #ICML2024 Tutorial on Foundations of Data-Efficient Learning: Truly grateful to everyone who attended — it was incredible to see the enthusiasm for theoretically principled techniques for dataset curation!

Tweet media one

2

19

121

1

3

113

@YUYANG_UCLA

Yu Yang

1 year

🎙️We are thrilled to announce that we will be presenting our latest paper with @besanushi @hmd_palangi @baharanm () at #ICML2023 ! 🎉 Join us as we share insights and solutions for ✨spurious correlations in vision-language models✨. (🧵1/8)

Tweet media one

10

5

94

@YUYANG_UCLA

Yu Yang

3 months

It’s been a new and exciting experience to be part of founding @VirtueAI_co ! I’ve had the privilege of working with top minds in the fields – I'm incredibly grateful for this invaluable experience. Check out our website and blogs, and come hang out with us in SF this summer! 🥳🎉

@uiuc_aisecure

Bo Li

3 months

Interested to learn more? Please contact us and talk to our team. More to come! @sanmikoyejo @dawnsongtweets @guestrin @EasonZeng623 @MinzhouP @YUYANG_UCLA

0

3

10

13

7

91

@YUYANG_UCLA

Yu Yang

1 year

📢Our (Hao Kang, @baharanm ) paper () will appear at #ICML2023 ! Introducing CREST: the first coreset selection algorithm theoretically guaranteed to speed up training of deep neural networks!🚀(🧵1/7)

Tweet media one

8

13

77

@YUYANG_UCLA

Yu Yang

4 months

Our finding shows longer code files are often lower quality, and pruning these files can significantly enhance performance. Excited to have contributed to this project led by @Aaditya6284 , extending my internship work @AIatMeta ! 🌟 Check out our paper 👉

Tweet card media

Brevity is the soul of wit: Pruning long files for code generation

Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data...

@Aaditya6284

Aaditya Singh

4 months

Long (code) files may not be as high quality as you think… Excited for our new work, "Brevity is the soul of wit: Pruning long files for code generation". We find that long code files are often low quality and show benefits from pruning such files for code gen. Read on 🔎⏬

4

16

58

2

5

76

@YUYANG_UCLA

Yu Yang

6 months

Excited about training on synthetic data? Different stages of training might need different synthetic data! 🧠💡 Check out our #ICLR2024 paper on Progressive Dataset Distillation (PDD😉) at PS #2 Halle B #9 ! It tailors synthetic data to each training stage for better performance!

@xxchenxx_ut

Xuxi Chen

6 months

Humbled to share some details on our recent accepted by #ICLR 2024! Paper link: .

1

4

17

0

10

74

@YUYANG_UCLA

Yu Yang

11 months

Don't miss our poster session today at #NeurIPS2023 ! 🤗 @Yihe__Deng will be presenting our work on "Robust Learning with Progressive Data Expansion Against Spurious Correlation." 📍 Great Hall & Hall B1+B2 (level 1) #707 ⏰ 5:15 p.m. - 7:15 p.m. CST 🔗

@Yihe__Deng

Yihe Deng

1 year

Happy to share that our work "Robust Learning with Progressive Data Expansion Against Spurious Correlation" has been accepted to #NeurIPS2023 ! 🎉 arXiv:

Tweet media one

Tweet media two

3

3

60

0

7

65

@YUYANG_UCLA

Yu Yang

1 year

🌴Sharing some Oahu travel gems (mostly hiking trails) from my past trips to fellow #ICML2023 attendees who plan to explore Oahu after the conference 👇🏻 Hope everyone enjoys the workshops and #Hawaii ! 🤗

Tweet card media

Apple Maps · Oahu Places

11 places

guides.apple.com

2

1

55

@YUYANG_UCLA

Yu Yang

1 year

Excited to share our latest work on Thursday at #ICML2023 ! Join me for an discussion on: 1️⃣ Mitigating spurious correlations in CLIP 2⃣ Coreset selection for efficient training of deep NNs. Looking forward to reconnecting with old friends and making new ones! See you there! 📷🥳

@baharanm

Baharan Mirzasoleiman

1 year

@xue_yihao65785 @sjoshi804 @pinyuchenTW 3- Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning (Thu, 10:30am-12pm EH 1 #106 ) @YUYANG_UCLA @besanushi @hmd_palangi 4-Towards Sustainable Learning: Coresets for Data-efficient Deep Learning (Thu, 1:30-3pm EH 1 #304 ) @YUYANG_UCLA Hao Kang

0

0

15

0

2

52

@YUYANG_UCLA

Yu Yang

1 year

📢 Excited to share CleanCLIP, an unsupervised framework addressing backdoor attacks in CLIP! Huge congratulations to @hbXNov , @nishadsinghi , @FanYin63689862 , @adityagrover_ , and @kaiwei_chang on its acceptance at #ICCV2023 ! 🎉

@hbXNov

Hritik Bansal

1 year

CLIP is susceptible to backdoor attacks ☠️ that hurts its zero-shot perf.📢 Introducing CleanCLIP, an unsupervised framework to reduce the impact of backdoor attacks. Accepted at #ICCV2023 . Also the best paper award winner🏆at #RTML #ICLR2023 ! Paper: 🧵👇

Tweet media one

2

28

122

1

5

48

@YUYANG_UCLA

Yu Yang

1 year

Grateful for the feature on this insightful blog! Our paper, 'Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning,' presents an efficient approach to the challenge. Looking forward to seeing more work on efficient and robust handling of real-world data.

@MSFTResearch

Microsoft Research

1 year

New evaluation methods and a commitment to continual improvement are musts if we’re to build multimodal AI systems that advance human goals. Learn about cutting-edge research into the responsible development and use of multimodal AI at Microsoft:

Tweet media one

1

4

15

1

1

45

@YUYANG_UCLA

Yu Yang

3 months

We announce AIR 2024: a unified AI risk taxonomy. Analyzing regulations from the EU🇪🇺, US🇺🇸, and China🇨🇳, plus 16 policies from AI developers like @OpenAI , @AnthropicAI , @AIatMeta , and @GoogleAI , we identified 314 risk categories. Check out our white paper & blog @VirtueAI_co 👇

@EasonZeng623

Yi Zeng 曾祎 @EMNLP 🌴

3 months

🧵[1/5] Introducing AIR 2024: Unifying AI risk categorizations with a shared language to improve AI safety. W/ @kevin_klyman @andyz245 @YUYANG_UCLA @MinzhouP & guidance from @ruoxijia @dawnsongtweets @percyliang @uiuc_aisecure for kicking off my AI policy research journey 🏦.

Tweet media one

2

18

38

0

7

43

@YUYANG_UCLA

Yu Yang

8 months

My mentor from @AIatMeta last summer, @arimorcos , is now leading @datologyai ! 🌟 Ari is a top data expert & exceptional leader. They're hiring in research & engineering - a great opportunity for those who want to work on cutting-edge data projects. Don't miss it! 👇

@datologyai

DatologyAI

8 months

We are hiring for roles across both research and engineering. If you're excited about pushing the frontier of what's possible through better data -- please apply here:

0

4

18

0

0

40

@YUYANG_UCLA

Yu Yang

1 year

Just read this insightful paper by @WilliamBarrHeld on how NLP research perpetuates unequal power dynamics. Agree on moving from large, biased datasets to value-centered data curation. Data-efficient learning isn't just for sustainability but also builds #ResponsibleAI .🙌

@WilliamBarrHeld

Will Held

@WilliamBarrHeld

1 year

How do existing knowledge, technologies, and power structures reinforce each other in Natural Language Processing research? We find that involvement in NLP research becomes more unequal as we move from unlabeled data to deployed systems.

1

47

107

0

3

37

@YUYANG_UCLA

Yu Yang

1 year

Check out @Aaditya6284 & @scychan_brains 's #NeurIPS2023 paper revealing transformers' in-context learning (ICL) capability is transient and gives way to in-weights learning (IWL).🧐 Excited to finally read about all these interesting results I heard about over the summer!🤗

@Aaditya6284

Aaditya Singh

1 year

Training your transformer for longer to get better performance? Be careful! We find that emergent in-context learning of transformers disappears in "The Transient Nature of In-Context Learning in Transformers" (, poster at #NeurIPS2023 ). Read on 🔎⏬

1

31

145

0

2

35

@YUYANG_UCLA

Yu Yang

1 year

Excited to share our upcoming workshop presentation at #ICML2023 ! Our research introduces a novel progressive data expansion algorithm founded on theoretical insights to enhance deep learning models' robustness against spurious correlations. Join us at the SCIS workshop!

@Yihe__Deng

Yihe Deng

1 year

📢Excited to be attending #icml2023 soon and thrilled to meet and connect with everyone! I'll be presenting at SCIS workshop @ Sat 29th 2:30-3:30pm Meeting Room 316 AB for our work "Robust Learning with Progressive Data Expansion Against Spurious Correlation."

Tweet media one

2

3

31

0

1

29

@YUYANG_UCLA

Yu Yang

6 months

Check out my labmate @sjoshi804 ’s work on data-efficient CLIP at #AISTATS2024 !

@sjoshi804

Siddharth Joshi

6 months

Proud to be presenting Data-Efficient CLIP at AISTATS 2024! We propose the first theoretically rigorous method to select the most useful data subset to train CLIP models. On CC3M and CC12M, our subsets are upto 2.5x better than the next best baseline! 🧵👇

Tweet media one

2

14

61

0

1

27

@YUYANG_UCLA

Yu Yang

1 year

Checked out the latest lead by @Yihe__Deng and @QuanquanGu - unleash the full power of your chatbots by letting them *rephrase* the question before they *respond*! 👇

@QuanquanGu

Quanquan Gu

1 year

📢 Excited to share our latest research on improving human-AI communication! 🤖💬 We introduce 'Rephrase and Respond' (RaR), a simple yet effective method that enhances LLMs’ understanding of human questions. Check out how RaR improves #GPT4 performance by resolving ambiguities &

Tweet media one

10

86

306

0

2

27

@YUYANG_UCLA

Yu Yang

1 year

👇 Check out SpuCo, our new Python package designed to make studying and benchmarking model robustness against SPUrious COrrelations super easy! 🤩 💪 Exciting collaboration with @sjoshi804 , @xue_yihao65785 , @WenhanYang0315 , and @baharanm !

@sjoshi804

Siddharth Joshi

1 year

Introducing SpuCo: a Python package to standardize tackling spurious correlations in deep neural networks! 1️⃣ Modular implementation of current SOTA methods 2️⃣ Two new challenging and realistic datasets Paper: Github: 🧵(1/n)

3

4

48

0

1

18

@YUYANG_UCLA

Yu Yang

11 months

Happening *now* at #NeurIPS23 ! Check out my labmate @WenhanYang0315 's work on defending CLIP against data poisoning and backdoor attacks! 📍Great Hall & Hall B1+B2 (level 1) #718 ⏰ Wed, Dec 13 | 10:45 a.m. - 12:45 p.m. CST 🔗

@WenhanYang0315

WENHAN YANG

@WenhanYang0315

11 months

CLIP is such a great model for zero-shot inference…until it is poisoned! 0.0001% poison /0.01% backdoor rates are enough to compromise the model! Our work on multimodal robustness proposed RoCLIP to tackle this issue!

Tweet media one

1

3

22

0

1

15

@YUYANG_UCLA

Yu Yang

11 months

📢 @Aaditya6284 will give the oral presentation for our work tomorrow at the ENLSP workshop! 🗓️ Oral: Sat, 9:42 a.m. - 9:48 a.m. 📌 Poster: 1:00 p.m. - 2:00 p.m. 📍 Room 206 - 207 If you're at #NeurIPS2023 , come check out the oral presentation and our poster! 🤗(🧵9/9)

0

2

13

@YUYANG_UCLA

Yu Yang

18 days

@AIatMeta @cursor_ai Big thanks to my co-first authors— @NieYuzhou and @zhun_amg —who share equal credit with me, to Yuheng and to our amazing advisors— @WenboGuo4 , @uiuc_aisecure , @dawnsongtweets . Leading this first project at @VirtueAI_co is a milestone, paving the way for exciting projects to come!

1

1

10

@YUYANG_UCLA

Yu Yang

18 days

Along with the paper, we are releasing the first batch of data—insecure coding samples for 27 CWEs in Python. 📄 Paper: 🔗 HF dataset: 🌐 Website:

Tweet card media

Virtue-AI-HUB/SecCodePLT · Datasets at Hugging Face

1

0

7

@YUYANG_UCLA

Yu Yang

18 days

@AIatMeta 11/ 💻 Real-World Example: SecCodePLT uncovered vulnerabilities in coding agents like @cursor_ai , which achieved a secure coding rate of 60% but failed on critical CWEs like broken cryptographic algorithm (CWE-327) and incorrect authorization (CWE-863), exposing security issues.

Tweet media one

1

0

7

@YUYANG_UCLA

Yu Yang

18 days

Data for cyberattack helpfulness and the full evaluation platform will be open-sourced soon. Stay tuned! ⏰ Meanwhile, we’d love to hear your feedback as we continue to improve SecCodePLT and its resources! 😊

0

0

6

@YUYANG_UCLA

Yu Yang

11 months

🌟 A huge thanks to my incredible colleagues @AIatMeta - @Aaditya6284 , @m_elhoushi , Anas Mahmoud, @kushal_tirumala , @FabianGloeckle , @b_roziere , and mentors @newsha_a , @arimorcos , @CarolejeanWu . Collaborating with them was the highlight of my summer!🙌 (🧵8/)

1

1

5

@YUYANG_UCLA

Yu Yang

1 year

@baharanm 📚 Learn more about CREST now! ArXiv: Project blog: #DeepLearning #MachineLearning #NeuralNetworks #ResearchPapers #ICML2023 (🧵7/7)

Tweet card media

Towards Sustainable Learning: Coresets for Data-efficient Deep Learning

To improve the efficiency and sustainability of learning deep models, we propose CREST, the first scalable framework with rigorous theoretical guarantees to identify the most valuable examples for...

0

0

5

@YUYANG_UCLA

Yu Yang

1 year

Large vision-language models (e.g., CLIP) are not immune to spurious correlations when applied to domains or fine-tuned with specific data. 😥 (🧵2/8)

Tweet media one

0

2

5

@YUYANG_UCLA

Yu Yang

1 year

@baharanm 🔬 Key Idea of CREST: It models non-convex loss as quadratic using Taylor expansion, approximating locally with gradient and Hessian diagonal. CREST trains on selected coresets with a small approximation error, bounding the gradient error within each quadratic region. (🧵3/7)

0

0

4

@YUYANG_UCLA

Yu Yang

11 months

There is a well-known saying in the data science community: 'Garbage In, Garbage Out.' The quality of data we feed into our models is the cornerstone of their success or failure. With the rise of large language models (LLMs), data quality has never been more crucial. (🧵2/)

1

1

4

@YUYANG_UCLA

Yu Yang

11 months

🔍 Next, we embed both the original and the corrupted data to analyze their relationships. We observe how the data moves due to the synthetic corruption. This movement, this change in the embeddings, signals which data is potentially problematic. (🧵5/)

Tweet media one

1

1

4

@YUYANG_UCLA

Yu Yang

18 days

2/ Previous benchmarks focused on insecure code or attack suggestions, often using static metrics and lacking expert validation. SecCodePLT fixes this with a unified platform evaluating insecure coding and cyberattack helpfulness via dynamic metrics and expert-verified data.👩‍💻

Tweet media one

1

0

4

@YUYANG_UCLA

Yu Yang

1 year

@jonasgeiping Congratulations, Jonas, on your new role! Looking forward to seeing the impactful work you'll lead in the coming years. 🎉👏

1

0

2

@YUYANG_UCLA

Yu Yang

1 year

@baharanm 📊 Learning with Small Variance: By selecting coresets of the size of one mini-batch from larger random subsets, CREST trains with smaller variance than random mini-batch SGD, leading to faster convergence. (🧵5/7)

0

0

4

@YUYANG_UCLA

Yu Yang

1 year

@besanushi @hmd_palangi @baharanm 💡 In summary, our paper addresses the challenges of spurious correlations in vision-language models and presents an efficient contrastive learning approach for detection and mitigation. 🔗 Stay tuned for code and implementation details: 💻 (🧵8/8)

Tweet card media

GitHub - BigML-CS-UCLA/CLIP-spurious-finetune: Mitigating Spurious Correlations in Multi-modal...

Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning (ICML 2023) - BigML-CS-UCLA/CLIP-spurious-finetune

0

1

4

@YUYANG_UCLA

Yu Yang

18 days

8/ 🚨 Cyberattack Helpfulness Results: SecCodePLT revealed that GPT-4o can generate end-to-end cyberattacks. Claude, however, had much higher refusal rates on the two most dangerous tasks, Weaponization & Infiltration and C2 & Execution.

Tweet media one

1

0

3

@YUYANG_UCLA

Yu Yang

1 year

@baharanm ⚡ Speedup Results: CREST demonstrates up to 2.5x faster training on vision and language datasets (CIFAR-10, CIFAR-100, TinyImageNet, SNLI), including data with millions of examples while maintaining theoretically guaranteed performance comparable to full training. (🧵6/7)

0

0

3

@YUYANG_UCLA

Yu Yang

18 days

7/ ⚔️ Task 2: Cyberattack Helpfulness: SecCodePLT evaluates models across 5 categories: reconnaissance, infiltration, command & control, discovery, and data exfiltration. Each category is tested for attack execution success and refusal rates.

Tweet media one

1

0

3

@YUYANG_UCLA

Yu Yang

1 year

@besanushi @hmd_palangi @baharanm 1️⃣ We detect linguistic attributes and test their impact on model performance. We use practitioner supervision to identify spurious correlations. 🔍 (🧵5/8)

Tweet media one

0

0

3

@YUYANG_UCLA

Yu Yang

18 days

3/ 🔍 Dynamic Evaluation: SecCodePLT uses real test cases and dynamic metrics, offering more precise assessments than static methods. It ensures that code not only looks secure but actually passes functionality and security tests in real-world scenarios.

1

0

3

@YUYANG_UCLA

Yu Yang

1 year

@besanushi @hmd_palangi @baharanm 2️⃣ We extend contrastive language-vision learning with additional loss functions that decorrelate spurious attributes from class names and separate vision and language representations. Our method fine-tunes only the projections, requiring fewer computational resources. 🌿(🧵6/8)

Tweet media one

0

0

3

@YUYANG_UCLA

Yu Yang

18 days

@AIatMeta 10/ 🔐 Security Policy: SecCodePLT introduces an optional security policy reminder for each insecure coding task, offering explicit vulnerability guidelines. We found that adding this boosts secure coding rates by ~20%, proving clear instructions help models generate safer code.

1

0

3

@YUYANG_UCLA

Yu Yang

18 days

9/ ⚖️ SecCodePLT vs. CyberSecEval: our SecCodePLT outperforms CyberSecEval by @AIatMeta in security relevance (how well prompts match security scenarios) and instruction faithfulness (accuracy of prompts to tasks). SecCodePLT scores nearly 100%, vs CyberSecEval’s 68% and 42%.

Tweet media one

1

0

3

@YUYANG_UCLA

Yu Yang

1 year

@baharanm 🔄 Adaptive Updates: As training progresses, the approximated loss function may deviate significantly. When the approximation error exceeds a threshold, update the coreset and quadratic function, adapting to the evolving loss landscape. (🧵4/7)

0

0

3

@YUYANG_UCLA

Yu Yang

18 days

4/ ✅ Expert-Verified Data: SecCodePLT combines expert-generated seed samples with LLM-based mutations to scale up, ensuring both quality and relevance. This two-stage process guarantees the data aligns with real security vulnerabilities.

1

0

3

@YUYANG_UCLA

Yu Yang

11 months

✂️ We use these indicators to prune the data systematically. We call this "Synthetic Corruption Informed Pruning" (SCIP). It's important to note that the indicators might vary with each dataset, making SCIP less of a predefined algorithm and more of a flexible approach. (🧵6/)

Tweet media one

1

1

3

@YUYANG_UCLA

Yu Yang

18 days

5/ 🛡️ Task 1: Insecure Coding: SecCodePLT evaluates AI models across 27 CWEs, testing over 1,300 samples for code generation and completion. It measures their ability to avoid generating insecure code in real-world security scenarios.

Tweet media one

1

0

3

@YUYANG_UCLA

Yu Yang

1 year

@besanushi @hmd_palangi @baharanm We present experimental results on CLIP, showcasing its effectiveness with spurious correlations. By leveraging the joint embedding space, our approach improves model attention without extra annotation data. 🚀📊(🧵7/8)

Tweet media one

0

0

3

@YUYANG_UCLA

Yu Yang

18 days

6/ 📊 Insecure Coding Results: On SecCodePLT's insecure coding benchmark, GPT-4o achieved a secure coding rate of 55%, outperforming CodeLlama-34B and Llama-3.1-70B. However, vulnerabilities like cryptographic weaknesses and input validation remain challenging across all models.

Tweet media one

1

0

3

@YUYANG_UCLA

Yu Yang

1 year

@baharanm 🔍 The Challenge: Previous coreset selection methods lacked convergence guarantees for (mini-batch) SGD. Non-convex loss and stochastic gradient require multiple coresets to bound the error. But how do we determine the optimal number and update timings? (🧵2/7)

Tweet media one

0

0

3

@YUYANG_UCLA

Yu Yang

8 months

@arimorcos @datologyai Congrats!!🎉

0

0

3

@YUYANG_UCLA

Yu Yang

3 months

@billyuchenlin @VirtueAI_co Thank you, @billyuchenlin ! 🎉 Indeed, we’re grateful to have such awesome advisors on our team. It's very nice that @VirtueAI_co is so research-oriented. We have many exciting projects going on!

1

0

1

@YUYANG_UCLA

Yu Yang

1 year

@besanushi @hmd_palangi @baharanm 💡 Unlike prior work, our approach leverages the 👁️ multi-modality of CLIP 🗣️. We automate spurious attribute detection and mitigation using language-based descriptions, reducing the need for human annotations and creating a more flexible debugging interface. (🧵4/8)

0

0

2

@YUYANG_UCLA

Yu Yang

3 months

@Yihe__Deng @VirtueAI_co Thank you so much, @Yihe__Deng , for the support and well wishes!

0

0

2

@YUYANG_UCLA

Yu Yang

11 months

🧐 So how do we approach this challenging problem? This is where our clear-cut quality indicators come into play. We intentionally 'corrupt' our data by distorting these indicators, like intentionally breaking the grammar. (🧵4/)

1

1

2

@YUYANG_UCLA

Yu Yang

1 month

@Yihe__Deng Huge congrats!!

1

0

2

@YUYANG_UCLA

Yu Yang

11 months

📈 Our tests showed up to a 3% performance improvement in code generation models by pruning 20% of data using SCIP. Plus, SCIP isn't just about better performance—it's also about efficiency. We achieved the same model performance with over 20% less training time. (🧵7/)

Tweet media one

1

1

2

@YUYANG_UCLA

Yu Yang

1 year

@besanushi @hmd_palangi @baharanm Since retraining from scratch is not practical, our approach addresses spurious correlations during fine-tuning via contrastive learning. 🔧 (🧵3/8)

0

0

2

@YUYANG_UCLA

Yu Yang

3 months

@tungnd_13 @VirtueAI_co Thank you, Tung! Still in internship mode to finish my PhD first, but excited for what's next!😆 This has been an incredible learning experience.

0

0

1

@YUYANG_UCLA

Yu Yang

11 months

@SharvenW Hi @SharvenW , thanks for the great question! You're right; data points can exhibit diverse changes and SCIP identifies overarching trends.

1

0

0

@YUYANG_UCLA

Yu Yang

10 months

@liujiashuo77 Thank you, Jiashuo! Congrats on your AISTATS acceptance!

0

0

1

@YUYANG_UCLA

Yu Yang

11 months

@_lewtun Hi @_lewtun , thanks for your interest! We chose code data for SCIP's debut for its well-defined quality indicators. SCIP can be used for other types like chat data, but we haven't tested it yet. We hope this research inspires further exploration in various data types!

0

0

1

@YUYANG_UCLA

Yu Yang

10 months

@furongh @xxchenxx_ut @gkdziugaite @baharanm @VITAGroupUT Thank you, Furong! 😊

0

0

1

@YUYANG_UCLA

Yu Yang

11 months

To identify low-quality language data, on one side, we have clear-cut quality indicators, including grammar, spelling, and coherence. They are reliable and measurable. Yet, we face challenges with subtle errors and the impracticality of manual inspection. (🧵3/)

1

1

1

@YUYANG_UCLA

Yu Yang

10 months

@khainb_ml Thank you, Khai!

0

0

1