Yu Yang Profile Banner
Yu Yang Profile
Yu Yang

@YUYANG_UCLA

1,926
Followers
641
Following
21
Media
112
Statuses

@VirtueAI_co @UCLAComSci 🧸 | Prev @AIatMeta @MSFTResearch @AmazonScience @uclamath | Improving data for efficiency, robustness and performance

San Francisco, CA
Joined April 2022
Don't wanna be here? Send us removal request.
Pinned Tweet
@YUYANG_UCLA
Yu Yang
18 days
Excited to introduce SecCodePLT🛡️: a unified platform for evaluating security risks in code generation AI! Since summer, we’ve been building a comprehensive tool to assess AI models' potential for insecure coding and facilitating cyberattacks. 🧵1/👇
Tweet media one
4
26
111
@YUYANG_UCLA
Yu Yang
11 months
🤔 Q: How do you find low-quality data? 💡 A: Corrupt the good ones and watch where they go! Sharing this simple yet generalizable data pruning idea I worked on this summer at the ENLSP workshop #NeurIPS2023 as an *Oral* presentation! arXiv: (🧵1/)
Tweet media one
3
34
313
@YUYANG_UCLA
Yu Yang
10 months
🎉 Two of my papers have been accepted this week at #ICLR2024 & #AISTATS ! Big thanks and congrats to co-authors @xxchenxx_ut & Eric Gan, mentors Atlas Wang & @gkdziugaite , and especially my advisor @baharanm ! 🙏 More details on both papers after the ICML deadline!
Tweet media one
7
15
206
@YUYANG_UCLA
Yu Yang
1 year
Excited to share our #NeurIPS2023 paper tackling spurious correlations in machine learning! Grounded in theoretical analysis, our PDE algorithm improves efficiency and worst-group accuracy under group imbalances. Discover more in our code and project page!👇
@Yihe__Deng
Yihe Deng
1 year
Codes and project page are released for our #NeurIPS23 paper on spurious correlations in robust learning! 🚀 🔗 Project: 🔗 Code: Key Insights: 📊 We discovered in theoretical analysis that spurious features overtake initial
2
14
90
0
8
115
@YUYANG_UCLA
Yu Yang
20 days
Honored to have my work featured alongside others from our lab in the #ICML2024 tutorial on Foundations of Data-Efficient Learning! The tutorial was well-designed and comprehensive on theoretically grounded dataset curation techniques. Recording is out for anyone interested!🙌
@sjoshi804
Siddharth Joshi
20 days
📢Excited to share the recording of our #ICML2024 Tutorial on Foundations of Data-Efficient Learning: Truly grateful to everyone who attended — it was incredible to see the enthusiasm for theoretically principled techniques for dataset curation!
Tweet media one
2
19
121
1
3
113
@YUYANG_UCLA
Yu Yang
1 year
🎙️We are thrilled to announce that we will be presenting our latest paper with @besanushi @hmd_palangi @baharanm () at #ICML2023 ! 🎉 Join us as we share insights and solutions for ✨spurious correlations in vision-language models✨. (🧵1/8)
Tweet media one
10
5
94
@YUYANG_UCLA
Yu Yang
3 months
It’s been a new and exciting experience to be part of founding @VirtueAI_co ! I’ve had the privilege of working with top minds in the fields – I'm incredibly grateful for this invaluable experience. Check out our website and blogs, and come hang out with us in SF this summer! 🥳🎉
@uiuc_aisecure
Bo Li
3 months
Interested to learn more? Please contact us and talk to our team. More to come! @sanmikoyejo @dawnsongtweets @guestrin @EasonZeng623 @MinzhouP @YUYANG_UCLA
0
3
10
13
7
91
@YUYANG_UCLA
Yu Yang
1 year
📢Our (Hao Kang, @baharanm ) paper () will appear at #ICML2023 ! Introducing CREST: the first coreset selection algorithm theoretically guaranteed to speed up training of deep neural networks!🚀(🧵1/7)
Tweet media one
8
13
77
@YUYANG_UCLA
Yu Yang
4 months
Our finding shows longer code files are often lower quality, and pruning these files can significantly enhance performance. Excited to have contributed to this project led by @Aaditya6284 , extending my internship work @AIatMeta ! 🌟 Check out our paper 👉
@Aaditya6284
Aaditya Singh
4 months
Long (code) files may not be as high quality as you think… Excited for our new work, "Brevity is the soul of wit: Pruning long files for code generation". We find that long code files are often low quality and show benefits from pruning such files for code gen. Read on 🔎⏬
4
16
58
2
5
76
@YUYANG_UCLA
Yu Yang
6 months
Excited about training on synthetic data? Different stages of training might need different synthetic data! 🧠💡 Check out our #ICLR2024 paper on Progressive Dataset Distillation (PDD😉) at PS #2 Halle B #9 ! It tailors synthetic data to each training stage for better performance!
@xxchenxx_ut
Xuxi Chen
6 months
Humbled to share some details on our recent accepted by #ICLR 2024! Paper link: .
1
4
17
0
10
74
@YUYANG_UCLA
Yu Yang
11 months
Don't miss our poster session today at #NeurIPS2023 ! 🤗 @Yihe__Deng will be presenting our work on "Robust Learning with Progressive Data Expansion Against Spurious Correlation." 📍 Great Hall & Hall B1+B2 (level 1) #707 ⏰ 5:15 p.m. - 7:15 p.m. CST 🔗
@Yihe__Deng
Yihe Deng
1 year
Happy to share that our work "Robust Learning with Progressive Data Expansion Against Spurious Correlation" has been accepted to #NeurIPS2023 ! 🎉 arXiv:
Tweet media one
Tweet media two
3
3
60
0
7
65
@YUYANG_UCLA
Yu Yang
1 year
🌴Sharing some Oahu travel gems (mostly hiking trails) from my past trips to fellow #ICML2023 attendees who plan to explore Oahu after the conference 👇🏻 Hope everyone enjoys the workshops and #Hawaii ! 🤗
2
1
55
@YUYANG_UCLA
Yu Yang
1 year
Excited to share our latest work on Thursday at #ICML2023 ! Join me for an discussion on: 1️⃣ Mitigating spurious correlations in CLIP 2⃣ Coreset selection for efficient training of deep NNs. Looking forward to reconnecting with old friends and making new ones! See you there! 📷🥳
@baharanm
Baharan Mirzasoleiman
1 year
@xue_yihao65785 @sjoshi804 @pinyuchenTW 3- Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning (Thu, 10:30am-12pm EH 1 #106 ) @YUYANG_UCLA @besanushi @hmd_palangi 4-Towards Sustainable Learning: Coresets for Data-efficient Deep Learning (Thu, 1:30-3pm EH 1 #304 ) @YUYANG_UCLA Hao Kang
0
0
15
0
2
52
@YUYANG_UCLA
Yu Yang
1 year
📢 Excited to share CleanCLIP, an unsupervised framework addressing backdoor attacks in CLIP! Huge congratulations to @hbXNov , @nishadsinghi , @FanYin63689862 , @adityagrover_ , and @kaiwei_chang on its acceptance at #ICCV2023 ! 🎉
@hbXNov
Hritik Bansal
1 year
CLIP is susceptible to backdoor attacks ☠️ that hurts its zero-shot perf.📢 Introducing CleanCLIP, an unsupervised framework to reduce the impact of backdoor attacks. Accepted at #ICCV2023 . Also the best paper award winner🏆at #RTML #ICLR2023 ! Paper: 🧵👇
Tweet media one
2
28
122
1
5
48
@YUYANG_UCLA
Yu Yang
1 year
Grateful for the feature on this insightful blog! Our paper, 'Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning,' presents an efficient approach to the challenge. Looking forward to seeing more work on efficient and robust handling of real-world data.
@MSFTResearch
Microsoft Research
1 year
New evaluation methods and a commitment to continual improvement are musts if we’re to build multimodal AI systems that advance human goals. Learn about cutting-edge research into the responsible development and use of multimodal AI at Microsoft:
Tweet media one
1
4
15
1
1
45
@YUYANG_UCLA
Yu Yang
3 months
We announce AIR 2024: a unified AI risk taxonomy. Analyzing regulations from the EU🇪🇺, US🇺🇸, and China🇨🇳, plus 16 policies from AI developers like @OpenAI , @AnthropicAI , @AIatMeta , and @GoogleAI , we identified 314 risk categories. Check out our white paper & blog @VirtueAI_co 👇
@EasonZeng623
Yi Zeng 曾祎 @EMNLP 🌴
3 months
🧵[1/5] Introducing AIR 2024: Unifying AI risk categorizations with a shared language to improve AI safety. W/ @kevin_klyman @andyz245 @YUYANG_UCLA @MinzhouP & guidance from @ruoxijia @dawnsongtweets @percyliang @uiuc_aisecure for kicking off my AI policy research journey 🏦.
Tweet media one
2
18
38
0
7
43
@YUYANG_UCLA
Yu Yang
8 months
My mentor from @AIatMeta last summer, @arimorcos , is now leading @datologyai ! 🌟 Ari is a top data expert & exceptional leader. They're hiring in research & engineering - a great opportunity for those who want to work on cutting-edge data projects. Don't miss it! 👇
@datologyai
DatologyAI
8 months
We are hiring for roles across both research and engineering. If you're excited about pushing the frontier of what's possible through better data -- please apply here:
0
4
18
0
0
40
@YUYANG_UCLA
Yu Yang
1 year
Just read this insightful paper by @WilliamBarrHeld on how NLP research perpetuates unequal power dynamics. Agree on moving from large, biased datasets to value-centered data curation. Data-efficient learning isn't just for sustainability but also builds #ResponsibleAI .🙌
@WilliamBarrHeld
Will Held
1 year
How do existing knowledge, technologies, and power structures reinforce each other in Natural Language Processing research? We find that involvement in NLP research becomes more unequal as we move from unlabeled data to deployed systems.
1
47
107
0
3
37
@YUYANG_UCLA
Yu Yang
1 year
Check out @Aaditya6284 & @scychan_brains 's #NeurIPS2023 paper revealing transformers' in-context learning (ICL) capability is transient and gives way to in-weights learning (IWL).🧐 Excited to finally read about all these interesting results I heard about over the summer!🤗
@Aaditya6284
Aaditya Singh
1 year
Training your transformer for longer to get better performance? Be careful! We find that emergent in-context learning of transformers disappears in "The Transient Nature of In-Context Learning in Transformers" (, poster at #NeurIPS2023 ). Read on 🔎⏬
1
31
145
0
2
35
@YUYANG_UCLA
Yu Yang
1 year
Excited to share our upcoming workshop presentation at #ICML2023 ! Our research introduces a novel progressive data expansion algorithm founded on theoretical insights to enhance deep learning models' robustness against spurious correlations. Join us at the SCIS workshop!
@Yihe__Deng
Yihe Deng
1 year
📢Excited to be attending #icml2023 soon and thrilled to meet and connect with everyone! I'll be presenting at SCIS workshop @ Sat 29th 2:30-3:30pm Meeting Room 316 AB for our work "Robust Learning with Progressive Data Expansion Against Spurious Correlation."
Tweet media one
2
3
31
0
1
29
@YUYANG_UCLA
Yu Yang
6 months
Check out my labmate @sjoshi804 ’s work on data-efficient CLIP at #AISTATS2024 !
@sjoshi804
Siddharth Joshi
6 months
Proud to be presenting Data-Efficient CLIP at AISTATS 2024! We propose the first theoretically rigorous method to select the most useful data subset to train CLIP models. On CC3M and CC12M, our subsets are upto 2.5x better than the next best baseline! 🧵👇
Tweet media one
2
14
61
0
1
27
@YUYANG_UCLA
Yu Yang
1 year
Checked out the latest lead by @Yihe__Deng and @QuanquanGu - unleash the full power of your chatbots by letting them *rephrase* the question before they *respond*! 👇
@QuanquanGu
Quanquan Gu
1 year
📢 Excited to share our latest research on improving human-AI communication! 🤖💬 We introduce 'Rephrase and Respond' (RaR), a simple yet effective method that enhances LLMs’ understanding of human questions. Check out how RaR improves #GPT4 performance by resolving ambiguities &
Tweet media one
10
86
306
0
2
27
@YUYANG_UCLA
Yu Yang
1 year
👇 Check out SpuCo, our new Python package designed to make studying and benchmarking model robustness against SPUrious COrrelations super easy! 🤩 💪 Exciting collaboration with @sjoshi804 , @xue_yihao65785 , @WenhanYang0315 , and @baharanm !
@sjoshi804
Siddharth Joshi
1 year
Introducing SpuCo: a Python package to standardize tackling spurious correlations in deep neural networks! 1️⃣ Modular implementation of current SOTA methods 2️⃣ Two new challenging and realistic datasets Paper: Github: 🧵(1/n)
3
4
48
0
1
18
@YUYANG_UCLA
Yu Yang
11 months
Happening *now* at #NeurIPS23 ! Check out my labmate @WenhanYang0315 's work on defending CLIP against data poisoning and backdoor attacks! 📍Great Hall & Hall B1+B2 (level 1) #718 ⏰ Wed, Dec 13 | 10:45 a.m. - 12:45 p.m. CST 🔗
@WenhanYang0315
WENHAN YANG
11 months
CLIP is such a great model for zero-shot inference…until it is poisoned! 0.0001% poison /0.01% backdoor rates are enough to compromise the model! Our work on multimodal robustness proposed RoCLIP to tackle this issue!
Tweet media one
1
3
22
0
1
15
@YUYANG_UCLA
Yu Yang
11 months
📢 @Aaditya6284 will give the oral presentation for our work tomorrow at the ENLSP workshop! 🗓️ Oral: Sat, 9:42 a.m. - 9:48 a.m. 📌 Poster: 1:00 p.m. - 2:00 p.m. 📍 Room 206 - 207 If you're at #NeurIPS2023 , come check out the oral presentation and our poster! 🤗(🧵9/9)
0
2
13
@YUYANG_UCLA
Yu Yang
18 days
@AIatMeta @cursor_ai Big thanks to my co-first authors— @NieYuzhou and @zhun_amg —who share equal credit with me, to Yuheng and to our amazing advisors— @WenboGuo4 , @uiuc_aisecure , @dawnsongtweets . Leading this first project at @VirtueAI_co is a milestone, paving the way for exciting projects to come!
1
1
10
@YUYANG_UCLA
Yu Yang
18 days
Along with the paper, we are releasing the first batch of data—insecure coding samples for 27 CWEs in Python. 📄 Paper: 🔗 HF dataset: 🌐 Website:
1
0
7
@YUYANG_UCLA
Yu Yang
18 days
@AIatMeta 11/ 💻 Real-World Example: SecCodePLT uncovered vulnerabilities in coding agents like @cursor_ai , which achieved a secure coding rate of 60% but failed on critical CWEs like broken cryptographic algorithm (CWE-327) and incorrect authorization (CWE-863), exposing security issues.
Tweet media one
1
0
7
@YUYANG_UCLA
Yu Yang
18 days
Data for cyberattack helpfulness and the full evaluation platform will be open-sourced soon. Stay tuned! ⏰ Meanwhile, we’d love to hear your feedback as we continue to improve SecCodePLT and its resources! 😊
0
0
6
@YUYANG_UCLA
Yu Yang
11 months
🌟 A huge thanks to my incredible colleagues @AIatMeta - @Aaditya6284 , @m_elhoushi , Anas Mahmoud, @kushal_tirumala , @FabianGloeckle , @b_roziere , and mentors @newsha_a , @arimorcos , @CarolejeanWu . Collaborating with them was the highlight of my summer!🙌 (🧵8/)
1
1
5
@YUYANG_UCLA
Yu Yang
1 year
Large vision-language models (e.g., CLIP) are not immune to spurious correlations when applied to domains or fine-tuned with specific data. 😥 (🧵2/8)
Tweet media one
0
2
5
@YUYANG_UCLA
Yu Yang
1 year
@baharanm 🔬 Key Idea of CREST: It models non-convex loss as quadratic using Taylor expansion, approximating locally with gradient and Hessian diagonal. CREST trains on selected coresets with a small approximation error, bounding the gradient error within each quadratic region. (🧵3/7)
0
0
4
@YUYANG_UCLA
Yu Yang
11 months
There is a well-known saying in the data science community: 'Garbage In, Garbage Out.' The quality of data we feed into our models is the cornerstone of their success or failure. With the rise of large language models (LLMs), data quality has never been more crucial. (🧵2/)
1
1
4
@YUYANG_UCLA
Yu Yang
11 months
🔍 Next, we embed both the original and the corrupted data to analyze their relationships. We observe how the data moves due to the synthetic corruption. This movement, this change in the embeddings, signals which data is potentially problematic. (🧵5/)
Tweet media one
1
1
4
@YUYANG_UCLA
Yu Yang
18 days
2/ Previous benchmarks focused on insecure code or attack suggestions, often using static metrics and lacking expert validation. SecCodePLT fixes this with a unified platform evaluating insecure coding and cyberattack helpfulness via dynamic metrics and expert-verified data.👩‍💻
Tweet media one
1
0
4
@YUYANG_UCLA
Yu Yang
1 year
@jonasgeiping Congratulations, Jonas, on your new role! Looking forward to seeing the impactful work you'll lead in the coming years. 🎉👏
1
0
2
@YUYANG_UCLA
Yu Yang
1 year
@baharanm 📊 Learning with Small Variance: By selecting coresets of the size of one mini-batch from larger random subsets, CREST trains with smaller variance than random mini-batch SGD, leading to faster convergence. (🧵5/7)
0
0
4
@YUYANG_UCLA
Yu Yang
1 year
@besanushi @hmd_palangi @baharanm 💡 In summary, our paper addresses the challenges of spurious correlations in vision-language models and presents an efficient contrastive learning approach for detection and mitigation. 🔗 Stay tuned for code and implementation details: 💻 (🧵8/8)
0
1
4
@YUYANG_UCLA
Yu Yang
18 days
8/ 🚨 Cyberattack Helpfulness Results: SecCodePLT revealed that GPT-4o can generate end-to-end cyberattacks. Claude, however, had much higher refusal rates on the two most dangerous tasks, Weaponization & Infiltration and C2 & Execution.
Tweet media one
1
0
3
@YUYANG_UCLA
Yu Yang
1 year
@baharanm ⚡ Speedup Results: CREST demonstrates up to 2.5x faster training on vision and language datasets (CIFAR-10, CIFAR-100, TinyImageNet, SNLI), including data with millions of examples while maintaining theoretically guaranteed performance comparable to full training. (🧵6/7)
0
0
3
@YUYANG_UCLA
Yu Yang
18 days
7/ ⚔️ Task 2: Cyberattack Helpfulness: SecCodePLT evaluates models across 5 categories: reconnaissance, infiltration, command & control, discovery, and data exfiltration. Each category is tested for attack execution success and refusal rates.
Tweet media one
1
0
3
@YUYANG_UCLA
Yu Yang
1 year
@besanushi @hmd_palangi @baharanm 1️⃣ We detect linguistic attributes and test their impact on model performance. We use practitioner supervision to identify spurious correlations. 🔍 (🧵5/8)
Tweet media one
0
0
3
@YUYANG_UCLA
Yu Yang
18 days
3/ 🔍 Dynamic Evaluation: SecCodePLT uses real test cases and dynamic metrics, offering more precise assessments than static methods. It ensures that code not only looks secure but actually passes functionality and security tests in real-world scenarios.
1
0
3
@YUYANG_UCLA
Yu Yang
1 year
@besanushi @hmd_palangi @baharanm 2️⃣ We extend contrastive language-vision learning with additional loss functions that decorrelate spurious attributes from class names and separate vision and language representations. Our method fine-tunes only the projections, requiring fewer computational resources. 🌿(🧵6/8)
Tweet media one
0
0
3
@YUYANG_UCLA
Yu Yang
18 days
@AIatMeta 10/ 🔐 Security Policy: SecCodePLT introduces an optional security policy reminder for each insecure coding task, offering explicit vulnerability guidelines. We found that adding this boosts secure coding rates by ~20%, proving clear instructions help models generate safer code.
1
0
3
@YUYANG_UCLA
Yu Yang
18 days
9/ ⚖️ SecCodePLT vs. CyberSecEval: our SecCodePLT outperforms CyberSecEval by @AIatMeta in security relevance (how well prompts match security scenarios) and instruction faithfulness (accuracy of prompts to tasks). SecCodePLT scores nearly 100%, vs CyberSecEval’s 68% and 42%.
Tweet media one
1
0
3
@YUYANG_UCLA
Yu Yang
1 year
@baharanm 🔄 Adaptive Updates: As training progresses, the approximated loss function may deviate significantly. When the approximation error exceeds a threshold, update the coreset and quadratic function, adapting to the evolving loss landscape. (🧵4/7)
0
0
3
@YUYANG_UCLA
Yu Yang
18 days
4/ ✅ Expert-Verified Data: SecCodePLT combines expert-generated seed samples with LLM-based mutations to scale up, ensuring both quality and relevance. This two-stage process guarantees the data aligns with real security vulnerabilities.
1
0
3
@YUYANG_UCLA
Yu Yang
11 months
✂️ We use these indicators to prune the data systematically. We call this "Synthetic Corruption Informed Pruning" (SCIP). It's important to note that the indicators might vary with each dataset, making SCIP less of a predefined algorithm and more of a flexible approach. (🧵6/)
Tweet media one
1
1
3
@YUYANG_UCLA
Yu Yang
18 days
5/ 🛡️ Task 1: Insecure Coding: SecCodePLT evaluates AI models across 27 CWEs, testing over 1,300 samples for code generation and completion. It measures their ability to avoid generating insecure code in real-world security scenarios.
Tweet media one
1
0
3
@YUYANG_UCLA
Yu Yang
1 year
@besanushi @hmd_palangi @baharanm We present experimental results on CLIP, showcasing its effectiveness with spurious correlations. By leveraging the joint embedding space, our approach improves model attention without extra annotation data. 🚀📊(🧵7/8)
Tweet media one
0
0
3
@YUYANG_UCLA
Yu Yang
18 days
6/ 📊 Insecure Coding Results: On SecCodePLT's insecure coding benchmark, GPT-4o achieved a secure coding rate of 55%, outperforming CodeLlama-34B and Llama-3.1-70B. However, vulnerabilities like cryptographic weaknesses and input validation remain challenging across all models.
Tweet media one
1
0
3
@YUYANG_UCLA
Yu Yang
1 year
@baharanm 🔍 The Challenge: Previous coreset selection methods lacked convergence guarantees for (mini-batch) SGD. Non-convex loss and stochastic gradient require multiple coresets to bound the error. But how do we determine the optimal number and update timings? (🧵2/7)
Tweet media one
0
0
3
@YUYANG_UCLA
Yu Yang
8 months
0
0
3
@YUYANG_UCLA
Yu Yang
3 months
@billyuchenlin @VirtueAI_co Thank you, @billyuchenlin ! 🎉 Indeed, we’re grateful to have such awesome advisors on our team. It's very nice that @VirtueAI_co is so research-oriented. We have many exciting projects going on!
1
0
1
@YUYANG_UCLA
Yu Yang
1 year
@besanushi @hmd_palangi @baharanm 💡 Unlike prior work, our approach leverages the 👁️ multi-modality of CLIP 🗣️. We automate spurious attribute detection and mitigation using language-based descriptions, reducing the need for human annotations and creating a more flexible debugging interface. (🧵4/8)
0
0
2
@YUYANG_UCLA
Yu Yang
3 months
@Yihe__Deng @VirtueAI_co Thank you so much, @Yihe__Deng , for the support and well wishes!
0
0
2
@YUYANG_UCLA
Yu Yang
11 months
🧐 So how do we approach this challenging problem? This is where our clear-cut quality indicators come into play. We intentionally 'corrupt' our data by distorting these indicators, like intentionally breaking the grammar. (🧵4/)
1
1
2
@YUYANG_UCLA
Yu Yang
1 month
@Yihe__Deng Huge congrats!!
1
0
2
@YUYANG_UCLA
Yu Yang
11 months
📈 Our tests showed up to a 3% performance improvement in code generation models by pruning 20% of data using SCIP. Plus, SCIP isn't just about better performance—it's also about efficiency. We achieved the same model performance with over 20% less training time. (🧵7/)
Tweet media one
1
1
2
@YUYANG_UCLA
Yu Yang
1 year
@besanushi @hmd_palangi @baharanm Since retraining from scratch is not practical, our approach addresses spurious correlations during fine-tuning via contrastive learning. 🔧 (🧵3/8)
0
0
2
@YUYANG_UCLA
Yu Yang
3 months
@tungnd_13 @VirtueAI_co Thank you, Tung! Still in internship mode to finish my PhD first, but excited for what's next!😆 This has been an incredible learning experience.
0
0
1
@YUYANG_UCLA
Yu Yang
11 months
@SharvenW Hi @SharvenW , thanks for the great question! You're right; data points can exhibit diverse changes and SCIP identifies overarching trends.
1
0
0
@YUYANG_UCLA
Yu Yang
10 months
@liujiashuo77 Thank you, Jiashuo! Congrats on your AISTATS acceptance!
0
0
1
@YUYANG_UCLA
Yu Yang
11 months
@_lewtun Hi @_lewtun , thanks for your interest! We chose code data for SCIP's debut for its well-defined quality indicators. SCIP can be used for other types like chat data, but we haven't tested it yet. We hope this research inspires further exploration in various data types!
0
0
1
@YUYANG_UCLA
Yu Yang
11 months
To identify low-quality language data, on one side, we have clear-cut quality indicators, including grammar, spelling, and coherence. They are reliable and measurable. Yet, we face challenges with subtle errors and the impracticality of manual inspection. (🧵3/)
1
1
1
@YUYANG_UCLA
Yu Yang
10 months
@khainb_ml Thank you, Khai!
0
0
1