Research Scientist, Mistral AI. Interested in LLMs, deep learning, fast nearest-neighbor search and privacy. ex:
@Meta
,
@NYUniversity
,
@Polytechnique
.
Our latest paper, ☢️ Radioactive data: tracing through training, is now on arXiv.
TL;DR: you can modify your data in an imperceptible way so that any model trained on it will have an identifiable mark. (1/7)
Our latest release
@MistralAI
Mixtral 8x7B mixture of experts
- performance of a GPT3.5
- inference cost of a 12B model
- context length of 32K
- speaks English, French, Italian, German and Spanish
Blog post
"Large Memory Layers with Product Keys" with
@GuillaumeLample
,
@LudovicDenoyer
, Marc'Aurelio Ranzato and
@hjegou
TL;DR
We introduce a large key-value memory layer with millions of values for a negligible computational cost.
1/2
Our latest paper, ☢️ Radioactive data: tracing through training, is now on arXiv.
TL;DR: you can modify your data in an imperceptible way so that any model trained on it will have an identifiable mark. (1/7)
We are hosting a research internship in differential privacy / privacy assessment of machine learning models with
@hastagiri
at Facebook AI Paris this summer. If you are currently pursuing a PhD and are interested in these topics, let’s get in touch!
We (
@PierreStock
and I) are hiring a Master student for an internship in Privacy in the summer 2022, with a potential CIFRE PhD following the internship. Details in the proposal attached.
Interesting paper by
@thegautamkamath
@florian_tramer
and N. Carlini. In particular, fine-tuning on Imagenet when you pre-train on 4B data is sidestepping privacy in my opinion. Let's choose training from scratch as a reproducible benchmark for privacy 🔥
🧵New paper w Nicholas Carlini &
@florian_tramer
: "Considerations for Differentially Private Learning with Large-Scale Public Pretraining."
We critique the increasingly popular use of large-scale public pretraining in private ML.
Comments welcome.
1/n
A 12-layer memory-augmented transformer outperforms a 24 layer transformer while being twice faster.
Our key insight is to use product keys, which enable fast and exact nearest neighbor search and reduce the complexity from N to sqrt(N) for a memory with N values.
2/2
Wondered why Mixtral 8x7B Instruct w/ 32K context wasn't summarizing 16K text. Prompt started with instruction to summarize following text, but model ignored it. Sliding Window Attention must have "unattended" my instructions? Set Window from 4K to 32K, et voilà, got the summary!
We are hosting a research internship in differential privacy / membership inference with
@PierreStock
at Facebook AI Paris next summer (2022). If you are currently pursuing a PhD and are interested in these topics, let’s get in touch!
We are at
#NeurIPS2019
this week to present our two papers on Product Key Memory Layers, and Cross-lingual Language Model Pretraining.
Please stop by our posters, Thursday at 5pm! Spotlight presentations are at 4:20 and 4:40pm
with
@alexsablay
and
@alex_conneau
For example, if someone merges your data with a 100x larger training set, you can predict with 99.99% confidence that the model was trained using your data. (2/7)
@gabrielpeyre
Isn't this missing additive / multiplicative constants? If I remember correctly the estimator stems from approximating the density by balls around each point that extend to its nearest neighbor (and log(Volume) = C_1 + d * log(distance))
@arthurmensch
It’d be nice if we could set up bullet points as reviewers such that authors can answer below each bullet point (with latex formatting àla MathExchange) and a budget for the total number of characters in the answers.
Here's how it works: for each class, we sample a random direction in the feature space (the carrier), and modify pixel values so that the features of each image of this class move in the direction of the carrier. (3/7)
Even if you retrain a model *from scratch* on these images, we can match the feature spaces, and observe that the classifier will be aligned in the direction of the carrier. This also works across architectures, and with different datasets. (5/7)
Since the carriers are chosen randomly, we can compute the probability that the classifier aligns with the carrier "by chance" (i.e. the p-value), and show that it is very low (10^{-4} with only 1% of radioactive data). (6/7)
If you train a classifier on top of these radioactive images, the classifier will align to the carrier direction as it is correlated with the class label. This works even if a very small part of the training data is radioactive (1%). (4/7)
@ThomasScialom
@ClementDelangue
@YJernite
Depends what kind of harm and limitations I guess?
Typically radioactive data (and others) allow you to mark data such that this mark is propagated to the trained model.
@pierrefdz
might know more
@TLesort
@JiliJeanlouis
J'aurais tendance à dire que si tu as un modele génératif p(x) tu peux juste mesurer -log p(x) sur tes nouvelles données. Tu peux aussi mesurer -log p(y|x) si tu as juste p(y|x), qui est devrait te donner une "indication faible"
@hiddenmarkov
@vitalyFM
You can have it for toy examples (e.g. Gaussian data), and approximately with SGLD. A more complete picture is in the « privacy for free » paper () (2/2)
@DanielOberski
Differential privacy will degrade into group privacy, so you would need a very small epsilon to protect against radioactivity (on the order of 1/n_radioactive)
@thegautamkamath
Where do you draw the line for tools? I regularly use thesaurus to find synonyms and expect to use chatGPT for stuff like that in the future (i.e. minor rewritings) but I feel like reporting the use of chatGPT makes it sound like it's writing entire sections of the paper...
@bozavlado
@giffmana
Great point now I can't unsee a world of weirdly shaped furniture that keeps breaking...but I believe similar effects come into play (more value to recommendations from people you know or people who have a reputation for doing good DIY tutorials)
@CarTrawler
We rented a car to Recordgo through your services but they were unable to deliver (they wanted us to pay 120€ extra in insurance because the caution “did not work”), we would like to get a refund how should we proceed ?
@dhuynh95
@huggingface
This measures captures both capacity of the model and true memorization right? Like if I prompt "The solution to x^2-2=0 is", a model that says "±sqrt(2)" can be actually solving the problem rather than memorizing from the training set.
@francoisfleuret
That works for a verbatim text (I think there is roughly 2 nats per token to hide the watermark assuming you don't do top-k/top-p), but is this robust to people slightly modifying the text afterwards (like add/remove or change a few words)?
@shortstein
If you have a 50% prior that a sample is in the training set, the posterior is bounded to 0.5 + ε/4. My rule: make DP non-vacuous by having ε<2
ExpandedWeights have been added to Opacus. Simply stated, it creates a virtual 'expanded' weight which first dimension is the batch size, so that each element of the batch has its own corresponding weight. (2/5)
@thegautamkamath
I'm also interested in this. So far my impression is that 1) gradients are seem as much more obfuscated than they actually are and 2) FL updates have this "low bandwidth" feeling like these "anonymized" computer crash reports
@MonniauxD
@ahcohen
Meeting people other researchers? Especially for younger researchers who are new to the field. But the carbon footprint could definitely be reduced.
@mrdrozdov
Isn't `` robust to context? Regarding compression I'd argue that most of the size is probably the model weights which shouldn't be compressed?
@ChSimonSU
@RauxJF
Je ne sais pas si chatGPT est entraîné avec beaucoup de textes de lois français, à mon avis il y a une bonne marge de progression en le "fine-tunant" sur des textes de loi français, arrêts, etc.
@deliprao
You can also craft the compressed image such that its decompressed version will be radioactive (using a differentiable operator that approximates JPEG); this is similar to what we do in the paper for data augmentation
@giffmana
I tend to think/hope that if you can fabricate news article their value will just go to zero (nice writing is no longer "proof of work") and the model will shift back to trusted news sources (reputation/proof of stake)
@ABelgo_optimum
@adelaigue
Il y a aussi un biais sur ce dont on a beaucoup moins besoin en période de confinement (ex: fabricants de voitures), en plus du biais court terme/long terme (on a d’autant moins besoin de nouvelles voitures qu’il y a un parc existant)
Functorch is also available in Opacus. Functorch is the equivalent of JAX in the Pytorch ecosystem. One way to use functorch is through the "no-op" GradSampleModule: Opacus relies on users to provide the grad_samples, but still takes care of the rest. (3/5)
@florian_tramer
Anecdotally I am not sure that the subset of Tiny images is 100% "private", as it seems Carmon et al. used a model trained on CIFAR-10 to mine it.
@yoavgo
@GuillaumeLample
The alternative view is to see the memory as an approximation of a very large FFN (the first linear layer is the approximated by a "product matrix", and the second linear layer corresponds to the set of values). 1/2