📣 Life update: I’ve joined OpenAI and am hiring researchers! 💥
I’m immensely grateful to all of my teammates at Google/DeepMind over the last ~5 years–you all have taught me so much. I’m excited to continue marching towards our shared mission to enable universal access to
Excited to share our newest work! 📝 Evaluation of LLMs is hard, especially for health equity. We provide a multifaceted human assessment framework, 7 newly-released adversarial datasets, and perform the largest human eval study on this topic to date. 🧵:
Excited to share the Med-PaLM 2 preprint! Physicians preferred Med-PaLM 2 answers over physician answers on eight of nine clinically relevant axes. Med-PaLM 2 also scored 86.5% on the MedQA licensing exam style benchmark (SOTA), 19% over Med-PaLM. 😁
Excited to share wider availability of our medical LLMs. It's been an exciting arc, from training the first Med-PaLM model this time last year, to Med-PaLM 2 and our trusted tester program just a few months later in April, and now more availability. Thanks to the team!!
📢Big
#HealthAI
news: Our latest medically tuned model is here — and it's available to allowlisted
@GoogleCloud
customers.🙌
Meet MedLM. It's a suite of models, built on Med-PaLM 2, that helps answer medical questions, summarize information, and more:
Today we announced our new medical LLM, Med-PaLM 2. On MedQA (USMLE), Med-PaLM 2 achieves accuracy of over 85%, going from a passing score to expert performance. Med-PaLM 2 beats our own previous SOTA by 18%.
With Tao Tu,
@Mysiak
,
@vivnat
,
@AziziShekoofeh
,
@alan_karthi
.
Excited to share Med-PaLM Multimodal (Med-PaLM M), the first demonstration of generalist biomedical AI, a single model that can perform a range of biomedical tasks. Work from our fantastic team
@GoogleAI
@GoogleHealth
@GoogleDeepMind
.
Excited for this to come out! AMIE is a research system for diagnostic reasoning and conversations. In a double-blinded crossover study (kind of like a "medical Turing test"), it outperformed primary care physicians!
Happy to introduce AMIE (Articulate Medical Intelligence Explorer) our research LLM for diagnostic conversations. AMIE surpassed Primary Care Drs in conversational quality & diagnostic accuracy in a "virtual OSCE"-style randomized study. Preprint ➡️ (1/7)
Excited to announce our latest work exploring the potential of LLMs for differential diagnosis, including a human-in-the-loop study on real-world cases! See below thread for details. Grateful to work with such amazing teammates
@GoogleAI
@GoogleDeepMind
@GoogleHealth
.
*New Research Paper* - Diagnostic conundrums are an unsolved grand challenge for AI. We present a new research LLM optimized for differential diagnosis (DDx), tested in
@NEJM
challenges. Our LLM outperformed clinicians & other LLMs... (1/6)
Med-PaLM, a medical large language model from
@Google
, achieved a notable feat by exceeding the passing USMLE score early on. We were fortunate to have
@thekaransinghal
join us & deliver an insightful talk about Med-PaLM to our lab.
Catch Karan's talk at:
Through better alignment with the requirements of the medical domain, we also observe exciting improvements on other tasks, including long-form medical question answering.
Blog post:
We will share a preprint soon!
🪩The
@stateofaireport
2023 is now here.
Our 6th installment is one of the most exciting years I can remember. The
#stateofai
report covers everything you *need* to know, covering research, industry, safety and politics.
There’s lots in there, so here’s my director’s cut 🧵
3/ To bridge the gap, we developed a 3-part human assessment framework. We used multiple complementary methods, including a participatory approach, physician focus groups, actual Med-PaLM 2 failures, and iterative pilot evaluations to expand coverage across 6 dimensions of bias.
5/ Finally, we applied our adversarial datasets and assessment rubrics to evaluate Med-PaLM 2. To increase coverage, we involved 806 raters across three rater groups: physicians, health equity experts, and consumers, for a total of 17k+ ratings.
4/ We’re introducing EquityMedQA, 7 newly-released adversarial medical question answering datasets. They represent a portfolio of approaches for adversarial testing, including curation based on known issues, red teaming based on Med-PaLM 2 failures, and LLM-based generation.
2/ LLMs have immense potential to widen access to medical expertise, especially in global health settings. But without evaluation and mitigation of potential harms, these systems could widen persistent gaps in health outcomes. Existing tools for evaluation are limited.
We're accepting applications for a research intern
@GoogleAI
to work on a project applying large language models (LLMs) to medical AI! Please apply here and reach out once you're team matching.
@vivnat
@alan_karthi
@AziziShekoofeh
Now our preprint for Med-PaLM 2 is up: We see a 19% improvement on the USMLE-style task, and answers to consumer queries are preferred over physician answers across eight of nine axes studied (factuality, harm, bias, ...).
6/ Different datasets, assessments, and rater groups surfaced different potential biases, suggesting the importance of using multiple complementary approaches. We identified new potential harms not measured in our previous bias evals.
Most importantly, through instruction prompt tuning, it had greatly improved long-form answers to consumer queries, often comparing similarly to physicians. 92.6% of Med-PaLM answers were aligned with scientific consensus, compared to 92.9% for clinicians (baseline model 61.9%).
Our results indicate rapid progress towards physician-level performance in medical question answering, highlighting the importance of both evaluation frameworks and alignment of models to societal values as we think about potential real-world impact of this technology.
9/ We’ve included all EquityMedQA adversarial questions and assessment rubrics with the preprint. We hope that the broader health AI community builds on these tools to realize our shared goal of systems that promote high-quality healthcare for all.
8/ While our tools can surface potential biases in LLM-generated answers to medical questions, further evaluation contextualized to specific clinical settings is needed to assess whether deployment of these systems promotes equitable health outcomes.
7/ Some other personally interesting bits: (i) LLM-generated datasets surfaced potential biases, although differently than manual sets, (ii) Med-PaLM 2 answers were usually preferred more often than either Med-PaLM and physician answers were preferred.
When we started this work, we set out to better understand the potential of building safe foundation models for medicine. We put together MultiMedQA, a benchmark of 7 medical question answering tasks spanning medical exams, medical research, and consumer queries.
Interested in learning more about the latest research on ML and analytics on decentralized data? Join
@EmilyGlanz
,
@MatharyCharles
,
@KairouzPeter
, myself, and others on Nov 10th for the Federated Learning and Analytics Research Workshop. Register below:
Excited to push the forefront of multimodal LLMs for Medicine!
We previewed an ambitious generalist approach with Med-PaLM M last week as the first demonstration of a generalist biomedical AI system that flexibly encodes and integrates multimodal biomedical data.
We started our team to catalyze the medical AI community and work on building more steerable, safe systems in a context where safety matters, in partnership with researchers, physicians, policymakers, and others. We're excited to share this milestone on our journey.
Biomedicine is highly multimodal, and Med-PaLM M is a multitask, multimodal large language model that achieves performance near or exceeding SOTA on (visual) question answering, radiology report generation, genomics variant calling, and more.
Moving forward, as biomedical models become more capable, it becomes more crucial to measure and mitigate safety risks, including hallucinated medical information and harmful uses of biological knowledge. We’re excited about grounding our safety research in this setting.
Moving beyond automated evaluation is crucial for safe real-world impact. In human evaluation of generated radiology reports, clinicians preferred model-produced reports over radiologist-produced reports 40.5% of the time on average, suggesting potential future clinical utility.
When we observed model limitations, we worked with physicians to train Med-PaLM, a state-of-the-art large language model aligned to the medical setting. It surpassed the passing score on US medical licensing exam-style questions for the first time.
Med-PaLM 2 improves Med-PaLM across multiple-choice and long-form medical question answering by leveraging PaLM 2, domain-specific tuning, and prompting strategies (new: ensemble refinement). We provide an overview here:
Physician eval shows Med-PaLM 2’s answers to common consumer medical questions were preferred over physicians' across eight of nine axes. For example, answers were preferred for alignment with medical consensus 72.9% of the time, and for better knowledge recall 80.1% of the time.