I'm excited to announce Semantra: an open source multi-tool for semantic search 🎉
- Launch a local search engine over text and PDF files
- Search by concepts/meaning
- Refine results via tagging and adding/subtracting queries
Try it out now 🚀📚🔍
New open source OCR model just dropped! This one by Microsoft features the best text recognition I've seen in any open model and performs admirably on handwriting.
It also handles a diverse range of vision tasks. You can play with it here:
This is incredible. IT CAN DO HANDWRITING RECOGNITION. I've been testing on some of the shakiest handwritten public records I have and I'm getting good results. This is a big deal for lots of journalism workflows. See this example 👇
The new Qwen2-VL-7B Instruct model gets *100%* accuracy extracting text from this handwritten document. This is the first open weights model (Apache 2.0) that I've seen OCR this accurately. (Thank you
@fdaudens
for the tip!)
Microsoft's new open source Phi 3.5 vision model is really good at OCR/text extraction — even on handwriting! You can prompt it to extract tabular data as well.
It's permissively licensed (MIT). Play around with it here:
Microsoft's new open source Phi 3.5 vision model is really good at OCR/text extraction — even on handwriting! You can prompt it to extract tabular data as well.
It's permissively licensed (MIT). Play around with it here:
New open source OCR model just dropped! This one by Microsoft features the best text recognition I've seen in any open model and performs admirably on handwriting.
It also handles a diverse range of vision tasks. You can play with it here:
Prototyping a real-time AI writing tool to show how large language models are essentially probability engines.
(thanks to
@ggerganov
's llama.cpp for enabling this to run rapidly on an 8GB RAM MacBook Air)
Some news: today is my first day as a senior machine learning engineer at
@nytimes
!
I'll be working on the new AI Initiatives team in the newsroom to prototype tools, shape standards, and aid reporting. Can't wait to get started!
I could not find a zoomable, explorable map of COVID-19 cases in the US, so I rolled my own:
Deeply indebted to
@USAFacts
for the data along with the zippy
@sveltejs
and
#deckgl
JS frameworks.
Introducing Textra, a free + open source OCR tool I created using Apple's new Vision API 🖼️✨📄
Textra runs on the command line and quickly/accurately converts PDF and image files to text (requires Mac OS 13+). Check it out!
#opensource
Apple's Live Text OCR is amazingly high quality and runs entirely offline. I spun up a quick demo of it transcribing a PDF of the Mueller Report page by page and outputting the transcript as it goes:
Update: Crosswalker is now open source! It's a general purpose tool for joining columns of text data that don't match perfectly.
🕸️ runs in the browser
🔒 keeps your data entirely local
😌 auto-saves your progress
In the works
@WapoEngineering
: a general text matching tool to help join columns of data when the names don't exactly match. We're using it for election precinct matching🗳️
In the works
@WapoEngineering
: a general text matching tool to help join columns of data when the names don't exactly match. We're using it for election precinct matching🗳️
I'm really excited to open source the campaign finance toolset I've been working on with
@WapoEngineering
recently: a speedy C framework and CLI to transform raw FEC filings into CSV files! ⚡️
An interactive computational essay on sound using
@observablehq
notebooks! Learn about waveforms and musical pitch by visualizing and listening to sound functions.
Over the past ~3 years, I've been working hard on a complete rewrite of with
@muckrock
— a platform that lets you analyze, annotate, and publish document collections. Today, I'm proud to announce we publicly launched the new site and open sourced the code!
🎉 Some belated personal news: I'm leading backend/platforms engineering for elections
@washingtonpost
under the direction of
@anthonyjpesce
and
@jeremybowers
Could not ask for better colleagues to work with!
Working on a side project to debug and visualize AWS step functions locally, and it's going well so far!
(This should make it easier to iterate, flow data, and see errors without having to deploy to AWS each time)
✨Career update: today is my last day at
@documentcloud
. The past three years have been thrilling! I'm so thankful for the opportunity to work and grow with
@muckrock
and pilot incredible tech. In mid-July, I'll be joining
@washingtonpost
as a senior full-stack newsroom engineer!
Prompts I used:
"OCR this image and provide just the output text. Format it all as plain text" (for the handwriting example)
"OCR this image and provide just the output text. Format it all as markdown" (for the table extraction example)
I'm really excited to open source the campaign finance toolset I've been working on with
@WapoEngineering
recently: a speedy C framework and CLI to transform raw FEC filings into CSV files! ⚡️
A new version of Textra is out! Textra is a command-line tool to extract text from images, PDFs — and now audio. It runs on Mac OS 13+ and uses Apple's APIs for fast, on-device text extraction.
#opensource
@snappercayt
Interesting. The captioning/OCR seems to work better on smaller images with consistent text sizes. It could be useful to chain this model on top of another layout/bounding box model.
Also this isn’t my model — it’s Microsoft’s:
Apple's Live Text OCR is amazingly high quality and runs entirely offline. I spun up a quick demo of it transcribing a PDF of the Mueller Report page by page and outputting the transcript as it goes:
@simonw
I was thinking you could get good results extracting a table's layout using existing models / algorithms (some of which are not even ML-based!) and then feeding in image subsections to this kind of high quality OCR model
I'm working on an open source semantic search command-line tool — coming soon! 🔎
It analyzes text files you specify and launches a local web server to search them semantically — based on meaning and not exact word matches.
#ai
#nlp
#semanticsearch
🚨🗳️ Interested in working on elections at The Washington Post? Our Election Platforms team is hiring!
We build data pipelines, results page infrastructure, admin interfaces, and many special projects — and work with with incredibly talented individuals:
Working on a new feature in Semantra (coming soon!): the ability to add and subtract semantic queries 🔥
It's fun to iterate with! Here's a section I found in the Mueller Report about caviar by searching for lavish bribery and positively/negatively tagging some search results.
I'm working on an open source semantic search command-line tool — coming soon! 🔎
It analyzes text files you specify and launches a local web server to search them semantically — based on meaning and not exact word matches.
#ai
#nlp
#semanticsearch
A very early in-progress demo of Semantra running entirely in-browser. No backend thanks to transformers.js!
The possibilities include being able to export document collections as static, hostable websites users can search without installing anything.
I'm excited to announce Semantra: an open source multi-tool for semantic search 🎉
- Launch a local search engine over text and PDF files
- Search by concepts/meaning
- Refine results via tagging and adding/subtracting queries
Try it out now 🚀📚🔍
@morisy
@goodside
I think the most captivating thing about this AI is that trying to fool it has turned into a game, with particularly hilarious and witty bypass mechanisms. It's the challenge of getting into the prompt makers' minds and subverting their defensive plays.
I made a simple web application to study for the 100 question
#USCIS
US Citizenship Civics exam. It allows you to click/tap on questions to reveal answers and has a button to shuffle the order of the questions.
Try it out here:
We’re releasing all of our data on all 40,000+ properties identified using
@NCOneMap
data, along with our subsidiary lookup.
We’re hoping it will help the public, researchers & policymakers understand the scope of corporate homeownership
#securityforsale
Excited to be at
#NICAR24
in Baltimore this year!
If you want to learn more about campaign finance analysis, come check out
@ccemorse
's and my free session (offered Thursday / Friday).
Hey, look, the handwritten document
@bxroberts
shared to benchmark an OCR tool I wrote 21 months ago (and subsequently I used to eval visual language models) made it into
@MistralAI
’s demo deck
Very excited to launch a new flagship
@documentcloud
feature: selectable text! The viewer allows text to be selected, copied, and searched. The processing pipeline does OCR, extracts positional text, and grafts it back in to create a searchable/selectable PDF (extremely quickly).
Currently tinkering on a tool to interpret and understand ML/AI models. It operates on
@PyTorch
models and launches an interactive frontend to display the model architecture, run the models in real-time, and add visualization blocks.
@WapoEngineering
It's still very early stages, so look out for some fun next steps like making it a Homebrew and Python package along with more rigorous testing and documentation! But the word got out early
This is a great example of how large language models work probabilistically.
Once ChatGPT outputs a few states in alphabetical order, the probability that the next one will be "Connecticut" is high — even though it contains no 'a' — because it has been trained on data like this.
I'm working on an open source semantic search command-line tool — coming soon! 🔎
It analyzes text files you specify and launches a local web server to search them semantically — based on meaning and not exact word matches.
#ai
#nlp
#semanticsearch
For the past year and a half we've been working on a new, faster, redesigned version of DocumentCloud. I'm excited to publicly demo the new beta tomorrow morning at
#NICAR2020
Hey
#NICAR20
! Come get a first look at the new
@DocumentCloud
beta and a chance to get early access: Join
@dylfreed
’s session Friday morning at 9am to see how fast we’ve made it:
I wrote documentation for Semantra in hopes it will be serviceable. Please let me know if you have any feedback, encounter any issues, or have any suggestions/ideas!
Repo:
Tutorial:
Guides:
Semantra is built for those seeking needles in haystacks: journalists, researchers, students, and more.
I've found it useful personally across a wide range of content, including books, reports, speeches, and government documents.
Tutorial:
Looking forward to presenting at my first
@SRCCON
with
@whatuphails
! We'll be talking about breaking silos by open sourcing your newsroom's internal tools — Friday at 11:30am ET.
Themes include: community building, technical design, org buy-in, and more. See you there!
#SRCCON22
For those familiar with the brilliant 10+ year old legacy platform, this is a from-the-ground-up revamp for the 2020s. We're talking modern, mobile-friendly web app/embeds, robust serverless processing that crunches documents in < 1 minute, and advanced search/OCR/entity features
✨ Introducing Interpogate
A tool to visualize and inspect
@PyTorch
model architectures. It works in Jupyter notebooks/Google Colab, operates on a diverse range of models, and has a convenient API to attach hooks to observe model behavior in realtime.
Currently tinkering on a tool to interpret and understand ML/AI models. It operates on
@PyTorch
models and launches an interactive frontend to display the model architecture, run the models in real-time, and add visualization blocks.
Friends, the Election Platform team at The Washington Post is hiring a senior full-stack engineer! We're working on internal tools and architecture to power elections coverage that reaches millions of readers. Join us:
Also of note, FastFEC utilizes
@ziglang
for its C build system, which provides an extremely smooth cross-platform compilation experience. I look forward to using Zig (and admiring mascot Ziggy) for more things going forward!
@ggerganov
Thank you! It does a good job catching typos (and awkward phrasings, omitting needed words in sentences, etc.). It varies in effectiveness based on where the tokenization of a split word falls.
Re: source code, will look deeper! It catches typos here too/has decent suggestions
Working on a project of this scale has been daunting/ thrilling. I'm incredibly thankful for the leadership and vision from folks like
@morisy
, Mitch,
@pilhofer
, boardmembers, and
@documentcloud
originals
@knowtheory
+
@jashkenas
+ others. Excited to continue iterating from here!
Giving my first webinar today on the new
@DocumentCloud
beta! The revamped doc platform is fine-tuned for breaking news with the fastest PDF processing out there. 🏎️🏎️🏎️
Join me on Zoom at 3pm Eastern/12pm Pacific – it's free and open to anyone
The newly launched
@arcprize
challenges A.I. progress, claiming state-of-the-art models are essentially pattern matching at scale and unable to actually acquire new skills.
They demonstrate this via a benchmark that's... a surprisingly fun puzzle game
Here's an example using Semantra on a collection of US inaugural speeches. You can play with this document collection in the tutorial
After downloading the documents, analyze them all at once with:
```
semantra us_inaugural_speeches/*.txt
```
We've been in beta for a while so are still improving the homepage and expanding our documentation, but the platform is open and all users/orgs have been migrated over. We're powering millions of documents and 10s of millions of monthly embed page views
This was incredibly fun to work on! Thanks to Jon and Sam
@ReutersGraphics
for instilling the value of adding a dangerous dose of creativity on deadline.
(I had an inkling the math would check out but still feel like we got lucky the bracket order didn't break while revolving!)
This slick bracket is Dylan's main contribution to the homepage, and it's easily the best part of the app. Lotta trig work went into it, but in keeping with a laconic East London style, we called it The Revolver.
@dangerscarf
I'm doing a talk that might be more in this spirit for SRCCON in October, with
@kat_alo
!
Not planning for anything too bellicose, but more like let's be open, talk through problems/solutions from various perspectives, and come up with some ideas together.
I will miss my colleagues at
@washingtonpost
and am thankful to the news eng / elections team for supporting my development and encouraging open source projects over the past three years. Here's to finding future avenues for collaboration!
And speaking of Apple's text extraction APIs, here is its speech recognizer running on a 9 hour meeting from the California Coastal Commission. This one also works offline — is anyone using it to freely transcribe interviews?
Apple's Live Text OCR is amazingly high quality and runs entirely offline. I spun up a quick demo of it transcribing a PDF of the Mueller Report page by page and outputting the transcript as it goes:
Thanks to
@mfederis
and the
@PeninsuPress
, the story
@jackie_botts
and I wrote on language access issues for undocumented immigrants after the Sonoma County wildfires is on the homepage of
Election Platforms helps architect and build the underlying tools behind elections to make future election nights easier and less stressful for everyone.
As someone who's spent a lot of time with Tesseract (the leading free, open source OCR library), I can't help but be excited by the quality/speed improvements Apple's closed source solution seems to offer. Now to think about building an open source command-line tool around it 🤔
@Ethan_Connelly
Thanks! A key difference is that this can run entirely offline on your own computer for free. Instead of trying to provide a chatbot experience, Semantra provides a human-in-the-loop interface on top of semantic search
Excited to share a virtual reality experience about ocean debris featuring the incredible photography of
@plasticpieces
. Done in collaboration with my teammates at
@StanfordJourn
! Dive into the experience on computer, phone, or VR headset at
An interactive computational essay on how code formatters work! Just in case you're curious or thinking of making a programming language.
@observablehq
@Dan_Jeffries1
> Nobody seems to be able to recreate the verbatim output with the BS prompts they provided.
You can trivially reproduce exact articles word-for-word from a variety of sources with the legacy completion model and GPT3.5
Thank you to
@morisy
and Mitch for leading such an innovative and supportive team. If you're looking to work with passionate folks, have full remote flexibility, and helm a widely used and respected product, please apply. It's an incredible opportunity!
Are you a software developer wanting to help power journalism, transparency & accountability for millions of people each month? A data journalist who wants to shift to building apps & managing a platform? We're hiring!
Just published my personal website, which details some of my projects in code, journalism, and music. (I used the new web app framework Sapper to make the site quick and snappy
@sveltejs
)
Excited to share that our live German election results page is up and running! 🔥 It's the first election I've helped work on — and the Post's first live results page for an international election 🇩🇪
Who said “Dank Learning” was just a viral research paper? Happy to say my Stanford roommate and I have turned his scholarship into an iPhone app that generates memes with AI.
Original paper:
(p.s. we don’t endorse offensive memes)
I've updated with the latest data from
@nytimes
.
New features:
1⃣ Auto-updates with live counts for the current day
2⃣ 🇵🇷 Puerto Rico added 🇵🇷
3⃣ Type 'c' to show counties with more new cases/deaths compared to the previous day (toggleable in settings)
@simonw
@bxroberts
@WapoEngineering
We had an initial version of the tool that looked similar to
@simonw
's. The key innovations of the new algo: 1) breaking apart the names into alphanumeric parts, 2) identifying perfect part matches, and 3) minimum edit distance on permutations of the remaining parts
@palewire
@MeghanHoyer
I’d love to partner with journalists interested in extracting information from tranches of documents and put together more training materials.
For myself I’ve had some fun learning from old classics and public domain books — it’s a really unique way to get to know a work!
Though it is unfortunate that only large, well-funded tech companies can afford to build very high-quality OCR models. And they are almost always closed source 💸
Fascinatingly basic hallucination. Every time you resample you seemingly get a different random answer? (I had expected that they provide the time in the system prompt or something.)
@simonw
@bxroberts
@WapoEngineering
Yes, it's a single page static application. So along with open sourcing it soon, we'll have a public nice URL for it too :)
.
@lennybronner
and his team used Crosswalker during the 2022 Midterm Elections to match precinct names. It was helpful to quickly construct crosswalks for The Washington Post Election Model when states released new precinct names just hours beforehand.
@hodgesmr
@simonw
Yea I also read it that way. Thaler was requesting the AI be credited as author, or as a hired contractor — both of which go against the clearly established human copyright/contract holder precedent. I’d imagine most prompt-design AI art could be argued to be transformative.
My
@figma
widget was approved! Search “code editor” in widgets or check the link below to try it out. (I mainly made it because I wanted to diagram APIs in 2d space with
@typescript
interfaces)
#WeekendProject
Experimenting with
@figma
's new widget API to make a code editor. This uses
@codemirror
under-the-hood and outputs Figma components. Plus there's a few style options 🎨
✨ Work-in-progress source code here:
Thanks to
@sveltejs
's SvelteKit for powering the server, the Mermaid library for providing quick flowcharts, and
@typescript
for making it easy to model the data
@emilymbender
@simonw
@knowtheory
@vlordier
@robroc
@mtdukes
Just to throw more thoughts into the void: humans constantly seek "magic" to enrich their lives. Its pursuit drives innovation. But I'd liken AI more to "alchemy" than anything — conflating machine outputs with human thought/intention/art/truth is an impossibly twisted pursuit.