Qwen-2.5 on WebGPU 🏎️
• 42 tok/sec for Qwen2.5-Coder-1.5B on Mac ⚡
• Powered by MLC WebLLM and WebGPU 🔥
Watch Qwen2.5-Coder-1.5B build a website entirely in the browser!
Released a free tool on ChatDB: Parquet AI
Query parquet files with natural language in the browser.
◆ Powered by
@duckdb
in the browser
◆ LLM from
@GroqInc
and Llama-3-70B
Here's me querying the capybara-dpo dataset from
@huggingface
Excited to release Natural-SQL-7B!
A new, very strong Text to SQL model that is my best fine tune yet. You can see how it does on the SQL-Eval benchmark by
@defogdata
NEW SQL Console on
@huggingface
Datasets Viewer🤗
🔸Run SQL on any public dataset
🔸Powered by
@duckdb
WASM running entirely in the browser
🔸 Share your SQL Queries via URL with others!
More coming soon!
In 2024, it takes 5 min to write some SQL and Markdown and deploy a beautiful dashboard.
Here's a dashboard of
@huggingface
hub stats powered by
@evidence_dev
and
@duckdb
I feel like the new CodeQwen-1.5 isn't being talked about enough for coding. Punching high above its weight for a 7B.
Outperforming DeepSeek is impressive in itself.
Not many people know every
@Gradio
app that is a
@huggingface
space exposes an API.
Here's all the code it takes for me send requests to the Llama guard space on ZeroGPU.
The HF Data Explorer is on the Chrome Web Store 🥳
Drop in SQL on top of any
@huggingface
dataset powered by
@duckdb
wasm running entirely in the browser.
Here's some fun things you can do with it 🔥
DuckDB Snippet of the Week 🦆📊
One of my favorite functions that is now part of
@duckdb
1.1.0.
You can plot beautiful histograms of different values with a single function in the SQL Console!
Wrote a blog post on how you can use the Datasets Explorer to find really interesting insights on
@huggingface
datasets 🔥
There's even a couple examples of the
@duckdb
spatial extension with some geospatial queries 🌎
I have actually been getting noticeable traffic to
@chatdb
from the free parquet tools I made. Adding another free tool for reading parquet files in the browser
Testing it with some
@huggingface
datasets. It's all powered by
@duckdb
wasm 🔥
Artifacts pro tip:
If you are running into unsupported library errors with NPM modules, just ask Claude to use the cdnjs link instead and it should work just fine.
Will have a fun dashboard to share here soon with some interesting stats from the Hub:
@Gradio
dominates 🔥. Custom Docker deployments and Streamlit sharing the rest of the pie.
What interesting stats would you like to know about the
@huggingface
hub?
Can be anything for:
- spaces
- models
- datasets
For example: ratio of different model licenses or most popular space sdks
Smol Instruct v0.2 is out! 🔥
You can run them in the browser with WebGPU with MLC WebLLM and Transformers.js bringing blazing fast intelligence to edge devices.
The new v0.2 model was trained with synthetic data from Llama3.1 70B and a few other datasets like OpenHermes-2.5.
Base Models can be more fun than instruct models sometimes 😁
Took like 30 minutes to make and is a blast to use! It uses the SmolLM-360M base model by
@huggingface
for suggestions
@yacineMTB
I think there are some other key factors too:
- Relative early age of the internet (Google etc in the 2000s)
- Engineers move into Product / Management quickly
Finally seeing some fruits of the 405B beast and the newer license with magpie-ultra
- 50k unfiltered rows
- Instructions for Planning, reasoning, coding, math, planning etc 🤯
Excited to see some high-quality synthetic data! Thanks
@argilla_io
Fully local Instructor with speculative decoding, Constrained Sampling, and in-process so theres no network dependency thanks to llama-cpp-python
Go follow the maintainer
@abetlen
he's been cooking
Run Llama 3.2 in the browser on WebGPU 🔥
• Llama 3.2 (1B + 3B)🤏 🦙
• Running 100% locally in the browser at 62 tok/sec 🏎️
• Powered by MLC WebLLM + WebGPU ⚡
The Datasets Explorer now supports the latest version of
@duckdb
You can summarize splits and how much duplicate data exists with the summarize command.
Data leakage and duplicate data can be pretty common among datasets. Here's a really good blog post on it by
@BdsLoick
Two important things to look at for datasets:
• Leakage from data in train and test sets etc.
• Duplicate Data
You can check these very easily with a simple SQL query
I fine tuned on ~20k Text to SQL pairs. I created my own dataset and really focused on:
- Tough Questions (complex, multi part)
- Multiple Tables per schema (most datasets have 1-3)
- Much more columns per table (most datasets have 3-7)
This helped NaturalSQL be able to excel
Excited to release Natural-SQL-7B!
A new, very strong Text to SQL model that is my best fine tune yet. You can see how it does on the SQL-Eval benchmark by
@defogdata
Wrote a blog post about remote Parquet files. Parquet files make up a lot of the datasets on the
@huggingface
hub.
🔸 HTTP Range Requests
🔸 Parquet Structure, Schema, and Metadata
🔸 What makes querying parquet files remotely so efficient
@dharmesh
I kinda like the UX of just forwarding the email instead of an app you have to upload etc etc.
Give the user value and just get out of their way.
We’re launching a new
@supabase
service:
It’s like if ChatGPT and Postgres had a love-child: launch as many databases as you want, build them with AI, create charts, create embeddings. 100% open source.
Introducing: Querying CSV with SQL in the Browser 🤯
Get rich insights from large CSV files by querying it with SQL!
Built with
◆
@vercel
/ Next.js / next/dynamic
◆
@duckdb
/ WebAssembly
There's a new CSV Parser in town.
I've totally revamped DuckDB's CSV Parser. And it has a bunch of cool optimizations, such as state-machine parsing, a new parallelism strategy, implicit casting, projection pushdown, etc...
All these improvements can result in a significant
Really liking the new
@shadcn
charts. Hacked together a quick space to try them.
Something really fun about dynamic, snappy experiences running entirely on the client.
Made something fun for creating cool react apps with Llama 3.1 405B with some leftover cloud credits.
Upvote / Downvote responses to help create the largest open React dataset on
@huggingface
🤗
PGVECTOR IS NOW FASTER THAN PINECONE. And 75% cheaper thanks to a new open-source extension – introducing pgvectorscale.
🐘 What is pgvectorscale?
Pgvectorscale is an open-source PostgreSQL extension that builds on pgvector, enabling greater performance and scalability (keep
@abacaj
Finding good seed tasks is very important. For the second iteration of NaturalSQL was able to generate 30k very high quality pairs.
Creating seed instructions for non coding datasets is still hard though
I can simply write SQL in code block and throw the variable in a chart super easily which is nice.
It's super nice not having to build out charts as well
@rishdotblog
Super cool, how much worse would the accuracy be with that much quantization?
Compiled sqlcoder-7B for WebGPU recently to run in the browser, but accuracy seemed to suffer a bit.
Been hacking around with a dashboard on top of the hub stats dataset
- Powered by
@duckdb
WASM
- Charts by
@shadcn
There are some other neat trends like spaces by sdk and licenses.
Really great read that goes in depth on synthetic data, working with small models, and the insights that went into creating data that wasn't too complex for small models.
It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in synthetic data and small language models.
This post is prompted by the coming release of an instant, in-browser model called SmolLM360 (link at the end)
The