Chris Riccomini Profile Banner
Chris Riccomini Profile
Chris Riccomini

@criccomini

9,554
Followers
248
Following
735
Media
11,964
Statuses

Writer of books (), code (), checks (), and newsletters ()

Sunnyvale, CA
Joined April 2009
Don't wanna be here? Send us removal request.
Pinned Tweet
@criccomini
Chris Riccomini
30 days
New post! I talk with JP about FizzBee, TLA+, and writing stable software. This one’s got me thinking about FizzBee + DST/Antithesis.
0
6
24
@criccomini
Chris Riccomini
7 months
I got a chance to sit in on some @ycombinator pitches this week. A few thoughts: 1⃣ I have AI fatigue--SO MUCH. Very little of it is deep tech; mostly applying OpenAI FM to stuff. Investors in this space: I have no idea how you do this. I feel like there's a lot of $ to be lost.
17
36
824
@criccomini
Chris Riccomini
5 years
Successful intern projects: 1. High value if completed. 2. Low risk if not completed. 3. Able to finish in allotted time (2-3 months). 4. Exciting to work on and talk about. Anything else I'm missing?
44
36
532
@criccomini
Chris Riccomini
2 years
Embedded DBs are having a renaissance. RDBMS: SQLite OLAP: DuckDB Graph: KuzuDB Search: Chroma The developer experience is so good on these. Things just work. Really cool to see.
10
66
453
@criccomini
Chris Riccomini
5 years
My @InfoQ talk 🎙️ on the "Future of Data Engineering" is up! I cover the six stages of data pipeline maturity: 0. None 1. Batch 2. Realtime 3. Integration 4. Automation 5. Decentralization Check it out! 👀 (I'm so sorry for the link picture)
5
76
323
@criccomini
Chris Riccomini
2 months
It's out! I've been working with @paulgb , @vigneshc , the team @responsive_apps , and others to put together an LSM storage engine built on object storage. Contributors, users, and feedback would all be great!
12
45
314
@criccomini
Chris Riccomini
1 year
Some interesting infra projects: WarpStream Turbopuffer LanceDB Neon AWS Neptune TigerBeetle Modal Materialize Tabular (Iceberg) DuckDB/Motherduck Arrow Data Fusion/Substrate gvisor KIP-932 (Kafka) VeniceDB Bauplan Buf schema registry Apicurio
19
22
275
@criccomini
Chris Riccomini
8 months
TIL about Apache DafaFusion Comet. @apple has replaced @ApacheSpark 's guts with @ApacheArrow DataFusion. And they're donating it. 🤯 This is an alternative to @MetaOpenSource 's Velox Spark implementation. /ht @philippemnoel
Tweet media one
6
57
264
@criccomini
Chris Riccomini
3 years
@sethrosen “Reddit’s database has two tables” “Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date” 🥴
12
16
233
@criccomini
Chris Riccomini
11 months
This is the future. Kafka writing Parquet to S3 (via tiered storage). Instant data lake.
@gunnarmorling
Gunnar Morling 🌍
11 months
"KIP-1008: ParKa - the Marriage of Parquet and Kafka" That's an interesting proposal: writing #Kafka segments as #Parquet files. Can see the appeal for data lake ingest; wondering though how well the columnar file structure plays with Kafka semantics 🤔.
Tweet media one
4
32
185
6
23
223
@criccomini
Chris Riccomini
29 days
Uber's actually doing the thing. If they keep going, this could be a first-class reference architecture.
Tweet media one
1
24
222
@criccomini
Chris Riccomini
2 years
DBs are getting totally ripped apart right now and I love it. Query engines (trino, duck), storage (s3, gcs), and indexing (iceberg, hudi) all separate.
@gunnarmorling
Gunnar Morling 🌍
2 years
"Querying SQLite databases with DuckDB" Enjoyed watching this fast-paced video by @markhneedham demoing how to use #DuckDB 's query engine to run analytics queries against data in a #SQLite file. 5:50 well spent 🦆!
1
19
98
15
20
206
@criccomini
Chris Riccomini
2 months
Big news: I'm helping with @martinkl with a second edition of Designing Data-Intensive Applications! An early release of the first 3 chapters is now available (O'Reilly Learning subscribers only at this point) and we're hoping to finish it next year.
10
38
202
@criccomini
Chris Riccomini
2 years
I'm open sourcing Recap, a dead simple data catalog for engineers! Unlike traditional catalogs, Recap is built to power infrastructure and tools that need metadata. Read the docs: Or dive straight into the Github repo:
18
22
202
@criccomini
Chris Riccomini
2 years
Can someone explain DuckDB to me like I'm five? It's not clicking for me. It's like SQLite, but for OLAP, right? What's the big deal?
22
24
196
@criccomini
Chris Riccomini
6 years
Martin Fowler's blog post on schemaless data is super helpful when framing a discussion about when, why, and how to deal with this kind of data.
2
57
188
@criccomini
Chris Riccomini
2 years
I spent some time today comparing common schema format compatibility (Avro, Protobuf, JSON schema, Parquet, Arrow, CUE, and ANSI SQL). If you're interested, my Google sheet is here: (It's hand-wavy in a few areas especially ANSI SQL, still useful).
7
32
179
@criccomini
Chris Riccomini
5 years
Good morning! 👋🌄 I've written down some thoughts on the next 2-3 years of data engineering. Would love to hear your thoughts! 😃
10
49
173
@criccomini
Chris Riccomini
1 month
Many of you know I've been angel investing for a while now. Today I'm excited to announce the next step: I've started Materialized View Capital. MVC is a micro VC fund that lets me continue doing what I enjoy, and take some friends along for the ride.
21
8
172
@criccomini
Chris Riccomini
3 years
DWH trends 🔭 * Realtime DWHs * Analytics Engineering * Data Mesh * Data Catalogs * Reverse ETL * Headless BI * Data Quality * Data Lakehouses * DataOps * Data Products So, yeah, I'm thinkin' we have enough work to fill the next 10 years in data infra/engineering.
5
27
165
@criccomini
Chris Riccomini
3 years
I am noticing a surprising dynamic. Remote work has actually made people feel MORE human, not less. I get to see their dogs, their kids, their work spaces. Interruptions during meetings are humanizing, not unprofessional.
4
13
153
@criccomini
Chris Riccomini
1 year
More interesting infra projects*: Responsive Nile DB Clickhouse Boiling data Quickwit Databend cr-sqlite Litestream Pravega Restate Inngest Bacalhau Roapi * I have $ in some of these (and some in the list below)
@criccomini
Chris Riccomini
1 year
Some interesting infra projects: WarpStream Turbopuffer LanceDB Neon AWS Neptune TigerBeetle Modal Materialize Tabular (Iceberg) DuckDB/Motherduck Arrow Data Fusion/Substrate gvisor KIP-932 (Kafka) VeniceDB Bauplan Buf schema registry Apicurio
19
22
275
8
14
151
@criccomini
Chris Riccomini
2 years
This is slick: "a no dependency Python SQL parser, transpiler, and optimizer. It can be used to format SQL or translate between different dialects like DuckDB, Presto, Spark, Snowflake, and BigQuery." /ht 🐘jwills @data -folks.masto.host
3
20
147
@criccomini
Chris Riccomini
3 months
"Object Storage Native" ... Ok, this is what I'm calling it now.. So many good things in this post. Here's one:
Tweet media one
7
19
144
@criccomini
Chris Riccomini
8 months
The sheer number of subprojects coming out of (and because of) @ApacheArrow is pretty staggering. It reminds me of the Hadoop ecosystem circa 2010 (the good parts 😉). Comet, DataFusion, Ballista, Flight, Substrait (via @VoltronData )...
8
16
142
@criccomini
Chris Riccomini
3 years
👋 The project I've been working on is now open source! Open Robo-Advisor is a Python library that acts as an advisor 🤖 for passive indexing (think Wealthfront). It's very basic, but I wanted to get it out early. Check it out and send feedback! 👀
7
20
141
@criccomini
Chris Riccomini
3 years
It bugs me that people use the word “stack” for what the data ecosystem is right now. Stack implies some kind of order or hierarchy, but there isn’t one. We have the DWH and then everything else. It’s more like a graph… Or just a mess… </get off my lawn>
22
10
138
@criccomini
Chris Riccomini
21 days
New post is up! Next-gen infrastructure must support flexible deployments. Embedded, single-node, clustered, BYOC, SaaS, and self-managed. We're finally able to do this with one codebase.
14
17
138
@criccomini
Chris Riccomini
5 years
1/ Post-Map/Reduce (second generation) data processing systems (Spark, Flink, Dataflow, Samza) have been about unifying batch and streaming. @confluentinc (with Kafka streams, KSQL) is focused on unifying streaming and databases.
7
45
136
@criccomini
Chris Riccomini
9 months
Latest is out. This is a longer post that goes in-depth on new query engine layers like @ApacheArrow Data Fusion and Velox. Hot takes: - DWH commoditized - Kafka threatened - HTAP coming
5
31
137
@criccomini
Chris Riccomini
8 months
Databases are getting quite commoditized. - Velox/DataFusion/DataBend/Substrait/optd commoditize query engine - PostgreSQL commoditizes protocol and SQL dialect - S3/RocksDB/Arrow/Parquet commoditize storage layer WAL is next @jrdntgn talked about this, but it goes beyond perf
11
19
135
@criccomini
Chris Riccomini
8 months
I just got around to reading this. It’s really good.
1
19
135
@criccomini
Chris Riccomini
2 years
I don't think ETL/ELT is really a thing any more... between outbox pattern, CDC, kSQL, materialize, dbt, CDW, etc etc... data is extracted, transformed, and loaded all over the place... Maybe it always has been. Were we just lying to ourselves all these years? ETLELTETLELT....
16
18
132
@criccomini
Chris Riccomini
4 years
Recent data engineering themes in my Twitter timeline: * Data gateway/mesh * Data ops * ML pipelines * Compliance, privacy, deletion * Data catalogs Some great 🔗 below... 🧵 1/6
1
43
130
@criccomini
Chris Riccomini
1 year
I've got some news! I’m launching Materialized View, a software infrastructure newsletter. Sign up now to get software infra hot takes, projects, papers, developer interviews, stack deep dives, and more. First post coming soon.
15
24
131
@criccomini
Chris Riccomini
4 years
Workplace survival tip: don’t be good at things you hate doing. (Is this a tech brain tweet? 😛)
6
12
129
@criccomini
Chris Riccomini
1 year
Three companies with popular products that are under attack: - dbt labs/dbt - temporal - databricks/spark Tons of startups going after these.
23
2
127
@criccomini
Chris Riccomini
2 years
We implemented some parts of Chad's post () at @wepayeng . Moira Tagle (Staff SWE @ WePay) wrote a schema checker that verified that all DB changes were bw/fw compatible before the change could be merged. 1/n
6
11
123
@criccomini
Chris Riccomini
8 months
Solid rainy afternoon read. "If SQL is considered a programming language, then relational databases function as virtual machines that execute SQL, similar to how the JVM executes Java."
3
23
119
@criccomini
Chris Riccomini
3 years
What’s the current state of the art for managing Python environments? venv seems dead. Should I be using conda?
63
3
118
@criccomini
Chris Riccomini
2 months
New post! We open sourced SlateDB a week ago. I wrote down some notes about its origin, what it's good for, and where it's headed. And, of course, the obligatory Github ⭐️'s vanity metric. 😃
3
28
119
@criccomini
Chris Riccomini
7 months
My latest post is up! Apache Kafka is an aging open source project. It's time to accept that Kafka's protocol is what matters.
15
19
118
@criccomini
Chris Riccomini
7 months
8⃣ I don't understand young founders with multiple failed startups that still have extreme conviction and boundless energy for their latest idea. This is just a mentality I don't have.
3
1
115
@criccomini
Chris Riccomini
3 years
Dmitriy ( @squarecog ) and I wrote a book for new software engineers. This is the book we've always wanted to give to new hires, the stuff tech leads and managers wish their new hires knew. It's available for pre-order today! Buy your copy now. 🛒
6
20
114
@criccomini
Chris Riccomini
4 months
Incredibly clear description about async IO from @kingprotty . This talk is worth watching regardless of whether you care about Zig or not.
0
19
113
@criccomini
Chris Riccomini
2 years
I think a lot of what’s happening in the data space right now is explained by looking through the lens of “data engineer” vs “analytics engineer”. Each has a different (but overlapping) set of tools/skills/responsibilities. I’m not convinced “analytics engineers” need to exist.
18
17
110
@criccomini
Chris Riccomini
8 months
New post! Picking at some of my stream processing scar tissue. Why Samza failed, how it led to Kafka Streams and Kafka Connect, and why I'm skeptical of Apache Flink.
6
13
110
@criccomini
Chris Riccomini
2 months
Part 2 of my real-time OLAP series is out now! Hope you enjoy. 😊 “For each of these use cases, there are a different but overlapping set of requirements. Each needs different query latency, data freshness, data correctness, and query throughput.”
1
8
104
@criccomini
Chris Riccomini
2 years
I got asked recently for some interesting projects in the data space. Here are some of my favs. 👇 @datafoldcom data-diff @inkandswitch cambria @TigerBeetleDB tigerbeetle @aerialfly buz @mycelial mycelial ( @sarahcat21 pointed me to .. all of these?)
4
15
103
@criccomini
Chris Riccomini
2 months
A post a month in the making! I'm finally writing about ClickHouse. 😀 I break down what makes it great and the challenges ahead.
3
9
102
@criccomini
Chris Riccomini
23 days
So if big data is dead, is big data integration dead as well? Wondering if ETL/data integration changes at all with this paradigm. Not sure it follows, but dbt + duckdb seems like a signal. (brain dump to follow) 1/5
6
15
104
@criccomini
Chris Riccomini
7 months
2⃣ Most silly ideas have been filtered out; everything is reasonable. Everything is very early. Consequently, most differentiation is around founding team, background, location, and where investors can add value.
2
0
99
@criccomini
Chris Riccomini
3 months
Took me all the way to Friday to get this one out. Started as a ClickHouse post, but felt I needed to motivate my perspective a bit. Now it's going to be 3 posts. 😅 A walk down memory lane: A brief history of Avatara, Apache Pinot, and Apache Druid.
4
14
96
@criccomini
Chris Riccomini
8 months
Hello, Monday. Here’s my latest Materialized View, just for you! “It's silly to have applications generate text-based SQL; they should be allowed to pass query plans to the database.”
4
12
96
@criccomini
Chris Riccomini
5 months
Here it is! A walkthrough of @lancedb 's Lance V2 and Meta's Nimble storage formats. There's a lot to like. I'm very bullish, though I do have a few concerns.
6
25
95
@criccomini
Chris Riccomini
10 days
This is a great post to share with that relative that keeps asking how LLMs work.
2
16
94
@criccomini
Chris Riccomini
3 years
Such a good post from @NotionHQ on their Postgres migration. * Origin of the name "shard" * Why they chose 480 logical shards * Migration process (double write, backfill, verify, switch) * Migration and verification wrote by different people
1
23
90
@criccomini
Chris Riccomini
1 month
We just released 0.2.0! 🎉 This release has: - in-memory block cache - on-disk object cache - garbage collection - compressed bloom filters Next up: admin CLI, range queries, and more cache improevements.
Tweet media one
1
12
92
@criccomini
Chris Riccomini
2 years
What are notable public posts in the data engineering space over the last ~10 years? * * * What else?
12
12
89
@criccomini
Chris Riccomini
3 months
Woah.. TIL "my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset."
1
13
89
@criccomini
Chris Riccomini
5 months
It feels good to get this off my chest. "Notably, S3 has no compare-and-swap (CAS) operation—something every single other competitor has. It also lacks multi-region buckets and object appends. Even S3 Express is proving to be lackluster."
6
12
88
@criccomini
Chris Riccomini
4 months
Love this strategy from @databricks folks. Honesty, brilliant.
Tweet media one
6
4
87
@criccomini
Chris Riccomini
2 months
“The CacheLib Caching Engine: Design and Experiences at Scale” Great paper. We’re debating whether to do an LOC cache (section 4 in paper) for SlateDB blocks or do object caching like Alluxio.
2
7
85
@criccomini
Chris Riccomini
3 years
I think centralizing transformation in the data warehouse is a dead end. We're doing it because it's convenient, not because it's right. The trend is to use the DWH more. We should instead be building the convenience of the DWH into the app and streaming layers.
18
11
85
@criccomini
Chris Riccomini
1 month
Put another way: Writing books () Writing code () Writing checks () Writing newsletters ()
@criccomini
Chris Riccomini
1 month
Last 20 days for me: * Designing Data-Intensive Apps 2nd edition with @martinkl * SlateDB with @_RohanDesai , @vigneshc , @paulgb * Materialized View Capital with friends Feeling very lucky. And busy. 😃
1
1
57
1
8
84
@criccomini
Chris Riccomini
2 months
SlateDB 0.1.4 is out! Biggest update is that we now have real compaction thanks to @_RohanDesai ! Next release, SlateDB will have an in-memory block cache and on-disk (inspired by JuiceFS, Alluxio, and Rockset). 🚀
Tweet media one
2
6
83
@criccomini
Chris Riccomini
11 months
Ok, y'all. I'm pretty excited to get this post out. From @temporalio to an overflowing market, durable execution is having a moment. The space is too crowded and frameworks are hard to use. I talk about what needs to change.
6
16
81
@criccomini
Chris Riccomini
7 months
4⃣ Most startups are 2 founders. Nearly everyone is in SF.
3
0
79
@criccomini
Chris Riccomini
8 months
"Solving durable execution’s immutability problem" is a solid read. Their overview of the current state is really helpful. Sounds like changing code on long-running workflows remains difficult.
1
17
78
@criccomini
Chris Riccomini
15 days
SlateDB on 🍊 front page 😅
5
5
78
@criccomini
Chris Riccomini
1 year
I think not enough people know about Ambry, LinkedIn's open source BLOB store. @sriramsubram and his team built it and it's still actively worked on.
1
17
76
@criccomini
Chris Riccomini
6 months
Bombs away! 💣 Latency, cost, durability: pick two. "I recently began hacking on a project to test this theory out. The project—dubbed SlateDB—is a cloud-native log-structured merge tree (LSM) embedded key-value database."
4
19
75
@criccomini
Chris Riccomini
6 months
There are weeks where decades happen.. - @supabase @tembo_io @neondatabase go GA - @lancedb 's Lance2 unveiled - @ApacheArrow DataFusion graduates to TLP in Apache - @auto_dba emerges from stealth .. and what else? I feel like I forgot some stuff ..
3
11
74
@criccomini
Chris Riccomini
3 years
I've been thinking of a survey post on modern/next-gen analytics, including: * Headless BI * Analytics Engineering (DBT 'n stuff) * Data Mesh (I know..) * Reverse ETL * White-label data viz (a la @TopcoatData ) What else should I cover?
19
6
73
@criccomini
Chris Riccomini
5 months
I'm squarely in the trough of disillusionment with S3. It's really showing its age. - No preconditions/CAS (Literally everyone else) - No multi-region buckets (GCS) - No append (ABS) - S3E1Z is expensive and lacks a ton of S3 features
8
5
73
@criccomini
Chris Riccomini
2 years
My mental model for monitoring "data quality" has 3 different categories: · Equality checks (a la @datafoldcom ) · Assertions (a la @expectgreatdata ) · Anomaly detection (a la @anomalo_hq ) Does it match? Does it match my expectations? Does it look weird?
3
16
72
@criccomini
Chris Riccomini
4 months
Lots of activity around Kafka proxies: "Reliably Processing Trillions of Kafka Messages Per Day" "Enabling Seamless Kafka Async Queuing with Consumer Proxy" (2021) Kroxylicious, proxy for Apache Kafka
2
9
72
@criccomini
Chris Riccomini
4 months
My second post about data lakehouse catalogs is up. Trying not to be too salty. 😅 "Databricks and Snowflake are talking a big game. So far, they've given us empty Github repositories and rewrites."
5
8
71
@criccomini
Chris Riccomini
8 years
Really excited to share this post! We've been streaming MySQL changes into Kafka. Pretty neat stuff.
@wepayeng
WePay Engineering
8 years
How we stream database changes in realtime with @MySQL , @debezium , and @apachekafka #kafka #mysql #bigdata ..
0
31
41
2
39
70
@criccomini
Chris Riccomini
7 months
Shower thought: DuckDB is an edge database.
11
3
71
@criccomini
Chris Riccomini
2 years
Next up, "Serverless Computing: One Step Forward, Two Steps Back."
Tweet media one
2
11
70
@criccomini
Chris Riccomini
7 months
3⃣ Most startups are somewhere between pre-revenue to 100k ARR.
1
1
69
@criccomini
Chris Riccomini
7 months
5⃣ I mostly sat in on B2B, SaaS, DevTools, and AI pitches. I was surprised (and excited) that several pitches were planning to sell straight to enterprise. Not just SMB/open source GTM.
1
0
69
@criccomini
Chris Riccomini
3 months
We had this exact problem at WePay. The table was called “disbursement_history”. Hot platforms & WePay’s own fee accounts were touched on almost every transaction. Led to a ton of prod issues. Really wish TB had existed then.
Tweet media one
0
8
70
@criccomini
Chris Riccomini
6 months
Pretty brutal takedown of tiered storage from the @warpstream_labs folks.
Tweet media one
@richardartoul
Richard Artoul
6 months
Tiered storage for Kafka is a classic tarpit idea. It makes all the sense in the world, but it doesn't work in practice. Check out our latest blog post to learn why.
2
10
82
7
8
68
@criccomini
Chris Riccomini
1 year
I'd love to see someone take DuckDB and use it to kill both ELK and Splunk. Put DuckDB everywhere, so it's more scalable than ELK and more cost efficient than Spunk. Anyone? Anyone?
10
2
66
@criccomini
Chris Riccomini
9 months
I had a random DB thought this morning: Are there any DBs that have an interface to send query plans over the wire rather than SQL? What I'm thinking is essentially a protobuf of something like a substrait plan.
16
3
66
@criccomini
Chris Riccomini
3 months
Digging into @ClickHouseDB a bit. It seems to fill a similar use case as Druid, Pinot, and Materialize. What are its differentiators? It doesn't have differential data flow (materialize) or startree indexes (pinot). They do have materialized views, though...
13
2
66
@criccomini
Chris Riccomini
1 month
New Friday, new post! Reflecting on the two monolith to microservice migrations I've survived, I think there's a better way.
1
13
65
@criccomini
Chris Riccomini
5 years
Comparison of the Open Source OLAP Systems for Big Data: ClickHouse, Druid, and Pinot
5
25
64
@criccomini
Chris Riccomini
3 years
Today marks my last day at @wepayeng . 🤗 Over the 6½ years that I've been at WePay, I've watched the engineering team grow from 20 to 250; I'm most proud of this. We built a talented team, but also a team with GREAT culture. I will miss them.
7
4
65
@criccomini
Chris Riccomini
1 year
Buried in today's @motherduck announcement was that they're already doing hybrid execution between local and remote DuckDB instances. The logical query plan is broken up between local and remote and executed seamlessly.
@criccomini
Chris Riccomini
1 year
@ananthdurai @__AlexMonahan__ @neelesh_salian @peterabcz @thetinot This. 👇 The (L) is local and the (R) is remote. (source: )
Tweet media one
3
3
16
2
11
64