Impressive work to explain 1+ second write latencies with Kafka running on ext-4. But the best part is the solution --> use xfs. Back in the day that was also the solution for intermittent high fsync latency from the MySQL binlog with ext-2 or ext-3.
Like Google Domains, I was deprecated by Google in 2009. MySQL team, in Ads Eng, was ended because F1/Spanner was coming. AFAIK it took ~5 years to fully switch and there was a migration to MariaDB after I left. I am sure there are more stories but I was too busy at FB to learn.
My focus on specific systems (MySQL, RocksDB) meant I neglected my general systems perf skills. Working on that now by reading "Understanding Software Dynamics" by Richard Sites and I highly recommend it.
LeanStore is impressive. Hope it turns into a product. Regardless I appreciate how much effort has gone into it. Systems research takes a long time.
Source is here
Kyle does amazing work to make databases better.
I am less of a fan of the drive-by snark from smart people who read or browse his work. Being smart doesn't replace sweat equity.
Team Spanner: Spanner,
@CockroachDB
,
@Yugabyte
,
@PingCAP
Team Aurora: PG/MySQL Aurora, AlloyDB,
@neondatabase
Team Spanner is also DistSQL or NewSQL.
What is a better name for Team Aurora?
Neon is Postgres and OSS. When does Team Aurora get an OSS MySQL solution?
Someone is fixing MySQL replication at scale by replacing lossless semisync with Raft. I briefly worked on semisync and I am definitely not a dist sys expert.
Oracle has been a great owner of MySQL -- invested a lot, regular & stable releases, innovation continues:
* parallel replication apply, query, index create
* synchronous replication
* InnoDB compression
* scaling InnoDB on many-core
* Heatwave
...
TreeLine - interesting paper, although I disagree with the claim that the primary reason for RocksDB (LSM) is write efficiency. The primary reason was space efficiency, while write efficiency was a secondary reason.
#VLDB2023
4 of the top 5 (Oracle, MySQL, MSFT, MongoDB) have peaked for 12+ months (no growth, or slight decline).
Only Postgres continues to grow.
No shame in not being able to grow forever, but fun to see Postgres continue to adapt, innovate and thrive.
Interesting paper on databases & fast SSD from CIDR2020. I learned a few things and like how they presented results at a high level. Recipe for fast DBMS IO is: array of fast SSD, SW RAID, XFS, O_DIRECT, fdatasync and io_uring.
Postgres is bad for business when your business is finding perf regressions. Postgres 17beta3 looks great on a small server
* no regressions
* one read-only test is ~2X faster
* many write-heavy tests are ~5% to ~10% faster
This paper is worth reading and I look forward to more research in this space. We need better b-trees to navigate more of the read, write & space-amp tradeoffs explained in the Rum Conjecture paper.
Thank you Xiangpeng Hao and
@badrishc
At
#VLDB2024
, check our Bf-Tree, our high-perf B-Tree design optimized for small key-values. It uses a mini-page abstraction to cache reads/writes and a variable-length buffer pool to maintain them. See and attend session C3 at 3:30pm today to learn more!
@cstross
@elonmusk
Excellent, now I must add seagull to my short list:
* duck - calm above water, furiously creating drama below water
* alligator -- big mouth to share ideas, short arms that can't reach keyboard to implement them
I co-presented a tutorial at SIGMOD. My part was a description of MVCC GC using Postgres, InnoDB and RocksDB as examples. By chance there is a proper paper in SIGMOD on MVCC GC and it is worth reading.
"Long-lived Transactions Made Less Harmful"
Old Postgres and old MySQL had similar performance on sysbench. But modern Postgres is usually faster than modern MySQL because Postgres has avoided CPU perf regressions over time.
Comparing MariaDB and MySQL with a CPU-bound Insert Benchmark on a new small server. The song remains the same ...
MySQL has big regressions over time
+ MariaDB does not
=
Modern MariaDB is faster than modern MySQL
@UMNComputerSci
@gregkh
I look forward to the post-mortem. Today a lot of time is being spent reviewing all of the previous commits from the UM research group.
Apparently
@Yugabyte
is telling the truth when they claim Postgres compatible. I was able to run the Postgres version of the Insert Benchmark without changes.
More Postgres tuning for the insert benchmark no a medium server with the database cached by Postgres. Reducing autovacuum scale factors to 0.05 helps a lot. Will now do IO-bound tests on this server.
The big win for FB from RocksDB & MyRocks was less space amp (used half the space vs compressed InnoDB). Better write efficiency was nice, but not the big deal. Many papers get this wrong.
Citations:
My Twitter experience has been mixed lately but there are still a few bright spots:
1) engaging with the database community
2) following computer systems perf experts
There is much I don't know so it is great to learn from others here.
When things are CPU-bound writes are a lot faster in Postgres than MySQL/InnoDB. Disable the MySQL binlog makes MySQL faster, but that isn't safe (for primaries). Enable fsync and the difference gets larger (even after accounting for 2 vs 1 fsyncs/write).
Trying out Hetzner:
* 48 cores, 128G RAM, ~4T of storage
* includes all the HW counters (PMC) for perf
* a similar server from AWS and GCP costs ~10X more at list price or ~5X more from GCP if I commit to 3 years of usage.
Met with
@mipsytipsy
. Happy to learn about growth of
@honeycombio
. Years ago I pitched her on benefit of staying at $bigTech. Clearly I know more about databases then business.
After much testing I might agree with OtterTune -- PG implementation of MVCC is a big problem. I like PG and hope this gets fixed. Perhaps I am doing it wrong, but this isn't an issue for MyRocks or InnoDB.
Search the post for "fairness"
I learn about new features by working near clever kernel people. Normally I just pitch io_uring, but perhaps sched_ext is the new kernel thing for me to pitch.
Which DBMS will use it first?
I am working on RocksDB (part-time contract at Meta). My current focus is universal (tiered) compaction and searching for CPU regressions. I see up to 20% more CPU/query from 6.0 to 6.26 for simple workloads. Finding CPU regressions after the fact is time consuming. If only ...
Pebbles, an LSM for
@CockroachDB
, is interesting. Compared to RocksDB:
1) commit pipeline is simpler
2) IO throttling for flush and compaction is different
3) writes are not stalled when flush/compaction gets behind
This should be a blog post because it is an interesting read. I just wish we could settle on one name from: green threads, M:N, fibers, stackful coroutines, user-level threads, ...
Interesting paper from Nguyen and Leis on improving storage for LOBs. An LSM with key-value separation, like RocksDB's Integrated BlobDB, is likely to be the most performant solution today in a production-ready DBMS.
An interesting post on the use of MySQL and MyRocks at Quora. The author, Vamsi Ponnekanti, also created the online schema change (OSC) tool while at FB and long ago we were classmates at UW-Madison.
I published a summary of the insert benchmark vs a big server for MyRocks, InnoDB and Postgres. Two highlights from the summary:
* worst-case write and query response time is much better for MyRocks than for InnoDB or Postgres. This wasn't expected.
/1
Realized last night that I need to learn more about b-epsilon trees. Today I learned someone from my technical community is publishing a book that includes a chapter on it. So I ordered a copy.
Long ago Mike wrote a great paper on things an OS does that makes it hard to write a DBMS. If DBOS succeeds then I hope for a paper that explains how DBMS features make it hard to write an OS.
*
*
I am happy to see companies like
@RocksetCloud
leverage
@RocksDB
so they can focus on adding value higher in the stack. Just like FB was able to leverage LevelDB to start the RocksDB project -
I struggle to find this paper once per decade.
High Volume Trans. Proc. ... by Whitney, Shasha et al from HTPS 7 in 1997
Whitney went on to much success with kx and kdb.
Shasha continued with a remarkable research career.
Sysbench on a large server with high concurrency.
* MySql 8.0 is faster than 5.6 for point queries and writes but slower for range queries
* Postgres is almost always faster than MySQL and often a lot faster
@mituzas
@micsolana
I am having a hard time understanding people who have a hard time understanding the impact of a struggling company having a jerk as CEO who doesn't understand systems yet is happy to pontificate about them & fire employees who correct him. Not sure this saves his investment.
Results from an in-memory sysbench benchmark to show the benefit of huge pages for Postgres and InnoDB. It helped Postgres a lot more (1.32X vs 1.06X). Perhaps I will explain why in future work.
Trie, skiplist, ART? What is best for the memtable? This paper does 3 things to make C* faster:
* reduces Java GC impact
* makes keys byte comparable
* uses a sharded trie (multi reader, single writer)
ForestDB also used a trie, and was write-optimized but never reached GA.
Low-concurrency insert benchmark:
* Postgres is boring (no regressions)
* MySQL has CPU regressions from 5.6 to 8.0
* MySQL 8.0.20 was an exciting release
Results:
* MySQL -
* Postgres -
Interesting papers from
@BU_DiSClab
for VLDB:
* LSM Trees Under Memory Pressure
* BoDS: Benchmark on Data Sortedness
They also have a paper in progress on sortedness, OSM.
Papers:
@matthewokeefe1
@FranckPachot
So many teams at Google wasted much time building workarounds on top of BigTable to compensate for the lack of ACID and support their user-facing workloads. Spanner made things much better for them. Too bad those stories aren't told in public.
Read a great paper.
Dremel: Adaptive Configuration Tuning of RocksDB KV Store
Things I liked:
* used some knowledge of LSM (cost models)
* allowed for uncertainty to explore tuning search space
* reduced search space via "fused features"
Fun to see new R&D on in-memory sort -- 2 for merge sort, 1 for quick sort:
*
*
*
I hope to revisit work I did on sort long ago, but the bar has been raised over the past 20 years.
MySQL 8.0.20 looks interesting:
* full support for hash join so that "... MySQL no longer use BNL as a join strategy."
* more work on CATS locking for InnoDB
* binlog compression
* disable PK checks on replication apply
Not all SSDs can process TRIM as fast as you want so that deleting a large amount of data can stall read IO requests for many seconds. We need trimbench to document how devices behave during large deletes.
Modern MariaDB is (almost always) 10% to 30% faster than modern MySQL using sysbench, a cached database and (new) small server because MySQL suffers from too many performance regressions over time.
Can someone save Twitter before the jerk ruins it? I use it to engage with systems and database communities and enjoy discussions with experts I would otherwise never encounter. No surprise, the site has been more error prone over the past week.
Writes fast on primary needs replays fast on replica. Great progress in Postgres 15 on this although the post wasn’t clear on the implementation to get concurrent disk reads.
On sysbench with a cached database MyRocks uses more CPU per operation than InnoDB, thus InnoDB gets more QPS.
Conference papers should focus more on CPU read-amp with an LSM, as that is a bigger issue than IO read-amp.
Yet another great paper from the Leanstore people. Page writeback on fast storage isn't easy, especially for a DBMS designed when storage was slower
"Write-Aware Timestamp Tracking: Effective and Efficient Page Replacement for Modern Hardware"
#vldb2023
My summary of an interesting article.
The problem - if you are paying by the IO, then doing a lot of IO via EBS is expensive
The solution - figure out how to use local attached storage.
I look forward to reading this but UDB (MySQL + RocksDB) is the data store and TAO is the (very clever) cache.
"RAMP-TAO: Layering Atomic Transactions on Facebook's Online TAO Data Store"
Much detail, nothing but good news from Postgres:
* CPU overhead doesn't change much from v11 to v15
* A few things are much faster in v15 (full table scan, update the same row)
Context is: small server, low concurrency, in-memory
I am sharing notes on RocksDB internals as I read the source code. This one is about code that determines whether write stalls or slowdowns are needed.
Let me be pedantic:
1) Joins are expensive
2) A query that uses a non-covering secondary index does an index nested loops join
3) Lets ban such queries!
FB implemented OSC (Online Schema Change for MySQL) to make a few critical, large, busy indexes covering for frequent queries.
When I read conference papers on LSM I often wish the paper didn't have an LSM overview. Reading the Tigger paper on using eBPF to build a DBMS proxy and the overview is excellent -- I needed that background info.
#vldb2023
I enjoyed reading "Optimizing Databases by Learning Hidden Parameters of Solid State Drives" and this blog post has a few comments and questions. I hope there is a sequel.
@pateljm
@uwKPark
@bpkrothGeek
I am starting to document regressions and sources of CPU overhead in MySQL and InnoDB.
FIrst up, why does binlog_log_row use ~3X more CPU in 8.0 vs 5.6?