a short note of appreciation on this exceedingly beautiful introductory book on calculus.
I find this book a very friendly progression to the basics of calculus using a style that does not seem dry at any point in time. It starts with the basic philosophy and machinery needed
Reading the Adaptive Radix Tree paper - Having the realisation that trie is such an underrated data structure. Wondering why we should use any other data structure for in memory database indexes.
Here are some rough notes from the paper ..
Tree based indexes (e.g. binary search
Linux recently improved their page fault handling replacing Linked Lists and Red Black trees with a new data structure that offers better cache friendliness. This talk has some pointers to why linked lists fail on modern CPUs and what it takes to make a cache friendly data
a data structure book that teaches you how to design a data structure by looking at the underlying hardware, workload and access patterns, locality - sounds awesome .. (via
@eatonphil
)
"After reading this book, you will be able to reason about which
If you are a functional programmer (or you aspire to be one), one of the mandatory readings is the paper by John Backus, which was incidentally also his Turing award lecture.
Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs
Cassandra recently moved to trie based memtables from their current implementation based on a combination of concurrent skip list and a hierarchy of B-Trees ..
Rereading this paper which discusses the rationale and details of this transition. I shared this paper before on X.
Book Recommendation - A Theory of Objects by Martin Abadi, Luca Cardelli
This book develops a theory of objects as a foundation for object oriented languages and programming.
There are lots of literature and books on the well established theory of functions - the λ-calculus,
As someone interested in learning modern data structures and their applications in developing new storage engines, optimized in-memory indices, fast kv stores etc. I have started preparing a reading list, actually two :-).
The first one (permanently WIP) is related to some of
Production ready high performance data structures need to be close to the hardware and cache line optimized - here's the bloom filter implementation used in RocksDB. An amazing piece of engineering I should say ..
Today's discovery from my
@ManningBooks
shelf - The book Algorithms and Data Structures for Massive Datasets () which I purchased long back, contains an entire chapter on the B-trees, Bε-trees and LSM trees ..
A new data structure that can possibly replace the bloom filter in LSM based systems ..
Many LSM tree-based database systems use Bloom filters to reduce the number of unnecessary I/Os. They read the on-disk SSTable only when the associated in-memory Bloom filter indicates that
I see lots of books / materials on data structures and algorithms being discussed on X. But hardly I see Jeff Erickson's Algorithms book being there. I think it's a bit underrated, but I find it an extremely lucidly written exposition of some of the most complex stuff in the
Scala's immutable vectors are implemented based on the data structure described in this paper. A very nice read if you love data structures and would like to learn the implementation of collection.immutable.Vector in Scala.
RRB-Trees: Efficient Immutable Vectors by Phil Bagwell
B-trees are usually used for disk based storage. This document from ScyllaDB makes a strong case why B-trees make a good choice for in-memory collections as well
Another great post about how performance of an algorithm in a practical situation depends more on cache misses, vectorisation capabilities, instruction level parallelism etc. rather than Big-O. Here's the link -
The conclusion 👇 is 💯. ..
"A modern CPU
This monograph is a wonderful introduction to the world of data streams algorithms and applications. The chapters abstract all mathematical ideas, algorithmic techniques as well as lower bound approaches for data stream models that comprise together the foundation of the theory
another awesome book on data structures and algorithms -
Algorithms and Data Structures - The Basic Toolbox
- Kurt Mehlhorn and Peter Sanders
This is one of the books where I found B-trees being described as a generalization of (2,3) trees. And the book has an excellent formal
Nice post .. some more tidbits ..
One way to judge data structures today is to explore how they perform in the presence of the hierarchy of caches in modern CPUs. Skip lists are not known to be very cache friendly when implemented with naive memory allocation (things have
Redis uses skip-list to implement Sorted Set, but
@dragonflydbio
switched their implementation to B+ Trees ⚡
I spent some time this week to understand the real reason behind this decision. Going into depth just changed the way I looked at Data structures and their
Based on my recent post on how cache friendliness can impact data structure performance, this is a related presentation from Cliff Click on how your software can interact with modern hardware architectures ..
A Crash Course in Modern Hardware -
From the
One of my favorite papers that explains the basics of types, abstraction and polymorphism. My first understanding of the various types of polymorphism was from this sketch in the paper.
On Understanding Types,
Data Abstraction, and Polymorphism -
A fascinating talk on managing page faults in Linux and some of the developments going on by the kernel team.
The talk starts claiming that linked lists are an immoral data structure and if you are using them for anything for which you care about performance, you are committing
I like tree data structures and would love to read them all :-) .. Here's another of my favourite - the EBTree (Elastic Binary Tree), used to optimize performance and memory usage of the HAProxy load balancer scheduler.
It was invented by Willy Tarreau - here's from his blog
Typically when we talk about analysis of algorithms or data structures, we focus on the worst case analysis most of the time, and also on the amortized analysis in some cases (though I feel we should be talking more about the latter than the former, but that's the subject of
A few of the many good books (technology, math, computation) that I read in 2023. Not all of them are newly published ones, but just that I completed reading them in 2023 .. what’s yours list ?
Quite some time back I posted about my appreciation for a calculus book the calculus story - a mathematical adventure (). Recently I have been exploring another calculus book and loving it too. Again, the love is related to the way the author presents some
a short note of appreciation on this exceedingly beautiful introductory book on calculus.
I find this book a very friendly progression to the basics of calculus using a style that does not seem dry at any point in time. It starts with the basic philosophy and machinery needed
Dear database Xeeps - Most of the storage engines of today's databases use some forms of B/B+ trees, LSM trees or Bε trees in implementing in-memory and persistent indexes.
All of these implementations fall under the category of cache aware data structures following the
Another post on a data structure that I learnt in 2023. The Maple Tree, which is now being used in the Linux Kernel replacing linked lists and rbtrees in various subsystems.
I had earlier posted about a couple of talks on the Maple Tree ( and
A fascinating talk on managing page faults in Linux and some of the developments going on by the kernel team.
The talk starts claiming that linked lists are an immoral data structure and if you are using them for anything for which you care about performance, you are committing
One of my favourite books on Computational Geometry has a world of wisdom on data structures and algorithms - absolutely in love with this book ..
#weekendread
A very interesting talk from Microsoft Research on scaling write-intensive key/value stores based on LSM trees. Every LSM tree based storage also has a bunch of bloom filters. The talk suggests improvements in (mostly) write throughput through a discussion of 3 SIGMOD papers,
Effects and IO were something missing from Functional and Reactive Domain Modeling book. Started adding some IO based implementations of domain models discussed in the book. Keep an eye on
Sometimes I wonder why Okasaki’s Purely Functional Data Structures is not taught enough in graduate data structures courses. With multi core being ubiquitous shouldn’t we focus more on immutable data structures in a persistent setting ?
Quitting an enterprise Java job and investing in Scala and FP - opened a world of opportunities for me where I enjoyed the work, started speaking in conferences and learnt a tonne of stuff ..
a data structure with a great application story ..
This from the linked blog post ..
"The PruningRadixTrie is perfect for auto-completion or query completion. While 37 ms for an auto-complete might seem fast enough for a single user, it becomes a
After an amazing (almost) 5 years at Lightbend, I am looking for new challenges. My recent past experiences have mostly been Scala and FP but am willing to tread on new pastures as well. Here’s my GitHub profile .. . pls RT
Somehow didn't expect this, but pleasantly surprised that Database Internals
@therealdatabass
by
@ifesdjeen
contains a discussion of Fractional Cascading in the context of FD trees.
Fractional Cascading is one of my favourite data structure techniques, but possibly it is not
Very excited to join Conviva as Principal Engineer. Will be working on their next generation actionable insights based streaming platform. Some real good stuff on time state analytics brewing up there ..
Cassandra also used to do the same until recently when they switched to a trie based data structure for Memtable. This paper () describes the rationale of moving away from a comparison based data structure ..
RocksDB uses skip lists for its default Memtable implementation. In some cases we do use skiplists over B-Trees. Check out this implementation of a skiplist.
Interesting thoughts on redefining how stateful streaming for analytics can be implemented with higher levels of abstraction.
Typically today's streaming systems, time-series databases and SQL extensions do not provide good abstractions for modeling processes that evolve
Looks like a great course. Looking at the course page (), the contents of course look awesome. Just wanted to give a shout out to one of the books referred to there, which incidentally has the same name as the course ..
Algorithmic Aspects of Machine
This is common algorithms knowledge - for solving dynamic programming problems, commonly we have 2 approaches - top down memoized recursive approach and bottom up iterative approach ..
However this observation, which I saw first and only in Erik Demaine's 6.006 lecture videos on
Cassandra moves from comparison based data structures to trie based ones for Memtables and SSTables.
Some of the issues with comparison based data structures like B-tree (that Cassandra used till now) are:
a. Theoretically, the worst-case lookup complexity in comparison-based
Trie Memtables & Trie-Indexed SSTables
are some of the new features in C* 5.0
A novel industry-leading implementation offering significant performance improvements to clusters, from read and update latencies, to reduced disk usage.
by
@b_lambov
Coming from a Scala background I always tried to search for immutability and persistence in data structures when I started programming in Rust. I even found a couple of crates implementing persistent data structures.
Then comes the realisation that in Rust mutability is not a
A very interesting story of Zomato moving its billing platform from TiDB to hosted Amazon DynamoDB that meant redesigning the data storage layer from relational to NoSQL key/value based one.
LSM trees are traditionally known to offer high throughput for writes and good utilization of storage space.
Here's a relatively new paper (2023) that discusses some of the optimization techniques adopted by LSM engines to accelerate reads.
The LSM Design Space and its Read
Seriously, TigerBeetle is one of the most complete projects with regards to artefacts - beautiful code, style guide, architecture document, this👇tutorial and of course their goldmine, the YouTube channel ..
Here are some algorithm pearls from TigerBeetle source code ..
1. A SetAssociativeCache significantly reducing the likelihood of cache thrashing, along with a HashMap acting as the stash. Adopted from this paper:
More Robust Hashing: Cuckoo Hashing with a Stash -
Excited to use the new saturating arithmetic and SIMD builtin changes for a SetAssociativeCache with Nth chance CLOCK eviction.
The new Prefetch is also great for binary search over Eytzinger Layout ala
@pkhuong
and Pat Morin's "Array Layouts for Comparison-Based Searching".
LSM trees offer great write throughput and being so popular in implementation of so many NoSql stores, we are seeing lots of optimizations as well. This paper suggests improvements on the overhead of bloom filters in LSM Trees ..
LSM trees power some of the most popular databases like Cassandra, RocksDB, LevelDB, etc; but they suffer from a massive problem ⚡
Although they are a popular choice for write-heavy use cases, they see latency spikes under a heavy workload. This typically happens because
Just found this monograph on B-trees that has a fairly holistic perspective on B-trees including the data structures and algorithms part and use of B-tree indexes in databases, transactional techniques and query processing techniques.
Modern B-Tree Techniques - Goetz Graefe
Was reading a blog post on the persistent storage structure of DuckDB that uses Adaptive Radix Tree (ART) based indexing on top of main memory storage. One of the core takeaways from reading that blog post is that data structures that offer the same core functionality can be
I was re-reading Sandy Metz's blog post The Wrong Abstraction after a long time, and some things immediately struck in my mind. BTW I agree to the overall proposition that the post makes.
Link: The Wrong Abstraction -
Every time I read the post
TIL : A persistent version of the Adaptive Radix Tree exists, a map data structure designed for in-memory analytics that supports efficient fine-grained updates without compromising immutability.
Link :
Reading the Adaptive Radix Tree paper - Having the realisation that trie is such an underrated data structure. Wondering why we should use any other data structure for in memory database indexes.
Here are some rough notes from the paper ..
Tree based indexes (e.g. binary search
Lecture notes for a beautiful OCaml course👇.. also have a look at (OCaml textbook for CS 3110 Data Structures and Functional Programming at Cornell University) h/t
@alexelcu
@paul_snively
@sabine_s_
@_JamesWard
@LawrPaulson
Polymorphic recursion need explicit type annotation. SML doesn’t have polymorphic recursion. Often needed when writing functions over recursive GADTs.
Not sure if this is the best material to link to, but my lecture notes on GADTs introduce them slowly:
Different languages make you think in different ways. Before starting to learn OCaml seriously, I have never thought of Modules as encodings of Existential Types and how these existentials tie in with the concept of Data Abstraction and design of Abstract Data Types.
Did some
Bought this some time back.Started reading over the weekend.Surprisingly approachable text on type theory till now.A quick look thru contents shows a good coverage of proof checking & proof dev & use of dependent type theory to formalise mathematics.Will update as I make progress
very interesting to follow the incremental optimization in this
#1brc
implementation in Rust - use of perfect hashing, manual SIMD, optimised pattern matching, parallelization etc. Lots of learn from ..
The best way to learn a new programming language/library and its idiomatic coding techniques is to read good code. Code reading is so underrated a technique.
git diff and many other diff programs use an algorithm that's based on the famous data structure, LCS (Longest Common Subsequence).
The core paper that first implemented this idea is by Eugene W. Myers. His 1986 paper "An O(ND) Difference Algorithm and Its Variations" is a
A concise explanation of the essence of algebraic domain modeling with an example and a pointer to a complete implementation using Scala 3 and ZIO 2.x ..
Lazy Sunday afternoon,
#AdventOfCode
done, now on to some relaxed reading of Claude Shannon making us think everything in terms of a 0 or 1 ..
A Mind at Play: How Claude Shannon Invented the Information Age - Jimmy Soni and Rob Goodman
It's an elegantly written, wonderfully
I was looking for some practical examples of Trie usage in databases and I found this article by
@holanda_pe
He shows how Adaptive Radix Trees are used/implemented in
@duckdb
.
I really liked the article structure and the knowledge depth. Can recommend
The Bf-tree addresses the key issue that makes B-trees cache unfriendly. It reduces the granularity of caching from page level thereby increasing the effective hot records in the cache ..
Bf-Tree: A Modern Read-Write-Optimized Concurrent Larger-Than-Memory Range Index -
TIL: postgresql uses Hopcroft-Karp bipartite matching algorithm to represent GROUPING SETs into minimal amount of ROLLUP chains:
Nice! Not seen very often application of bipartite matching in real life.
TIL: AstraDB / Cassandra started with HNSW algorithm for vector indexing, which is fast, accurate and easy to implement. But it needs a lot of memory. Hence they switched on to DiskANN that uses limited memory and a more compact representation of vectors, support incremental
Lockdown Saturday hacking - first draft of ZIO based domain model with effects just landed in the additional accompaniment of Functional and Reactive Domain Modeling .. Now supports cats-io, cats-mtl, cats based tagless final and ZIO
#Scala
#ZIO
#DDD
Revisiting Bloom Filter Efficiency in LSM Trees - thoughts from some recent readings ..
Almost all LSM based storage structures use bloom filters in front of their memtable structures to reduce IO overhead at least for the target keys not present in the underlying secondary
A fantastic talk on SplinterDB by Alex Conway, one of the creators. The best part of the talk is that it describes the confluence of the theory of data structures and their application to develop an impactful system that works in practice.
He talks about data structures, lower
@ChShersh
@MarisaVeryMoe
Ask yourself why B-trees exist and are _provably optimal_ for so many applications if the asymptotics you claim here can also exist. Replace B-trees with cache oblivious b-trees if you want to work in worlds with caches you don't know.
Don't want to think about the IO mode?
A
Algebra of data structures is a beautiful subject and when this "algebra translates directly into an implementation using algebraic data types, and its homomorphisms give rise to functions for manipulating and transforming these edge graphs", it becomes a fascinating read.
Let a
Updated domain model from Functional & Reactive Domain Modeling w/ mtl based approach. Tagless final + mtl + hand-rolled special instances for typeclasses look nice. No more monad transformer allocs. Sample impl w/ cats-core, cats-effect & cats-mtl ..
As an interviewer it puts me off when asking for one’s favourite data structure, someone mentions linked list. They think they can get away with CS 101 knowledge by saying linked lists offer faster insertion and deletion but the discussion quickly switches to how LLs suck with
Looks like the course Programming with Categories () by David Spivak, Brendan Fong and
@BartoszMilewski
will end up in a book .. yay! The first chapter is already online ..
This talk by
@noelwelsh
touches upon the exact pain points that a typeless Python API inflicts upon u.
@odersky
's recent proposal here , Scala's support for staging has the right foundations to bring type support to deep learning
It amazes me that every time I look for something to read on database systems, a new paper pops up for LSM trees.
Just found this gem from SIGMOD 2022 ..
Sarkar, S. and M. Athanassoulis. (2022). “Dissecting, Designing, and Optimizing LSM-based Data
I have been going through the book Writing Powerful Rust Macros from
@ManningBooks
by Sam Van Overmeire.
Link :
The book assumes a bit of Rust background on part of the header and some basic familiarity with Rust meta programming and macros. But the
Learning database index structures with deep learning models - is this the future of database indexes ? Is there any database that has tried anything of this sort ?
The Case for Learned Index Structures -
(From the abstract)
all existing index structures
"The language features of Scala such as type safety and immutability extend my comfort zone, giving me confidence that my code will work. The best thing is all I have to do to get this added benefit is to make it compile!"
This paper by Phil Wadler (Proofs are Programs: 19th Century Logic and 21st Century Computing) states the following in the concluding section ..
"Church’s lambda calculus, both typed and untyped, had a profound impact on the development of programming languages. The notion that
Looks like a great resource for learning Rust for persons with basic familiarity of the language. Some great content on borrows, lifetimes and especially dyn Trait.
(h/t
@Evanfchan
)
Had this bookmarked for quite some time now. Just watched .. one of the best ever database presentations. The simulation game was mind blowing -
@TigerBeetleDB
Next stop - TigerStyle :-)