9 Comments
User's avatar
noobengineer's avatar

Thanks for the great article 👍

For The section:

However, B+trees have a linked list of leaf nodes. Thus, you can just use one path to traverse down the whole tree and then access all the keys using links at the leaf nodes. Thus, making sequential access more efficient in case of B+Trees.

My question is:

How is the B+ tree read sequential on disc (assuming HDD)?

From your example, I believe the leaf nodes 1,3,4,5,6,10,12... are not stored in sequential pages on disc. The reason is because otherwise, as soon as you add a new node , say 7, you will have to move all pages after that (10,12...) which will become an extremely expensive operation.

When I read "sequential", I believe the context should be sequential disc read (in context of a hard disc, for SSDs, it doesn't matter).

So, even though each leaf node is pointing to the next leaf node, it will be a random disc access.

Expand full comment
Vivek Bansal's avatar

Hey @noobengineer,

I think you were referring to this article: https://vivekbansal.substack.com/p/btree-vs-bplus-tree-key-differences

But, nevertheless, thanks for highlighting this and it makes a lot of sense. You're right. Though the access within the data structure B+Tree looks like sequential but it would actually be random access inside disk.

I would do some further investigation and edit the article.

Once again, thankyou.

Expand full comment
Navin Bhandari's avatar

Really good read!!

Expand full comment
Vivek Bansal's avatar

Thankyou Navin 🙌

Expand full comment
Imre's avatar

Hey Nice article, I really enjoyed the graphics, they look really good and help the understanding! If someone is interested, I have a series of articles where I implemented these concepts into a simple Java based LSM. https://tmsvr.com/building-a-database-series/ https://github.com/tmsvr/tmsvr-databases

Expand full comment
LucaFan's avatar

Excellent article, thanks. Just a question. How are ssttables ordered? For Key timestamp? What if I have multiple metrics?

Expand full comment
Deepak Kumar's avatar

Hi, how writes in LSM is append only? Yes WAL is append but we have to write to Memtable also which will be logN operation where N is size of keys in memory, which should be some high number to keep reads fast. Am I missing something?

Expand full comment
Sarthak Jain's avatar

Nice blog

Expand full comment
Ashwani Yadav's avatar

awesome read on LSM tree.

I would like to know how can we do a binary search in one SSTable.

Since data size is different for different rows, how should we perform binary search.

My string records in SSTable will look something like this (ofcourse this will be in binary encoding, I am showing as text for simplicity):

content of SSTable

--------------------

{recordOffset: 0, keySize: 4, key: "name", valueSize: 7, value: "ashwani"}

{recordOffset: xxx, keySize: 6, key: "friend", valueSize: 5, value: "vivek"}

--------------------

Expand full comment