Multi Version Concurrency Control
This article deep dives into what is multi version concurrency control and what is the benefit of using it
Why do we need Concurrency control?
What is the problem?
Let’s imagine that you are trying to transfer money(let’s say 100Rs) from your account to your friend’s account. At the database level, this transfer of money will be done via a database write transaction that debits money from your account and at the same time credits money to your friend’s account.
Imagine that while this transaction is in progress, to be more specific, consider the exact point in time ( let’s say time instance: T ) when the money has been debited from your account but not credited to your friend’s account. If someone tries to read your account balance and your friend’s account balance at the time instance T, it will appear to the reader as if money has disappeared from the bank, since it has left your account but not reached your friend’s account.
This problem arises due to inconsistent reads, which means that although the write transaction was not fully done i.e. crediting the 100Rs money to your friend’s account, other read transactions were allowed to read the inconsistent and uncommitted data from the database.
How do we solve concurrency issues?
To prevent these scenarios, we need concurrency control in our transactions. Concurrency control enables which data should or should not be visible to other transactions while reading and writing data into the database.
To solve our original problem of inconsistent reads, the simplest way here is to block all the readers until the writer is done which is usually referred to as the read-write lock. In simple terms, a read-write lock states:
“Multiple threads can read the data in parallel but an exclusive lock is needed for writing or modifying data.” In other words, when a writer thread is writing something, all other write threads and read threads must wait until the original writer is done writing.
However, locks are known to provide resource contention. Imagine if multiple writes come at the same time, then multiple write threads will be blocked until one of the write threads is writing the data.
So, the main question becomes: How can we provide maximum database concurrency while getting minimum lock contention?
Enters the solution: Multi-Version Concurrency Control
Multi-Version Concurrency Control (MVCC)
The premise of multi-version concurrency control is:
“When an MVCC database needs to update a piece of data, it will not overwrite the original data item with new data, but instead creates a newer version of the data item, thus storing multiple versions of the same data.“
MVCC tries to maximize database throughput by keeping multiple copies of each data item. Every client connected to the database sees a state (or snapshot) of the database as it was some time ago and then performs an update on top of it. Any changes made by a writer thread are not visible to other users until the changes are completed (as in, the transaction has been committed). This prevents transactions from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session.
Does MVCC need locks at all?
A good question at this juncture would be since we are storing multiple versions of the same data, then do we need locks at all if we choose multi-version concurrency control over traditional locks?
The answer is that multi-version concurrency control helps: reads not blocking writes and writes not blocking reads but MVCC still needs locks to function.
Consider a case where two write transactions concurrently read data, try to modify the data, and then try to commit their changes. If both transactions are working on the same set of data, then only the latest changes made by a transaction would be visible and the other transaction changes would be lost forever. These are called Lost Updates.
So, in a nutshell, MVCC still needs locks, for example to prevent write conflicts.
Note: Multi-version concurrency control is a popular technique and is used in many databases. For example: PostgreSQL, CouchDB, MySQL with InnoDB storage, etc.
Snapshot Isolation Level
Unlike other traditional isolation levels present in databases, there is an isolation level called Snapshot isolation. A transaction that is operating under the snapshot isolation level will operate on a personal snapshot of the database which is taken at the time of starting the transaction.
When the transaction is concluded, it will successfully commit the changes only if the values updated by the transaction have not been changed by other transactions running concurrently since the time snapshot was taken. And, if there is a write-write conflict among transactions for any values, the transaction will be aborted.
Snapshot isolation level is implemented with multi-version concurrency control and is adopted by multiple databases including InterBase, Firebird, Oracle, MySQL, PostgreSQL, SQL Anywhere, MongoDB, and Microsoft SQL Server.
What is the Row-versioning technique?
Snapshot isolation uses a concept called row-versioning which is a database technique used to track changes made to individual rows in a table. It essentially creates a version history for each row, allowing the database to identify the current state and past states of the data. Here's a deeper look at how it works:
Each row in a table gets an additional column that stores a version number or timestamp. This column is often named
rowversion
(SQL Server) or uses system columns likexmin
andxmax
(PostgreSQL).Whenever a row is inserted, updated, or deleted, the version information associated with that row is incremented or updated accordingly.
Row versioning is the secret sauce of how MVCC works, which allows multiple transactions to access and modify data concurrently with minimal locking. Transactions can read older versions of the data based on their version numbers.
The row-versioning technique is a very basic concept of the database and it helps in implementing Optimistic Concurrency Control as well.
Conclusion
The main advantage of using multi-version concurrency control rather than traditional locking( read-write lock ) is that MVCC locks acquired for reading do not block writing and MVCC locks acquired for writing do not block reading and thus provide high throughput out of your database.
That’s it, folks for this edition of the newsletter. Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!
Book exclusive 1:1 with me here.
Resources
Multi-Version Concurrency Control by Wikipedia
Getting started with MVCC
Snapshot Isolation by Wikipedia
thanks a lot Vivek.
I would like to add (in context of postgresql):
1) At the Read Committed isolation level, a snapshot is taken at the beginning of each statement and is active till the statement ends.
2) At the Repeatable Read and Serialization isolation levels, a snapshot is taken at the beginning of the first statement of the transaction and it remains active till the end of the transaction.
"Consider a case where two write transactions concurrently read data, try to modify the data, and then try to commit their changes. If both transactions are working on the same set of data, then only the latest changes made by a transaction would be visible and the other transaction changes would be lost forever. These are called Lost Updates." ----> In simple terms, it's not possible to achieve this using the Serialization Isolation level (also known as Serializable Snapshot Isolation). This isolation method relies on Multi-Version Concurrency Control (MVCC) instead of using locks. What it does is keep track of the dependencies between the data that's read and written by concurrent transactions, and it prevents situations where updates could be lost.