Google Spanner

This post deep dive into Google spanner as a database, it's internal working and what's so special about it compared to other databases

Mar 31, 2024

Introduction

Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at a global scale and support externally-consistent distributed transactions. It also provides dynamic replication configurations.

Let’s try to decode some jargon here:

“multi-version”: In other databases, an update operation on an existing record overwrites the existing record but in Google Spanner, each record in the database is attached to a timestamp (when the transaction commits) in the Spanner database. This timestamp is referred to as a version and that’s why it’s called a multi-version database.

“globally-distributed and synchronously-replicated”: Spanner automatically shards data across machines as the amount of data or the number of servers changes, and it automatically migrates data across machines (even across datacenters) to balance load and in response to failures. Whenever any new data gets ingested on one instance, Google Spanner synchronously replicates the data onto other instances”

“externally-consistent distributed transactions”: This means Spanner behaves as if all transactions are executed one after another in a specific order, even though they might be happening concurrently on different servers.

“dynamic replication configurations“: The replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance). Data can also be dynamically moved across datacentres to balance resource usage

The magic sauce of Google Spanner is that it assigns globally meaningful commit timestamps to transactions, even though transactions may be distributed. The timestamps reflect serialization order.

For example: if a transaction T1 commits before another transaction T2 starts, then T1’s commit timestamp is smaller than T2’s. Spanner is the first system to provide such guarantees at a global scale.

This is such a hard thing to accomplish primarily because the data is distributed and there can be internet lag across machines. Thus, all machines might not be able to communicate their local timestamp to other machines and deciding which transaction to execute first and which one to execute later is hard to accomplish.

Read more about this in the TrueTime API section!

Google Spanner Internals

This section explains the following:

Underlying Spanner Implementation
Directory abstraction, which is used to manage replication and locality
Spanner’s Data Model: why Spanner looks like a relational database instead of a key-value store

Underlying Spanner Implementation

A Spanner deployment is called a universe. Given that Spanner manages data globally, there will be only a handful of running universes. The above image represents a single Spanner universe.

Spanner has the following components associated:

Zones: The Spanner database is a set of zones. Each zone contains some partitioned data. Each zone represents a physical location in the world and data can be replicated across zones. One datacenter can contain multiple zones. Zones can be added or removed as the new datacenters are added and old ones are turned off.
Zonemaster: A zone has one zonemaster. Zonemaster assigns data to spanservers.
SpanServers: A zone contains one hundred to several thousand spanservers which are used to serve data to the clients.
Location Proxies: The per-zone location proxies are used by clients to locate the spanservers assigned to serve their data.
Universe master: The universe master is primarily a console that displays status information about all the zones for interactive debugging.
Placement driver: The placement driver handles the automated movement of data across zones. The placement driver periodically communicates with the spanservers to find data that needs to be moved, either to meet updated replication constraints or to balance the load.

Figure 2: Represents Spanserver software stack

Each spanserver manages 100 to 1000 instances of a data structure called a tablet. Each spanserver implements a single Paxos state machine on top of each tablet. The tablet contains the metadata of the Paxos state machine. You can imagine a tablet as a data unit that contains multiple key-values like:

(key:string, timestamp:int64) → string

As discussed above, Spanner assigns timestamps to data, which is an important way in which Spanner is more like a multi-version database than a key-value store. Since the size of a tablet data structure could get big, it’s stored on a distributed file system called Colossus.

Summary: The key-value mapping state of each replica is stored in its corresponding tablet. The metadata of the Paxos state machine is also stored in the tablet. Since the data in the tablet could get big, everything is stored on the Colossus which is distributed file system.

Directory abstraction

Directories are the unit of data movement between Paxos groups

In Spanner, a Paxos group is a set of servers that work together to maintain consistency for a specific chunk of data. Each group uses the Paxos algorithm to ensure all members agree on the latest updates.

As you already know, Spanner divides the data into multiple shards. Each data shard is replicated across multiple servers in a Paxos group. These servers ensure everyone has the same version of the same data shard.

After dividing the data into shards, Spanner organizes data into units called directories. These are like smaller sections within a shard also called a directory, potentially containing multiple related entries. Now, Spanner can relocate entire directories from one Paxos group (set of servers) to another. This movement happens without disrupting ongoing database operations.

Reasons for Moving Directories:

There are several reasons why Spanner might move directories:

Load Balancing: If one Paxos group becomes overloaded with data or requests, Spanner can move directories to a less busy group, distributing the workload more evenly.
Data Co-locality: If two frequently accessed directories are currently spread across different Paxos groups, Spanner can bring them together, improving query performance as both directories can be accessed from the same server set.
Data Proximity: For geographically distributed deployments, Spanner can move directories closer to the locations where users primarily access them, reducing network latency and improving performance.

Summary: Directories are the unit of data movement and placement. They allow applications to control data locality by choosing keys carefully. A directory is a set of contiguous keys that share a common prefix. Spanner can move directories between Paxos groups to shed load, group frequently accessed directories together, or move a directory closer to its accessors. Data is moved directory by directory.

Spanner’s Data Model

Here are the key points related to Spanner’s data model:

Spanner leans towards a relational data model, similar to traditional databases. Data is organized into tables with rows and columns, allowing for querying using SQL.
It adheres to ACID properties (Atomicity, Consistency, Isolation, Durability) ensuring data reliability and integrity even in a distributed setting.
Spanner supports secondary indexes to enable efficient data retrieval based on specific columns. This speeds up queries based on frequently used search criteria.
Spanner employs a multi-version concurrency control mechanism. Each data item has a history of changes, allowing access to past versions. This is useful for applications requiring historical data analysis or rollback capabilities.

Spanner has something called a hierarchy of tables. The client applications declare the hierarchies in database schemas via the INTERLEAVE IN declarations. The table at the top of a hierarchy is a directory table. Each row in a directory table with key K, together with all of the rows in descendant tables that start with K in lexicographic order, forms a directory.

For example, in the above image, the directory 453 contains all of the keys that start with the number 2 and the directory 3665 contains all of the keys that start with number 1. Albums(2,1) represent the row from the Albums table for user id 2, and album id 1.

This interleaving of tables to form directories is significant because it allows clients to describe the locality relationships that exist between multiple tables, which is necessary for good performance in a sharded, distributed database. Without it, Spanner would not know the most important locality relationships.

TrueTime API

What’s the existing problem that the TrueTime API is trying to solve?

The Problem is Inconsistent Timestamps in Distributed Systems

Traditional distributed databases often rely on timestamps generated by individual server clocks. The whitepaper highlights this as a challenge because:

Clock Skew: Physical clocks on different servers can drift slightly over time, leading to inconsistencies.
Network Delays: Messages carrying timestamps can experience delays, further disrupting the perceived order of events.

Consequences of Inconsistent Timestamps:

Inconsistent timestamps can lead to data inconsistencies, particularly when dealing with concurrent transactions across geographically distributed servers.
This can cause issues like:
- "Lost Updates": Updates from one transaction might be overwritten by a later transaction, even if the first update happened logically earlier.
- "Dirty Reads": A transaction might read data that hasn't been fully committed by another transaction, leading to inconsistent data views.

Spanner's TrueTime API to the Rescue:

The paper introduces the TrueTime API as a solution to address these challenges:

Uncertainty Awareness: TrueTime acknowledges the inherent uncertainty in clock synchronization across geographically distributed servers.
Interval Timestamps: When a transaction occurs in Spanner, it's assigned a timestamp based on the TrueTime API. This timestamp captures a guaranteed window during which the transaction happened, considering potential clock variations. This interval considers potential clock skew and network delays. For example, if a transaction T happened at the T1 timestamp, then the TrueTime API will return an interval [T1A, T1B] where T1A < T1 < T1B.
Stronger Consistency Guarantees: By factoring in this uncertainty, Spanner can achieve stronger consistency guarantees for reads and writes. This ensures a clear and well-defined order of transactions, even if physical clocks on individual servers aren't perfectly synchronized.

Benefits of Stronger Consistency:

Simplified Development: Developers don't need to write complex code to handle potential inconsistencies arising from concurrent transactions.
Predictable Behavior: Applications can rely on a well-defined order of transactions, leading to more predictable and reliable data behaviour.
Data Integrity: Stronger consistency guarantees minimize the risk of data corruption due to inconsistent timestamps or concurrent writes.

That’s it, folks for this edition of the newsletter. Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!

Book exclusive 1:1 with me here.

Resources

Spanner Whitepaper by Google
Spanner explanation by Martin Kleppmann

Curious Engineer

Discussion about this post