Distributed storage systems make shared data available for a group of users. The goal is transparency: users should not know where the data is actually stored.
There are many design decisions to consider; for example:
- How much scalability do we need?
- What kind of consistency?
- What is the expected workload?
There are two access models:
- Upload/download: when a client uses a file, they download it from the server, then upload it once theyโre done.
- Remote access: files stay on the server, and clients ask the server to perform the operation on their behalf.
There are multiple established systems, including:
- Classics like ๐ Network File System (NFS) and ๐ฆด Coda File System.
- More modern, scalable ones like ๐ Google File System (GFS) and ๐ Bigtable.
CAP Theorem
In this system, we want three things:
- Consistency: all clients see the same data, even during concurrent updates.
- Availability: all clients can access the data, even during faults.
- Partition-tolerance: consistency and availability hold, even during network partitions.
Unfortunately, the CAP theorem states that we can get at most two of the three. In essence, when a network partition happens, we can either:
- Cancel the operation, decreasing availability but increasing consistency.
- Proceed with the operation, increasing availability but decreasing consistency.
Generally, a CA system is not a great option. Instead, we usually use CP, which rejects updates when not all replicas can be contacted, making the system unavailable (๐ Bigtable), or AP, which accepts inconsistencies and tries to resolve them later (๐ฆด Coda File System).