Data deduplication is a technique that improves storage utilization by identifying unique data and storing only one instance of that data. Any occurrence of the unique data is replaced with a small reference to the correspondingly stored single instance of that data block.
Storage-based data deduplication is most effective in applications where many copies of very similar or even identical data are stored on a single system, which is surprisingly a common scenario. Data backups are one of the use cases where data deduplication is widely used since most data in a given backup remain unchanged from the previous backups. Other use cases include WAN optimization where only unique data or references to unique data are sent over to remote locations to synchronize copies of data sets between locations. Yet another use case is a large deployment of virtual servers or VMs that share the same VM image. However this use case can easily be implemented using QCOW2 images with base VM image as the backing file.
There are some good discussions on the web on various data deduplication technologies, however for our discussion we focus mainly on inline deduplication in file systems.
There are primarily two ways to deduplicate the data in file system.
- Variable size blocks – This approach is well suited for file level deduplication. Variable size deduplication uses rolling CRC similar to that of rsync algorithm to calculate the unique data of variable size. This kind of deduplication is good for file level deduplication because of the way a file is usually modified. File edits include inserting new data or deleting existing data. When these changes happen in the middle of the file, the offset of the data that follows the modification will change. Variable size deduplication can detect unmodified data even when the offset is changed. One major draw back with this approach is it is CPU intensive, as the rolling CRC needs to pass thru the entire file to calculate occurrence of unmodified data. Since CRC is prone to false positives, other strong hashing algorithms such as SHA1 or SHA256 usually substantiate CRC. Most “dedupe at source” backup applications use this technique.
- Fixed size blocks – Well suited for block level storage deduplication including VM image backups, LUN snapshots etc. The assumption here is that the block offset and size are fixed so the deduplication process includes computing a digital signature such as SHA128 or SHA256 for each block and indexing data block using its signature. Compared to variable size based deduplication, fixed size block only require SHA computation and then indexing blocks based on SHA so it is computationally more efficient. ZFS, an open source file system, uses fixed size deduplication.
There are three main components of any deduplication file systems.
- Chunk store where all unique data blocks are stored
- Lookup which indexes the SHA of data block to location of data in chunk store
- File metadata that maps the file offset and length to sequence of SHAs that identifies data blocks in chunk store
Algorithms to implement dedupe file systems have evolved over the last few years and the data deduplication has now become a mainstream feature of any storage appliances.
Deduplication vs. Compression
Though theoretically both methods eliminate redundant data in a data set, compression is well known to remove redundancy within a file or a group of files archived together into a single file. Compression is more efficient for smaller data sets with redundant data. Data deduplication is the process of eliminating redundant data with in a storage system that comprises of large data sets. These data sets can be number of file systems or groups of storage devices. The data deduplication is efficient for larger data sets.
Scale-out file systems
Over the last few years, the explosion of big data from social media and other sources have given rise to highly scalable applications. Though these applications have inbuilt redundancy to protect against hardware failures, Backup and DR are also essential strategies for implementing comprehensive Data Protection. However traditional backup systems are either single node or two node systems with limited scalability and capacity and are not suitable to meet the backup needs of these applications. New backup systems must be based on the same principles as scale out architectures in order to meet their backup needs. And one of the critical features is the support of deduplication in these new classes of backup systems.
Though the algorithms to implement deduplication in traditional file systems are well documented, algorithms to implement deduplication in scale out file systems are less known. In this blog, I will layout the challenges in implementing deduplication in scale out file systems.
Lets discuss the essential attributes of scale out file systems first and then discuss the challenges involved in implementing deduplication in these file systems.
Loosely coupled nodes with data distributed across all nodes. Data is distributed uniformly among the nodes so each node physical resources such as memory, IO channels and disks are equally utilized to scale the performance.
Data distribution in scale out file systems can be skewed over a period of time: this happens because new nodes are added to the system; existing nodes are retired, data is deleted or nodes temporarily going offline and coming back online. In order to get the data distributed uniformly again, these file systems regularly rebalances the data between the nodes.
Scale-out file systems are meant to grow large, in fact very large. Some of the deployments may scale to hundreds of nodes, so given the MTBF of a typical hard disk statistically it is possible to loose a disk every few days. In order to coupe with these failures, these file systems have built in replication to maintain at least one accessible copy of data in the face of a failure.
So when designing deduplication in these file systems, it must as well adhere to these essential attributes:
Indexes, Chunk stores as well as the file metadata of the deduplication should be uniformly distributed across all nodes of the file system. Usually index and chunk stores are co-located on same node so chunk lookups can be local and does not span multiple nodes. File metadata can uniformly spread across nodes for optimal performance.
Index, chunk stores and file metadata need to be periodically migrated between nodes to achieve a uniform storage distribution.
Replication is the tricky feature of all. After all deduplication is about eliminating redundant data which is at odds with replication. However unique data need to be replicated for high availability. There are two ways how this can be achieved.
- Make two distinct file systems and replicate files between these two file systems. Each file system has deduplication built in so each file system stores files efficiently.
- Index and chunk stores keep multiple copies of each chunk on multiple nodes.
The first option is easier to manage because it leverages existing file system replication to achieve multiple copies of chunk stores for high availability.
Other considerations when implementing a deduplication in these architectures are:
Each unique data chunk in a dedupe file system can be referenced by multiple files; hence, it is important to know how many outstanding references exists on a unique data chunk before it can be deleted. Traditional file systems such as ZFS keep a reference count on each unique data and delete the block when the reference count hits zero. Scale out architectures usually spread the metadata across nodes and there is no single place where a reference count per unique data block can be kept so periodic garbage collection of unreferenced unique data blocks should be performed to reclaim the storage.
Dedupe file systems often employ journal for updates to be durable because each update touches more metadata than traditional files systems. For example, appending to a file requires adding SHA to file metadata, creating index and adding the data to chunk store. If the system crashes in the middle of these updates, it leaves the file system in inconsistent state. As with any scale out architectures, the journal has to be global so any surviving node can replay the journal entries and bring back file system data structures to consistent state again.