基于Dedup的数据打包技术

基于Dedup的数据打包技术

作者简介：刘爱贵，研究方向为网络存储、数据挖掘和分布式计算；毕业于中科院，目前从事存储软件研发工作。 Email: Aigui.Liu@gmail.com
注：作者学识和经验水平有限，如有错误或不当之处，敬请批评指正。

0、引言
    Tar, winrar, winzip是最为常见的数据打包工具软件，它们把文件集体封装成一个单独的数据包，从而方便数据的分布、传输、归档以及持久保存等目的。这类工具通常都支持数据压缩技术，从而有效减少数据的存储空间，常用压缩算法有Huffman编码、Z77/z78、LZW等。压缩算法的原理是通过对数据的重新编码，高频率数据片段采用较短的编码，低频率数据片段采用较长的编码，从而获得全局的上数据量较小的文件表示。

1、Dedup原理
   Deduplication，即重复数据删除，它是一种非常新的且流行度很高的存储技术，可以大大减少数据的数量。重复数据删除技术，通过数据集中重复的数据，从而消除冗余数据。借助dedup技术，可以提高存储系统的效率，有效节约成本、减少传输过程中的网络带宽。同时它也是一种绿色存储技术，能有效降低能耗(存储空间小了，所需要存储系统磁盘也就少了，自然所需要电能就减少了)。
   dedup按照消重的粒度可以分为文件级和数据块级。文件级的dedup技术也称为单一实例存储(SIS, Single Instance Store)，数据块级的重复数据删除，其消重粒度更小，可以达到4-24KB之间。显然，数据块级的可以提供更高的数据消重率，因此目前主流的dedup产品都是数据块级的。重复数据删除原理如下图所示。将文件都分割成数据块(可以是定长或变长的数据块)，采用MD5或SHA1等Hash算法(可以同时使用两种或以上hash算法，或CRC校验等，以获得非常小概率的数据碰撞发生)为数据块计算FingerPrint。具有相同FP指纹的数据块即可认为是相同的数据块，存储系统中仅需要保留一份。这样，一个物理文件在存储系统中就对应一个逻辑表示，由一组FP组成的元数据。当进行读取文件时，先读取逻辑文件，然后根据FP序列，从存储系统中取出相应数据块，还原物理文件副本。
   重复数据删除目前主要应用于数据备份，因此对数据进行多次备份后，存在大量重复数据，非常适合dedup技术。事实上，dedup技术可以用于很多场合，包括在线数据、近线数据、离线数据存储系统，甚至可以在文件系统、卷管理器、NAS、SAN中实施。还可以用于网络数据传输，当然也可以应用于数据打包技术。dedup技术可以帮助众多应用降低数据存储量，节省网络带宽，提高存储效率、减小备份窗口，绿色节能。这里，基于dedup实现一种数据打包技术。

基于Dedup的数据打包技术

2、基于Dedup的数据打包模型

数据包文件的数据布局：

Header

Unique block data

File metadata

数据包由三部分组成：文件头(header)、唯一数据块集(unique block data)和逻辑文件元数据(file metadata)。其中，header为一个结构体，定义了数据块大小、唯一数据块数量、数据块ID大小、包中文件数量、元数据在包中的位置等元信息。文件头后紧接就存储着所有唯一的数据块，大小和数量由文件头中元信息指示。在数据块之后，就是数据包中文件的逻辑表示元数据，由多个实体组成，结构如下所示，一个实体表示一个文件。解包时根据文件的元数据，逐一提取数据块，还原出当初的物理文件。

逻辑文件的元数据表示：

Entry header

pathname

Entry data

Last block data

逻辑文件的实体头中记录着文件名长度、数据块数量、数据块ID大小和最后一个数据块大小等信息。紧接着是文件名数据，长度在实体头中定义。文件名数据之后，存储着一组唯一数据块的编号，编号与唯一数据块集中的数据块一一对应。最后存储着文件最后一个数据块，由于这个数据块大小通常比正常数据块小，重复概率非常小，因此单独保存。

3、原型实现
基于上面的数据布局，就可以实现支持重复数据删除的数据打包方法。本人在Linux系统上实现了一个原型，实现中使用了hashtable来记录和查询唯一数据块信息，使用MD5算法计算数据块指纹，并使用zlib中的z77压缩算法对删除了重复数据后的数据包进行压缩。hashtable, MD5, z77算法和实现，这里不作介绍，有兴趣的读者可以参考相关资源。下面给出dedup.h, dedup.c undedup.c源码文件。目前实现的原型还相对比较粗糙。

/* dedup.h */

#ifndef _DEDUP_H #define _DEDUP_H #include "md5.h" #include "hash.h" #include "hashtable.h" #include "libz.h" #ifdef __cplusplus extern "C" { #endif /* * deduplication file data layout * -------------------------------------------------- * | header | unique block data | file metadata | * -------------------------------------------------- * * file metedata entry layout * ----------------------------------------------------------------- * | entry header | pathname | entry data | last block data | * ----------------------------------------------------------------- */ typedef unsigned int block_id_t; #define BLOCK_SIZE 4096 /* 4K Bytes */ #define BACKET_SIZE 10240 #define MAX_PATH_LEN 255 #define BLOCK_ID_SIZE (sizeof(block_id_t)) /* deduplication package header */ #define DEDUP_MAGIC_NUM 0x1329149 typedef struct _dedup_package_header { unsigned int block_size; unsigned int block_num; unsigned int blockid_size; unsigned int magic_num; unsigned int file_num; unsigned long long metadata_offset; } dedup_package_header; #define DEDUP_PKGHDR_SIZE (sizeof(dedup_package_header)) /* deduplication metadata entry header */ typedef struct _dedup_entry_header { unsigned int path_len; unsigned int block_num; unsigned int entry_size; unsigned int last_block_size; int mode; } dedup_entry_header; #define DEDUP_ENTRYHDR_SIZE (sizeof(dedup_entry_header)) #ifdef __cplusplus } #endif #endif
/* dedup.c */

#include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <dirent.h> #include <unistd.h> #include <getopt.h> #include <fcntl.h> #include <dirent.h> #include <errno.h> #include "dedup.h" /* unique block number in package */ static unsigned int g_unique_block_nr = 0; /* regular file number in package */ static unsigned int g_regular_file_nr = 0; /* block length */ static unsigned int g_block_size = BLOCK_SIZE; /* hashtable backet number */ static unsigned int g_htab_backet_nr = BACKET_SIZE; void show_md5(unsigned char md5_checksum[16]) { int i; for (i = 0; i < 16; i++) { printf("%02x", md5_checksum[i]); } } void show_pkg_header(dedup_package_header dedup_pkg_hdr) { printf("block_size = %d/n", dedup_pkg_hdr.block_size); printf("block_num = %d/n", dedup_pkg_hdr.block_num); printf("blockid_size = %d/n", dedup_pkg_hdr.blockid_size); printf("magic_num = 0x%x/n", dedup_pkg_hdr.magic_num); printf("file_num = %d/n", dedup_pkg_hdr.file_num); printf("metadata_offset = %lld/n", dedup_pkg_hdr.metadata_offset); } int dedup_regfile(char *fullpath, int prepos, int fd_bdata, int fd_mdata, hashtable *htable, int debug) { int fd; char *buf = NULL; unsigned int rwsize, pos; unsigned char md5_checksum[16 + 1] = {0}; unsigned int *metadata = NULL; unsigned int block_num = 0; struct stat statbuf; dedup_entry_header dedup_entry_hdr; if (-1 == (fd = open(fullpath, O_RDONLY))) { perror("open regulae file"); return errno; } if (-1 == fstat(fd, &statbuf)) { perror("fstat regular file"); goto _DEDUP_REGFILE_EXIT; } block_num = statbuf.st_size / g_block_size; metadata = (unsigned int *)malloc(BLOCK_ID_SIZE * block_num); if (metadata == NULL) { perror("malloc metadata for regfile"); goto _DEDUP_REGFILE_EXIT; } buf = (char *)malloc(g_block_size); if (buf == NULL) { perror("malloc buf for regfile"); goto _DEDUP_REGFILE_EXIT; } pos = 0; while (rwsize = read(fd, buf, g_block_size)) { /* if the last block */ if (rwsize != g_block_size) break; /* calculate md5 */ md5(buf, rwsize, md5_checksum); /* check hashtable with hashkey */ unsigned int *bindex = (block_id_t *)hash_value((void *)md5_checksum, htable); if (bindex == NULL) { bindex = (unsigned int *)malloc(BLOCK_ID_SIZE); if (NULL == bindex) { perror("malloc in dedup_regfile"); break; } /* insert hash entry and write unique block into bdata*/ *bindex = g_unique_block_nr; hash_insert((void *)strdup(md5_checksum), (void *)bindex, htable); write(fd_bdata, buf, rwsize); g_unique_block_nr++; } metadata[pos] = *bindex; memset(buf, 0, g_block_size); memset(md5_checksum, 0, 16 + 1); pos++; } /* write metadata into mdata */ dedup_entry_hdr.path_len = strlen(fullpath) - prepos; dedup_entry_hdr.block_num = block_num; dedup_entry_hdr.entry_size = BLOCK_ID_SIZE; dedup_entry_hdr.last_block_size = rwsize; dedup_entry_hdr.mode = statbuf.st_mode; write(fd_mdata, &dedup_entry_hdr, sizeof(dedup_entry_header)); write(fd_mdata, fullpath + prepos, dedup_entry_hdr.path_len); write(fd_mdata, metadata, BLOCK_ID_SIZE * block_num); write(fd_mdata, buf, rwsize); g_regular_file_nr++; _DEDUP_REGFILE_EXIT: close(fd); if (metadata) free(metadata); if (buf) free(buf); return 0; } int dedup_dir(char *fullpath, int prepos, int fd_bdata, int fd_mdata, hashtable *htable, int debug) { DIR *dp; struct dirent *dirp; struct stat statbuf; char subpath[MAX_PATH_LEN] = {0}; if (NULL == (dp = opendir(fullpath))) { return errno; } while ((dirp = readdir(dp)) != NULL) { if (strcmp(dirp->d_name, ".") == 0 || strcmp(dirp->d_name, "..") == 0) continue; sprintf(subpath, "%s/%s", fullpath, dirp->d_name); if (0 == lstat(subpath, &statbuf)) { if (debug) printf("%s/n", subpath); if (S_ISREG(statbuf.st_mode)) dedup_regfile(subpath, prepos, fd_bdata, fd_mdata, htable,debug); else if (S_ISDIR(statbuf.st_mode)) dedup_dir(subpath, prepos, fd_bdata, fd_mdata, htable, debug); } } closedir(dp); return 0; } int dedup_package(int path_nr, char **src_paths, char *dest_file, int debug) { int fd, fd_bdata, fd_mdata, ret = 0; struct stat statbuf; hashtable *htable = NULL; dedup_package_header dedup_pkg_hdr; char **paths = src_paths; int i, rwsize, prepos; char buf[1024 * 1024] = {0}; if (-1 == (fd = open(dest_file, O_WRONLY | O_CREAT, 0755))) { perror("open dest file"); ret = errno; goto _DEDUP_PKG_EXIT; } htable = create_hashtable(g_htab_backet_nr); if (NULL == htable) { perror("create_hashtable"); ret = errno; goto _DEDUP_PKG_EXIT; } fd_bdata = open("./.bdata", O_RDWR | O_CREAT, 0777); fd_mdata = open("./.mdata", O_RDWR | O_CREAT, 0777); if (-1 == fd_bdata || -1 == fd_mdata) { perror("open bdata or mdata"); ret = errno; goto _DEDUP_PKG_EXIT; } g_unique_block_nr = 0; g_regular_file_nr = 0; for (i = 0; i < path_nr; i++) { if (lstat(paths[i], &statbuf) < 0) { perror("lstat source path"); ret = errno; goto _DEDUP_PKG_EXIT; } if (S_ISREG(statbuf.st_mode) || S_ISDIR(statbuf.st_mode)) { if (debug) printf("%s/n", paths[i]); /* get filename position in pathname */ prepos = strlen(paths[i]) - 1; if (strcmp(paths[i], "/") != 0 && *(paths[i] + prepos) == '/') { *(paths[i] + prepos--) = '/0'; } while(*(paths[i] + prepos) != '/' && prepos >= 0) prepos--; prepos++; if (S_ISREG(statbuf.st_mode)) dedup_regfile(paths[i], prepos, fd_bdata, fd_mdata, htable, debug); else dedup_dir(paths[i], prepos, fd_bdata, fd_mdata, htable, debug); } else { if (debug) printf("%s is not regular file or directory./n", paths[i]); } } /* fill up dedup package header */ dedup_pkg_hdr.block_size = g_block_size; dedup_pkg_hdr.block_num = g_unique_block_nr; dedup_pkg_hdr.blockid_size = BLOCK_ID_SIZE; dedup_pkg_hdr.magic_num = DEDUP_MAGIC_NUM; dedup_pkg_hdr.file_num = g_regular_file_nr; dedup_pkg_hdr.metadata_offset = DEDUP_PKGHDR_SIZE + g_block_size * g_unique_block_nr; write(fd, &dedup_pkg_hdr, DEDUP_PKGHDR_SIZE); /* fill up dedup package unique blocks*/ lseek(fd_bdata, 0, SEEK_SET); while(rwsize = read(fd_bdata, buf, 1024 * 1024)) { write(fd, buf, rwsize); memset(buf, 0, 1024 * 1024); } /* fill up dedup package metadata */ lseek(fd_mdata, 0, SEEK_SET); while(rwsize = read(fd_mdata, buf, 1024 * 1024)) { write(fd, buf, rwsize); memset(buf, 0, 1024 * 1024); } if (debug) show_pkg_header(dedup_pkg_hdr); _DEDUP_PKG_EXIT: close(fd); close(fd_bdata); close(fd_mdata); unlink("./.bdata"); unlink("./.mdata"); hash_free(htable); return ret; } void usage() { printf("Usage: dedup [OPTION...] <target file> <source files ...>/n"); printf("/nPackage files with deduplicaton technique./n"); printf("Mandatory arguments to long options are mandatory for short options too./n"); printf(" -z, --compress filter the archive through compress/n"); printf(" -b, --block block size for deduplication, default is 4096/n"); printf(" -t, --hashtable hashtable backet number, default is 10240/n"); printf(" -d, --debug print debug messages/n"); printf(" -h, --help give this help list/n"); printf("/nReport bugs to <Aigui.Liu@gmail.com>./n"); } int main(int argc, char *argv[]) { char tmp_file[] = "./.dedup/0"; int bz = 0, bhelp = 0, bdebug = 0; int ret = -1, c; struct option longopts[] = { {"compress", 0, &bz, 'z'}, {"block", 1, 0, 'b'}, {"hashtable", 1, 0, 't'}, {"debug", 0, &bdebug, 'd'}, {"help", 0, &bhelp, 'h'}, {0, 0, 0, 0} }; /* parse options */ while ((c = getopt_long (argc, argv, "zb:t:dh", longopts, NULL)) != EOF) { switch(c) { case 'z': bz = 1; break; case 'b': g_block_size = atoi(optarg); break; case 't': g_htab_backet_nr = atoi(optarg); break; case 'd': bdebug = 1; break; case 'h': case '?': default: bhelp = 1; break; } } if (bhelp == 1 || (argc - optind) < 2) { usage(); return 0; } if (bz) { /* dedup and compress */ ret = dedup_package(argc - optind -1 , argv + optind + 1, tmp_file, bdebug); if (ret == 0) { ret = zlib_compress_file(tmp_file, argv[optind]); unlink(tmp_file); } } else { /* dedup only */ ret = dedup_package(argc - optind - 1, argv + optind + 1, argv[optind], bdebug); } return ret; }

/* undedup.c */

#include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <getopt.h> #include <fcntl.h> #include <errno.h> #include "dedup.h" /* block length */ static unsigned int g_block_size = BLOCK_SIZE; void show_pkg_header(dedup_package_header dedup_pkg_hdr) { printf("block_size = %d/n", dedup_pkg_hdr.block_size); printf("block_num = %d/n", dedup_pkg_hdr.block_num); printf("blockid_size = %d/n", dedup_pkg_hdr.blockid_size); printf("magic_num = 0x%x/n", dedup_pkg_hdr.magic_num); printf("file_num = %d/n", dedup_pkg_hdr.file_num); printf("metadata_offset = %lld/n", dedup_pkg_hdr.metadata_offset); } int prepare_target_file(char *pathname, char *basepath, int mode) { char fullpath[MAX_PATH_LEN] = {0}; char path[MAX_PATH_LEN] = {0}; char *p = NULL; int pos = 0, fd; sprintf(fullpath, "%s/%s", basepath, pathname); p = fullpath; while (*p != '/0') { path[pos++] = *p; if (*p == '/') mkdir(path, 0755); p++; } fd = open(fullpath, O_WRONLY | O_CREAT, mode); return fd; } int undedup_regfile(int fd, dedup_entry_header dedup_entry_hdr, char *dest_dir, int debug) { char pathname[MAX_PATH_LEN] = {0}; block_id_t *metadata = NULL; unsigned int block_num = 0; char *buf = NULL; char *last_block_buf = NULL; long long offset, i; int fd_dest, ret = 0; metadata = (block_id_t *) malloc(BLOCK_ID_SIZE * dedup_entry_hdr.block_num); if (NULL == metadata) return errno; buf = (char *)malloc(g_block_size); last_block_buf = (char *)malloc(g_block_size); if (NULL == buf || NULL == last_block_buf) { ret = errno; goto _UNDEDUP_REGFILE_EXIT; } read(fd, pathname, dedup_entry_hdr.path_len); read(fd, metadata, BLOCK_ID_SIZE * dedup_entry_hdr.block_num); read(fd, last_block_buf, dedup_entry_hdr.last_block_size); fd_dest = prepare_target_file(pathname, dest_dir, dedup_entry_hdr.mode); if (fd_dest == -1) { ret = errno; goto _UNDEDUP_REGFILE_EXIT; } if (debug) printf("%s/%s/n", dest_dir, pathname); /* write regular block */ block_num = dedup_entry_hdr.block_num; for(i = 0; i < block_num; ++i) { offset = DEDUP_PKGHDR_SIZE + metadata[i] * g_block_size; lseek(fd, offset, SEEK_SET); read(fd, buf, g_block_size); write(fd_dest, buf, g_block_size); } /* write last block */ write(fd_dest, last_block_buf, dedup_entry_hdr.last_block_size); close(fd_dest); _UNDEDUP_REGFILE_EXIT: if (metadata) free(metadata); if (buf) free(buf); if (last_block_buf) free(last_block_buf); return ret; } int undedup_package(char *src_file, char *dest_dir, int debug) { int fd, i, ret = 0; dedup_package_header dedup_pkg_hdr; dedup_entry_header dedup_entry_hdr; unsigned long long offset; if (-1 == (fd = open(src_file, O_RDONLY))) { perror("open source file"); return errno; } if (read(fd, &dedup_pkg_hdr, DEDUP_PKGHDR_SIZE) != DEDUP_PKGHDR_SIZE) { perror("read dedup_package_header"); ret = errno; goto _UNDEDUP_PKG_EXIT; } if (debug) show_pkg_header(dedup_pkg_hdr); offset = dedup_pkg_hdr.metadata_offset; for (i = 0; i < dedup_pkg_hdr.file_num; ++i) { if (lseek(fd, offset, SEEK_SET) == -1) { ret = errno; break; } if (read(fd, &dedup_entry_hdr, DEDUP_ENTRYHDR_SIZE) != DEDUP_ENTRYHDR_SIZE) { ret = errno; break; } ret = undedup_regfile(fd, dedup_entry_hdr, dest_dir, debug); if (ret != 0) break; offset += DEDUP_ENTRYHDR_SIZE; offset += dedup_entry_hdr.path_len; offset += dedup_entry_hdr.block_num * dedup_entry_hdr.entry_size; offset += dedup_entry_hdr.last_block_size; } _UNDEDUP_PKG_EXIT: close(fd); return ret; } void usage() { printf("Usage: undedup [OPTION...] <source file>/n"); printf("/nUnpackage files with deduplicaton technique./n"); printf("Mandatory arguments to long options are mandatory for short options too./n"); printf(" -z, --uncompress filter the archive through uncompress/n"); printf(" -c, --directory change to directory, default is PWD/n"); printf(" -d, --debug print debug messages/n"); printf(" -h, --help give this help list/n"); printf("/nReport bugs to <Aigui.Liu@gmail.com>./n"); } int main(int argc, char *argv[]) { char tmp_file[] = "./.dedup/0"; char path[MAX_PATH_LEN] = "./0"; int bz = 0, bhelp = 0, bdebug = 0; int ret = -1, c; struct option longopts[] = { {"compress", 0, &bz, 'z'}, {"directory", 1, 0, 'c'}, {"debug", 0, &bdebug, 'd'}, {"help", 0, &bhelp, 'h'}, {0, 0, 0, 0} }; while ((c = getopt_long (argc, argv, "zc:dh", longopts, NULL)) != EOF) { switch(c) { case 'z': bz = 1; break; case 'c': sprintf(path, "%s", optarg); break; case 'd': bdebug = 1; break; case 'h': case '?': default: bhelp = 1; break; } } if (bhelp == 1 || (argc - optind) < 1) { usage(); return 0; } if (bz) { /* uncompress and undedup */ ret = zlib_decompress_file(argv[optind], tmp_file); if (ret == 0) { ret = undedup_package(tmp_file, path, bdebug); unlink(tmp_file); } } else { /* only undedup */ ret = undedup_package(argv[optind], path, bdebug); } return ret; }
/* dedup usage */
Usage: dedup [OPTION...] <target file> <source files ...>

Package files with deduplicaton technique.

-z, --compress   filter the archive through compress
-b, --block      block size for deduplication, default is 4096
-t, --hashtable hashtable backet number, default is 10240
-d, --debug      print debug messages
-h, --help       give this help list

/* undedup usage */
Usage: undedup [OPTION...] <source file>

Unpackage files with deduplicaton technique.

-z, --uncompress filter the archive through uncompress
-c, --directory   change to directory, default is PWD
-d, --debug       print debug messages
-h, --help        give this help list

4、初步测试
这里使用linux最新的kernel源码进行测试，并与tar工具进行比较。从www.kernel.org 下载linux-2.6.32.tar.gz文件，并解压出源文件，然后分别使用tar和dedup工具进行打包，分别得到以下几个文件。

Filename	File size	commands
linux-2.6.32.tar	382392320 (365MB)	tar cvf linux-2.6.32.tar linux-2.6.32/
linux-2.6.32.tar.dd	380381944 (363M)	dedup linux-2.6.32.tar.dd linux-2.6.32.tar
linux-2.6.32.dd	357325910 (341MB)	dedup linux-2.6.32.dd linux-2.6.32/
linux-2.6.32.tar.gz	84322110 (81MB)	gzip -c linux-2.6.32.tar > linux-2.6.32.tar.gz
linux-2.6.32.tar.dd.gz	83978234 (81MB)	gzip -c linux-2.6.32.tar.dd > linux-2.6.32.tar.dd.gz
linux-2.6.32.dd.gz	83674306 (80MB)	gzip -c linux-2.6.32.dd > linux-2.6.32.dd.gz

linux-2.6.32.tar.gz解压出来的kernel源码文件数据很多，使用这个文件来测试应该具有普遍的意义。通过初步的测试结果，我们可以看出，即使在这样不明确数据是否具备较高重复率的情况下，dedup技术也能较明显地减少数据包的数据量。在数据重复率很高的测试用例下，比如全0或全1的大文件，dedup要远远优于tar。比如，全0的64MB文件，tar+gzip的结果为65KB，而dedup的结果才有286字节。

5、TODO
1、变长数据块。目前是定长数据块的实现，技术上较为简单，变长数据块可能会获得更高的数据压缩率。
2、相似文件识别。如果两个文件只有很小的差别，比如在某处插入了若干字节，找出这些数据块并单独处理，可能会提高数据压缩率。

秒客网

基于Dedup的数据打包技术

相关文章