|
|
![]() |
CZIP is a general-purpose compression scheme like GZIP and BZIP2, which is designed to utilize content-based naming (CBN). CBN is a well-known technique that recognizes identical blocks in an arbitrary collection of data by comparing the content fingerprints(e.g., SHA-1 or MD5 hash). The basic idea is to name the content by its fingerprint and assume the two contents are the same if their fingerprints match. CBN depends on the small collision probablility (e.g., 2-160 for SHA-1) of two different contents having the same hash values in practice, and has been applied to a number of research and commercial systems so far.
The primary purpose of CZIP is to easily provide the benefit of CBN by hiding the gory details under the abstract of a compression scheme. CZIP allows any CZIP-aware programs to apply content-based caching by exposing the chunk hashes in the header of the file format. In an environment where (a lot of) cross-file content redundancy is expected, CZIP can greatly reduce the physical memory consumption (as well as the disk space) by squeezing out the same content in the similar files.
Here is how CZIP compresses a given object.
This approach has several benefits:
More information can be found in our USENIX'07 paper .
Please check back again.
CoBlitz understands the CZIP file format and supports content-addressable caching for CZIP'ed objects.
KyoungSoo Park
Sunghwan Ihm
Mic Bowman (Intel Research)
Vivek Pai
Please contact KyoungSoo(kyoungso@cs.princeton.edu) if you have any questions.