CZIP - A Portable Content-based Compression Scheme

A Portable Content-based Compression Scheme

What
Is It?

CZIP is a general-purpose compression scheme like GZIP and BZIP2, which is designed to utilize content-based naming (CBN). CBN is a well-known technique that recognizes identical blocks in an arbitrary collection of data by comparing the content fingerprints(e.g., SHA-1 or MD5 hash). The basic idea is to name the content by its fingerprint and assume the two contents are the same if their fingerprints match. CBN depends on the small collision probablility (e.g., 2^-160 for SHA-1) of two different contents having the same hash values in practice, and has been applied to a number of research and commercial systems so far.

The primary purpose of CZIP is to easily provide the benefit of CBN by hiding the gory details under the abstract of a compression scheme. CZIP allows any CZIP-aware programs to apply content-based caching by exposing the chunk hashes in the header of the file format. In an environment where (a lot of) cross-file content redundancy is expected, CZIP can greatly reduce the physical memory consumption (as well as the disk space) by squeezing out the same content in the similar files.

How Does
It Work?

Here is how CZIP compresses a given object.

CZIP divides the content into chunks. CZIP supports both fixed-sized chunking as well as chunking by the Rabin's fingerprint.
Each chunk is (internally) named/referred to by its content hash, and only the unique chunks are gathered . The default hashing function is SHA-1, but other hashing schemes can be supported as well.
CZIP writes each chunk information in the header (or in the footer if necessary). The chunk information includes the content hash, the byte offset of the chunk, applied hashing algorithm and so on. CZIP-aware programs read this information and quickly figure out which chunk is missing in their cache.

This approach has several benefits:

Unlike the previous approach, no system-wide upgrade is needed to benefit from CBN. CZIP simply defines the standard compression format and all the benefits transparently come from the user-level sharing of the common data.
It can be used to easily integrate with HTTP object caching. By implmenting a server-side module, a Web server can cache the similar files by their content, dramatically reducing its run-time memory footprint. By the same token, a client-side CZIP cache can benefit from CBN independently from the server-side module.
Even if a program does not understand CZIP, the file format still provides the compression benefit as in GZIP or BZIP2.

More information can be found in our USENIX'07 paper .

Download

CZIP source code will be available soon.
Apache module (mod_czip) implementing content-based caching will be available soon.

Please check back again.

Status

CoBlitz understands the CZIP file format and supports content-addressable caching for CZIP'ed objects.

People

KyoungSoo Park
Sunghwan Ihm
Mic Bowman (Intel Research)
Vivek Pai

Please contact KyoungSoo(kyoungso@cs.princeton.edu) if you have any questions.

WhatIs It?

How DoesIt Work?

Download

Status

People

What
Is It?

How Does
It Work?