Duplicate Storage Avoidance (DSA)

The Duplicate Storage Avoidance (DSA) project aims to tackle the problem of content aliasing in the WWW.

It is a well-known fact that due to mirroring or other similar mechanisms, different URLs might actually store the same content. This phenomenon is called content aliasing. Obviously, it is a waste of disk space to store duplicate content. In this project, I am trying to solve this problem by using Content-Digest to index the disk cache.

There are some other side advantages of the project other than saving disk space:

Short Term Goal

This project is a result of the Duplicate Transfer Detection (DTD) research project I collaborate with Terence Kelly and Jeff Mogul. Most of the effort for now will be directed to make this reserch project a success.

As a result, there are two limitations on this project in the short term:

  1. It only supports UFS.
  2. It only guarantees the HTTP flow of Squid is working. If it breaks other protocol flow, it doesn't care.

Long Term Goal

After the research project is on the track, I will discuss with fellow Squid developers and see if it is good to incorporate DSA and related stuff to a new version of Squid.

Current status (09/14/2003)

Design Decisions

New Data Structures

  1. There will be two layers of entries to represent a URL. The new layer is called InstanceEntry. Each InstanceEntry will be corresponding to a URL. The StoreEntry in the original squid will point to the unique payload.
  2. There will be a instance_table Hash Table to contain all the InstanceEntry's.
  3. StoreEntry now will be indexed by the MD5 value of the content.
  4. StoreEntry will have an array of pointers pointing to all the InstanceEntry's associated with it.
  5. For each InstanceEntry, we have a pointer that points back to its StoreEntry.
  6. The data structures of both InstanceEntry and StoreEntry will now be:
    	struct _InstanceEntry {
    	    hash_link hash; 
    	    time_t timestamp;
    	    time_t expires;
    	    time_t lastmod;
    	    u_short flags;
    	    StoreEntry * p; /* pointer to corresponding StoreEntry */
    	};
    
    	struct _StoreEntry {
    	    hash_link hash;             /* must be first */
    	    MemObject *mem_obj;
    	    RemovalPolicyNode repl;
    	    time_t lastref;
    	    u_short refcount;
    	    u_short flags;
    	    size_t swap_file_sz;
    	    sfileno swap_filen:25;
    	    sdirno swap_dirn:7;
    	    u_short lock_count; /* sum of all the lock_count's of the associated StoreEntry */
    	    mem_status_t mem_status:3;
    	    swap_status_t swap_status:3;
    	    MD5_CTX ctx; /* MD5 context for the hash, updated per httpReadReply */    
    	    Array * instances; /* array of StoreEntry's (aka instances) */
    	};
    
  7. The valid flags values for InstanceEntry are now only KEY_PRIVATE, ENTRY_NEGCACHED, RELEASE_REQUEST. All flags values are valid for StoreEntry except KEY_PRIVATE and ENTRY_NEGCACHED.

How to caculate the MD5 digest for a StoreEntry

  1. Whenever a StoreEntry is created, a StoreEntry without a hash key is also created.
  2. The StoreEntry hash key is updated (MD5Update) before we call storeAppend in httpReadReply of file http.c.
  3. The StoreEntry hash key is finalized (MD5Final) when httpPconnTransferDone returns true in httpReadReply of file http.c.

Check whether we should keep a StoreEntry

  1. A look up in the store_table will be performed when storeCheckCachable returns true in storeSwapOutFileClosed of store_swapout.c
  2. If the content is already in disk, move the association of the StoreEntry from the new StoreEntry to the old StoreEntry. Release the new PayloadEntry.
  3. If the content is not in disk, we insert the new StoreEntry into store_table.

Refresh Check

  1. Need to move a InstanceEntry-StoreEntry association to a new one if the IMS reply is 200.
  2. If after the move, there are no more InstanceEntry associated with a StoreEntry, the StoEntry will be purged. [Another implementation is to keep the StoreEntry and the content and we only use the removal policy to take care of the content. Need to do more research into this and see which one is the better approach.]

Metadata on disk

  1. For squid -z, now a directory called /cache/instances will also be created.
  2. Under this directory, the information pertaining to a InstanceEntry will be stored.
  3. For each StoreEntry hash key, there will be a directory. Each directory will contain all the InstanceEntry metadata related this particular StoreEntry.
  4. There are three levels of StoreEntry directory, the first byte of the key will be the first level, the second byte will be the second level, the rest will be the third level. Here is an example, /cache/instances/25/49/060d268c586990028760761e13e5.
  5. The name of the InstanceEntry metadata file will be the InstanceEntry hash key.
  6. The metadata that prepends the actual object will now be the metadata of StoreEntry.
  7. When the hash key is finalized, the hash key will be written to the metadata prepended the content.

Content Digest Collision

  1. There isn't any code to handle this case right now. But I plan to replace the old one with the new one just like how squid handles URL digest collision.

Future Work

  1. Explore the possibility of trying other desgin alternatives.
  2. See whether people like this idea enough to include it in a future release of squid.

Related Publications

  1. Clarifying the Fundamentals of HTTP, Jeff Mogul. In Proceedings of The Eleventh International World Wide Web Conference, Honolulu, Hawaii, 7-11 May 2002.
  2. Alias-free content-indexed object cache, Peter Mattis, John Plevyak, Matthew Haines, Adam Beguelin, Brian Tottyand David Gourley. U.S. Patent #6,292,880.

Squid Now! Cache Now! Valid HTML 4.0! SourceForge
$Id: index.html,v 1.7 2003/09/14 04:49:30 ymc Exp $