Duplicate Transfer Detection (DTD)

The Duplicate Transfer Detection (DTD) project aims to tackle the problem of content aliasing in the WWW in conjunction of the Duplicate Storage Avoidance Project.

DTD is an extension of DSA that allows squid to minimize the bandwidth usage at the Proxy-Server end. To achieve this, DTD-enabled proxy will send a HEAD request to the original server first on a cache miss. Then the server will return a Content-MD5 tag that contains the value of the MD5 digest of the content from the requested URL (This may not be the default behavior for every web server but you can turn it on in Apache by adding "ContentDigest on" to httpd.conf for example).

If there is a match in the Content-MD5 value among the content in disk cache, this is considered as a DTD hit. In this case, we don't need to issue the subsequent GET request. If there are no matches for the Content-MD5 value, we have DTD miss and we have to issue the GET request to obtain the content.

Goal

This project is a result of the Duplicate Transfer Detection (DTD) research project I collaborate with Terence Kelly and Jeff Mogul. Most of the effort for now will be directed to make this reserch project a success.

Unlike the DSA project, the DTD project is not really obvious that which implementation is the best for DTD (hence, this research project). We will try to explore as many different implementations as we can. Then we will decide on the best approach based on the performance and/or ease of deployment.

There is no plan to incorporate DTD to squid in the near future until we know what we want to implement.

Current status (9/14/2003)

Design Decisions

  1. DTD algorithm:
    	if cache_hit then
    	    if fails_refresh_check then
    		send IMS request
    		if not_modified then
    		    /* cache refresh hit */
    		else
    		    do things as in the cache miss case
    	else /* cache miss */
    	    send HEAD request
    	    if we have the Content-MD5 tag then
    		if digest matches then
    		    /* DTD hit, also subject to the normal refresh check algorithm */
    		else /* DTD miss */
    		    send GET request
    	    else
    		use GET to obtain the content
    
  2. Added content_md5 field to HttpReply struct. I also added related code to HttpReply.c to parse Content-MD5 tag.
  3. Now every GET request will become a DTD-enabled, ie it will first be translated into a HEAD request and depending on whether we get a DTD hit, we may or may not issue the GET request after HEAD request. Therefore, I added was_get as a new field to request_flag struct to show that a request was originally a GET even though it becomes HEAD at a certain point of time.
  4. The crux of this hack lies in clientHandleHEADReply - a new function added to client_side.c. At the end of clientProcessMiss, we turn the GET request into HEAD and set the was_get flag. Then we call storeClientCopy to schedule clientHandleHEADReply to handle the HEAD reply from server.
  5. clientCreateStoreEntry no longer calls storeClientCopy to schedule clientSendMoreData. The equivalent call is moved to where we need to issue the GET request after a DTD miss.
  6. clientHandleHEADReply basically does what is described in the miss case of the DTD algorithm. In case of DTD hit, we need to reset the StoreEntry, log the log_type as LOG_TCP_HIT, set store_client type to STORE_DISK_CLIENT and finally schedule clientCacheHit with storeClientCopy. In case of DTD miss or invalid/missing Content-MD5, we will schedule a clientSendMoreData with storeClientCopy.
  7. The GET request now is issued when we Reforward, ie when fwdReforward returns 1. To do this, if was_get is set, then we return 1 and set request method back to GET. Squid will then automatically issue the GET request.

Future Work

  1. There are many ways to implement DTD. In this project I only considers the one that doesn't require modifications/extensions to the HTTP protocol. Other DTD projects will be created while the research is moving ahead. Currently there are some other alternatives/extensions we are considering:

Related Publications

  1. Aliasing on the World Wide Web: Prevalence and Performance Implications, Terence Kelly and Jeff Mogul. In Proceedings of The Eleventh International World Wide Web Conference, Honolulu, Hawaii, 7-11 May 2002.
  2. Increasing Effective Link Bandwidth by Suppressing Replicated Data, Jonathan Santos and David Wetherall. In Proceedings of the 1998 USENIX Annual Technical Conference, New Orleans, Louisiana, June 15-19, 1998.

Squid Now! Cache Now! Valid HTML 4.0! SourceForge
$Id: index.html,v 1.5 2003/09/14 04:52:42 ymc Exp $