From squid-dev-request@ircache.net Fri Jan 8 02:50:13 1999 Message-ID: <3694306D.2C124D1D@hem.passagen.se> Date: Thu, 07 Jan 1999 04:56:29 +0100 MIME-Version: 1.0 Subject: Re: Raw vs. cooked file systems for cache? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: Eric Stern From: Henrik Nordstrom Cc: Williams Jon , squid-users@ircache.net, squid-dev@ircache.net Eric Stern wrote: > Cisco cacheengine reportedly uses a cyclic filesystem, which is a VERY > attractive design except that I imagine it is disasterous to your hit > rate. Well, you could probably write back those few objects that should not be wasted without a to high impact on I/O usage. Writing in a cyclic filesystem is close to free provided that you make proper use of buffering, and sequential reads are not bad if planned. This is as long as the cache is big enougth that (real) LRU age is close to the cache cycle. This write back can also be piggy backed from cache hit processing to limit (or even eleminate) the number of freshly referenced objects that are at the tail. A reasonable rule is to write back the object at the head if it is more than 50% back in the file system (shoule be configurable). A cyclic filesystem is also very interesting from a recovery perspective. It is very easy to maintain stable checkpoints, making cache recovery close to instaneous even after power failures. It is also interesting from a RAID perspective. It may be possible to use RAID 5 with a negligible performance impact, as long as the writes are done in chunks big enought to fill the RAID 5 stripe before it is commited. This allows one to build a high-performance fault-tolerant cache box. The next idea popping up in my mind is using a chunked cyclic filesystem with static index logs (once the index of one chunk is written, then is is not rewritten until the chunk is reused). This simplifies things a lot. Using a cyclic filesystem will waste some diskspace, but not terribly much. Maybe in the range of 15%. Hmm.. I may change my opinion on the Squid filesystem issue... ;-) --- Henrik Nordstrom Spare time Squid hacker From squid-dev-request@ircache.net Sat Jan 23 02:51:42 1999 Message-ID: <36A92902.67549FD2@hem.passagen.se> Date: Sat, 23 Jan 1999 02:42:26 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <15A9EB778A@mail.lbi.ee> <308FD4FFA@mail.lbi.ee> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: Andres Kroonmaa From: Henrik Nordstrom Cc: Carlos Maltzahn , squid-dev@ircache.net Andres Kroonmaa wrote: > In addition, whatever memory cache there is, eventually page flushes > will also be done at ~50/sec, If using a cyclical filesystem where objects are stored lineary as they are received rather than randomly this can be optimized down to 1-5 ops by colescaling the writes into larger sqeuential chunks. > Knowing that disk io is 10msec average, that comes to 880msec per > second average io wait-time. Wouldnt it be nice to cut that in half? Also, for loads of this magnitude asyncronous reads should be used. There is absolutely no point in blocking the whole process for a pagein of a HIT. > In fact, thing are much worse. Kernel is too simple to do any clever > tricks in optimising page flushes. This is because Squid currently uses random I/O in random files, and very few operating systems are optimized for random I/O. > The only way out is to _never_ let squid process block, and implement > all and any disk optimisations inside squid. One of the coolest disk > optimisations you can think of is elevator seeking disk io. You just > can't do that as long as you rely on kernel to do page io. Yes, and it is very easy to implement too. Have one reader thread / spindle calling readv() with a sorted list of blocks to read. but it needs careful thoughts to not waste to much CPU time on sorting. Having only one outstanding readv() operation at a time makes a natural interval for collecting and sorting operations. For writing sequential I/O should be used, either using write threads or chunked mmap() combined with periodic (once per 1/2 second) msync(ASYNC). Which to use of write and mmap depends a bit on the operating system selected, but chances are high that mmap+msync is the most effective on most operating systems. The exception is perhaps where mmap is implemented on top of the disk cache (doubles the amount of memory used). The reason why periodic syncing should be used is to keep the dirty page list short, and to hint the OS into making the writes sequential. No separate writing thread should be required as long as the free page list is properly maintained. Some tuning of minfree may be required to ensure this if short on memory, but in most configurations this shouldn't be needed. Only problem with mmap() is that it does not mix well with simoultaneous reading from the same file (the same mmap() segment is ok, but not the file). > Large files have many-many inodes. Don't forget that 1 inode > describes a max of some 12 direct disk blocks. For larger files > you'd need lots of indirect inodes. The indirect block pointers are full filesystem blocks, not inodes. A 2GB file on a 8K/block FS uses roughtly: * 1 inode * 65 pointer blocks a (ceil 2GB / 8K / ( 8K / 2 ) + 1) * 65536 data blocks Any decent OS should keep those few pointer blocks in cache along with the inode while the file is in active use. > So, you can't avoid indirection. With double-indirect blocks > you'd still have up to 3 disk ops until you really get to > the object. OS caching perhaps makes this look much better, > but for that you'd again need huge amount of cache ram. Not sure if max 256KB / GB is a huge amount of ram.. OS caching should get a very high hit rate on these pages, or you should be looking for another OS. > and that basically requires the use of raw partitions with all the > allocation and free space management implemented in squid. There is no big difference between raw partitions or large files here. The big difference between the two is perhaps in cache management as most OS:es supports uncached raw partition I/O but all requires hinting for file I/O. With todays smart OS cache managers combined with some hinting this shouldn't be a problem regardless of which is used. > of cache ram. If you really want to avoid all overhead, your target > is really 1 disk op per any type of object operation. True. But the question remains on how to do this in an efficient and manageable way. As you may have guessed (from this message, and my previous message on this subject), my vote at the moment is for a multilevel cyclical file system with partially threaded I/O (threaded at a disk/spindle level, not in any way like the current async-io code). /Henrik From squid-dev-request@ircache.net Tue Jan 26 22:00:13 1999 Message-ID: <36AD1BF4.2757E0CA@hem.passagen.se> Date: Tue, 26 Jan 1999 02:35:48 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <47B7703DBB@mail.lbi.ee> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: Andres Kroonmaa From: Henrik Nordstrom Cc: squid-dev@ircache.net Andres Kroonmaa wrote: > Not sure what you are talking about, but yes, I agree that there are ways > to reduce needed io to 1-2 per objects. But compared to mmap this might > be another thread. I am speaking of io ops much less than one per object for storing, this is by sacrifying some disk space and memory, and doing some things that seems utterly stupid at first but which can be cleverly done to avoid most of the stupidity ;-) The basic idea is a cyclical file system (similar to a huge FIFO) but with some active recoupling to maintain a good hit rate for a cache. > Why? As well we could insist on cutting it by 3 or 4. The calc was as I > said rude and wrong ;) And if we take that avg disk io is 10msec and we > are doing about 88 random ios/sec, I can't see how we could do that in > 440ms/sec... There is no need to use random I/O for storing objects. This is what makes the difference. All writes can be done in large sequencial chunks which keeps drives and I/O buses happy. > By async reads you mean threads? Are we ready talking about threads > as of portable solution? That would be cool of course. The current async-io implementation can't be regarded as very portable, but I beleive threads are if used in the way they are intended to be used, without excessive filedescriptor sharing and other strange things which are bad for both portability and stability. > I'm afraid that if data blocks are not sequential, readv does not give > much benefit, sorted or not. You just avoid OS/app context switches. > To arrange for readv you perhaps waste cpu comparably. And besides, > you can't start writing to network until _all_ blocks in readv are > serviced. Yes, the main idea of readv is to minimise context switches, and as a side effect a kind of disk hartbeat. > I'd imagine this be done as linked-list where you can always find a > right place to insert new request into the chain, and then use regular > read/write. btw, I guess OS does also sort multiple requests somehow. Most OS only does multiple requests when multiple devices are involved, and I was speaking about one I/O thread per device to fuel the OS with parallell I/O load for the devices involved. Some OS:es may support multiple outstanding I/O requests on the same device, and on those it could be beneficial from having multiple reader threads for the same device. > > For writing sequential I/O should be used, either using write threads or > > chunked mmap() combined with periodic (once per 1/2 second) > > I'm afraid, sequential writes can be only possible when there is ample of > free disk space, ie. cache is not full and fragmentation low. You can't hope > on that. Perhaps there are ways to ensure sequential free space, but it > wouldn't be trivial. I think you are stuck in the thinking of a normal filesystem here. It is not that hard to rewrite the recycling policy in such a way that there always is a couple of minutes contigous free space for storing fresh objects. > In addition, I think that when accessing disk via cached interface, you > have no guarantees that disk ops are scheduled in the same order as you > dictate from application. It makes it harder to implement elevator > optimisations. But then again, cached interface to raw partition might > have other benefits. Above you effectively said that elevator optimisation won't give a performance gain.. > What has to be solved is the problem that no single block size fits all. About the only reason to use a block size at all is to keep the file block index a 32 bit integer, and to avoid having small objects cross more physical block boundaries than needed. > Note that mmap seems very attractive for that. But it has many objections. mmap is very attractive for writing sequencially in the main thread without blocking. For random I/O or reading it is not very attractive. > ( btw, msync(ASYNC) is not supported everywhere, eg. FreeBSD-30) Then use a separate sync thread when not supported. No big deal. > Please describe multilevel cyclical FS, or point me to relevant papers. I am planning on this. Need some time (which I don't have) to formulate the ideas and to collect relevant references. Basic idea is that writing is always done sequencially in a cyclical fashion (when the end is reached, restart at the beginning). Unfortunately we most likely needs something extra besides this to avoid throwing away objects that we are interested in keeping in the cache, and for this I have tree ideas which are emerging and I am trying to find a proper name for this. The full working name is something like chunked cyclical multilevel feedback file system which should give an idea of what is involved. multilevel is probably the wrong word here but I havent figured out another word to use for that part yet, maybe garbage collecting is more appropriate for describing that part but the truth is somewhere in between. /Henrik From squid-dev-request@ircache.net Tue Jan 26 22:00:16 1999 Date: Mon, 25 Jan 1999 19:50:28 -0700 (MST) From: Alex Rousskov To: squid-dev@ircache.net Subject: Re: memory-mapped files in Squid In-Reply-To: <36AD1BF4.2757E0CA@hem.passagen.se> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Hi there, I just want to back Henrik's opinion. I think that tuning/optimizing Squid disk I/O based on traditional "regular file system" paradigms is wrong. A design must account for such caching specifics as high tolerance to "lost" data and uneven QoS demands for reads and writes. A cyclic-like cache with, perhaps, a relatively small "protected" area for "popular" objects sounds like a reasonable idea worth investigating. On a side note, elevator-based optimizations are key to a good I/O performance. Seek and rotation are the bottleneck when it comes to raw disk I/O. There is a question on how much control over raw disk request order, raw disk blocks dataplacement, disk caching policies, etc. we have. However, I believe that an "elevator+" that works 75% of the time is still significantly better than random I/O. $0.02, Alex. > chunked cyclical multilevel feedback file system When a [spare time] hacker starts talking like a first-year Ph.D. student deciding on a thesis title, now that's scary! ;-) From squid-dev-request@ircache.net Tue Jan 26 22:00:49 1999 From: "Andres Kroonmaa" Organization: MicroLink Online To: Henrik Nordstrom Date: Tue, 26 Jan 1999 19:00:40 +0300 (EETDST) MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: memory-mapped files in Squid CC: squid-dev@ircache.net In-reply-to: <36AD1BF4.2757E0CA@hem.passagen.se> Message-ID: <6651436805@mail.lbi.ee> On 26 Jan 99, at 2:35, Henrik Nordstrom wrote: > The basic idea is a cyclical file system (similar to a huge FIFO) but > with some active recoupling to maintain a good hit rate for a cache. > > I'm afraid, sequential writes can be only possible when there is ample of > > free disk space, ie. cache is not full and fragmentation low. You can't hope > > on that. Perhaps there are ways to ensure sequential free space, but it > > wouldn't be trivial. > > I think you are stuck in the thinking of a normal filesystem here. It is What does it have to do with normal filesystem? I do not mean ufs block suballocation here. In any case you have to manage free space and allocated space, whatever you talk about: be it sectors, tracks, blocks or chunks. Of 10 objects Squid writes to disk, 4 expires in less time than other 6. So in this chunk of 10 we have a hole of 4 that is suboptimal to fill in one-go the next time the cycle wraps. And if we still do, then of these 4 1-2 again expires sooner than the rest, and of the former 6 perhaps another 1-2. You end up with chunk of 10 that has 2-3 small holes in it, and you have to handle this somehow. Else, after some time you have totally random disk access again. Yes you can ensure free space, by dropping objects that are left unexpired in such holy chunks, and overwriting them with new objects, keeping writes sequential. But by this you leave more that 40% of disks unused. Perhaps even worse, depending severely on how fast new objects are coming in, expiring, etc. Or you can move objects around to pack them, freeing space, or use read-modify-write on chunks thus concatenating multiple ios to 2. But I wouldn't call this trivial.. > not that hard to rewrite the recycling policy in such a way that there > always is a couple of minutes contigous free space for storing fresh > objects. If you say so. If you know how to solve the problem of noncontinous free space on a system that has random expiration of objects and without sacrifying some >20% of disk space for that, then its way cool. I'm all ears. I just can't think of a trivial way for that, and thats what I admit. > > In addition, I think that when accessing disk via cached interface, you > > have no guarantees that disk ops are scheduled in the same order as you > > dictate from application. It makes it harder to implement elevator > > optimisations. But then again, cached interface to raw partition might > > have other benefits. > > Above you effectively said that elevator optimisation won't give a > performance gain.. How's that? Let me try again. With raw direct disk io, io is not cached by OS, that is all io is done directly to/from process address space. As long as there are no competing requests to the same disk by other apps, disk ops are scheduled the same order they are requested, and position of disk arm is predictable. The whole elevator seeking optimisation relies on this. With using cached disk interface, IO goes through OS buffer cache, that is every write is buffered and delayed, every read is cached, but immediately scheduled. You don't have anymore strict predictability of disk arm location over the disk platters. Even if OS schedules writes in the same order they were presented, competing reads would move disk arms randomly. This contradicts elevator optimisation, and you effectively rely on OS to do the optimisation. If you meant my comment on readv, then I was not saying it makes elevator seeking useless, I meant to say that there is no need for using readv. I don't think OS/app context switch overhead is so big that you need to cluster these calls together into one syscall. I meant to say that there is no big difference whether you readv multiple blocks in a single syscall, or read multiple blocks in multiple non-blocking syscalls. > Basic idea is that writing is always done sequencially in a cyclical > fashion (when the end is reached, restart at the beginning). > Unfortunately we most likely need something extra besides this to avoid > throwing away objects that we are interested in keeping in the cache, This is exactly what I meant by being nontrivial... > and for this I have tree ideas which are emerging and I am trying to > find a proper name for this. The full working name is something like > chunked cyclical multilevel feedback file system which should give an > idea of what is involved. multilevel is probably the wrong word here but > I havent figured out another word to use for that part yet, maybe > garbage collecting is more appropriate for describing that part but the > truth is somewhere in between. Hmm. I'm really interested in hearing more on that. Sounds promising... ---------------------------------------------------------------------- Andres Kroonmaa mail: andre@online.ee Network Manager Organization: MicroLink Online Tel: 6308 909 Tallinn, Sakala 19 Pho: +372 6308 909 Estonia, EE0001 http://www.online.ee Fax: +372 6308 901 ---------------------------------------------------------------------- From squid-dev-request@ircache.net Thu Jan 28 21:15:50 1999 Message-ID: <199901281409.BAA16887@koro.off.connect.com.au> To: squid-dev@ircache.net From: Kevin Littlejohn Reply-To: Kevin Littlejohn Subject: Re: memory-mapped files in Squid In-reply-to: Your message of "Thu, 28 Jan 1999 20:03:28 +1100." <199901280903.UAA27315@nara.off.connect.com.au> References: <199901280903.UAA27315@nara.off.connect.com.au> <19990128102411.20065@is.co.za> <36AD1BF4.2757E0CA@hem.passagen.se> <6651436805@mail.lbi.ee> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <16880.917532544.1@koro.off.connect.com.au> Date: Fri, 29 Jan 1999 01:09:05 +1100 >>> Oskar Pearson wrote > Hi > > Reiserfs is optimised for many small files. It doesn't use block > allocations (if I remember correctly), so fragmentation is not really an > issue. The two problems are that it's: > > 1) complex. We don't really want this in Squid unless we know that it works > 2) best implemented at the kernel level. > 3) very processor intensive > Well, on the heels of that... I've been waiting, for part of this week, for a better web site to come up. It's taking too long, and I really want to get what code I have here out for people to pick on, so... http://www.bofh.net.au/~darius/squidfs contains a squidfs.tgz file. This has the code I've been hacking at. Some caveats: 1) it's not hooked into squid yet - as has previously been pointed out, the disk access code in squid is spread through several different places, and is not particularly clean. I have once already adjusted async-io to hook to this - I'm reasonably happy that you could hook either with or without async-io (with is kinda redundant, but might be interesting *shrug*) 2) It's interface needs work. In particular, I've just a few days ago come to the opinion that it should have an identical interface to the standard fs commands (read, write, open, etc). See point one ;) Anyway, some of it is half-way through being switched over. 3) It's probably buggy as all hell. I've been playing with it standalone, using a dd'ed blank file as a mock drive. I've only written small blocks, but it can read and write at least under 4K of data ;) 4) I'm not a C programmer - the code may be a _tiny_ bit crusty... In particular, the types used for fd's and inodes and so forth are a bit scrambled at the moment - I'm in the midst of re-organising them to be sane... That out of the way, I'm putting this forward because I'd love to see something done specific to squid, and I'm constantly running out of time to do it. There's numerous comments through the code about design decisions made - hopefully it's not too bizarre. My big hope, tho, is not that this is a 'perfect squid fs', but that this gives a good framework, both easily adaptable and easily measureable, for doing some serious work on squid-specific fs'es. From the various numbers that have been bandied around, I suspect every second cache is running under very different conditions. Given that, then getting something out there that can be heavily tweaked is probably worthwhile. (Having said that, Stew and I spent a couple of weeks beating up on each other's fs ideas, and I'm reasonably confident that this system, as it was designed if not as it's implemented, will provide good performance, with (hopefully) not too much fragmentation. I think it also addresses the other big points well - it's not complex to a ridiculous extent, and it is very portable - needs threads, and that's about it.) Ah well, anyway, if it's useful to people, good - if not, I don't lose anything by putting it in public (other than having put ugly code out for people to see... ;) Oh, one other thing - that machine is on the far end of a 64K link - please be gentle on it. I'm in the process of getting a better-connected site happening... (he says like people will be mobbing it ;) All feedback gratefully accepted... ;) KevinL (extremely nervous about throwing this code into public - it's nowhere near 'finished' :( --------------- qnevhf@obsu.arg.nh --------------- Kevin Littlejohn, Technical Architect, Connect.com.au Don't anthropomorphise computers - they hate that. From squid-dev-request@ircache.net Thu Jan 28 21:16:10 1999 From: "Andres Kroonmaa" Organization: MicroLink Online To: Alex Rousskov , Henrik Nordstrom Date: Thu, 28 Jan 1999 18:19:57 +0300 (EETDST) MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: memory-mapped files in Squid CC: squid-dev@ircache.net References: <6651436805@mail.lbi.ee> In-reply-to: Message-ID: <16B04E605B@mail.lbi.ee> On 26 Jan 99, at 10:16, Alex Rousskov wrote: > On Tue, 26 Jan 1999, Andres Kroonmaa wrote: > > > If you know how to solve the problem of noncontinous free space on a > > system that has random expiration of objects and without sacrifying > > some >20% of disk space for that, then its way cool. I'm all ears. > > I just can't think of a trivial way for that, and thats what I admit. > > The trivial way is a "cyclic" approach. You always write at the current > pointer, incrementing it after write completes. You do not care if the > objects you overwrite expire or not. You do not have any garbage > collection. Just one huge FIFO disk-resident queue. Writes are always > continuous. Inter-object reads for small objects are continues. > Intra-object reads are random (but can be optimized with Scan or whatnot). Yes, of course, mentioned that. But this would be too simplistic. It is hard to believe that fifo as Replacement Policy could be very effective in both reducing network traffic and disk utilisation. Sliding pointer of fifo leaves behind whatever it gets from source, including fast-expiring objects. We can consider expired objects as dead or deleted objects, because when (if ever) they are referenced, they are probably overwritten by refresh. Of 100% fresh objects that we fetch, some part will survive a cycle. Some part is expired long before. But just this part that survives is what increases squid' hit/miss ratio, and we want to keep it. The other part is a waste, and we'd want to replace it first. There are basically 2 strong desires that compete with speed and simplicity: 1) we want to have our disks as full as possible. 90% at least. 2) we want that these disks are full of _useful_ stuff, not waste. So, we have to look at allocation algoritms that integrates both 1) intelligent replacement policy 2) and does not suffer speed loss at high disk utilisation. If we just overwrite useful stuff on every pass, whats left behind? Perhaps some 40% of waste. Having effective use of 60% of available disk space seems very expensive. Performance win on disks can easily transform into performance loss on network. This basically means that we do NOT want to blindly overwrite objects that could be useful and longterm. And this leads us to unavoidable need to handle fragmented free space by either resorting to some randomness in free space reuse or some sort of free space compression and data reorganisation. Both add overhead and there must be found an acceptable compromise between speed-simplicity and intelligence+ effectiveness. btw, this cyclic idea reminded me log-structured FS alot, and I went on digging some papers on that. Of course I realise that squid has much more relaxed requirements on preserving ondisk objects than real FS, but still, there are lots of similar problems to solve, like batching writes, optimising reads and plugging or reordering holes. Henrik, there's a work done that seems extremely related to what you thinking about. You should read that, it may be useful for your work. http://now.cs.berkeley.edu/Papers2/Abstracts/sosp97.html > A non trivial (smart) part is to spend minimal efforts in maintaining > metadata and preserving "popular" objects. The latter may be needed to get > decent hit ratio. IMO, before coding any smartness into Squid, one has to > test it using simple trace-driven simulations because it is not obvious > how many/what objects have to be specially preserved on any decent size > FIFO queue. right. ---------------------------------------------------------------------- Andres Kroonmaa mail: andre@online.ee Network Manager Organization: MicroLink Online Tel: 6308 909 Tallinn, Sakala 19 Pho: +372 6308 909 Estonia, EE0001 http://www.online.ee Fax: +372 6308 901 ---------------------------------------------------------------------- From squid-dev-request@ircache.net Fri Jan 29 01:55:27 1999 Message-ID: <36B0F7A2.2EC8D10D@hem.passagen.se> Date: Fri, 29 Jan 1999 00:49:54 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <6651436805@mail.lbi.ee> <16B04E605B@mail.lbi.ee> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: squid-dev@ircache.net From: Henrik Nordstrom Cc: Andres Kroonmaa Andres Kroonmaa wrote: > Yes, of course, mentioned that. But this would be too simplistic. > It is hard to believe that fifo as Replacement Policy could be very > effective in both reducing network traffic and disk utilisation. Not if you take into account the past evolution of disk technology. The size/cost ratio is increasing by a far higher rate than the speed/cost ratio and the networkspeed/cost ratio highest of them all. If this persists then we may soon see a situation where disk speed utilization is of far greater significance than size utilization in order to keep up with the network speed. You could say that the current approach in Squid is at one far end of the scale (or rather some of the earlier Squid versions.. see below), and pure FIFO at the other. My thinking at the moment is starting from FIFO and see what can be done to adapt this to the needs of a cache without sacrifying to much disk space. Also, Squid 2 does not really pay much attention expiration, it is almost pure LRU, so it is not that far from a FIFO replacement with respect to objects that are only used once. In fact it is very questionable if the expiration check in the store maintaince routine does anything useful at all other than destabilizing the LRU age (see storeMaintainSwapSpace, but ignore the description which is hopeless out of date). The reason to why Squid preserves expired objects this is that even an expired object is useful in a later refresh check as many objects haven't changed even if expired, so throwing away expired objects is a big waste close to identical to throwing away old (non-expired) objects as both needs to be validated with the origin source, and both are probably not modified. > This basically means that we do NOT want to blindly overwrite objects > that could be useful and longterm. And this leads us to unavoidable I am not exactly proposing to blindly overwrite objects but at the same time I don't agree in that a blind FIFO is to simplistic. There are however several approaches which can be used to preserve useful information without a serious I/O impact. You could say that what I am trying to do is to have a FIFO mimic some of the properties of LRU, and if neccesary with some unused space optimization, but my epasis is on optimized write performance. It would be nice if someone with some real life logs could calculate the ratio of TCP_REFRESH_HIT to TCP_REFRESH_MISS to guide me in my future thoughts. Hmm.. speaking about refreshes. If I am not mistaken Squid ignores using IMS on some classes of objects, even if it has a copy in cache (no content-length and some other criterias). Why is that? To mee It seems like a very stupid thing to do when many servers can be configured to support IMS on generated content (SSI and similar). > this cyclic idea reminded me log-structured FS alot, and I went on .. > Henrik, there's a work done that seems extremely related to what > you thinking about. You should read that, it may be useful for your > work. > http://now.cs.berkeley.edu/Papers2/Abstracts/sosp97.html Thanks. I'll look into that. > > test it using simple trace-driven simulations because it is not obvious > > how many/what objects have to be specially preserved on any decent size > > FIFO queue. On the question of traces: Is store.log fixed in Squid yet, or does it still forget to log some (most) releases? This has some importance in comparing trace driven results to Squid. /Henrik From squid-dev-request@ircache.net Tue Jan 26 22:00:54 1999 Date: Tue, 26 Jan 1999 10:16:32 -0700 (MST) From: Alex Rousskov To: Andres Kroonmaa cc: squid-dev@ircache.net Subject: Re: memory-mapped files in Squid In-Reply-To: <6651436805@mail.lbi.ee> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Tue, 26 Jan 1999, Andres Kroonmaa wrote: > If you know how to solve the problem of noncontinous free space on a > system that has random expiration of objects and without sacrifying > some >20% of disk space for that, then its way cool. I'm all ears. > I just can't think of a trivial way for that, and thats what I admit. The trivial way is a "cyclic" approach. You always write at the current pointer, incrementing it after write completes. You do not care if the objects you overwrite expire or not. You do not have any garbage collection. Just one huge FIFO disk-resident queue. Writes are always continuous. Inter-object reads for small objects are continues. Intra-object reads are random (but can be optimized with Scan or whatnot). A non trivial (smart) part is to spend minimal efforts in maintaining metadata and preserving "popular" objects. The latter may be needed to get decent hit ratio. IMO, before coding any smartness into Squid, one has to test it using simple trace-driven simulations because it is not obvious how many/what objects have to be specially preserved on any decent size FIFO queue. Alex. From squid-dev-request@ircache.net Thu Jan 28 21:12:22 1999 Message-ID: <19990128102411.20065@is.co.za> Date: Thu, 28 Jan 1999 10:24:12 +0200 From: Oskar Pearson To: Andres Kroonmaa Cc: squid-dev@ircache.net Subject: Re: memory-mapped files in Squid Reply-To: oskar@is.co.za References: <36AD1BF4.2757E0CA@hem.passagen.se> <6651436805@mail.lbi.ee> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <6651436805@mail.lbi.ee>; from Andres Kroonmaa on Tue, Jan 26, 1999 at 07:00:40PM +0300 Hi Reiserfs is optimised for many small files. It doesn't use block allocations (if I remember correctly), so fragmentation is not really an issue. The two problems are that it's: 1) complex. We don't really want this in Squid unless we know that it works 2) best implemented at the kernel level. 3) very processor intensive I still have some testing to do as to speed etc. At some levels it's faster than a netapp nfs server (over nfs). My tests indicated that it's performance dropped dramatically with lots of files. I hope to prove (sometime) that this drop is because of a lack of memory in my test machine (and various other things that I have done wrong that Reiser told me to try) Info/design page http://www.idiom.com/~beverly/reiserfs.html Speed tests (across nfs, mainly) http://www.linux.org.za/oskar/ Oskar From squid-dev-request@ircache.net Thu Jan 28 21:16:26 1999 Date: Thu, 28 Jan 1999 10:12:54 -0700 (MST) From: Alex Rousskov To: Andres Kroonmaa cc: squid-dev@ircache.net Subject: Re: memory-mapped files in Squid In-Reply-To: <16B04E605B@mail.lbi.ee> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Thu, 28 Jan 1999, Andres Kroonmaa wrote: > Of 100% fresh objects that we fetch, some part will survive a cycle. Correct me if I am wrong, but the recommended LRU expiration age is 3 - 7 days. The "cycle" is not something short. If a cache receives 3GB of cachable traffic per day and has 30GB disk cache, it would take 10 days to complete a cycle (just an example). If an object was not referenced twice in 3-7 days, it is useless with a very high probability. Thus, FIFO may work for most of reasonably configured caches, especially if augmented with a small buffer for popular objects (say, those referenced at least twice). > Some part is expired long before. But just this part that survives is > what increases squid' hit/miss ratio, and we want to keep it. The > other part is a waste, and we'd want to replace it first. For a lot (most?) of reasonable configurations, it does not matter much what you are replacing as long as that stuff is several days old. The latter is the reason why all "traditional" caching policies perform the same on any reasonable size cache. Essentially, you are selecting between old garbage and very old garbage. > 1) we want to have our disks as full as possible. 90% at least. > 2) we want that these disks are full of _useful_ stuff, not waste. Sure. However, none of the traditional policies caches useful data! All traditional policies that I know of fill cache with garbage, on average. Provided your cache is of a reasonable size, of course. The real challenge is avoid caching waste in the first place. And not to reduce disk space requirements, but to decrease disk _bandwidth_ contention and hence speedup hits. > So, we have to look at allocation algorithms that integrates both > 1) intelligent replacement policy > 2) and does not suffer speed loss at high disk utilisation. IMO, all reasonable _replacement_ policies will perform the same no matter how intelligent they are. I have not seen a single study that shows an alternative super intelligent policy outperforming LRU-TH by a significant margin on a reasonable size cache. We may need an intelligent _caching_ (or cache admission) policy, but I am afraid we are not ready for that yet. > If we just overwrite useful stuff on every pass, whats left behind? We overwrite garbage on every pass. What is left is garbage. On average, only a small portion of your cache actually gives you hits. Preserving that portion is probably feasible. > Perhaps some 40% of waste. Having effective use of 60% of available > disk space seems very expensive. Performance win on disks can easily > transform into performance loss on network. Perhaps some 80% is waste? 90%? 95%? That is why I said that a few [simple] experiments/calculations are needed to (a) estimate disk "utilization" (portion of useful objects) and (b) average object age with FIFO policy. $0.02, Alex. From squid-dev-request@ircache.net Thu Jan 28 21:16:28 1999 Message-Id: <199901281755.JAA27205@granite.hpl.hp.com> To: Alex Rousskov cc: Andres Kroonmaa , squid-dev@ircache.net Subject: Re: memory-mapped files in Squid X-Organization: Hewlett-Packard Laboratories In-reply-to: Your message of Thu, 28 Jan 1999 10:12:54 -0700. Date: Thu, 28 Jan 1999 09:55:47 -0800 From: John Dilley > Thus, FIFO may work for most of reasonably configured caches, > especially if augmented with a small buffer for popular objects (say, > those referenced at least twice). Like you say below, most replacement policies will do about as well as LRU (plus or minus 10%). The question is whether FIFO will be much more efficient -- will it reduce the disk bottleneck? The web cache workload is heavily write intensive, like 1:3 read:write. This is in contrast to traditional (OLTP or database) workloads which are more like 4:1 read:write. Systems/disks tuned for OLTP will not work well as web caches. You bring up an interesting notion with the small buffer, see below. > > 1) we want to have our disks as full as possible. 90% at least. > > 2) we want that these disks are full of _useful_ stuff, not waste. > > Sure. However, none of the traditional policies caches useful data! All > traditional policies that I know of fill cache with garbage, on average. > Provided your cache is of a reasonable size, of course. And this is not likely to change, due to the nature of the web proxy workload. We and others have found in studies of long-term traces that something like 60% of objects are never re-referenced. So most of the objects that pass through the cache are garbage (i.e., have no value since they are never re-referenced. There was some interesting work at the last Web Cache Workshop from a group in Poland on how to identify these objects (which Martin Arlitt in our group here referred to as "one-timers"). By doing this they were able to reduce disk write rate significantly without reducing much the disk read rate (i.e., they did not suffer much loss in byte hit rate). > We may need an intelligent _caching_ (or cache admission) policy, but > I am afraid we are not ready for that yet. Why are we not ready for that yet? I believe there is room for improvement of replacement policies (and can say more about this at the San Diego workshop, I hope!), and believe there are other opportunities as well. Please help me understand if there are fundamental limitations to being able to address these, though... Thanks! -- jad -- John Dilley From squid-dev-request@ircache.net Fri Jan 29 01:17:33 1999 Date: Thu, 28 Jan 1999 15:41:39 -0700 (MST) From: Alex Rousskov To: John Dilley cc: squid-dev@ircache.net Subject: Re: memory-mapped files in Squid In-Reply-To: <199901281755.JAA27205@granite.hpl.hp.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Thu, 28 Jan 1999, John Dilley wrote: > Like you say below, most replacement policies will do about as > well as LRU (plus or minus 10%). The question is whether FIFO will be > much more efficient -- will it reduce the disk bottleneck? For writes, almost nothing could be more efficient than a SCAN. Cyclic FS or FIFO will emulate SCAN in the absence of reads. Of course, random reads will break the "perfect" order. However, we can still group some writes together and reduce seek and possibly rotation delays. > And this is not likely to change, due to the nature of the web > proxy workload. We and others have found in studies of long-term traces > that something like 60% of objects are never re-referenced. So most of > the objects that pass through the cache are garbage (i.e., have no value > since they are never re-referenced. Exactly. However, one does not have to write all the objects to disks! If there is a way to filter out useless objects, we can reduce disk traffic significantly. We did some "researchy" study on that (published in CNDS'99), and there are a couple of similar papers (you've mentioned the study from Poland). > Why are we not ready for that yet? The techniques proposed usually go beyond simple algorithms that everybody got used to. It is not clear how practical and robust those algorithms are. Humans are usually scared of anything new and controversial. > I believe there is room for > improvement of replacement policies (and can say more about this at the > San Diego workshop, I hope!), and believe there are other opportunities > as well. Please help me understand if there are fundamental limitations > to being able to address these, though... Thanks! IMO, there are no fundamental limitations. It just takes time (and more research) to move this new category of algorithms from research papers to practice. Alex. From oskar@is.co.za Fri Jan 29 11:18:51 1999 Message-ID: <19990129121034.55148@is.co.za> Date: Fri, 29 Jan 1999 12:10:34 +0200 From: Oskar Pearson To: Henrik Nordstrom Subject: Re: memory-mapped files in Squid Reply-To: oskar@is.co.za References: <6651436805@mail.lbi.ee> <16B04E605B@mail.lbi.ee> <36B0F7A2.2EC8D10D@hem.passagen.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <36B0F7A2.2EC8D10D@hem.passagen.se>; from Henrik Nordstrom on Fri, Jan 29, 1999 at 12:49:54AM +0100 > It would be nice if someone with some real life logs could calculate the > ratio of TCP_REFRESH_HIT to TCP_REFRESH_MISS to guide me in my future > thoughts. Sure. Quite a simple request ;') sh-2.02$ grep TCP_REFRESH_HIT access.log |wc -l 3262 sh-2.02$ grep TCP_REFRESH_MISS access.log |wc -l 1569 sh-2.02$ wc -l access.log 682701 access.log From squid-dev-request@ircache.net Fri Jan 29 22:55:33 1999 Message-ID: <36B1A248.1E40562A@hem.passagen.se> Date: Fri, 29 Jan 1999 12:58:00 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <6651436805@mail.lbi.ee> <16B04E605B@mail.lbi.ee> <36B0F7A2.2EC8D10D@hem.passagen.se> <19990129121034.55148@is.co.za> <36B19334.5F7F9254@hem.passagen.se> <19990129131757.21671@is.co.za> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: oskar@is.co.za From: Henrik Nordstrom Cc: squid-dev@ircache.net Oskar Pearson wrote: > If I understand this correctly, the major advantage of a fifo buffer is > that there is almost no fragmentation... right? That is one of the advantages, but there are more. * Almost no fragmentation * Writes can be optimized to use as few I/O ops as possible, even crossing object boundaries. * I/O can be fully elevator optimized to minimize seek times. No random seeking. * No or minimal filesystem overhead. A extremely simple layout of the store can be used. * No, or very small block sizes, giving a higher disk size utilization per stored object. * Quite obvious how to make checkpoints, so we can have instant restarts without risking cache corruption even if Squid crashes and leaves the cache in a "dirty" state. > I personally think that a fairly classic filesystem (something like the sfs > code) is probably the way to go. Lets see where a FIFO approach can end up. You have to admit that it is an interesting area to investigate. /Henrik From oskar@is.co.za Fri Jan 29 12:38:22 1999 Message-ID: <19990129131757.21671@is.co.za> Date: Fri, 29 Jan 1999 13:17:57 +0200 From: Oskar Pearson To: Henrik Nordstrom Cc: suqid-dev@ircache.net Subject: Re: memory-mapped files in Squid Reply-To: oskar@is.co.za References: <6651436805@mail.lbi.ee> <16B04E605B@mail.lbi.ee> <36B0F7A2.2EC8D10D@hem.passagen.se> <19990129121034.55148@is.co.za> <36B19334.5F7F9254@hem.passagen.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <36B19334.5F7F9254@hem.passagen.se>; from Henrik Nordstrom on Fri, Jan 29, 1999 at 11:53:40AM +0100 Hi > Your TCP_REFRESH_HIT ratio is 67% of 0.0070%. > Of those few refreshes that you have there was fewer hits than I > expected, but still enought to suggest that I am thinking in the right > direction. However if the refresh ratio commonly is that small there is > no apparent need to bother with saving disk space on refreshes (less > than 0.5% estimated space saving). If I understand this correctly, the major advantage of a fifo buffer is that there is almost no fragmentation... right? This would mean that the 0.5% waste would be insignificant to the +- 10% of disk space I have to keep free on my disks at the moment to avoid fragmentation. I am not sure of the fifo buffer idea, though. I don't believe that any of the commercial systems are using this... and they have lots of money to throw at getting a filesystem right. I personally think that a fairly classic filesystem (something like the sfs code) is probably the way to go. Oskar --- "Haven't slept at all. I don't see why people insist on sleeping. You feel so much better if you don't. And how can anyone want to lose a minute - a single minute of being alive?" -- Think Twice From squid-dev-request@ircache.net Fri Jan 29 22:55:58 1999 Message-ID: <199901291439.BAA01871@nara.off.connect.com.au> To: Henrik Nordstrom cc: oskar@is.co.za, squid-dev@ircache.net From: Kevin Littlejohn Reply-To: Kevin Littlejohn Subject: Re: memory-mapped files in Squid In-reply-to: Your message of "Fri, 29 Jan 1999 12:58:00 BST." <36B1A248.1E40562A@hem.passagen.se> References: <36B1A248.1E40562A@hem.passagen.se> <6651436805@mail.lbi.ee> <16B04E605B@mail.lbi.ee> <36B0F7A2.2EC8D10D@hem.passagen.se> <19990129121034.55148@is.co.za> <36B19334.5F7F9254@hem.passagen.se> <19990129131757.21671@is.co.za> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <1864.917620778.1@nara.off.connect.com.au> Date: Sat, 30 Jan 1999 01:39:38 +1100 >>> Henrik Nordstrom wrote > Oskar Pearson wrote: > > > If I understand this correctly, the major advantage of a fifo buffer is > > that there is almost no fragmentation... right? > > That is one of the advantages, but there are more. > > * Almost no fragmentation > * Writes can be optimized to use as few I/O ops as possible, even > crossing object boundaries. > * I/O can be fully elevator optimized to minimize seek times. No random > seeking. I'm not sure how you achieve this - given the request pattern is random? Unless you're planning on collating requests and serving multiples. I had thought any sane fs would keep that in mind when seeking somewhere, but the implementation of this in such a way as to not starve any given request is interesting. No random seeking is a bold claim, any which way. (Incidentally, sfs has been built with 'elevator' writes in mind - I thought about it, and decided that you're always going to do less writes than reads, so simply picking one block out of any blocks between you and the block to read, and committing that on the way, is valid. Could conceivably lead to maxing out the fs cache, but that's addressable fairly easily. Read seeks are not optimised at all - the idea is simply to service each request as immediately as possible. Chains of requests can be dropped onto the service queue, so if you want a whole bunch of blocks from a given file, you can sorta encourage the drive service thread to read them all at once, or at least one after the other. > * No or minimal filesystem overhead. A extremely simple layout of the > store can be used. > * No, or very small block sizes, giving a higher disk size utilization > per stored object. But if you're serving random objects, you still have to hold an index of location somewhere. That index rises in size with the smaller block sizes. Not a killer, but worth bearing in mind. (Where, on a cyclic fs, do you store static information, btw? ;) > * Quite obvious how to make checkpoints, so we can have instant restarts > without risking cache corruption even if Squid crashes and leaves the > cache in a "dirty" state. This is true (and the last thing you want to be doing is a 'fsck' of some sort), but again you need to address the indexes of files. Unless you're pulling the same trick sfs does, and making the location on disk the 'name' of the file as well, and storing that in squid's own index. In which case, pick a _big_ data type, given you want the small block sizes ;) Either way, you need to pay close attention to what happens if the machine goes away at every point of committing a new file and it's relevant meta-data to disk. > > > I personally think that a fairly classic filesystem (something like the sfs > > code) is probably the way to go. > > Lets see where a FIFO approach can end up. You have to admit that it is > an interesting area to investigate. Personally, I'd like to see both. I'm still very sceptical of a circular fs design - I just don't think it matches the usage pattern or the expiry pattern (and yes, I _like_ being able to tweak my refresh patterns, ta very much ;), but if it works, then it's a worthy thing to have. KevinL (fs racing, anyone? ;) --------------- qnevhf@obsu.arg.nh --------------- Kevin Littlejohn, Technical Architect, Connect.com.au Don't anthropomorphise computers - they hate that. From squid-dev-request@ircache.net Fri Jan 29 13:18:53 1999 Message-ID: <36B19D47.7EB9822C@hem.passagen.se> Date: Fri, 29 Jan 1999 12:36:39 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <6651436805@mail.lbi.ee> <16B04E605B@mail.lbi.ee> <36B0F7A2.2EC8D10D@hem.passagen.se> <19990129121034.55148@is.co.za> <36B19334.5F7F9254@hem.passagen.se> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: oskar@is.co.za From: Henrik Nordstrom Cc: squid-dev@ircache.net Thanks Oskar. Your TCP_REFRESH_HIT ratio is 67% of 0.0070%. Of those few refreshes that you had there was fewer hits than I expected, but still enought to suggest that I am thinking in the right direction. However if the refresh ratio commonly is that small there is no apparent need to bother with saving disk space on refreshes (less than 0.5% estimated space saving). How the ratios shown above is calculated: 67% = TCP_REFRESH_HIT / TCP_REFRESH* 0.0070% = TCP_REFRESH* / ALL_REQUESTS /Henrik From squid-dev-request@ircache.net Fri Jan 29 22:55:52 1999 From: "Andres Kroonmaa" Organization: MicroLink Online To: Henrik Nordstrom Date: Fri, 29 Jan 1999 16:38:55 +0300 (EETDST) MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: memory-mapped files in Squid CC: squid-dev@ircache.net In-reply-to: <36B0F7A2.2EC8D10D@hem.passagen.se> Message-ID: <2D01D40D9C@mail.lbi.ee> On 29 Jan 99, at 0:49, Henrik Nordstrom wrote: > Andres Kroonmaa wrote: > > > Yes, of course, mentioned that. But this would be too simplistic. > > It is hard to believe that fifo as Replacement Policy could be very > > effective in both reducing network traffic and disk utilisation. > [..] agree with everything. > > This basically means that we do NOT want to blindly overwrite objects > > that could be useful and longterm. And this leads us to unavoidable > > I am not exactly proposing to blindly overwrite objects but at the same > time I don't agree in that a blind FIFO is to simplistic. There are > however several approaches which can be used to preserve useful > information without a serious I/O impact. You could say that what I am > trying to do is to have a FIFO mimic some of the properties of LRU, and > if neccesary with some unused space optimization, but my epasis is on > optimized write performance. In fact, to mimic LRU should be easy. Whenever we fetch object from disk to service client hit, we can append that object to the write queue, that way moving all accessed objects to the front of LRU. This would make fifo into real LRU. With small objects it could even add little overhead. How are you planning to implement the metadata? We can integrate IO routines quite tightly with squid. This allows us to do tricks ordinary FS's can't, like let squidFS decide the file ID, based on (currently) optimal location. We could also allow squidFS to change the file ID if practical and notify squid about that fact. In fact, we could pack objects much tighter together by using allocation granularity of say 512 bytes. Into file ID code the chunk number and offset, from Squid object map we know the size, and the object itself should have some crosscheck prepended. Hmm, this way we could always write almost any-size object sequentially, assign it a number, and from this number alone we can always find the object on disk, and we need to know only a start of it. Fragmentation totally eliminated. We can manage many chunks concurrently in memory, to try place related objects together, eg. by URL hostname, this way also optimisting read performance. With sufficient ram we can manage hundreds of chunks concurrently. But then we'd need to cope with partially filled chunks.. Anyway, we could just write them all out in sequence, in whatever order, they are still all pretty close to each other, at least within a cylinder or few. I'm starting to like the beauty of its simplicity... > It would be nice if someone with some real life logs could calculate the > ratio of TCP_REFRESH_HIT to TCP_REFRESH_MISS to guide me in my future > thoughts. Bytes Request type Resolve Method Count Count-% Bytes-% 5995707919 TCP_MISS DIRECT 674345 32.68 % 47.71 % 2873310741 TCP_HIT NONE 367157 17.79 % 22.86 % 1348208061 TCP_MISS TIMEOUT_DIRECT 94084 4.56 % 10.73 % 992042854 TCP_MISS SIBLING_HIT 143092 6.93 % 7.89 % 398727919 ERR_CLIENT_ABORT DIRECT 43125 2.09 % 3.17 % 311738853 TCP_CLIENT_REFRESH DIRECT 68694 3.33 % 2.48 % 191591763 TCP_IMS_HIT NONE 147684 7.16 % 1.52 % 152717453 TCP_REFRESH_MISS DIRECT 13799 0.67 % 1.22 % 120455743 TCP_REFRESH_HIT DIRECT 16608 0.80 % 0.96 % 47846534 TCP_MISS NONE 11374 0.55 % 0.38 % But this cache is hardly able to cache a 1-days worth of objects. So the amount of refreshes is expected low. ---------------------------------------------------------------------- Andres Kroonmaa mail: andre@online.ee Network Manager Organization: MicroLink Online Tel: 6308 909 Tallinn, Sakala 19 Pho: +372 6308 909 Estonia, EE0001 http://www.online.ee Fax: +372 6308 901 ---------------------------------------------------------------------- From squid-dev-request@ircache.net Fri Jan 29 22:56:05 1999 From: "Andres Kroonmaa" Organization: MicroLink Online To: Henrik Nordstrom Date: Fri, 29 Jan 1999 16:54:13 +0300 (EETDST) MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: memory-mapped files in Squid CC: squid-dev@ircache.net In-reply-to: <36B19D47.7EB9822C@hem.passagen.se> Message-ID: <2D434B06D8@mail.lbi.ee> On 29 Jan 99, at 12:36, Henrik Nordstrom wrote: > Thanks Oskar. > > Your TCP_REFRESH_HIT ratio is 67% of 0.0070%. > > Of those few refreshes that you had there was fewer hits than I > expected, but still enought to suggest that I am thinking in the right > direction. However if the refresh ratio commonly is that small there is > no apparent need to bother with saving disk space on refreshes (less > than 0.5% estimated space saving). thats not that simple. refreshes happen only when the object is requested another time. refresh_hit is almost as pure win as plain tcp_hit. Squid just decided to go and check with the source. This can only speak about the fact that this object has been in cache for some time and/or refresh_rules required squid to refresh-check. This rule applies also to objects that are requested 1000 times a day, so that they remain on the top-10 of the LRU, but its just their age that requires squid to do refreshes after some time. some very old objects that are reused every 9 weeks can on the other hand give plain tcp_hit, and so are not covered with these stats. Very many gifs can be fresh for ages, and there's no need to refresh them before refresh_max time psaaes And, if the whole cache is flooded in a day-or-two, there's almost no good reason for these refreshes to appear. ---------------------------------------------------------------------- Andres Kroonmaa mail: andre@online.ee Network Manager Organization: MicroLink Online Tel: 6308 909 Tallinn, Sakala 19 Pho: +372 6308 909 Estonia, EE0001 http://www.online.ee Fax: +372 6308 901 ---------------------------------------------------------------------- From squid-dev-request@ircache.net Fri Jan 29 22:56:10 1999 Date: Fri, 29 Jan 1999 08:17:38 -0700 (MST) From: Alex Rousskov To: Oskar Pearson cc: squid-dev@ircache.net Subject: Re: memory-mapped files in Squid In-Reply-To: <19990129131757.21671@is.co.za> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Fri, 29 Jan 1999, Oskar Pearson wrote: > I am not sure of the fifo buffer idea, though. I don't believe that any of > the commercial systems are using this... and they have lots of money to > throw at getting a filesystem right. Perhaps they are throwing them in the wrong direction? Like optimizing a traditional file system? :) Alex. From squid-dev-request@ircache.net Sat Jan 30 18:24:57 1999 Message-ID: <36B24694.534A25F1@hem.passagen.se> Date: Sat, 30 Jan 1999 00:39:00 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <36B1A248.1E40562A@hem.passagen.se> <6651436805@mail.lbi.ee> <16B04E605B@mail.lbi.ee> <36B0F7A2.2EC8D10D@hem.passagen.se> <19990129121034.55148@is.co.za> <36B19334.5F7F9254@hem.passagen.se> <19990129131757.21671@is.co.za> <199901291439.BAA01871@nara.off.connect.com.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: Kevin Littlejohn From: Henrik Nordstrom Cc: squid-dev@ircache.net Kevin Littlejohn wrote: > I'm not sure how you achieve this - given the request pattern is random? Reads are random, but elevator seeking eleminates many stupid seeks. Yes, there still is some randomness, but with a non-random pattern. Elevator seeking takes into account that disks are fast for requests on the same or a adjacent track, but slow on longer seeks. The cyclic nature also guarantees that no request starves. Elevator cycle time is also a good measure on how loaded the disk is. It may be possible to cluster related objects and use intelligent prefetching for hits to further minimize seeks on cache hits by sacrifying some memory. In a FIFO cache chances are high that related objects are close to each other. /Henrik From squid-dev-request@ircache.net Sat Jan 30 18:24:56 1999 Message-ID: <36B26009.33DD710E@hem.passagen.se> Date: Sat, 30 Jan 1999 02:27:37 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <2D01D40D9C@mail.lbi.ee> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: Andres Kroonmaa From: Henrik Nordstrom Cc: squid-dev@ircache.net Andres Kroonmaa wrote: > In fact, to mimic LRU should be easy. Whenever we fetch object from disk > to service client hit, we can append that object to the write queue, that > way moving all accessed objects to the front of LRU. This would make fifo > into real LRU. With small objects it could even add little overhead. Sounds like you have read my first message on cyclic filesystems, sent some weeks ago ;-) Appending a HIT object to the write queue is almost for free with respect to service time. The object is already read in, and we probably need to write out other objects as well. To avoid having frequently requested objects taking up a unproportionably large portion of the cache my idea is to rewrite hits if they are located more than a certain (configurable) distance from the head. > How are you planning to implement the metadata? Ok. you haven't read the first message ;-) I'll try to summarise all current thoughts so far: * To minimize latency caused by writes, writes should be done sequentially in larger units (1MB at a time or something similar). * To be able to guarantee that writes can be done sequentially a FIFO type storage policy should be used, possibly with extensions to avoid throwing away objects we want to keep (probably) and to defragment unused space (less likely, early estimates puts this in the range of 2-5% and it is non-trivial to implement). * Also, with a FIFO style replacement policy, it is very hard to justify the overhead incurred by a standard filesystem. Probably better to use raw partitions, both from a I/O performance level and how much memory that is required by the OS for caches/buffers. * Storage space is maintained in larger chunks (lets say in the range of 10 minutes of data). * Disk store index is physical location. How closely objects can be packed inside each chunk is determined by the index key size, and key size is in turn determined by the amount of memory we can spend on it. A estimate is that 64 bits will be used, of which 16 is "cache_dir", but we may get by with 48 bits (40 bits index, 8 bits "cache_dir"). This is a increase by 4 bytes on 32 bit machines, but as no LRU list is maintained another 12 bytes (24 on 64bit machines) is saved by not having a dlink_node so memory requirements is actually less per GB of disk. On a second thought the dlink_node may be needed to cluster objects to their storage chunk, so it is perhaps not possible to eliminate this, so worst case is 4 bytes more per store entry on 32-bit machines and best case is 10 bytes less. * On-disk metadata is kept separately for each chunk together with the data, The exception is when objects has to be released without being replaced by another one. When this happens a release entry is kept in a later chunk (the chunk where Squid found that it needs to remove the object from cache by some reason). Another exception is when a (large) object crosses chunk boundaries, here the metadata log entry points to a earlier chunk. * On disk metadata for a storage chunk is static for the lifetime of the chunk. Once written it is not touched. * Storage space is reused one chunk at a time. * To mimic LRU, reused objects are rewritten at the head if old enough. A plus from this is that validated/refreshed objects can get their headers updated correctly even if the object isn't changed. * Checkpoints are kept as two storage chunk pointers: last fully completed storage chunk, and the oldest non-reused chunk. These checkpoints may actually be implemented as timestamps in the chunks themselves. * In-memory index are rebuilt by reading the meta data indexes from the storage blocks, beginning with the oldest. * One-way elevator optimization is used for I/O, possibly with periodic direction change to have an even wear on the drive heads. The reasons why one-way elevator optimization is used instead of two way are: 1. it gives a more even service time, 2. pending blocks are likely very sparse at the turn point, and if we are doing a multi-track seek then we may just as well seek over the whole drive. * Chaining of object fragments is needed to store larger objects where we could not afford to wait for the whole object until storing it on disk. * A write object queue buffer should be used to order the written objects somewhat. This is for avoiding storage of aborted objects, to limit object fragmenting/interleaving and perhaps to cluster related objects. * The current store object format already contains a certain level of metadata redundancy. There is probably no need to extend this further to reliably detect when things go wrong. /Henrik From squid-dev-request@ircache.net Mon Feb 1 22:19:51 1999 From: "Andres Kroonmaa" Organization: MicroLink Online To: Henrik Nordstrom Date: Mon, 1 Feb 1999 21:46:42 +0200 EET MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: memory-mapped files in Squid CC: squid-dev@ircache.net In-reply-to: <36B26009.33DD710E@hem.passagen.se> Message-ID: <44D40A62DE@mail.lbi.ee> On 30 Jan 99, at 2:27, Henrik Nordstrom wrote: > Sounds like you have read my first message on cyclic filesystems, sent > some weeks ago ;-) I guess i missed it then. In recent discussion I was just objecting throwing away objects from fifo tail, and you weren't clear enough ;) Only after I got a hint on how to make fifo into LRU, it occured to me... > To avoid having frequently > requested objects taking up a unproportionably large portion of the > cache my idea is to rewrite hits if they are located more than a certain > (configurable) distance from the head. Hmm. what you mean by "taking up large portion of cache"? I imagine that when you rewrite accessed object onto fifo head, its space at its old location is released? Aah, you mean that when some object hits 10/sec then it shouldn't be rewritten to the fifo head at the same rate? > * To be able to guarantee that writes can be done sequentially a FIFO > type storage policy should be used, possibly with extensions to avoid > throwing away objects we want to keep (probably) and to defragment > unused space (less likely, early estimates puts this in the range of > 2-5% and it is non-trivial to implement). btw, how did you estimate? > * Also, with a FIFO style replacement policy, it is very hard to justify > the overhead incurred by a standard filesystem. Probably better to use > raw partitions, both from a I/O performance level and how much memory > that is required by the OS for caches/buffers. Regarding fifo ontop of ufs, absolutely. But it may have some benefits to use OS-buffered raw partitions. As I understand, unbuffered partitions can't be poll()ed (always ready and block on read/write) while buffered can (OS handles disk access scheduling and write buffering). This could allow us continue to use select() style coding without threads if needed. > * Disk store index is physical location. How closely objects can be > packed inside each chunk is determined by the index key size, and key > size is in turn determined by the amount of memory we can spend on it. A > * Checkpoints are kept as two storage chunk pointers: last fully > completed storage chunk, and the oldest non-reused chunk. These > checkpoints may actually be implemented as timestamps in the chunks > themselves. > * In-memory index are rebuilt by reading the meta data indexes from the > storage blocks, beginning with the oldest. > * Chaining of object fragments is needed to store larger objects where > we could not afford to wait for the whole object until storing it on > disk. Few thoughts. * more than 90% of objects are <32K in size. Can we eliminate the need to handle chained multi-fragment large objects? Say we let UFS handle objects >32KB, and only put smaller into fifo-store. We'd need to add some code for handling that, but it gives us alot of more freedom. For eg. we can keep pending objects in ram until fully fetched, (keeping upto 32KB per object is not a big deal), but then we can write each object out virtually atomically. There's no worries about partially written objects. No worries about multiple disk ops per object, no worries about multiple object pieces on disks, chunk alignment, etc. As >32KB objects are relatively rare, the overhead of UFS is not so big. Besides, big files is what UFS is much better at. Also, makes it much more attractive to append recently used objects onto the write queue. It would be much more difficult to manage with wildly differing objects sizes, and increases the cost of doing so. Perhaps add cache_dir types (fifo/ufs) and max object sizes to put on each. Have different routines for each, and in configuration decide how large objects are serviced by UFS or fifo... * We need to rewrite squid store interface I guess. Fifo style does not need any open/close calls. No need to use filenames. Squid just swaps the object into memory, (like get_object(&StoreEntrywithMemObject)) If object fits into buffer, it will be done in single disk io. If not, request is perhaps handled by UFS interface instead, which would handle filenames and open/read/close stuff itself. So, at some abstraction level, squid should expect objects to be fetched from disk in one call. When we need the filename to locate object in UFS, then we calculate it when needed and handle as currently. Basically, we'd want to move from FD centric activity inside squid disk routines to more abstract object centric. * We don't want to add indirection layer between swap_file_number and physical location on disk. This only slows us down. Because we want to change physical location on disks occasionally, we'd like to give IO routines the freedom to change swap_file_number in squid metadata structures directly. This allows fifo writer to sort its queue after object is handled to it, and later, whenever needed. The most sense in this is that when squid creates new disk object, it does not know exactly what its swap_file_number will be. Object's fileno will be picekd by IO routines ideally, based on where the fifo pointer is at that exact time and which spindle is most idle. Perhaps disk IO routines should even decide which cache_dir to pick. In a sense, squid metadata becomes integral part of storage. The only key is URL/MD5 hash which resolves directly into (current) location on disk. (in case of fifo/lru storage, or into filenumber in case of UFS) As I now understand, this is how you mean it to be, ie. squid metadata is located directly on the disks near data, and URL database is rebuilt during startup directly from this disk data. Above I assumed that there was some global URL database in squid that implemented LRU and mapped URLs to cache_dirs/locations. If we drop global metadata, then we have per cache_dir LRU that is implemented inside FS. But perhaps there might be some benefit from being able to write just fetched disk object to another disk's write queue? > * On-disk metadata is kept separately for each chunk together with the > data, The exception is when objects has to be released without being > replaced by another one. When this happens a release entry is kept in a > later chunk (the chunk where Squid found that it needs to remove the > object from cache by some reason). Like sort of transaction log? If we had global URL database, we'd not need this. But this solution might be even better. hotswap the drive, restart squid, and it runs with other subset of URL database... > Another exception is when a (large) object crosses chunk boundaries, > here the metadata log entry points to a earlier chunk. uhh. can we avoid large multipart files on fifo storage? ---------------------------------------------------------------------- Andres Kroonmaa mail: andre@online.ee Network Manager Organization: MicroLink Online Tel: 6308 909 Tallinn, Sakala 19 Pho: +372 6308 909 Estonia, EE0001 http://www.online.ee Fax: +372 6308 901 ---------------------------------------------------------------------- From squid-dev-request@ircache.net Tue Feb 2 00:38:23 1999 From: "Andres Kroonmaa" Organization: MicroLink Online To: Alex Rousskov Date: Tue, 2 Feb 1999 00:31:30 +0200 EET MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: memory-mapped files in Squid CC: squid-dev@ircache.net References: <16B04E605B@mail.lbi.ee> In-reply-to: Message-ID: <47930C0749@mail.lbi.ee> Although we already perhaps all agree that fifo can be trivially changed into LRU-like replacement policy, I'd like to point out few bits that still seem to be missed in regards to replacement policies. While talking of pure FIFO-based replacement, On 28 Jan 99, at 10:12, Alex Rousskov wrote: > On Thu, 28 Jan 1999, Andres Kroonmaa wrote: > > > Of 100% fresh objects that we fetch, some part will survive a cycle. > > Correct me if I am wrong, but the recommended LRU expiration age is 3 - 7 > days. The "cycle" is not something short. If a cache receives 3GB of > cachable traffic per day and has 30GB disk cache, it would take 10 days to > complete a cycle (just an example). If an object was not referenced twice > in 3-7 days, it is useless with a very high probability. But given 10K average object size and 3G per day, it makes about 300K reqs per day. I'd like to see the cache manager who allocates 30G of disk for this small workload (even nowadays). if you also count the current RAM needs for such a box, it will be way far from cheap cache box. And if we have say 30G/day, to have a expiration age of 10 days, we'd love to have 300G of disks? What if we target 60 days for zillions of .gifs? LRU is a lovely hack that works amasingly well "inspite of common sense" ;) But it is totally based on locality, and it is awfully dumb. As long as we compare hit-counts or even dumb hit-bytes, it beats most other policies or is pretty close. But if you attach a price-tag for bytes or counts, it does not fit too well anymore. Suppose you need to have $$$/G saved, based on inter-ISP peering costs? would LRU fit at all? Or if you need RTT-mSec/G saved? LRU is "one-fits-all". But is doesn't fit all. It might be way cool to see 30-60% hitrate on URLs to your local web server, but it really gives anyone very little benefit, isn't it? If a really cheap object gets into cache, and it is popular, LRU will not drop it, it would rather drop all the rest. Is that smart? We need more inteligent replacement policies not to increase hit-rate, but to increase a value per hit, value per GB kept. > Thus, FIFO may work for most of reasonably configured caches, especially > if augmented with a small buffer for popular objects (say, those > referenced at least twice). Small buffer would be 30-40% of my disk space. Of 30G it makes 10G. I wouldn't call it small, and I wouldn't call maintaining it trivial. But FIFO can be converted into LRU, and the lovely hack comes in. And I'd agree it is much easier to use fifo/LRU, although it may be much harder to include some other criteria based policies later... > > Some part is expired long before. But just this part that survives is > > what increases squid' hit/miss ratio, and we want to keep it. The > > other part is a waste, and we'd want to replace it first. > > For a lot (most?) of reasonable configurations, it does not matter much > what you are replacing as long as that stuff is several days old. The > latter is the reason why all "traditional" caching policies perform the > same on any reasonable size cache. Essentially, you are selecting between > old garbage and very old garbage. If you ever need to pay $0.25 per 10K gif, you'd start thinking totally different. (its just example, but it exactly illustrates those who fight with payperbyte). Old (or very old) is not equal to garbage. Only if this "old" becomes more costly to keep than refetch do we assign it a tag of garbage. Determining the point of when it becomes more costly to keep is the whole point of different replacement policies. Last reference time is simplest of all, but it has nothing to do with any kind of _cost_. > > 1) we want to have our disks as full as possible. 90% at least. > > 2) we want that these disks are full of _useful_ stuff, not waste. > > Sure. However, none of the traditional policies caches useful data! All > traditional policies that I know of fill cache with garbage, on average. > Provided your cache is of a reasonable size, of course. Sure. But its not a matter of caching useful data or garbage. Its a matter of _keeping_ useful data and _dropping_garbage_. Please note the difference. LRU does that. It picks (what it thinks)useful data to keep, and ignores the rest. The only criteria is "last-reference". It just happens so that the rest gets dropped if not referenced intime. This is the charm of its simplicity. But this is also its dumbness. FIFO byitself does not do anything to keep useful data. It just happens that with 4-10 times the same amount of disk we have comparable hitrate as with LRU. > The real challenge is avoid caching waste in the first place. And not to > reduce disk space requirements, but to decrease disk _bandwidth_ > contention and hence speedup hits. No, we can't avoid caching waste. For that we'd need to "see" the future. We can try to predict it, but we can never be exactly right. And NO, not to decrease the _disk_ bandwidth or increase the hits, but to reduce the _cost_ of (international) network traffic and thus being able to purchase higher bandwidth links. Sure, collapsing cache box is worth little, but you can feel the difference in interests. Fast cache box is very cool, but disk-space-effective box might be much more important to some. > > So, we have to look at allocation algorithms that integrates both > > 1) intelligent replacement policy > > 2) and does not suffer speed loss at high disk utilisation. > > IMO, all reasonable _replacement_ policies will perform the same no matter > how intelligent they are. I have not seen a single study that shows an > alternative super intelligent policy outperforming LRU-TH by a significant > margin on a reasonable size cache. I hope I have succeeded to explain myself by now. * hit-rate is not everything. I agree that LRU performs well. Intelligent? No. Who needs this intelligence? Those who count bits and bytes in bucks. > We may need an intelligent _caching_ (or cache admission) policy, but I am > afraid we are not ready for that yet. In fact we were, sort of, in squid 1.0, by means of ttl_patterns... ;-) ---------------------------------------------------------------------- Andres Kroonmaa mail: andre@online.ee Network Manager Organization: MicroLink Online Tel: 6308 909 Tallinn, Sakala 19 Pho: +372 6308 909 Estonia, EE0001 http://www.online.ee Fax: +372 6308 901 ---------------------------------------------------------------------- From squid-dev-request@ircache.net Mon Feb 1 22:18:45 1999 Message-ID: <199902010439.PAA06193@koro.off.connect.com.au> To: oskar@linux.org.za Cc: squid-dev@ircache.net From: Kevin Littlejohn Reply-To: Kevin Littlejohn Subject: Re: Squidfs code In-reply-to: Your message of "Thu, 28 Jan 1999 23:53:48 +0200." <19990128235348.A5559@linux.org.za> Date: Mon, 01 Feb 1999 15:39:14 +1100 http://www.bofh.net.au/~darius/squidfs/sfs-0.1.tgz contains the below fixes, plus fixes for the mentioned problems - it no longer cores (or shouldn't), -Wall produces no warnings, and several memory leaks have been shut off. Next step is to go over the various interface routines, make sure they're sane - now that my other major in-house project has settled down (and I've found a house to live in, yay ;), I can sink my teeth into this properly again. More to come. KevinL >>> oskar@linux.org.za wrote > > --zhXaljGHf11kAtnf > Content-Type: text/plain; charset=us-ascii > > Hi > > Attached is a bunch of cosmetic changes to the sfs code. (I know, I know, > it's not really useful, but I am tired... getting rid of warnings etc is > the most I can do right now ;) > > Includes: > > o A real makefile > o Makefile includes linux and solaris sections. Defines _REENTRANT > for Linux. You have to manually specify this. > Currently linux just core-dumps with me immediately. I will try > and track this down. > o Compiles almost without warnings with 'gcc -Wall'. There are a > couple of things that I was not up to fixing immediately: noted them > with 'XXX'. I removed some unused variables (and commented out > others): if they were 'for future use', sorry. > o Now have $Id: squid-fs-messages.txt,v 1.1 2000/07/16 23:28:13 hno Exp $ entries in every file, so that changes can be > tracked. > o Fixed recursive includes of header files > o sfs_seek code doesn't match documentation. Kludged it's variable > list in the meantime > o replaced broken squid_curtime definition. was the same as saying > 'exit + 1;' > o Included headers to get rid of silly warnings > > Hopefully this makes the code more 'coder friendly'... I will actually > work through the code and documentation sometime.... > > Oskar From squid-dev-request@ircache.net Tue Feb 2 00:52:51 1999 Message-ID: <36B63213.3F669D3@hem.passagen.se> Date: Tue, 02 Feb 1999 00:00:35 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <44D40A62DE@mail.lbi.ee> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: Andres Kroonmaa From: Henrik Nordstrom Cc: squid-dev@ircache.net Andres Kroonmaa wrote: > I guess i missed it then. In recent discussion I was just > objecting throwing away objects from fifo tail, and you weren't > clear enough ;) Only after I got a hint on how to make fifo > into LRU, it occured to me... Well. I assumed that people got the first message, there it was mentioned that reused objects can be written back to the head, and most of what I have said so far has been a random collection of thoughts on a changing subject. > Hmm. what you mean by "taking up large portion of cache"? I imagine > that when you rewrite accessed object onto fifo head, its space at > its old location is released? Released but not reused. It is storage policy that is FIFO not object policy. Writing is always done at the head so the space won't be refilled until we get there again. > Aah, you mean that when some object hits 10/sec then it shouldn't > be rewritten to the fifo head at the same rate? Yes, among other things. > btw, how did you estimate? A rought calculation of refreshrates (both TCP_REFRESH and TCP_CLIENT_REFRESH). But as I said it is an early estimate and I may have missed something important. As Alex said some simulations should be done to get a clearer picture of how things behaves. > Regarding fifo ontop of ufs, absolutely. But it may have some > benefits to use OS-buffered raw partitions. As I understand, > unbuffered partitions can't be poll()ed (always ready and block > on read/write) while buffered can (OS handles disk access > scheduling and write buffering). I doubt that poll can be used on any disk media without blocking, raw or filesystem. But you are welcome to prove me wrong. > Can we eliminate the need to handle chained multi-fragment > large objects? Not in a pure FIFO store, but as you say it may be done in a hybrid design. > Also, makes it much more attractive to append recently used > objects onto the write queue. It would be much more difficult > to manage with wildly differing objects sizes, and increases > the cost of doing so. Francly I does not see the difference. In my thoughts I only write recently used objects to the head when they are requested, where the write back sort of piggy backs on the swap in. Regardless of how the object is laid out you have to read it in for the client, and write it out on the head if deemed neccesary. > Perhaps add cache_dir types (fifo/ufs) and max object sizes to > put on each. Have different routines for each, and in > configuration decide how large objects are serviced by UFS or > fifo... Configuration possibilities are numerous, especially when combining different kind of storage. > We need to rewrite squid store interface I guess. Yes. Some kind of abstract store interface is needed. It is kind of hard to tweak the fd-centric design into a non-fd centric system.. This applies to all communication and not only disk I/O. > We don't want to add indirection layer between swap_file_number and > physical location on disk. No, and it has never been the intention to use such an redirection layer. That would be to essentially reimplement a kind of directory structure and that was one of the things I wanted to get rid of. > The most sense in this is that when squid creates new disk object, > it does not know exactly what its swap_file_number will be. Object's > fileno will be picekd by IO routines ideally, based on where the fifo > pointer is at that exact time and which spindle is most idle. Yes. disk "swap file number" has to be picked when the store layout manager decides to actually store the object. All communication to/from store has to be on object level and not swap file numer. Actually the store has to maintain the metadata, as the store decides on replacement policy, so this is not really a problem. The abstraction level should be above this, with operations like "create object", "open object", "write object data", "read object data", "stat object info" (timestams, ...). > Perhaps disk IO routines should even decide which cache_dir to pick. Or put in another way: Each cache_dir can decide on when it stores objects, and where these objects are stored. > In a sense, squid metadata becomes integral part of storage. The only > key is URL/MD5 hash which resolves directly into (current) location > on disk. (in case of fifo/lru storage, or into filenumber in case > of UFS) Well, this is already how it is, almost anyway. If we are peering then some additional metadata is needed to quickly determine if the object is fresh or not, both for ICP and HTTP peerings so we cant get the in-memory index down to simply a store index. Some additional meta data well be needed in memory. If we want to design a efficient store for a non-peering cache then a different set of rules is in effect. One interesting question is if it may be possible to drop the in-memory index all together on a non-peering cache, and use URL hash as store index, combined with a digest like design to hint if the object is there or not. But this is very much separated from the FIFO discussion. Neither LRU or FIFO would be used in such a system (more like a random replacement policy). The big plus is that there is almost no limit on cache size. > As I now understand, this is how you mean it to be, ie. squid > metadata is located directly on the disks near data, and URL > database is rebuilt during startup directly from this disk data. Yes. The on-disk database is what got stored together (or rather close to) the objects. > Above I assumed that there was some global URL database in squid > that implemented LRU and mapped URLs to cache_dirs/locations. In my desgin NO LRU list is used. A FIFO is used with some addons to mimic a some properties of a LRU, but there is no real LRU in action. > But perhaps there might be some benefit from being able to write > just fetched disk object to another disk's write queue? Writing and reading should be isolated from each other. Of course objects should be able to be written back on another disk than it's read from. Objects should always be written to the most suitable disk, regardless if the object came from network, disk or whatever. A object is a object. > Like sort of transaction log? If we had global URL database, we'd not > need this. But this solution might be even better. hotswap the drive, > restart squid, and it runs with other subset of URL database... Yes, I see metadata logs is a sort of transaction log. In theory Squid can be programmed to hotswap the drive without restart. What is needed is the functions "disable and release everything from cache_dir X" and "activate cache_dir X". > uhh. can we avoid large multipart files on fifo storage? As you said a hybrid design could be used, where some storage is FIFO and some filesystem based. A third possibility is a spool area for larger objecs, from which completed objects are written to the FIFO. This area can also managed using FIFO to automatically clean out any fragmentation. There are a couple of other possibilites as well. You may remember my wording "multilevel" in earlier messages. This came from the idea that store could be maintained at multiple levels, where the first level blindly writes object and the second (and third ...) level eats objects from the tail onto the next level fifo. This is a good idea if the first level wastes a lot of space on objects that then is thrown away (refreshed or aborted during storage), but it may be hard load balance such a system... /Henrik From squid-dev-request@ircache.net Wed Feb 3 00:02:36 1999 Message-ID: <199902020030.LAA10083@koro.off.connect.com.au> To: Henrik Nordstrom Cc: squid-dev@ircache.net From: Kevin Littlejohn Reply-To: Kevin Littlejohn Subject: Re: memory-mapped files in Squid In-reply-to: Your message of "Tue, 02 Feb 1999 00:00:35 BST." <36B63213.3F669D3@hem.passagen.se> References: <36B63213.3F669D3@hem.passagen.se> <44D40A62DE@mail.lbi.ee> Date: Tue, 02 Feb 1999 11:30:29 +1100 >>> Henrik Nordstrom wrote > > Aah, you mean that when some object hits 10/sec then it shouldn't > > be rewritten to the fifo head at the same rate? > > Yes, among other things. > > > btw, how did you estimate? > > A rought calculation of refreshrates (both TCP_REFRESH and > TCP_CLIENT_REFRESH). But as I said it is an early estimate and I may > have missed something important. As Alex said some simulations should be > done to get a clearer picture of how things behaves. I think this is going to be a big part of the tuning for the cyclic fs - how do you deal with, for instance, the next MS service pack - 50Mb of heavily-hit object. Personally, re-writing objects on hits doesn't give me warm fuzzy feelings... > > We need to rewrite squid store interface I guess. > > Yes. Some kind of abstract store interface is needed. It is kind of hard > to tweak the fd-centric design into a non-fd centric system.. This > applies to all communication and not only disk I/O. > > > We don't want to add indirection layer between swap_file_number and > > physical location on disk. > > No, and it has never been the intention to use such an redirection > layer. That would be to essentially reimplement a kind of directory > structure and that was one of the things I wanted to get rid of. > This stuff applies to any squidFS - the aim should be to have squid store, internally, a direct pointer to the location on disk for the object, rather than any level of indirection. I think that's one of the crucial points - and yeah, we could do with a slight abstraction of the current disk handling, so there's not read() and write() sprinkled through asyncio and disk.c, and so what's stored as a 'name' (and what's stored as a 'fd' for open files) is easily changeable. Is anyone out there looking into this yet? Because I'm about three days away from starting on the squid side of the sfs stuff - so if someone is already eyeing off cleaning up squid's disk IO, I'd like to talk to them ;) > Writing and reading should be isolated from each other. Of course > objects should be able to be written back on another disk than it's read > from. Objects should always be written to the most suitable disk, > regardless if the object came from network, disk or whatever. A object > is a object. Except if you've already incurred the overhead of writing to disk, why incur it again? I'm still not convinced that increasing the workload of the disk is a good thing to do in the process of attempting to speed up disk access. I know that there's many other things affecting disk access speeds - but that media is still the slow part of the chain (well, after network), so it makes sense to me to keep disk use low. > > > Like sort of transaction log? If we had global URL database, we'd not > > need this. But this solution might be even better. hotswap the drive, > > restart squid, and it runs with other subset of URL database... > > Yes, I see metadata logs is a sort of transaction log. > > In theory Squid can be programmed to hotswap the drive without restart. > What is needed is the functions "disable and release everything from > cache_dir X" and "activate cache_dir X". In fact, squid already handles a drive 'going away'. It doesn't currently handle a drive 'coming online', but that would be almost trivial - just get it to rebuild the cache_store index for that drive, and some way of signalling that the drive has come on-line... So long as all the data pertaining to a drive is on that drive, and nowhere else, then you've got a swappable setup. > > uhh. can we avoid large multipart files on fifo storage? > > As you said a hybrid design could be used, where some storage is FIFO > and some filesystem based. > > A third possibility is a spool area for larger objecs, from which > completed objects are written to the FIFO. This area can also managed > using FIFO to automatically clean out any fragmentation. I'd be curious to see what the 'best' lower limit for size of objects is before you start using this 'staging area'. I'd also be curious to see what impact it has on performance if the object sizes drift - if you're heavily hitting that area, you may start to loose some of the cyclic gains :( > > There are a couple of other possibilites as well. > > You may remember my wording "multilevel" in earlier messages. This came > from the idea that store could be maintained at multiple levels, where > the first level blindly writes object and the second (and third ...) > level eats objects from the tail onto the next level fifo. This is a > good idea if the first level wastes a lot of space on objects that then > is thrown away (refreshed or aborted during storage), but it may be hard > load balance such a system... Cute idea. There's definately some nifty ideas there, but I'm not convinced that the gains from a cyclic fs over a more traditional style fs are enough to warrant the extra management complexity - shuffling objects around on disk, etc. Guess the only real way to tell is to implement and see ;) KevinL (Who can see much experimentation coming up... time to requisition a new cache box ;) From squid-dev-request@ircache.net Wed Feb 3 01:10:47 1999 Message-ID: <36B7920B.2BA1CF25@hem.passagen.se> Date: Wed, 03 Feb 1999 01:02:19 +0100 MIME-Version: 1.0 Subject: Re: memory-mapped files in Squid References: <36B63213.3F669D3@hem.passagen.se> <44D40A62DE@mail.lbi.ee> <199902020030.LAA10083@koro.off.connect.com.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: Kevin Littlejohn From: Henrik Nordstrom Cc: squid-dev@ircache.net Kevin Littlejohn wrote: > I think this is going to be a big part of the tuning for the cyclic > fs - how do you deal with, for instance, the next MS service pack - > 50Mb of heavily-hit object. Personally, re-writing objects on hits > doesn't give me warm fuzzy feelings... I would probably use a rewrite rate limit of 30%. It the object are in the newest 30% of the store, then don't rewrite it to the head. > Except if you've already incurred the overhead of writing to disk, why > incur it again? As I see it there are three options in a cyclic file system: 1. Rewrite reused objects when reused, with a limit on by which rate the object is rewritten to the head. Penalties: * extra writes * changing store index numbers. 2. When rewriting a store chunk, preserve those objects that we are interested in keeping, but pack them together. Penalties: * extra reads * extra writes * no obvious way to determine what to keep and what to throw away * changing store index numbers. 3. When rewriting a store chunk, preserve those objects that we are interested in keeping where they are in the chunk. Penalties: * fragmentation * no obvious way to determine what to keep and what to throw away. Currently I think 1 is the way to go, but 2 or 3 allows for a more flexible or customisable replacement policy. > I know that there's many other things affecting disk access > speeds - but that media is still the slow part of the chain (well, after > network), so it makes sense to me to keep disk use low. There will always be some kinds of tradeoffs. Goal is to find a design that allows for some balancing between the different kinds. > In fact, squid already handles a drive 'going away'. In theory yes, in practice no. Squid may continue to function, but complains loudly. > I'd be curious to see what the 'best' lower limit for size of objects is before > you start using this 'staging area'. I'd also be curious to see what impact > it has on performance if the object sizes drift - if you're heavily hitting > that area, you may start to loose some of the cyclic gains :( I see this FIFO + standard fs as a possible future extension if deemed neccesary. Not as part of the basic cyclic fs design. Also the standard FS staging area should only be used if the object is larger than we want to queue in memory. Smaller objects should go directly to the cyclic store. > There's definately some nifty ideas there, but I'm not convinced that the > gains from a cyclic fs over a more traditional style fs are enough to warrant > the extra management complexity - shuffling objects around on disk, etc. I wouldn't say that there is much extra management complexity, but I admit that some of Squids interfaces does not fit well. > Guess the only real way to tell is to implement and see ;) Unfortunately I don't have the time needed to implement anything right now, and I am not sure if I will be able to later. /Henrik From squid-dev-request@ircache.net Wed Feb 3 00:02:52 1999 From: "Andres Kroonmaa" Organization: MicroLink Online To: darius@connect.com.au Date: Tue, 2 Feb 1999 10:56:59 +0200 EET MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: memory-mapped files in Squid CC: squid-dev@ircache.net In-reply-to: <199902020030.LAA10083@koro.off.connect.com.au> References: Your message of "Tue, 02 Feb 1999 00:00:35 BST." <36B63213.3F669D3@hem.passagen.se> Message-ID: <5200F810EB@mail.lbi.ee> On 2 Feb 99, at 11:30, Kevin Littlejohn wrote: > > > We need to rewrite squid store interface I guess. > > > > Yes. Some kind of abstract store interface is needed. It is kind of hard > > to tweak the fd-centric design into a non-fd centric system.. This > > applies to all communication and not only disk I/O. > > This stuff applies to any squidFS - the aim should be to have squid store, > internally, a direct pointer to the location on disk for the object, rather > than any level of indirection. I think that's one of the crucial points > - and yeah, we could do with a slight abstraction of the current disk > handling, so there's not read() and write() sprinkled through asyncio and > disk.c, and so what's stored as a 'name' (and what's stored as a 'fd' for > open files) is easily changeable. Actually, I believe it would be benefitial to change squid to be more object-oriented than hack-oriented ;) More and more stuff gets tied to it and it becomes increasingly difficult to see all interactions. I believe that everything should be turning around StoreEntry structure. It should have all and any bits needed to accomplish everything we want. Like vnodes in some OS'es, it should represent an in-memory version of cache object, it should have its own locks, temp areas, handlers, etc.etc. It should be arranged so that parts of it could be kept in ram, like currently memObject can be attached to it if needed. Focus of most activity should be towards this only pointer to StoreEntry. This should allow us to pass only pointers between calls, avoid lots of data copying around in memory, needless malloc/frees and if using adequate locking, allows to easily add threading later on. Squid 2 has done long step in that direction and I hope it will stay that way. > > Writing and reading should be isolated from each other. Of course > > objects should be able to be written back on another disk than it's read > > from. Objects should always be written to the most suitable disk, > > regardless if the object came from network, disk or whatever. A object > > is a object. > > Except if you've already incurred the overhead of writing to disk, why > incur it again? I'm still not convinced that increasing the workload of > the disk is a good thing to do in the process of attempting to speed up > disk access. I know that there's many other things affecting disk access > speeds - but that media is still the slow part of the chain (well, after > network), so it makes sense to me to keep disk use low. It has to do with increasing disk sequential bandwidth and little to no increase in disk access time performance. Calculate the cost of disk write with different disks and segment sizes, lets suppose: disk1: avg accesstime=20mS, sustained sequential write: 5MB/s disk2: avg accesstime=20mS, sustained sequential write: 20MB/s disk3: avg accesstime=10mS, sustained sequential write: 20MB/s disk4: avg accesstime=10mS, sustained sequential write: 5MB/s 1) segment size of 8KB time to write random segment: disk1: 20mSec + 8K/5M = 21.6 msec disk2: 20mSec + 8K/20M = 20.4 msec disk3: 10mSec + 8K/20M = 10.4 msec disk4: 10mSec + 8K/5M = 11.6 msec max random write bandwidth: disk1: 8KB/21.6mSec = 370KB/sec disk2: 8KB/20.4mSec = 392KB/sec disk3: 8KB/10.4mSec = 769KB/sec disk4: 8KB/11.6mSec = 689KB/sec As you see, although disk performance parameters differ considerably, the sustained transfer rate is pretty close, especially for a case of comparable accesstime disks. ** disk accesstime dictates max random bandwidth of the disk with small segments. This is the reason why elevator optimisation rulez, it minimises average seek times between successive disk operations. 2) segment size of 1MB time to write random segment: disk1: 20mSec + 1M/5M = 221.6 msec disk2: 20mSec + 1M/20M = 70.4 msec disk3: 10mSec + 1M/20M = 60.4 msec disk4: 10mSec + 1M/5M = 211.6 msec max random write bandwidth: disk1: 1M/221.6mSec = 4621KB/sec disk2: 1M/70.4mSec = 14545KB/sec disk3: 1M/60.4mSec = 16953KB/sec disk4: 1M/211.6mSec = 4839KB/sec As you can see, with 8KB segment size and random io, we can transfer only 400-800KB in a second, but if we cluster writes together, we can do near 15MB writes every second on disks with fast transfers. ** for large sequential segments, transfer bandwidth dominates overall write performance, and disk accesstimes influence it very little. As you can also see, elevator optimisation gives very little here. But elevator still gives us edge on read performance. We are not increasing disk workload, we are trying to move it to what disks are best at, we are trying to do more work during the same amount of time. We do not want to keep disk use low, we do want keep disk seek count low. We can't avoid random access for reads, but we can avoid it for writes. So, big and sequential is fast. Small and random is slow. Thats the whole reason of log-FS'es and fifo design. So, if you append some 64 8K objects (512K) to a write queue, overhead added only halfs (for new stuff) the write performance, keeping it still way higher than random access. And, because added to write queue only if object is already accessed by read, there is no added read overhead. As we need to do strictly sequential writes always, we have to keep free space sequential, and FIFO is best at this. Unfortunately, fifo drops objects at fifo tail and these objects could be quite popular. So here we basically only decide if we overwrite once written objects with new ones, thus probably loosing some hits on these, or we append these objects to the write queue, thus forming LRU from FIFO. As long as objects appended to the write queue do not impose higher cost than would be to keep totally random writes or refetching from the network, we are pretty cool. It should be trivial to leave config options to both disable LRU or to tweak max objects size that is appended to the write queue. > > > uhh. can we avoid large multipart files on fifo storage? > > > > As you said a hybrid design could be used, where some storage is FIFO > > and some filesystem based. > > > > A third possibility is a spool area for larger objecs, from which > > completed objects are written to the FIFO. This area can also managed > > using FIFO to automatically clean out any fragmentation. > > I'd be curious to see what the 'best' lower limit for size of objects is before > you start using this 'staging area'. I'd also be curious to see what impact > it has on performance if the object sizes drift - if you're heavily hitting > that area, you may start to loose some of the cyclic gains :( I guess so too. I believe we don't want to move large objects around too often. From some point there might be higher overhead in handling them than placing them once into the system fs and living with the overhead of fs. > There's definately some nifty ideas there, but I'm not convinced that the > gains from a cyclic fs over a more traditional style fs are enough to warrant > the extra management complexity - shuffling objects around on disk, etc. Agree that the beast should be simple. LRU is famous at being simple yet effective. For small objects, FIFO seems cool, for large objects, more traditional FS could be better, if not ufs, then perhaps something else. ---------------------------------------------------------------------- Andres Kroonmaa mail: andre@online.ee Network Manager Organization: MicroLink Online Tel: 6308 909 Tallinn, Sakala 19 Pho: +372 6308 909 Estonia, EE0001 http://www.online.ee Fax: +372 6308 901 ---------------------------------------------------------------------- From squid-dev-request@ircache.net Thu Feb 4 02:02:59 1999 Message-Id: <199902031626.IAA17903@granite.hpl.hp.com> To: Henrik Nordstrom cc: Kevin Littlejohn , squid-dev@ircache.net Subject: Re: memory-mapped files in Squid X-Organization: Hewlett-Packard Laboratories In-reply-to: Your message of Wed, 03 Feb 1999 01:02:19 +0100. <36B7920B.2BA1CF25@hem.passagen.se> Date: Wed, 03 Feb 1999 08:26:50 -0800 From: John Dilley > > I think this is going to be a big part of the tuning for the cyclic > > fs - how do you deal with, for instance, the next MS service pack - > > 50Mb of heavily-hit object. Personally, re-writing objects on hits > > doesn't give me warm fuzzy feelings... > > I would probably use a rewrite rate limit of 30%. It the object are in > the newest 30% of the store, then don't rewrite it to the head. We've found a partitioned policy for cache replacement can work better than a single policy alone. In this case it seems we could use two policies to handle two distinct parts of the working set: 1. The FIFO policy is optimized to handle writing many objects to disk as efficiently as possible. This matches the proxy cache workload, which consists of more writes than reads (in aggregate); furthermore most objects (over 50%) are never re-referenced, so they should be sent to disk and forgotten about as quickly as possible. Circular FIFO appears to accomplish this nicely, and can be optimized to lay out objects very efficiently with regards to disk blocks (as discussed here earlier). 2. If an object is re-referenced it is possible it will become part of the hot object set. In the proxy workload some objects are very popular, hence Squid's main-memory hot object cache. A separate cache partition for popular objects with a frequency based replacement policy can improve the byte (and object) hit rate of the cache without degrading performance. Basically when objects are first added to the cache (cold miss) they are added to the FIFO. If they are re-referenced, they are moved to a separate cache partition. The move can be "logical" by adjusting the cache metadata and leaving the data in the FIFO until "disk quite time", or until it is about to be replaced. In this way the extra writes of Henrik's option 1, below, are avoided some of the time. In addition, the popular objects can be maintained with a replacement policy that works better than FIFO for popular objects. A key question is how much better is a partitioned policy than a simple FIFO policy on a real proxy cache workload? Does the improvement justify implementation of a separate cache partition? Does this mixed policy help alleviate the cache bottleneck (disk access)? > > Except if you've already incurred the overhead of writing to disk, why > > incur it again? > > As I see it there are three options in a cyclic file system: > > 1. Rewrite reused objects when reused, with a limit on by which rate the > object is rewritten to the head. > Penalties: > * extra writes > * changing store index numbers. > > 2. When rewriting a store chunk, preserve those objects that we are > interested in keeping, but pack them together. > Penalties: > * extra reads > * extra writes > * no obvious way to determine what to keep and what to throw away > * changing store index numbers. > > 3. When rewriting a store chunk, preserve those objects that we are > interested in keeping where they are in the chunk. > Penalties: > * fragmentation > * no obvious way to determine what to keep and what to throw away. > > Currently I think 1 is the way to go, but 2 or 3 allows for a more > flexible or customisable replacement policy. ... > > There's definately some nifty ideas there, but I'm not convinced that the > > gains from a cyclic fs over a more traditional style fs are enough to warrant > > the extra management complexity - shuffling objects around on disk, etc. > > I wouldn't say that there is much extra management complexity, but I > admit that some of Squids interfaces does not fit well. I would encourage anyone looking into circular FS implementation not to shuffle data around. One value of the FIFO approach is that data are streamed to disk as efficiently as possible, thus alleviating the disk bottleneck. I don't fully understand the impact of subsequent (random access) reads of data from the FIFO, but as noted earlier, another cache structure may make sense for the popular objects. -- jad -- John Dilley From andre@ml.ee Mon Feb 1 22:19:48 1999 From: "Andres Kroonmaa" Organization: MicroLink Online To: Henrik Nordstrom Date: Mon, 1 Feb 1999 21:46:42 +0200 EET MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: memory-mapped files in Squid CC: squid-dev@ircache.net In-reply-to: <36B26009.33DD710E@hem.passagen.se> Message-ID: <44D40A62DE@mail.lbi.ee> On 30 Jan 99, at 2:27, Henrik Nordstrom wrote: > Sounds like you have read my first message on cyclic filesystems, sent > some weeks ago ;-) I guess i missed it then. In recent discussion I was just objecting throwing away objects from fifo tail, and you weren't clear enough ;) Only after I got a hint on how to make fifo into LRU, it occured to me... > To avoid having frequently > requested objects taking up a unproportionably large portion of the > cache my idea is to rewrite hits if they are located more than a certain > (configurable) distance from the head. Hmm. what you mean by "taking up large portion of cache"? I imagine that when you rewrite accessed object onto fifo head, its space at its old location is released? Aah, you mean that when some object hits 10/sec then it shouldn't be rewritten to the fifo head at the same rate? > * To be able to guarantee that writes can be done sequentially a FIFO > type storage policy should be used, possibly with extensions to avoid > throwing away objects we want to keep (probably) and to defragment > unused space (less likely, early estimates puts this in the range of > 2-5% and it is non-trivial to implement). btw, how did you estimate? > * Also, with a FIFO style replacement policy, it is very hard to justify > the overhead incurred by a standard filesystem. Probably better to use > raw partitions, both from a I/O performance level and how much memory > that is required by the OS for caches/buffers. Regarding fifo ontop of ufs, absolutely. But it may have some benefits to use OS-buffered raw partitions. As I understand, unbuffered partitions can't be poll()ed (always ready and block on read/write) while buffered can (OS handles disk access scheduling and write buffering). This could allow us continue to use select() style coding without threads if needed. > * Disk store index is physical location. How closely objects can be > packed inside each chunk is determined by the index key size, and key > size is in turn determined by the amount of memory we can spend on it. A > * Checkpoints are kept as two storage chunk pointers: last fully > completed storage chunk, and the oldest non-reused chunk. These > checkpoints may actually be implemented as timestamps in the chunks > themselves. > * In-memory index are rebuilt by reading the meta data indexes from the > storage blocks, beginning with the oldest. > * Chaining of object fragments is needed to store larger objects where > we could not afford to wait for the whole object until storing it on > disk. Few thoughts. * more than 90% of objects are <32K in size. Can we eliminate the need to handle chained multi-fragment large objects? Say we let UFS handle objects >32KB, and only put smaller into fifo-store. We'd need to add some code for handling that, but it gives us alot of more freedom. For eg. we can keep pending objects in ram until fully fetched, (keeping upto 32KB per object is not a big deal), but then we can write each object out virtually atomically. There's no worries about partially written objects. No worries about multiple disk ops per object, no worries about multiple object pieces on disks, chunk alignment, etc. As >32KB objects are relatively rare, the overhead of UFS is not so big. Besides, big files is what UFS is much better at. Also, makes it much more attractive to append recently used objects onto the write queue. It would be much more difficult to manage with wildly differing objects sizes, and increases the cost of doing so. Perhaps add cache_dir types (fifo/ufs) and max object sizes to put on each. Have different routines for each, and in configuration decide how large objects are serviced by UFS or fifo... * We need to rewrite squid store interface I guess. Fifo style does not need any open/close calls. No need to use filenames. Squid just swaps the object into memory, (like get_object(&StoreEntrywithMemObject)) If object fits into buffer, it will be done in single disk io. If not, request is perhaps handled by UFS interface instead, which would handle filenames and open/read/close stuff itself. So, at some abstraction level, squid should expect objects to be fetched from disk in one call. When we need the filename to locate object in UFS, then we calculate it when needed and handle as currently. Basically, we'd want to move from FD centric activity inside squid disk routines to more abstract object centric. * We don't want to add indirection layer between swap_file_number and physical location on disk. This only slows us down. Because we want to change physical location on disks occasionally, we'd like to give IO routines the freedom to change swap_file_number in squid metadata structures directly. This allows fifo writer to sort its queue after object is handled to it, and later, whenever needed. The most sense in this is that when squid creates new disk object, it does not know exactly what its swap_file_number will be. Object's fileno will be picekd by IO routines ideally, based on where the fifo pointer is at that exact time and which spindle is most idle. Perhaps disk IO routines should even decide which cache_dir to pick. In a sense, squid metadata becomes integral part of storage. The only key is URL/MD5 hash which resolves directly into (current) location on disk. (in case of fifo/lru storage, or into filenumber in case of UFS) As I now understand, this is how you mean it to be, ie. squid metadata is located directly on the disks near data, and URL database is rebuilt during startup directly from this disk data. Above I assumed that there was some global URL database in squid that implemented LRU and mapped URLs to cache_dirs/locations. If we drop global metadata, then we have per cache_dir LRU that is implemented inside FS. But perhaps there might be some benefit from being able to write just fetched disk object to another disk's write queue? > * On-disk metadata is kept separately for each chunk together with the > data, The exception is when objects has to be released without being > replaced by another one. When this happens a release entry is kept in a > later chunk (the chunk where Squid found that it needs to remove the > object from cache by some reason). Like sort of transaction log? If we had global URL database, we'd not need this. But this solution might be even better. hotswap the drive, restart squid, and it runs with other subset of URL database... > Another exception is when a (large) object crosses chunk boundaries, > here the metadata log entry points to a earlier chunk. uhh. can we avoid large multipart files on fifo storage? ---------------------------------------------------------------------- Andres Kroonmaa mail: andre@online.ee Network Manager Organization: MicroLink Online Tel: 6308 909 Tallinn, Sakala 19 Pho: +372 6308 909 Estonia, EE0001 http://www.online.ee Fax: +372 6308 901 ---------------------------------------------------------------------- From darius@connect.com.au Wed Feb 3 00:02:34 1999 Message-ID: <199902020030.LAA10083@koro.off.connect.com.au> To: Henrik Nordstrom Cc: squid-dev@ircache.net From: Kevin Littlejohn Reply-To: Kevin Littlejohn Subject: Re: memory-mapped files in Squid In-reply-to: Your message of "Tue, 02 Feb 1999 00:00:35 BST." <36B63213.3F669D3@hem.passagen.se> References: <36B63213.3F669D3@hem.passagen.se> <44D40A62DE@mail.lbi.ee> Date: Tue, 02 Feb 1999 11:30:29 +1100 >>> Henrik Nordstrom wrote > > Aah, you mean that when some object hits 10/sec then it shouldn't > > be rewritten to the fifo head at the same rate? > > Yes, among other things. > > > btw, how did you estimate? > > A rought calculation of refreshrates (both TCP_REFRESH and > TCP_CLIENT_REFRESH). But as I said it is an early estimate and I may > have missed something important. As Alex said some simulations should be > done to get a clearer picture of how things behaves. I think this is going to be a big part of the tuning for the cyclic fs - how do you deal with, for instance, the next MS service pack - 50Mb of heavily-hit object. Personally, re-writing objects on hits doesn't give me warm fuzzy feelings... > > We need to rewrite squid store interface I guess. > > Yes. Some kind of abstract store interface is needed. It is kind of hard > to tweak the fd-centric design into a non-fd centric system.. This > applies to all communication and not only disk I/O. > > > We don't want to add indirection layer between swap_file_number and > > physical location on disk. > > No, and it has never been the intention to use such an redirection > layer. That would be to essentially reimplement a kind of directory > structure and that was one of the things I wanted to get rid of. > This stuff applies to any squidFS - the aim should be to have squid store, internally, a direct pointer to the location on disk for the object, rather than any level of indirection. I think that's one of the crucial points - and yeah, we could do with a slight abstraction of the current disk handling, so there's not read() and write() sprinkled through asyncio and disk.c, and so what's stored as a 'name' (and what's stored as a 'fd' for open files) is easily changeable. Is anyone out there looking into this yet? Because I'm about three days away from starting on the squid side of the sfs stuff - so if someone is already eyeing off cleaning up squid's disk IO, I'd like to talk to them ;) > Writing and reading should be isolated from each other. Of course > objects should be able to be written back on another disk than it's read > from. Objects should always be written to the most suitable disk, > regardless if the object came from network, disk or whatever. A object > is a object. Except if you've already incurred the overhead of writing to disk, why incur it again? I'm still not convinced that increasing the workload of the disk is a good thing to do in the process of attempting to speed up disk access. I know that there's many other things affecting disk access speeds - but that media is still the slow part of the chain (well, after network), so it makes sense to me to keep disk use low. > > > Like sort of transaction log? If we had global URL database, we'd not > > need this. But this solution might be even better. hotswap the drive, > > restart squid, and it runs with other subset of URL database... > > Yes, I see metadata logs is a sort of transaction log. > > In theory Squid can be programmed to hotswap the drive without restart. > What is needed is the functions "disable and release everything from > cache_dir X" and "activate cache_dir X". In fact, squid already handles a drive 'going away'. It doesn't currently handle a drive 'coming online', but that would be almost trivial - just get it to rebuild the cache_store index for that drive, and some way of signalling that the drive has come on-line... So long as all the data pertaining to a drive is on that drive, and nowhere else, then you've got a swappable setup. > > uhh. can we avoid large multipart files on fifo storage? > > As you said a hybrid design could be used, where some storage is FIFO > and some filesystem based. > > A third possibility is a spool area for larger objecs, from which > completed objects are written to the FIFO. This area can also managed > using FIFO to automatically clean out any fragmentation. I'd be curious to see what the 'best' lower limit for size of objects is before you start using this 'staging area'. I'd also be curious to see what impact it has on performance if the object sizes drift - if you're heavily hitting that area, you may start to loose some of the cyclic gains :( > > There are a couple of other possibilites as well. > > You may remember my wording "multilevel" in earlier messages. This came > from the idea that store could be maintained at multiple levels, where > the first level blindly writes object and the second (and third ...) > level eats objects from the tail onto the next level fifo. This is a > good idea if the first level wastes a lot of space on objects that then > is thrown away (refreshed or aborted during storage), but it may be hard > load balance such a system... Cute idea. There's definately some nifty ideas there, but I'm not convinced that the gains from a cyclic fs over a more traditional style fs are enough to warrant the extra management complexity - shuffling objects around on disk, etc. Guess the only real way to tell is to implement and see ;) KevinL (Who can see much experimentation coming up... time to requisition a new cache box ;) From jad@granite.hpl.hp.com Thu Feb 4 02:02:56 1999 Message-Id: <199902031626.IAA17903@granite.hpl.hp.com> To: Henrik Nordstrom cc: Kevin Littlejohn , squid-dev@ircache.net Subject: Re: memory-mapped files in Squid X-Organization: Hewlett-Packard Laboratories In-reply-to: Your message of Wed, 03 Feb 1999 01:02:19 +0100. <36B7920B.2BA1CF25@hem.passagen.se> Date: Wed, 03 Feb 1999 08:26:50 -0800 From: John Dilley > > I think this is going to be a big part of the tuning for the cyclic > > fs - how do you deal with, for instance, the next MS service pack - > > 50Mb of heavily-hit object. Personally, re-writing objects on hits > > doesn't give me warm fuzzy feelings... > > I would probably use a rewrite rate limit of 30%. It the object are in > the newest 30% of the store, then don't rewrite it to the head. We've found a partitioned policy for cache replacement can work better than a single policy alone. In this case it seems we could use two policies to handle two distinct parts of the working set: 1. The FIFO policy is optimized to handle writing many objects to disk as efficiently as possible. This matches the proxy cache workload, which consists of more writes than reads (in aggregate); furthermore most objects (over 50%) are never re-referenced, so they should be sent to disk and forgotten about as quickly as possible. Circular FIFO appears to accomplish this nicely, and can be optimized to lay out objects very efficiently with regards to disk blocks (as discussed here earlier). 2. If an object is re-referenced it is possible it will become part of the hot object set. In the proxy workload some objects are very popular, hence Squid's main-memory hot object cache. A separate cache partition for popular objects with a frequency based replacement policy can improve the byte (and object) hit rate of the cache without degrading performance. Basically when objects are first added to the cache (cold miss) they are added to the FIFO. If they are re-referenced, they are moved to a separate cache partition. The move can be "logical" by adjusting the cache metadata and leaving the data in the FIFO until "disk quite time", or until it is about to be replaced. In this way the extra writes of Henrik's option 1, below, are avoided some of the time. In addition, the popular objects can be maintained with a replacement policy that works better than FIFO for popular objects. A key question is how much better is a partitioned policy than a simple FIFO policy on a real proxy cache workload? Does the improvement justify implementation of a sep