p2p-fs Design Document

P2P-FS Logo

Functionality

The aim is to create a Peer to Peer file system, which users can mount, and access other shares just as they would a normal file system. Additionally, when users read data, the system should be able to detect replicas of the data, and the data should be brought in from these multiple sources, thus accomplishing 2 main objectives:
  1. Reduce the load on an individual machine.
  2. Continue serving data to the client(s) even in the face of node failure in case the file is housed at multiple nodes.

Protocol Requirements

The protocol to be designed needs to support the following requirements:
  1. All the nodes in the running system need to be aware of all the other exports in the system.
  2. There should not be unbounded sending of messages on the network in the process of accomplishing the former.

Files involved:

There is a single configuration file /etc/p2pfs.conf which contains:
  1. The list of HOSTS which you wish to export your shares to.
  2. The list of directores that you wish to share with the hosts.

The protocol:


Note: We shall take up the case of non boot-up messages later, when the inotify module has been implemented.

Internals:

To aid a quick 'ls' and 'find' implementation, all the file names exported by all the nodes are stored at all the other nodes locally in memory. This means everyone has a list of all the files shared by everyone else in their local RAM.

The directory layout is such. Suppose you mount the FS on /mnt/fuse, and 192.168.1.100, and 192.168.1.101 are also sharing files, doing an 'ls /mnt/fuse' will show:

127.0.0.1
192.168.1.100
192.168.1.101

Further if you do an 'ls' on these directories, you can see the files(and directories) shared by the individual nodes.

Internally, there are 2 main data structures maintained:

  1. MFS: the in-memory file system, which is a GLL(a hash table of hash tables) which stores the directory tree that should be shown on an 'ls -R' command. This data structure is very important and is used for getting file attributes, listings, doing path name lookups, etc.... Each file has an associated MD5 tagged to it.
  2. FR: File Replicas. This is a hash table which maps MD5 to a list of complete paths that contain this files which are replicas of each other.

How are these 2 data structures used?

When one does an open, [1] is consulted, and the MD5 of the file is obtained. Then, using this MD5, we index into [2], and get the associated replica list, which we associate with the open FILE HANDLE which we return to open(). Every subsequent read() passes to us the FILE HANDLE(FH), which we can use to get the associated file data(in this case the list of replicas) for that file. Pretty simple eh?

So far so good.... Now, suppose 192.168.1.101(abbrev. 101), and 192.168.1.100(abbrev. 100) are sharing the same file, but under different names:

100/a.mp3
101/b.mp3

Then, the system will detect this and understand them to be replicas of each other. So, in case a user opens 100/a.mp3, and starts reading from it, the read request will be split across both nodes to reduce load on any one, and increase the download speed.

Read Ahead Cachine(RAC):

There are 2 constants HWM and RSZ(refill size), which determine the max size of the RAC, and how much to refill at each cache-fill operation respectively.

If the working size of the RAC is 0, then we fill it upto HWM. Else, if it has a working size < RSZ, gets filled up with RSZ extra bytes, and the first RSZ bytes(which are unused) are discarded. Future read requests should be satisfied from here if possible. Once the cache size falls below RSZ, we repeat the cache-fill operation.

However, we also maintain 2 variables total reads(TR), and cache hist(CH). If the CH ratio(CH/TR) falls below a certain constant(K, say 0.55), we stop using the RA logic, and fall back to the normal on-demand read logic.

To aid initial caching, the 2 variables are both initialized to 32 so that initial cache misses if any don't count for stopping the RA logic. So, the strategy we are adoping here is that of enabling the RA logic by default and then disabling it if the application begins exhibiting non-sequential read behaviour. This is considerably simpler than the contradictory strategy of conditionally enabling it if we detect sequential reads happening. Additionally, most uses of p2p-fs require RA by default, so we enable it by default.

One thing to note is that once the application begins exhibiting a lot of random reads, and CH/TR falls below K, the variables TR and CH stop getting updated. This means that once RA stops, it will never start again unless the file is re-opened.

MD5 Caching for files:

In earlier versions of p2p-fs, there used to be one thread that read the files, computed MD5, and then returned control to fuse. If you were to do an 'ls' on the mount point, the system would go into wait till the fuse's mount code was executed. The next version improved this hugely by assigning another thread to perform these computations. So, the directory would get filled incrementally. Doing an 'ls' would reveal more files every time!!!! The current version is even better in the sense that it caches the MD5 values for files, and on file system mount creates 3 sets:

  1. Files on disk, not in the cache.
  2. Files in the cache, not on disk.
  3. File in both cache and on disk.

MD5 for all files in [1] needs to be computed, whereas for those in [3] is not necessary, since we already have it cached. However, we need to do it for those files which are in [3], but have been modified to since the last time their MD5 sum was computed.

We then do a UNION on [1] and [3] to get the current working set of files. These are the files that are displayed on 'ls' to the fuse mount point. This reduces the run time by a _huge_ amount, since files that are shared generally don't change too much between mounts.

I am using an algorithm that does a 2-way set Difference and Intersection in a single pass, and has O(n) complexity given that the input sequences are sorted. Sorting takes O(n.lg(n)). So, this is also the upper bound on the entire operation. You can find it in include/set_ops.hpp.