p2p-fs Design Document

Functionality

The aim is to create a Peer to Peer file system, which users can mount, and access other shares just as they would a normal file system. Additionally, when users read data, the system should be able to detect replicas of the data, and the data should be brought in from these multiple sources, thus accomplishing 2 main objectives:

Reduce the load on an individual machine.
Continue serving data to the client(s) even in the face of node failure in case the file is housed at multiple nodes.

Protocol Requirements

The protocol to be designed needs to support the following requirements:

All the nodes in the running system need to be aware of all the other exports in the system.
There should not be unbounded sending of messages on the network in the process of accomplishing the former.

Files involved:

There is a single configuration file /etc/p2pfs.conf which contains:

The list of HOSTS which you wish to export your shares to.
The list of directores that you wish to share with the hosts.

The protocol:

Every node is supposed to create the /etc/p2pfs.conf file and then mount the file system on a specified mount point.
When this operation is performed, the file-system will read the configuration file, and begin exporting the shares to the specified hosts.
On bootup, a unique identifier called the Instance IDentifier(IID) is associated with the running instance of p2p-fs. This is unique to even running instance of p2p-fs on the network.
Similtaneously, the file system updates a Message ID(MID) counter associated with each message that it sends out.
It also maintains a mapping of IP,MID for each host that is sharing files. The key is the IP address of the host, and the value is the last(greatest) MID received from that host.
A machine first sends out a file-send BEGIN request which contains the IP address of the HOST to which the files belong(and is empty if they belong to the current host), along with the MID flag. There is also a flag which specifies if this is a boot-up message.
If the IID associated with a particular host does not match the IID sent, then the MID counter is reset, and we execute the protocol as if this is the highest numbered message received so far.
If the IID sent matches with that of the IID stored for this particular host, then the rest of the protocol is executed only if the MID sent is greater than the stored MID for that IP. This prevents an overflow of messages on the network.
If it is a boot-up export, then the other machine shall send back the exports this this machine has to offer(locally) back to this machine, else it won't. This is done to prevent continuous message sending. This sending back of it's local exports happens only once all the local files have been read, and their MD5 keys computed.
This way, the newly added machine will get all the shares that the machine it is exporting to has with it.
Then, the newly added machine starts sending sending out it's exports to machine on it's HOSTS list. This ensures that machines in the network get the newly added machine's exports.
After the newly added machine is done exporting it's shares to the machine, it sends a END message to the machine it is exporting to. At this point, the machine that has received the exports from the newly added machine also tries to propagate all the exports of this machine to all the machines in it's HOSTS list(except for itself, and the machine to which the files originally belong). This is done only for boot-up messages, and if the check while receiving the BEGIN message was cleared. ie. We detect if the corresponding BEGIN performed any action. If not, we refrain from sending/propagating this message to the HOSTS on our list. We note that the responses are sent to all HOSTS except the one that sent us the BEGIN in the first place.
This ensures that all machines in the network are aware of all other exports.

Note: We shall take up the case of non boot-up messages later, when the inotify module has been implemented.

Internals:

To aid a quick 'ls' and 'find' implementation, all the file names exported by all the nodes are stored at all the other nodes locally in memory. This means everyone has a list of all the files shared by everyone else in their local RAM.

The directory layout is such. Suppose you mount the FS on /mnt/fuse, and 192.168.1.100, and 192.168.1.101 are also sharing files, doing an 'ls /mnt/fuse' will show:

127.0.0.1
192.168.1.100
192.168.1.101

Further if you do an 'ls' on these directories, you can see the files(and directories) shared by the individual nodes.

Internally, there are 2 main data structures maintained:

MFS: the in-memory file system, which is a GLL(a hash table of hash tables) which stores the directory tree that should be shown on an 'ls -R' command. This data structure is very important and is used for getting file attributes, listings, doing path name lookups, etc.... Each file has an associated MD5 tagged to it.
FR: File Replicas. This is a hash table which maps MD5 to a list of complete paths that contain this files which are replicas of each other.

How are these 2 data structures used?

When one does an open, [1] is consulted, and the MD5 of the file is obtained. Then, using this MD5, we index into [2], and get the associated replica list, which we associate with the open FILE HANDLE which we return to open(). Every subsequent read() passes to us the FILE HANDLE(FH), which we can use to get the associated file data(in this case the list of replicas) for that file. Pretty simple eh?

So far so good.... Now, suppose 192.168.1.101(abbrev. 101), and 192.168.1.100(abbrev. 100) are sharing the same file, but under different names:

100/a.mp3
101/b.mp3

Then, the system will detect this and understand them to be replicas of each other. So, in case a user opens 100/a.mp3, and starts reading from it, the read request will be split across both nodes to reduce load on any one, and increase the download speed.

Read Ahead Cachine(RAC):

There are 2 constants HWM and RSZ(refill size), which determine the max size of the RAC, and how much to refill at each cache-fill operation respectively.

If the working size of the RAC is 0, then we fill it upto HWM. Else, if it has a working size < RSZ, gets filled up with RSZ extra bytes, and the first RSZ bytes(which are unused) are discarded. Future read requests should be satisfied from here if possible. Once the cache size falls below RSZ, we repeat the cache-fill operation.

However, we also maintain 2 variables total reads(TR), and cache hist(CH). If the CH ratio(CH/TR) falls below a certain constant(K, say 0.55), we stop using the RA logic, and fall back to the normal on-demand read logic.

To aid initial caching, the 2 variables are both initialized to 32 so that initial cache misses if any don't count for stopping the RA logic. So, the strategy we are adoping here is that of enabling the RA logic by default and then disabling it if the application begins exhibiting non-sequential read behaviour. This is considerably simpler than the contradictory strategy of conditionally enabling it if we detect sequential reads happening. Additionally, most uses of p2p-fs require RA by default, so we enable it by default.

One thing to note is that once the application begins exhibiting a lot of random reads, and CH/TR falls below K, the variables TR and CH stop getting updated. This means that once RA stops, it will never start again unless the file is re-opened.

MD5 Caching for files:

In earlier versions of p2p-fs, there used to be one thread that read the files, computed MD5, and then returned control to fuse. If you were to do an 'ls' on the mount point, the system would go into wait till the fuse's mount code was executed. The next version improved this hugely by assigning another thread to perform these computations. So, the directory would get filled incrementally. Doing an 'ls' would reveal more files every time!!!! The current version is even better in the sense that it caches the MD5 values for files, and on file system mount creates 3 sets:

Files on disk, not in the cache.
Files in the cache, not on disk.
File in both cache and on disk.

MD5 for all files in [1] needs to be computed, whereas for those in [3] is not necessary, since we already have it cached. However, we need to do it for those files which are in [3], but have been modified to since the last time their MD5 sum was computed.

We then do a UNION on [1] and [3] to get the current working set of files. These are the files that are displayed on 'ls' to the fuse mount point. This reduces the run time by a _huge_ amount, since files that are shared generally don't change too much between mounts.

I am using an algorithm that does a 2-way set Difference and Intersection in a single pass, and has O(n) complexity given that the input sequences are sorted. Sorting takes O(n.lg(n)). So, this is also the upper bound on the entire operation. You can find it in include/set_ops.hpp.