p2p-fs Design Document
Functionality
The aim is to create a Peer to Peer file system, which users can
mount, and access other shares just as they would a normal file
system. Additionally, when users read data, the system should be able
to detect replicas of the data, and the data should be brought in from
these multiple sources, thus accomplishing 2 main objectives:
-
Reduce the load on an individual machine.
-
Continue serving data to the client(s) even in the face of node
failure in case the file is housed at multiple nodes.
Protocol Requirements
The protocol to be designed needs to support the following
requirements:
- All the nodes in the running system need to be aware of all the
other exports in the system.
-
There should not be unbounded sending of messages on the network in
the process of accomplishing the former.
Files involved:
There is a single configuration file /etc/p2pfs.conf which
contains:
-
The list of HOSTS which you wish to export your shares to.
-
The list of directores that you wish to share with the hosts.
The protocol:
-
Every node is supposed to create the /etc/p2pfs.conf file and
then mount the file system on a specified mount point.
-
When this operation is performed, the file-system will read the configuration
file, and begin exporting the shares to the specified hosts.
-
On bootup, a unique identifier called the Instance
IDentifier(IID) is associated with the running instance of
p2p-fs. This is unique to even running instance of p2p-fs on the
network.
-
Similtaneously, the file system updates a Message ID(MID)
counter associated with each message that it sends out.
-
It also maintains a mapping of IP,MID for each host that is
sharing files. The key is the IP address of the host, and the value
is the last(greatest) MID received from that host.
-
A machine first sends out a file-send BEGIN request which contains
the IP address of the HOST to which the files belong(and is empty if
they belong to the current host), along with the MID flag. There is
also a flag which specifies if this is a boot-up message.
-
If the IID associated with a particular host does not match the IID
sent, then the MID counter is reset, and we execute the protocol as
if this is the highest numbered message received so far.
-
If the IID sent matches with that of the IID stored for this
particular host, then the rest of the protocol is executed only if
the MID sent is greater than the stored MID for
that IP. This prevents an overflow of messages on the network.
-
If it is a boot-up export, then the other machine shall send back
the exports this this machine has to offer(locally) back to
this machine, else it won't. This is done to prevent continuous
message sending. This sending back of it's local exports happens
only once all the local files have been read, and their MD5 keys
computed.
-
This way, the newly added machine will get all the shares that the
machine it is exporting to has with it.
-
Then, the newly added machine starts sending sending out it's
exports to machine on it's HOSTS list. This ensures that machines in
the network get the newly added machine's exports.
-
After the newly added machine is done exporting it's shares to the
machine, it sends a END message to the machine it is exporting
to. At this point, the machine that has received the exports from
the newly added machine also tries to propagate all the exports of
this machine to all the machines in it's HOSTS list(except for
itself, and the machine to which the files originally belong). This
is done only for boot-up messages, and if the check while receiving
the BEGIN message was cleared. ie. We detect if the corresponding
BEGIN performed any action. If not, we refrain from
sending/propagating this message to the HOSTS on our list. We note
that the responses are sent to all HOSTS except the one that
sent us the BEGIN in the first place.
-
This ensures that all machines in the network are aware of all other
exports.
Note: We shall take up the case of non boot-up messages later,
when the inotify module has been implemented.
Internals:
To aid a quick 'ls' and 'find' implementation, all the file names
exported by all the nodes are stored at all the other nodes locally in
memory. This means everyone has a list of all the files shared by
everyone else in their local RAM.
The directory layout is such. Suppose you mount the FS on /mnt/fuse,
and 192.168.1.100, and 192.168.1.101 are also sharing files, doing an
'ls /mnt/fuse' will show:
127.0.0.1
192.168.1.100
192.168.1.101
Further if you do an 'ls' on these directories, you can see the
files(and directories) shared by the individual nodes.
Internally, there are 2 main data structures maintained:
-
MFS: the in-memory file system, which is a GLL(a hash table of
hash tables) which stores the directory tree that should be shown on
an 'ls -R' command. This data structure is very important and is
used for getting file attributes, listings, doing path name lookups,
etc.... Each file has an associated MD5 tagged to it.
-
FR: File Replicas. This is a hash table which maps MD5 to a
list of complete paths that contain this files which are replicas of
each other.
How are these 2 data structures used?
When one does an open, [1] is consulted, and the MD5 of the file is
obtained. Then, using this MD5, we index into [2], and get the
associated replica list, which we associate with the open FILE HANDLE
which we return to open(). Every subsequent read() passes to us the
FILE HANDLE(FH), which we can use to get the associated file data(in
this case the list of replicas) for that file. Pretty simple eh?
So far so good.... Now, suppose 192.168.1.101(abbrev. 101), and
192.168.1.100(abbrev. 100) are sharing the same file, but under
different names:
100/a.mp3
101/b.mp3
Then, the system will detect this and understand them to be replicas
of each other. So, in case a user opens 100/a.mp3, and starts reading
from it, the read request will be split across both nodes to reduce
load on any one, and increase the download speed.
Read Ahead Cachine(RAC):
There are 2 constants HWM and RSZ(refill size), which determine the
max size of the RAC, and how much to refill at each cache-fill
operation respectively.
If the working size of the RAC is 0, then we fill it upto HWM. Else,
if it has a working size < RSZ, gets filled up with RSZ extra bytes,
and the first RSZ bytes(which are unused) are discarded. Future read
requests should be satisfied from here if possible. Once the cache
size falls below RSZ, we repeat the cache-fill operation.
However, we also maintain 2 variables total reads(TR), and cache
hist(CH). If the CH ratio(CH/TR) falls below a certain constant(K, say
0.55), we stop using the RA logic, and fall back to the normal
on-demand read logic.
To aid initial caching, the 2 variables are both initialized to 32 so
that initial cache misses if any don't count for stopping the RA
logic. So, the strategy we are adoping here is that of enabling the RA
logic by default and then disabling it if the application begins
exhibiting non-sequential read behaviour. This is considerably simpler
than the contradictory strategy of conditionally enabling it if we
detect sequential reads happening. Additionally, most uses of p2p-fs
require RA by default, so we enable it by default.
One thing to note is that once the application begins exhibiting a lot
of random reads, and CH/TR falls below K, the variables TR and CH stop
getting updated. This means that once RA stops, it will never start
again unless the file is re-opened.
MD5 Caching for files:
In earlier versions of p2p-fs, there used to be one thread that read
the files, computed MD5, and then returned control to fuse. If you
were to do an 'ls' on the mount point, the system would go into wait
till the fuse's mount code was executed. The next version improved
this hugely by assigning another thread to perform these
computations. So, the directory would get filled incrementally. Doing
an 'ls' would reveal more files every time!!!! The current
version is even better in the sense that it caches the MD5 values for
files, and on file system mount creates 3 sets:
- Files on disk, not in the cache.
- Files in the cache, not on disk.
- File in both cache and on disk.
MD5 for all files in [1] needs to be computed, whereas for those in
[3] is not necessary, since we already have it cached. However, we
need to do it for those files which are in [3], but have been modified
to since the last time their MD5 sum was computed.
We then do a UNION on [1] and [3] to get the current working set of
files. These are the files that are displayed on 'ls' to the fuse
mount point. This reduces the run time by a _huge_ amount, since files
that are shared generally don't change too much between mounts.
I am using an algorithm that does a 2-way set Difference and
Intersection in a single pass, and has O(n) complexity given that the
input sequences are sorted. Sorting takes O(n.lg(n)). So, this is also
the upper bound on the entire operation. You can find it in
include/set_ops.hpp.