GlusterFS on EC2
GlusterFS is a POSIX compliant distributed file system. It is very flexible due to it's modular "translators" and well suited for cloud comptuing. GlusterFS can replicate files, parallelize volume access, stripe data, distribute files (a server that stores a file will have the whole file) and much more. The actual configuration of the storage cluster depends on the desired specs, e.g. if the application data integrity/durability is a primary concern then striping is probably not an option, replication can enhance read performance and durability but screw up write performance, etc.
The filesystem structure is defined on the client side, as opposed to many other distributed file systems. The translator stack on the client side should therefoe be consistent between clients of the gluster file system, or at least architectually compatible.
Using GlusterFS for highly accessed small files is infeasible - A good example is sharing PHP scripts of a busy website. Cpu usage and latency are so bad that it's practically unusable.
Software installation and configuration
The binary packages on gluster.com are for 64bit platforms. There are no packages for version 3.1+ on the debian repositories yet, so I needed to compile and package to use on small and medium instances. Compilation and packaging is a snap, gluster use autotools so building a debian package with cdbs is a snap.
Unlike version 3.0+, the packages are no longer split to client, server and lib packages (maybe the debian guys will split them on the official deb, beats me), there is one package containing everything, so be gentle with the postinst script.
Server configuration
Client configuration
On Gluster, almost everything works client side. It is very important to have things synchronized between clients - time, user id's, Gluster client configurations. Like NFS v3, Gluster does not translate posix uid number and the clients must have the same id numbers if you want permissions to work. This also means that the same security assumption as NFSv3, you have total control of all the systems that are capable of accessing the file system. See
security below. NTP time synchronization is a must if you use IO caching, or else you will have a stale cache.
Disasters and recovery
Performance
Performance varies greatly between different server and client configurations. As a general rule, native Gluster clients can utilize the cluster more efficiently, but this does not mean they will always perform better. For example, the Gluster FUSE mount client is horrible when it comes to many small files/blocks because of the many context switches to kernel space and back. If you are facing a use case which requires many small files, use heavy caching or switch to a different client like NFS or
Booster.
Caching
Security
--
AvishaiIshShalom - 31 Dec 2010