Finding the Right Distributed Cluster File System
Dear Habrasociety!
I want to ask your advice in choosing a distributed cluster file system, since I have no experience with them, and they themselves are quite different and have a bunch of features. In addition, there is also a relative information hunger in this direction - any specifics are simply missing.
The system is built on Linux.
The task is essentially simple - WEB hosting, i.e. the repository will store the sites themselves and their files. The WEB server will be connected to the storage and work with it directly as with a file system.
The first thing that came across was DRBD . This is just replication, you can use it for geographic replication. Vobschem not FS.
Next came the GFS (Global File System). After examining the information on it, it was found that the system is not distributed, but simply allows customers to join the central repository and all work simultaneously with it. For small volumes, in general, this is very suitable. Fault tolerance can be organized using the same DRBD mirroring data. However, if you need a large amount, you will have to dodge with expensive storage systems, because this system works with block devices that connect via iSCSI, FC, InfiniBand, etc. With large volumes, costs go up sharply due to the need to buy expensive pieces of iron, besides also 2, so that the second would be a slave in the first in stock. Of course, I don’t know, maybe it’s possible to build some kind of virtual block device from a pack of servers, but in my opinion this is already a perversion.
And then I finally got to the bottom of GlusterFS ( Off site ). Judging by the description - what you need. The distributed cluster file system, with data replication, distribution of data among network nodes, is scaled almost linearly. It has automatic recovery, adding nodes to the cluster on the fly, etc., in general, a full-fledged adult FS. Used on many productive clusters around the world.
PS Hadoop, MogileFS and others do not offer, it is more of a framework for embedding in applications. I need a solution exclusively at the file system level.
PSS Please note that we are discussing fully functional and stable FSs that can be used in production. Many offer products that are in early development (PohmelFS) and / or have a bunch of restrictions (GridFS, in which there are no permissions, no folders, and even file creation - an experimental feature. GridFS is made on top of MongoDB).
I want to ask your advice in choosing a distributed cluster file system, since I have no experience with them, and they themselves are quite different and have a bunch of features. In addition, there is also a relative information hunger in this direction - any specifics are simply missing.
The system is built on Linux.
What I need from this file system:
- Distribution of data by network nodes
- Automatically create replicas (you need 3, better if you can configure)
- For clients, this should look like a full-fledged POSIX (not quite, but close to) file system. For users, this should look like a regular file system.
- Built-in High Avaliability, automatic recovery, adding new nodes on the fly
- XFS support desirable
Task?
The task is essentially simple - WEB hosting, i.e. the repository will store the sites themselves and their files. The WEB server will be connected to the storage and work with it directly as with a file system.
Self-search results:
The first thing that came across was DRBD . This is just replication, you can use it for geographic replication. Vobschem not FS.
Next came the GFS (Global File System). After examining the information on it, it was found that the system is not distributed, but simply allows customers to join the central repository and all work simultaneously with it. For small volumes, in general, this is very suitable. Fault tolerance can be organized using the same DRBD mirroring data. However, if you need a large amount, you will have to dodge with expensive storage systems, because this system works with block devices that connect via iSCSI, FC, InfiniBand, etc. With large volumes, costs go up sharply due to the need to buy expensive pieces of iron, besides also 2, so that the second would be a slave in the first in stock. Of course, I don’t know, maybe it’s possible to build some kind of virtual block device from a pack of servers, but in my opinion this is already a perversion.
And then I finally got to the bottom of GlusterFS ( Off site ). Judging by the description - what you need. The distributed cluster file system, with data replication, distribution of data among network nodes, is scaled almost linearly. It has automatic recovery, adding nodes to the cluster on the fly, etc., in general, a full-fledged adult FS. Used on many productive clusters around the world.
Actually, the questions:
- Is there anyone who has worked with such systems. What to expect from them, what are the pitfalls?
- Maybe someone knows other, more suitable FS?
PS Hadoop, MogileFS and others do not offer, it is more of a framework for embedding in applications. I need a solution exclusively at the file system level.
PSS Please note that we are discussing fully functional and stable FSs that can be used in production. Many offer products that are in early development (PohmelFS) and / or have a bunch of restrictions (GridFS, in which there are no permissions, no folders, and even file creation - an experimental feature. GridFS is made on top of MongoDB).