Configuring chunk server load awareness¶
In a default setup the system will try to distribute all chunks evenly according to the predefined goals and/or ec configurations. This is also what happens to chunk servers in a defined label group (check here :ref:` labeling_chunk server` for labels). While this is the perfect setting for most cases, other case might require a different setting, for example if your chunk servers are doing other non LizardFS related workloads, you will want to have the distribution be based also on the amount of I/O load each chunk server is handling at a given moment.
This functionality can be achieved by setting the
setting in the chunk server’s mfschunkserver.cfg configuration file. By default this is disabled and the setting is a per chunk server setting.
Additionally it is possible to define a penalty configuration option in the master configuration file (The setting for this is LOAD_FACTOR_PENALTY), if tit is found to be under heavy I/O load.
Configuring rack awareness (network topology)¶
The topology of a LizardFS network can be defined in the mfstopology.cfg file. This configuration file consists of lines matching the following syntax:
ADDRESS can be represented as:
|n.n.n.n||single IP address|
|n.n.n.n/b||IP class specified by network address and bits number|
|n.n.n.n/m.m.m.m||IP class specified by network address and mask|
|f.f.f.f-t.t.t.t||IP range specified by from-to addresses (inclusive)|
The switch number can be specified as a positive 32-bit integer.
Distances calculated from mfstopology.cfg are used to sort chunk servers during read/write operations. Chunk servers closer (with lower distance) to a client will be favored over further away ones.
Please note that new chunks are still created at random to ensure their equal distribution. Re balancing procedures ignore topology configuration as well.
As for now, distance between switches can be set to 0, 1, 2:
0 - IP addresses are the same
1 - IP addresses differ, but switch numbers are the same
2 - switch numbers differ
The topology feature works well with chunk server labeling - a combination of the two can be used to make sure that clients read to/write from chunk servers best suited for them (e.g. from the same network switch).
Quality of service can be configured in the /etc/mfs/globaliolimits.cfg file.
Configuration options consist of:
- subsystem <subsystem> cgroups subsystem by which clients are classified
- limit <group> <throughput in KiB/s>
- limit for clients in cgroup <group>
- limit unclassified <throughput in KiB/s>
- limit for clients that do not match to any specified group.
If globaliolimits.cfg is not empty and this option is not set, not specifying limit unclassified will prevent unclassified clients from performing I/O on LizardFS
# All client share 1MiB/s bandwidth limit unclassified 1024
# All clients in blkio/a group are limited to 1MiB/s, other clients are blocked subsystem blkio limit /a 1024
# The directory /a in the blkio group is allowed to transfer 1MiB/s # /b/a group gets 2MiB/s # unclassified clients share 256KiB/s of bandwidth. subsystem blkio limit unclassified 256 limit /a 1024 limit /b/a 2048
Quota mechanism can be used to limit inodes usage and space usage for users and groups. By default quotas can be set only by a superuser. Setting the SESFLAG_ALLCANCHANGEQUOTA flag in the mfsexports.cfg file would allow everybody to change quota.
In order to set quota for a certain user/group you can simply use mfssetquota tool:
mfssetquota (-u UID/-g GID) SL_SIZE HL_SIZE SL_INODES HL_INODES MOUNTPOINT_PATH
- SL - soft limit
- HL - hard limit
Mounting the meta data¶
LizardFS meta data can be managed through a special mount point called META. This mount point allows to control trashed items (undelete/delete them permanently) and view files that are already deleted but still held open by clients.
To be able to mount meta data you need to add the “mfsmeta” parameter to the mfsmount command:
# mfsmount /mnt/lizardfs-meta -o mfsmeta
after that you will see the following line at mtab:
mfsmeta#10.32.20.41:9321 on /mnt/lizardfs-meta type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
The structure of the mounted meta data directory will look like this:
/mnt/lizardfs-meta/ ├── reserved └── trash └── undel
Each file with a ‘trashtime’ setting above zero will be present here. You can recover those files or delete them permanently.
Recovering files from the trash¶
In order to recover a file, just must move it to the undel/ directory. Files are represented by their inode number and path, so the file dir1/dir2/file.txt with inode 5 will be present at:
and recovering it would be performed like this:
$ cd trash $ mv ‘5|dir1|dir2|file.txt’ undel/
Removing files permanently¶
In order to delete a file permanently, just remove it from trash.
If you delete a file, but someone else uses this file and keeps an open descriptor, you will see this file in here until the descriptor is closed.
Deploying LizardFS as a HA Cluster¶
LizardFS can be run as a high-availability cluster on several nodes. When working in HA mode, a dedicated daemon watches the status of the meta data servers and performs a fail over whenever it detects a master server has crashed (e.g. due to a power outage). The state of the available participating servers is being constantly monitored via a lightweight protocol doing a ‘heartbeat’ like check on the other nodes. Running a LizardFS installation as a HA-cluster significantly increases its availability. Since uRaft uses quorum a reasonable minimum of meta data servers in a HA installation is at least 3, to make sure that a proper election with a ‘majority’ of voices can be done. For details on the underlying algorithm, check raft in the glossary.
In order to deploy LizardFS as a high-availability cluster, follow the steps below.
These steps should be performed on all machines chosen to be in a cluster.
Install the lizardfs-uraft package:
$ apt-get install lizardfs-uraft for Debian/Ubuntu $ yum install lizardfs-uraft for CentOS/RedHat
Prepare your installation:
Fill lizardfs-master config file (/etc/mfs/mfsmaster.cfg) according to Configuring your Master. Details depend on your personal configuration, the only fields essential for uraft are:
PERSONALITY = ha-cluster-managed ADMIN_PASSWORD = your-lizardfs-password MASTER_HOST = the floating ip so that the participating hosts know where to sync the meta database from
For a fresh installation, execute the standard steps for the lizardfs-master (creating mfsexports file, empty meta data file etc.). Do not start the lizardfs-master daemon yet.
Fill the lizardfs-uraft config file (/etc/mfs/lizardfs-uraft.cfg). Configurable fields are:
- identifiers of all the machines in your cluster
- node address ordinal number; should be unique for each machine
- IP at which LizardFS will be accessible for the clients
- a matching netmask for floating IP
- network interface for the floating IP
- The address of the local master controlled by this uraft node, defaults to localhost.
- The port the local master listens on, defaults to 9421
- Minimum election timeout (ms), defaults to 400
- Maximum election timeout (ms), defaults to 600
- HEARTBEAT_PERIOD = 20
- Period between heartbeat messages between uraft nodes (ms), defaults to 20.
- How often uRaft checks if local master is alive (ms), defaults to 250.
Example configuration for a cluster with 3 machines:¶
The first, node1, is at 192.168.0.1, the second node gets hostname node2, and the third one gets hostname node3 and operates under a non-default port number - 99427.
All machines are inside a network with a 255.255.255.0 netmask and use their network interface eth1 for the floating ip.
The LizardFS installation will be accessible at 192.168.0.100
# Configuration for node1: URAFT_NODE_ADDRESS = 192.168.0.1 # ip of first node URAFT_NODE_ADDRESS = node2 # hostname of second node URAFT_NODE_ADDRESS = node3:99427 # hostname and custom port of third node URAFT_ID = 0 # URAFT_ID for this node URAFT_FLOATING_IP = 192.168.0.100 # Shared (floating) ip address for this cluster URAFT_FLOATING_NETMASK = 255.255.255.0 # Netmask for the floating ip URAFT_FLOATING_IFACE = eth1 # Network interface for the floating ip on this node # Configuration for node2: URAFT_NODE_ADDRESS = 192.168.0.1 # ip of first node URAFT_NODE_ADDRESS = node2 # hostname of second node URAFT_NODE_ADDRESS = node3:99427 # hostname and custom port of third node URAFT_ID = 1 # URAFT_ID for this node URAFT_FLOATING_IP = 192.168.0.100 # Shared (floating) ip address for this cluster URAFT_FLOATING_NETMASK = 255.255.255.0 # Netmask for the floating ip URAFT_FLOATING_IFACE = eth1 # Network interface for the floating ip on this node # Configuration for node3: URAFT_NODE_ADDRESS = 192.168.0.1 # ip of first node URAFT_NODE_ADDRESS = node2 # hostname of second node URAFT_NODE_ADDRESS = node3:99427 # hostname and custom port of third node URAFT_ID = 2 # URAFT_ID for this node URAFT_FLOATING_IP = 192.168.0.100 # Shared (floating) ip address for this cluster URAFT_FLOATING_NETMASK = 255.255.255.0 # Netmask for the floating ip URAFT_FLOATING_IFACE = eth1 # Network interface for the floating ip on this node
Enable arp broadcasting in your system (for the floating IP to work):
$ echo 1 > /proc/sys/net/ipv4/conf/all/arp_accept
Start the lizardfs-uraft service:
Change “false” to “true” in /etc/default/lizardfs-uraft:
$ service lizardfs-uraft start
You can check your uraft status via telnet on URAFT_STATUS_PORT (default: 9428):
$ telnet NODE-ADDRESS 9428
When running telnet locally on a node, it is sufficient to use:
$ telnet localhost 9428
Please check if you have the sudo package installed and that the ‘mfs’ user has been added with the right permissions to the /etc/sudoers file.
For more advanced concepts, like 2 active uRaft servers and a quorum node, please check the URAFT Cookbook .