Advanced configuration

Configuring chunk server load awareness

In a default setup the system will try to distribute all chunks evenly according to the predefined goals and/or ec configurations. This is also what happens to chunk servers in a defined label group (check here :ref:` labeling_chunk server` for labels). While this is the perfect setting for most cases, other case might require a different setting, for example if your chunk servers are doing other non LizardFS related workloads, you will want to have the distribution be based also on the amount of I/O load each chunk server is handling at a given moment.

This functionality can be achieved by setting the

ENABLE_LOAD_FACTOR

setting in the chunk server’s mfschunkserver.cfg configuration file. By default this is disabled and the setting is a per chunk server setting.

Additionally it is possible to define a penalty configuration option in the master configuration file (The setting for this is LOAD_FACTOR_PENALTY), if tit is found to be under heavy I/O load.

Configuring rack awareness (network topology)

The topology of a LizardFS network can be defined in the mfstopology.cfg file. This configuration file consists of lines matching the following syntax:

ADDRESS SWITCH-NUMBER

ADDRESS can be represented as:

* all addresses
n.n.n.n single IP address
n.n.n.n/b IP class specified by network address and bits number
n.n.n.n/m.m.m.m IP class specified by network address and mask
f.f.f.f-t.t.t.t IP range specified by from-to addresses (inclusive)

The switch number can be specified as a positive 32-bit integer.

Distances calculated from mfstopology.cfg are used to sort chunk servers during read/write operations. Chunk servers closer (with lower distance) to a client will be favored over further away ones.

Please note that new chunks are still created at random to ensure their equal distribution. Re balancing procedures ignore topology configuration as well.

As for now, distance between switches can be set to 0, 1, 2:

0 - IP addresses are the same

1 - IP addresses differ, but switch numbers are the same

2 - switch numbers differ

The topology feature works well with chunk server labeling - a combination of the two can be used to make sure that clients read to/write from chunk servers best suited for them (e.g. from the same network switch).

Configuring QoS

Quality of service can be configured in the /etc/mfs/globaliolimits.cfg file.

Configuration options consist of:

  • subsystem <subsystem> cgroups subsystem by which clients are classified
  • limit <group> <throughput in KiB/s>
  • limit for clients in cgroup <group>
  • limit unclassified <throughput in KiB/s>
  • limit for clients that do not match to any specified group.

If globaliolimits.cfg is not empty and this option is not set, not specifying limit unclassified will prevent unclassified clients from performing I/O on LizardFS

Example 1:

# All client share 1MiB/s bandwidth
limit unclassified 1024

Example 2:

# All clients in blkio/a group are limited to 1MiB/s, other clients are blocked
subsystem blkio
limit /a 1024

Example 3:

# The directory /a in the blkio group is allowed to transfer 1MiB/s
# /b/a group gets 2MiB/s
# unclassified  clients share 256KiB/s of bandwidth.
     subsystem blkio
     limit unclassified 256
     limit /a   1024
     limit /b/a 2048

Configuring Quotas

Quota mechanism can be used to limit inodes usage and space usage for users and groups. By default quotas can be set only by a superuser. Setting the SESFLAG_ALLCANCHANGEQUOTA flag in the mfsexports.cfg file would allow everybody to change quota.

In order to set quota for a certain user/group you can simply use mfssetquota tool:

mfssetquota  (-u UID/-g GID)   SL_SIZE   HL_SIZE   SL_INODES   HL_INODES   MOUNTPOINT_PATH

where:

  • SL - soft limit
  • HL - hard limit

Mounting the meta data

LizardFS meta data can be managed through a special mount point called META. This mount point allows to control trashed items (undelete/delete them permanently) and view files that are already deleted but still held open by clients.

To be able to mount meta data you need to add the “mfsmeta” parameter to the mfsmount command:

# mfsmount /mnt/lizardfs-meta -o mfsmeta

after that you will see the following line at mtab:

mfsmeta#10.32.20.41:9321 on /mnt/lizardfs-meta type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)

The structure of the mounted meta data directory will look like this:

/mnt/lizardfs-meta/
                   ├── reserved
                   └── trash
                   └── undel

Trash directory

Each file with a ‘trashtime’ setting above zero will be present here. You can recover those files or delete them permanently.

Recovering files from the trash

In order to recover a file, just must move it to the undel/ directory. Files are represented by their inode number and path, so the file dir1/dir2/file.txt with inode 5 will be present at:

trash/5|dir1|dir2|file.txt,

and recovering it would be performed like this:

$ cd trash
$ mv ‘5|dir1|dir2|file.txt’ undel/

Removing files permanently

In order to delete a file permanently, just remove it from trash.

Reserved directory

If you delete a file, but someone else uses this file and keeps an open descriptor, you will see this file in here until the descriptor is closed.

Deploying LizardFS as a HA Cluster

LizardFS can be run as a high-availability cluster on several nodes. When working in HA mode, a dedicated daemon watches the status of the meta data servers and performs a fail over whenever it detects a master server has crashed (e.g. due to a power outage). The state of the available participating servers is being constantly monitored via a lightweight protocol doing a ‘heartbeat’ like check on the other nodes. Running a LizardFS installation as a HA-cluster significantly increases its availability. Since uRaft uses quorum a reasonable minimum of meta data servers in a HA installation is at least 3, to make sure that a proper election with a ‘majority’ of voices can be done. For details on the underlying algorithm, check raft in the glossary.

In order to deploy LizardFS as a high-availability cluster, follow the steps below.

These steps should be performed on all machines chosen to be in a cluster.

Install the lizardfs-uraft package:

$ apt-get install lizardfs-uraft for Debian/Ubuntu
$ yum install lizardfs-uraft for CentOS/RedHat

Prepare your installation:

Fill lizardfs-master config file (/etc/mfs/mfsmaster.cfg) according to Configuring your Master. Details depend on your personal configuration, the only fields essential for uraft are:

PERSONALITY = ha-cluster-managed
ADMIN_PASSWORD = your-lizardfs-password
MASTER_HOST = the floating ip so that the participating hosts know where to sync the meta database from

For a fresh installation, execute the standard steps for the lizardfs-master (creating mfsexports file, empty meta data file etc.). Do not start the lizardfs-master daemon yet.

Fill the lizardfs-uraft config file (/etc/mfs/lizardfs-uraft.cfg). Configurable fields are:

URAFT_NODE_ADDRESS
identifiers of all the machines in your cluster
URAFT_ID
node address ordinal number; should be unique for each machine
URAFT_FLOATING_IP
IP at which LizardFS will be accessible for the clients
URAFT_FLOATING_NETMASK
a matching netmask for floating IP
URAFT_FLOATING_IFACE
network interface for the floating IP
URAFT_ELECTOR_MODE
...
LOCAL_MASTER_ADDRESS
The address of the local master controlled by this uraft node, defaults to localhost.
LOCAL_MASTER_MATOCL_PORT
The port the local master listens on, defaults to 9421
ELECTION_TIMEOUT_MIN
Minimum election timeout (ms), defaults to 400
ELECTION_TIMEOUT_MAX
Maximum election timeout (ms), defaults to 600
HEARTBEAT_PERIOD = 20
Period between heartbeat messages between uraft nodes (ms), defaults to 20.
LOCAL_MASTER_CHECK_PERIOD
How often uRaft checks if local master is alive (ms), defaults to 250.

Example configuration for a cluster with 3 machines:

The first, node1, is at 192.168.0.1, the second node gets hostname node2, and the third one gets hostname node3 and operates under a non-default port number - 99427.

All machines are inside a network with a 255.255.255.0 netmask and use their network interface eth1 for the floating ip.

The LizardFS installation will be accessible at 192.168.0.100

 # Configuration for node1:
 URAFT_NODE_ADDRESS = 192.168.0.1            # ip of first node
 URAFT_NODE_ADDRESS = node2                  # hostname of second node
 URAFT_NODE_ADDRESS = node3:99427            # hostname and custom port of third node
 URAFT_ID = 0                                # URAFT_ID for this node
 URAFT_FLOATING_IP = 192.168.0.100           # Shared (floating) ip address for this cluster
 URAFT_FLOATING_NETMASK = 255.255.255.0      # Netmask for the floating ip
 URAFT_FLOATING_IFACE = eth1                 # Network interface for the floating ip on this node

# Configuration for node2:
 URAFT_NODE_ADDRESS = 192.168.0.1            # ip of first node
 URAFT_NODE_ADDRESS = node2                  # hostname of second node
 URAFT_NODE_ADDRESS = node3:99427            # hostname and custom port of third node
 URAFT_ID = 1                                # URAFT_ID for this node
 URAFT_FLOATING_IP = 192.168.0.100           # Shared (floating) ip address for this cluster
 URAFT_FLOATING_NETMASK = 255.255.255.0      # Netmask for the floating ip
 URAFT_FLOATING_IFACE = eth1                 # Network interface for the floating ip on this node

 # Configuration for node3:
 URAFT_NODE_ADDRESS = 192.168.0.1            # ip of first node
 URAFT_NODE_ADDRESS = node2                  # hostname of second node
 URAFT_NODE_ADDRESS = node3:99427            # hostname and custom port of third node
 URAFT_ID = 2                                # URAFT_ID for this node
 URAFT_FLOATING_IP = 192.168.0.100           # Shared (floating) ip address for this cluster
 URAFT_FLOATING_NETMASK = 255.255.255.0      # Netmask for the floating ip
 URAFT_FLOATING_IFACE = eth1                 # Network interface for the floating ip on this node

Enable arp broadcasting in your system (for the floating IP to work):

$ echo 1 > /proc/sys/net/ipv4/conf/all/arp_accept

Start the lizardfs-uraft service:

Change “false” to “true” in /etc/default/lizardfs-uraft:

$ service lizardfs-uraft start

You can check your uraft status via telnet on URAFT_STATUS_PORT (default: 9428):

$ telnet NODE-ADDRESS 9428

When running telnet locally on a node, it is sufficient to use:

$ telnet localhost 9428

Please check if you have the sudo package installed and that the ‘mfs’ user has been added with the right permissions to the /etc/sudoers file.