NFS cluster status and HighlyAvailableNFS
While working on an NFS cluster setup, I stumbled upon these two articles which are maybe helpful for someone:
http://billharlan.com/pub/papers/NFS_for_clusters.html
Saturated network?
$ time dd if=/dev/zero of=testfile bs=4k count=8182 8182+0 records in 8182+0 records out real 0m8.829s user 0m0.000s sys 0m0.160s |
First exercise your disk with your own code or with a simple write operation like writing files should be enough to test network saturation. When profiling reads instead of writes, call umount
and mount
to flush caches, or the read will seem instantaneous:
$ cd / $ umount /mnt/test $ mount /mnt/test $ cd /mnt/test $ dd if=testfile of=/dev/null bs=4k count=8192 |
Check for failures on a client machine with:
$ nfsstat -c or $ nfsstat -o rpc |
If more than 3% of calls are retransmitted, then there are problems with the network or NFS server. Look for NFS failures on a shared disk server with:
$ nfsstat -s or $ nfsstat -o rpc |
It is not unreasonable to expect 0 badcalls. You should have very few “badcalls” out of the total number of “calls.”
Lost packets
NFS must resend packets that are lost by a busy host. Look for permanently lost packets on the disk server with:
$ head -2 /proc/net/snmp | cut -d' ' -f17 ReasmFails 2 |
If you can see this number increasing during nfs activity, then you are losing packets. You can reduce the number of lost packets on the server by increasing the buffer size for fragmented packets:
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh $ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh |
This is about double the default.
Server threads
See if your server is receiving too many overlapping requests with:
$ grep th /proc/net/rpc/nfsd th 8 594 3733.140 83.850 96.660 0.000 73.510 30.560 16.330 2.380 0.000 2.150 |
The first number is the number of threads available for servicing requests, and the the second number is the number of times that all threads have been needed. The remaining 10 numbers are a histogram showing how many seconds a certain fraction of the threads have been busy, starting with less than 10% of the threads and ending with more than 90% of the threads. If the last few numbers have accumulated a significant amount of time, then your server probably needs more threads.
Increase the number of threads used by the server to 16 by changing RPCNFSDCOUNT=16
in /etc/rc.d/init.d/nfs
Invisible or stale files
If separate clients are sharing information through NFS disks, then you have special problems. You may delete a file on one client node and cause a different client to get a stale file handle. Different clients may have cached inconsistent versions of the same file. A single client may even create a file or directory and be unable to see it immediately. If these problems sound familiar, then you may want to adjust NFS caching parameters and code multiple attempts in your applications.
https://help.ubuntu.com/community/HighlyAvailableNFS
Introduction
In this tutorial we will set up a highly available server providing NFS services to clients. Should a server become unavailable, services provided by our cluster will continue to be available to users.
Our highly available system will resemble the following:
Follow Us!