Archive for the ‘ Performance ’ Category

NFS cluster status and HighlyAvailableNFS

While working on an NFS cluster setup, I stumbled upon these two articles which are maybe helpful for someone:

http://billharlan.com/pub/papers/NFS_for_clusters.html

Saturated network?

$ time dd if=/dev/zero of=testfile bs=4k count=8182
  8182+0 records in
  8182+0 records out
  real    0m8.829s
  user    0m0.000s
  sys     0m0.160s

 

First exercise your disk with your own code or with a simple write operation like writing files should be enough to test network saturation. When profiling reads instead of writes, call umount and mount to flush caches, or the read will seem instantaneous:

$ cd /
$ umount /mnt/test
$ mount /mnt/test
$ cd /mnt/test
$ dd if=testfile of=/dev/null bs=4k count=8192

Check for failures on a client machine with:

  $ nfsstat -c
or
  $ nfsstat -o rpc

If more than 3% of calls are retransmitted, then there are problems with the network or NFS server. Look for NFS failures on a shared disk server with:

  $ nfsstat -s
or
  $ nfsstat -o rpc

It is not unreasonable to expect 0 badcalls. You should have very few “badcalls” out of the total number of “calls.”

Lost packets

NFS must resend packets that are lost by a busy host. Look for permanently lost packets on the disk server with:

$ head -2 /proc/net/snmp | cut -d' ' -f17
  ReasmFails
  2

If you can see this number increasing during nfs activity, then you are losing packets. You can reduce the number of lost packets on the server by increasing the buffer size for fragmented packets:

$ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh

This is about double the default.

Server threads

See if your server is receiving too many overlapping requests with:

$ grep th /proc/net/rpc/nfsd
  th 8 594 3733.140 83.850 96.660 0.000 73.510 30.560 16.330 2.380 0.000 2.150

The first number is the number of threads available for servicing requests, and the the second number is the number of times that all threads have been needed. The remaining 10 numbers are a histogram showing how many seconds a certain fraction of the threads have been busy, starting with less than 10% of the threads and ending with more than 90% of the threads. If the last few numbers have accumulated a significant amount of time, then your server probably needs more threads.
Increase the number of threads used by the server to 16 by changing RPCNFSDCOUNT=16 in /etc/rc.d/init.d/nfs

Invisible or stale files

If separate clients are sharing information through NFS disks, then you have special problems. You may delete a file on one client node and cause a different client to get a stale file handle. Different clients may have cached inconsistent versions of the same file. A single client may even create a file or directory and be unable to see it immediately. If these problems sound familiar, then you may want to adjust NFS caching parameters and code multiple attempts in your applications.

 

https://help.ubuntu.com/community/HighlyAvailableNFS

Introduction

 

In this tutorial we will set up a highly available server providing NFS services to clients. Should a server become unavailable, services provided by our cluster will continue to be available to users.

Our highly available system will resemble the following: drbd.jpg

 

 

 

Cluster Overview (RedHat suite)

I found this document which contains also a lot of useful informations and basics about clustering even if you not plan to use the RedHat suite.

Cluster Basics

Failover Domains

A cluster is two or more computers (called nodes or members) that work together to perform a task. There are four major types of clusters:

1. Storage clusters

provide a consistent file system image across servers in a cluster, allowing the servers to simultaneously read and write to a single shared file system. A storage cluster simplifies storage administration by limiting the installation and patching of applications to one file system. Also, with a cluster-wide file system, a storage cluster eliminates the need for redundant copies of application data and simplifies backup and disaster recovery.

2. High availability clusters

provide highly available services by eliminating single points of failure and by failing over services from one cluster node to another in case a node becomes inoperative. Typically, services in a high-availability cluster read and write data (via read-write mounted file systems). Therefore, a high-availability cluster must maintain data integrity as one cluster node takes over control of a service from another cluster node. Node failures in a high-availability cluster are not visible from clients outside the cluster. (High-availability clusters are sometimes referred to as failover clusters.)

3. Load balancing clusters

dispatch network service requests to multiple cluster nodes to balance the request load among the cluster nodes. Load balancing provides cost-effective scalability because you can match the number of nodes according to load requirements. If a node in a load-balancing cluster becomes inoperative, the load-balancing software detects the failure and redirects requests to other cluster nodes. Node failures in a load-balancing cluster are not visible from clients outside the cluster.

4. High performance clusters

use cluster nodes to perform concurrent calculations. A high-performance cluster allows applications to work in parallel, therefore enhancing the performance of the applications. (High performance clusters are also referred to as computational clusters or grid computing.)

via Cluster Suite Overview.

32-bit or 64-bit JVM? How about a Hybrid?

Before x86-64 came along, the decision on whether to use 32-bit or 64-bit mode for architectures that supported both was relatively simple: use 64-bit mode if the application requires the larger address space, 32-bit mode otherwise. After all, no point in reducing the amount of data that fits into the processor cache while increasing memory usage and bandwidth if the application doesn’t need the extra addressing space.

When it comes to x86-64, however, there’s also the fact that the number of named general-purpose registers has doubled from 8 to 16 in 64-bit mode. For CPU intensive apps, this may mean performance at the cost of extra memory usage. On the other hand, for memory intensive apps 32-bit mode might be better in if you manage to fit your application within the address space provided. Wouldn’t it be nice if there was a single JVM that would cover the common cases?

It turns out that the HotSpot engineers have been working on doing just that through a feature called Compressed oops. The benefits:

  • Heaps up to 32GB (instead of the theoretical 4GB in 32-bit that in practice is closer to 3GB)
  • 64-bit mode so we get to use the extra registers
  • Managed pointers (including Java references) are 32-bit so we don’t waste memory or cache space

The main disadvantage is that encoding and decoding is required to translate from/to native addresses. HotSpot tries to avoid these operations as much as possible and they are relatively cheap. The hope is that the extra registers give enough of a boost to offset the extra cost introduced by the encoding/decoding.

Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode.

The tested application has two phases, a single threaded one followed by a multi-threaded one. Both phases do a large amount of allocation so memory bandwidth is very important. All tests were done on a dual quad-core Xeon 5400 series with 10GB of RAM. I should note that a different JDK version had to be used for 32-bit mode (JDK6u10rc2) because there is no Linux x86 build of JDK6u6p. I chose the largest heap size that would allow the 32-bit JVM to run the benchmark to completion without crashing.

I started by running the application with a smaller dataset:

JDK6u10rc2 32-bit
Single-threaded phase: 6298ms
Multi-threaded phase (8 threads on 8 cores): 17043ms
Used Heap after full GC: 430MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

JDK6u6p 64-bit with Compressed Oops
Single-threaded phase: 6345ms
Multi-threaded phase (8 threads on 8 cores): 16348ms
Used Heap after full GC: 500MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops

The performance numbers are similar and the memory usage of the 64-bit JVM with Compressed Oops is 16% larger.

JDK6u6p 64-bit
Single-threaded phase: 6463ms
Multi-threaded phase (8 threads on 8 cores): 18778ms
Used Heap after full GC: 700MB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

The performance is again similar, but the memory usage of the 64-bit JVM is much higher, over 60% higher than the 32-bit JVM one.

Let’s try the larger dataset now:

JDK6u10rc2 32-bit
Single-threaded phase: 14188ms
Multi-threaded phase (8 threads on 8 cores): 73451ms
Used Heap after full GC: 1.25GB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC

JDK6u6p 64-bit with CompressedOops
Single-threaded phase: 13742ms
Multi-threaded phase (8 threads on 8 cores): 76664ms
Used Heap after full GC: 1.45GB
JVM Args: -XX:MaxPermSize=256m -Xms3328m -Xmx3328m -server -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops

The performance difference and memory overhead are the same as with the smaller dataset. The benefit of Compressed Oops here is that we still have plenty of headroom while the 32-bit JVM is getting closer to its limits. This may not be apparent from the heap size after a full GC, but during the multi-threaded phase the peak memory usage is quite a bit larger and the fact that the allocation rate is high does not help. This becomes more obvious when we look at the results for the 64-bit JVM.

JDK6u6p 64-bit
Single-threaded phase: 14610ms
Multi-threaded phase (8 threads on 8 cores): 104992ms
Used Heap after full GC: 2GB
JVM Args: -XX:MaxPermSize=256m -Xms4224m -Xmx4224m -server -XX:+UseConcMarkSweepGC

I had to increase the Xms/Xmx to 4224m for the application to run to completion. Even so, the performance of the multi-threaded phase took a substantial performance hit when compared to the other two JVM configurations. All in all, the 64-bit JVM without compressed oops does not do well here.

In conclusion, it seems that compressed oops is a feature with a lot of promise and it allows the 64-bit JVM to be competitive even in cases that favour the 32-bit JVM. It might be interesting to test applications with different characteristics to compare the results. It’s also worth mentioning that since this is a new feature, it’s possible that performance will improve further before it’s integrated into the normal JDK releases. As it is though, it already hits a sweet spot and if it weren’t for the potential for instability, I would be ready to ditch my 32-bit JVM.

Update: The early access release of JDK 6 Update 14 also contains this feature.
Update 2: This feature is enabled by default since JDK 6 Update 23.

via 32-bit or 64-bit JVM? How about a Hybrid? « Ismael Juma.