Archive for the ‘ Performance ’ Category

5 subtle ways you’re using MySQL as a queue, and why it’ll bite you

5 subtle ways you’re using MySQL as a queue, and why it’ll bite you

Note: This guest post is from our friends at Percona about using MySQL as a job queue.

I work for Percona, a MySQL consulting company. To augment my memory, I keep a quick-reference text file with notes on interesting issues that customers ask us to solve. One of the categories of frequent problems is attempts to build a job queue in MySQL. I have so many URLs under this bullet point that I stopped keeping track anymore. Customers have endless problems with job queues in their databases. By “job queue” I simply mean some list of things they’ve inserted, which usually need to be processed and marked as completed. I’ve seen scores — maybe hundreds — of cases like this.

Many people realize the difficulties in building a good job queue or batch processing system, and try not to create one inside MySQL. Although the job queue is a great design pattern from the developer’s point of view, they know it’s often hard to implement well in a relational database. However, experience shows me that job queues sneak up in unexpected ways, even if you’re a seasoned developer.

Here are some of the most common ways I’ve seen the job-queue design pattern creep into an application’s database. Are you using MySQL for any of the following?

  1. Storing a list of messages to send: whether it’s emails, SMS messages, or friend requests, if you’re storing a list of messages in a table and then looking through the list for messages that need to be sent, you’ve created a job queue.
  2. Moderation, token claims, or approval: do you have a list of pending articles, comments, posts, email validations, or users? If so, you have a job queue.
  3. Order processing: If your order-processing system looks for newly submitted orders, processes them, and updates their status, it’s a job queue.
  4. Updating a remote service: does your ad-management software compute bid changes for ads, and then store them for some other process to communicate with the advertising service? That’s a job queue.
  5. Incremental refresh or synchronization: if you store a list of items that has changed and needs some background processing, such as files to sync for your new file-sharing service, well, by now you know what that is.

As you can see, queues are sly; they slip into your design without you realizing it. Frankly, many of them aren’t really a problem in reality. But the potential is always there, and I’ve observed that it’s hard to predict which things will become problems. This is usually because it depends on behavior that you don’t know in advance, such as which parts of your application will get the most load, or what your users will promote to their friends.

Let’s dive a little bit into why job queues can cause trouble, and then I’ll show you some ways to help reduce the chance it’ll happen to you. The problem is usually very simple: performance. As time passes, the job queue table starts to either perform poorly, or cause other things to perform badly through collateral damage. There are three primary reasons for this:

  1. Polling. Many of the job queue systems I see have one or more worker processes checking for something to do. This starts to become a problem pretty quickly in a heavily loaded application, for reasons I’m about to explain.
  2. Locking. The specific implementation of the polling often looks like this: run a SELECT FOR UPDATE to see if there are items to process; if so, UPDATE them in some way to mark them as in-process; then process them and mark them as complete. There are variations on this, not necessarily involving SELECT FOR UPDATE, but often something with similar effects. The problem with SELECT FOR UPDATE is that it usually creates a single synchronization point for all of the worker processes, and you see a lot of processes waiting for the locks to be released with COMMIT. Bad implementations of this (not committing until the workers have processed the items, for example) are really horrible, but even “good” implementations can cause serious pile-ups.
  3. Data growth. I can’t tell you how many times I’ve seen email list management applications that have a single huge emails table. New emails go into the table with a “new” status, and then they get updated to mark them as sent. As time passes, these email tables can grow into millions or even billions of rows. Even though there might only be hundreds to thousands of new messages to send, that big bloated table makes all the queries really, really slow. If you combine this with polling and/or locking and lots of load on the server, you have a recipe for epic disaster.

The solutions to these problems are actually rather simple: 1) avoid polling; 2) avoid locking; and 3) avoid storing your queue in the same table as other data. Implementing these solutions can take a bit of creativity, however.

First, let’s look at how to avoid polling. I wish that MySQL had listen/notify functionality, the way that PostgreSQL and Microsoft SQL Server do (just to mention two). Alas, MySQL doesn’t, but you can simulate it. Here are three ideas: use GET_LOCK() and RELEASE_LOCK(), or write a plugin to communicate through Spread, or make the consumers run a SLEEP(100000) query, and then kill these queries to “signal” to the worker that there’s something to do. These can work quite well, although it’d be nice to have a more straightforward solution.

Locking is actually quite easy to avoid. Instead of SELECT FOR UPDATE followed by UPDATE, just UPDATE with a LIMIT, and then see if any rows were affected. The client protocol tells you that; there’s no need for another query to the database to check. Make sure autocommit is enabled for this UPDATE, so that you don’t hold the resultant locks open for longer than the duration of the statement. If you don’t have autocommit enabled, the application must follow up with a COMMIT to release any locks, and that is really no different from SELECT FOR UPDATE. (The rest of the work can be done with autocommit disabled; you need to enable it for only this statement.) While I’m wishing for things, I wish that SELECT FOR UPDATE had never been invented. I haven’t seen a case yet where it can’t be done a better way, nor have I seen a case where it has failed to cause problems

Finally, it’s also really easy to avoid the one-big-table syndrome. Just create a separate table for new emails, and when you’re done processing them, INSERT them into long-term storage and then DELETE them from the queue table. The table of new emails will typically stay very small and operations on it will be fast. And if you do the INSERT before the DELETE, and use INSERT IGNORE or REPLACE, you don’t even need to worry about using a transaction across the two tables, in case your app crashes between. That further reduces locking and other overhead. If you fail to execute the DELETE, you can just have a regular cleanup task retry and purge the orphaned row. (Hmm, sounds like another job queue, no?) You can do much the same thing for any type of queue. For example, articles or comments that are pending approval can go into a separate table. This is really required on a large scale, although you shouldn’t worry that your WordPress blog doesn’t do things this way (unless you’ve been hired to rewrite CNN.com using WordPress as a backend).

Finally, and I’ve saved perhaps the most obvious solution for last, don’t use the database at all! Use a real queueing system, such as Resque, ActiveMQ, RabbitMQ, or Gearman. Be careful, however, that you don’t enable persistence to a database and choose to use MySQL for that. Depending on the queue system, that can just reintroduce the problem in a generic way that’s even less optimal. Some queue systems use all of the database worst practices I enumerated above.

I hope this article has given you some insight into the variety of ways that job queues inside of MySQL can sneak up on you and bite you in the tendermost parts. And I hope you can learn to recognize and avoid this design pattern yourself, or at least implement it in a way that won’t hurt you. It really is such a common problem that it’s become one of the classic questions I see. Now, I’m off to check my list of pending consulting requests and see what I should work on next.

via 5 subtle ways you’re using MySQL as a queue, and why it’ll bite you.

Javascript performance: callback (async) vs Q ..

Promises/A+ Performance Hits You Should Be Aware Of

The Promises/A+ specification is a fresh and very interesting way of dealing with the asynchronous nature of Javascript. It also provides a sensible way to deal with error handling and exceptions. In this article we will go through the performance hits you should be aware of and as a side-effect do a comparison between the two most popular Promises/A+ implementations, When and Q and how they compare to Async, the lowest abstraction you can get on asynchronicity.

 

Basic nodejs single thread architecture:

 

The Case

My motivation for looking deeper into the performance of Promises/A+ was a Job Queuing system i’ve been working on named Kickq. It is expected that the system will get hammered when used on production so stress testing was warranted. After stubbing all the database interactions, essentially making the operation of job creation synchronous, I was getting odd performance results.

The test was simple, create 500 jobs in a loop and measure how long it takes for all the jobs to finish.

The measurements were in the ~550ms range and my eyeballs started to roll. “That’s a synchronous operation, it should finish in less than 3ms, WHAT THE????!?!”. After taking a few moments to let it sip in the suspect was found, it was Promises. I used them as the only pattern to handle asynchronous ops and callbacks throughout the whole project. Brian Cavalier, one of the authors of When.js, helped me pinpoint the real culprit, it was the tick:

Promises/A+ Specification, Note 4.1 In practical terms, an implementation must use a mechanism such as setTimeout, setImmediate, or process.nextTick to ensure that onFulfilled and onRejected are not invoked in the same turn of the event loop as the call to then to which they are passed.

In other words, Promises, per the specification, must be resolved Asynchronously! That comes with a cost, a heavy one apparently.

In the process of studying performance I had to create a performance library, poor mans profiling. And a benchmark test for Promises/A+ implementations that’s already used to optimize the future versions of When.

 

Creating The Promises/A+ Benchmark

I tried to broaden the definition of the test case. If an application uses the Promises pattern as the only way to manage how the internal parts interact, we can make a few assumptions:

  • There will be a series of promises chained together, representing the various operations that will be performed by your application.
  • The Deferred Object is used on each link of the chain to control resolution and how the promise object is exposed.
  • Throughout the whole chain of promises there can be operations that are actually synchronous, we will measure all cases.

Promises, Total Time to Resolve, 500 Loops

Promises, Memory Consumption

Difference to First Resolved Promise, 500 Loops

Perf Type Async When 2.1.0 Q 0.9.5 Promise 3.0.1
Sync Diff 0.01ms 36.62ms 186.43ms 63.96ms
Mixed Diff 5.37ms 41.78ms 226.34ms 83.83ms
Async Diff 22.42ms 58.18ms 241.80ms 93.68ms
Sync Diff vs AsyncLib 1x 3,662x 18,643x 6,396x
Mixed Diff vs AsyncLib 1x 7.78x 42.15x 15.61x
Async Diff vs AsyncLib 1x 2.60x 10.79x 4.18x

Libraries When.js v1.8.1 and Deferred are not included in this table because they resolve promises synchronously. This difference makes the Diff metric inapplicable.

Total Time of execution, 500 Loops

Perf Type Async When 1.8.1 When 2.1.0 Q 0.9.5 Deferred 0.6.3 Promise 3.0.1
Sync Total 5.15ms 12.35ms 72.35ms 301.47ms 71.25ms 80.50ms
Mixed Total 18.94ms 40.57ms 80.21ms 325.49ms 94.58ms 95.67ms
Async Total 35.70ms 50.63ms 90.52ms 337.82ms 105.87ms 107.01ms
Sync Total vs AsyncLib 1x 2.40x 14.05x 58.54x 13.83x 15.63x
Mixed Total vs AsyncLib 1x 2.14x 4.23x 17.19x 4.99x 5.05x
Async Total vs AsyncLib 1x 1.42x 2.54x 9.46x 2.97x 3.00x

Average Memory Difference – Single 500 Loop Runs

Pert Type Async When 1.8.1 When 2.0.1 When 2.1.x Q Q longStack=0 Deferred
Sync 113.29% 160.98% 840.21% 866.88% 1106.67% 684.56% 354.07%
Async 159.29% 458.44% 811.32% 834.63% 1110.21% 691.41% 429.18%

 

via http://thanpol.as/javascript/promises-a-performance-hits-you-should-be-aware-of/

NFS cluster status and HighlyAvailableNFS

While working on an NFS cluster setup, I stumbled upon these two articles which are maybe helpful for someone:

http://billharlan.com/pub/papers/NFS_for_clusters.html

Saturated network?

$ time dd if=/dev/zero of=testfile bs=4k count=8182
  8182+0 records in
  8182+0 records out
  real    0m8.829s
  user    0m0.000s
  sys     0m0.160s

 

First exercise your disk with your own code or with a simple write operation like writing files should be enough to test network saturation. When profiling reads instead of writes, call umount and mount to flush caches, or the read will seem instantaneous:

$ cd /
$ umount /mnt/test
$ mount /mnt/test
$ cd /mnt/test
$ dd if=testfile of=/dev/null bs=4k count=8192

Check for failures on a client machine with:

  $ nfsstat -c
or
  $ nfsstat -o rpc

If more than 3% of calls are retransmitted, then there are problems with the network or NFS server. Look for NFS failures on a shared disk server with:

  $ nfsstat -s
or
  $ nfsstat -o rpc

It is not unreasonable to expect 0 badcalls. You should have very few “badcalls” out of the total number of “calls.”

Lost packets

NFS must resend packets that are lost by a busy host. Look for permanently lost packets on the disk server with:

$ head -2 /proc/net/snmp | cut -d' ' -f17
  ReasmFails
  2

If you can see this number increasing during nfs activity, then you are losing packets. You can reduce the number of lost packets on the server by increasing the buffer size for fragmented packets:

$ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh

This is about double the default.

Server threads

See if your server is receiving too many overlapping requests with:

$ grep th /proc/net/rpc/nfsd
  th 8 594 3733.140 83.850 96.660 0.000 73.510 30.560 16.330 2.380 0.000 2.150

The first number is the number of threads available for servicing requests, and the the second number is the number of times that all threads have been needed. The remaining 10 numbers are a histogram showing how many seconds a certain fraction of the threads have been busy, starting with less than 10% of the threads and ending with more than 90% of the threads. If the last few numbers have accumulated a significant amount of time, then your server probably needs more threads.
Increase the number of threads used by the server to 16 by changing RPCNFSDCOUNT=16 in /etc/rc.d/init.d/nfs

Invisible or stale files

If separate clients are sharing information through NFS disks, then you have special problems. You may delete a file on one client node and cause a different client to get a stale file handle. Different clients may have cached inconsistent versions of the same file. A single client may even create a file or directory and be unable to see it immediately. If these problems sound familiar, then you may want to adjust NFS caching parameters and code multiple attempts in your applications.

 

https://help.ubuntu.com/community/HighlyAvailableNFS

Introduction

 

In this tutorial we will set up a highly available server providing NFS services to clients. Should a server become unavailable, services provided by our cluster will continue to be available to users.

Our highly available system will resemble the following: drbd.jpg