Archive for the ‘ Cloud ’ Category

Post-mortem Windows Azure

A curious software bug which caused a two day outrage of Windows Azur:

Introduction

As a follow-up to my March 1 posting, I want to share the findings of our root cause analysis of the service disruption of February 29th. We know that many of our customers were impacted by this event and we want to be transparent about what happened, what issues we found, how we plan to address these issues, and how we are learning from the incident to prevent a similar occurrence in the future.

Again, we sincerely apologize for the disruption, downtime and inconvenience this incident has caused. We will be proactively issuing a service credit to our impacted customers as explained below. Rest assured that we are already hard at work using our learnings to improve Windows Azure.

Overview of Windows Azure and the Service Disruption

Windows Azure comprises many different services, including Compute, Storage, Networking and higher-level services like Service Bus and SQL Azure. This partial service outage impacted Windows Azure Compute and dependent services: Access Control Service (ACS), Windows Azure Service Bus, SQL Azure Portal, and Data Sync Services. It did not impact Windows Azure Storage or SQL Azure.

While the trigger for this incident was a specific software bug, Windows Azure consists of many components and there were other interactions with normal operations that complicated this disruption. There were two phases to this incident. The first phase was focused on the detection, response and fix of the initial software bug. The second phase was focused on the handful of clusters that were impacted due to unanticipated interactions with our normal servicing operations that were underway. Understanding the technical details of the issue requires some background on the functioning of some of the low-level Windows Azure components.

Fabric Controllers, Agents and Certificates

In Windows Azure, cloud applications consist of virtual machines running on physical servers in Microsoft datacenters. Servers are grouped into “clusters” of about 1000 that are each independently managed by a scaled-out and redundant platform software component called the Fabric Controller (FC), as depicted in Figure 1. Each FC manages the lifecycle of applications running in its cluster, provisions and monitors the health of the hardware under its control. It executes both autonomic operations, like reincarnating virtual machine instances on healthy servers when it determines that a server has failed, as well as application-management operations like deploying, updating and scaling out applications. Dividing the datacenter into clusters isolates faults at the FC level, preventing certain classes of errors from affecting servers beyond the cluster in which they occur.

Figure 1. Clusters and Fabric Controllers

Part of Windows Azure’s Platform as a Service (PaaS) functionality requires its tight integration with applications that run in VMs through the use of a “guest agent” (GA) that it deploys into the OS image used by the VMs, shown in Figure 2. Each server has a “host agent” (HA) that the FC leverages to deploy application secrets, like SSL certificates that an application includes in its package for securing HTTPS endpoints, as well as to “heart beat” with the GA to determine whether the VM is healthy or if the FC should take recovery actions.

Figure 2. Host Agent and Guest Agent Initialization

So that the application secrets, like certificates, are always encrypted when transmitted over the physical or logical networks, the GA creates a “transfer certificate” when it initializes. The first step the GA takes during the setup of its connection with the HA is to pass the HA the public key version of the transfer certificate. The HA can then encrypt secrets and because only the GA has the private key, only the GA in the target VM can decrypt those secrets.

There are several cases that require generation of a new transfer certificate. Most of the time that’s only when a new VM is created, which occurs when a user launches a new deployment, when a deployment scales out, or when a deployment updates its VM operating system. The fourth case is when the FC reincarnates a VM that was running on a server it has deemed unhealthy to a different server, a process the platform calls “service healing.”

The Leap Day Bug

When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.

As mentioned, transfer certificate creation is the first step of the GA initialization and is required before it will connect to the HA. When a GA fails to create its certificates, it terminates. The HA has a 25-minute timeout for hearing from the GA. When a GA doesn’t connect within that timeout, the HA reinitializes the VM’s OS and restarts it.

If a clean VM (one in which no customer code has executed) times out its GA connection three times in a row, the HA decides that a hardware problem must be the cause since the GA would otherwise have reported an error. The HA then reports to the FC that the server is faulty and the FC moves it to a state called Human Investigate (HI). As part of its standard autonomic failure recovery operations for a server in the HI state, the FC will service heal any VMs that were assigned to the failed server by reincarnating them to other servers. In a case like this, when the VMs are moved to available servers the leap day bug will reproduce during GA initialization, resulting in a cascade of servers that move to HI.

To prevent a cascading software bug from causing the outage of an entire cluster, the FC has an HI threshold, that when hit, essentially moves the whole cluster to a similar HI state. At that point the FC stops all internally initiated software updates and automatic service healing is disabled. This state, while degraded, gives operators the opportunity to take control and repair the problem before it progresses further.

The Leap Day Bug in Action

The leap day bug immediately triggered at 4:00PM PST, February 28th (00:00 UST February 29th) when GAs in new VMs tried to generate certificates. Storage clusters were not affected because they don’t run with a GA, but normal application deployment, scale-out and service healing would have resulted in new VM creation. At the same time many clusters were also in the midst of the rollout of a new version of the FC, HA and GA. That ensured that the bug would be hit immediately in those clusters and the server HI threshold hit precisely 75 minutes (3 times 25 minute timeout) later at 5:15PM PST. The bug worked its way more slowly through clusters that were not being updated, but the critical alarms on the updating clusters automatically stopped the updates and alerted operations staff to the problem. They in turn notified on-call FC developers, who researched the cause and at 6:38PM PST our developers identified the bug.

By this time some applications had single VMs offline and some also had multiple VMs offline, but most applications with multiple VMs maintained availability, albeit with some reduced capacity. To prevent customers from inadvertently causing further impact to their running applications, unsuccessfully scaling-out their applications, and fruitlessly trying to deploy new applications, we disabled service management functionality in all clusters worldwide at 6:55PM PST. This is the first time we’ve ever taken this step. Service management allows customers to deploy, update, stop and scale their applications but isn’t necessary for the continued operation of already deployed applications. However stopping service management prevents customers from modifying or updating their currently deployed applications.

We created a test and rollout plan for the updated GA by approximately 10:00PM PST, had the updated GA code ready at 11:20PM PST, and finished testing it in a test cluster at 1:50AM PST, February 29th. In parallel, we successfully tested the fix in production clusters on the VMs of several of our own applications. We next initiated rollout of the GA to one production cluster and that completed successfully at 2:11AM PST, at which time we pushed the fix to all clusters. As clusters were updated we restored service management functionality for them and at 5:23AM PST we announced service management had been restored to the majority of our clusters.

 
Secondary Outage ……

Read the rest of the story:

via Summary of Windows Azure Service Disruption on Feb 29th, 2012 – Windows Azure – Site Home – MSDN Blogs.

Google App Engine GAE vs Amazon Elastic Computing EC2 vs Microsoft Azure

Google App Engine GAE vs Amazon Elastic Computing EC2 vs Microsoft Azure.

Almost a year ago, I compared Google App Engine and Microsoft Windows Azure, trying to decide which platform I should write and host my blog (and some other small projects) on. The comparison was about more than hosting – the languages and frameworks used would be influenced by the platform I was hosting on. There were also APIs available only to one platform, or easier to use on one platform compared to the other (such as the App Engine authentication).

Due to the huge differences, I did a little homework on each platform, and ultimately, it came down to price. The difference in pricing between Google App Engine and Windows Azure was so enormous, that there wasn’t really a decision to make. App Engine hosts this blog for free. Windows Azure would’ve cost around $100/month minimum.

One Year On

Fast forward a year and things have changed a little. App Engine has become more mature, Amazon has introduced Micro instances, and Microsoft has done, well, not a lot. There’s been seemingly zero change in the pricing for Windows Azure, meaning there’s still a significant minimum cost in using it.


Windows Azure vs Amazon EC2 vs Google App Engine – Stack Overflow

Reasons to use GAE:

  • You don’t pay until your app grows quite a bit. With Azure, you pay almost $100 each month, even if you don’t have a single website visitor. If your db goes over 1GB, you pay an extra $90 ($9->$99) for storage.
  • GAE’s payment is also very fine-grained – only for the resources you use. Azure (and AWS) is “blocky” – you pay something for each server instance you’ll run (plus resources), irrespective of whether it gets any use at all.
  • GAE has the lightest admin load. Once you’re setup, deploying and re-deploying is quick and they’ll auto-everything. For example, you don’t worry about how many servers your app is using, how to shard the data, how to load-balance.
  • Mail just works. At the time of writing, Azure doesn’t offer SMTP out so you need a 3rd party server.
  • Great integration with many of the Google offerings – calendars, mail, whatever. You can delegate user management to Google if you don’t want control over your user base.
  • With GAE you know any features they add to the store, you’ll get. With Azure, you get the feeling Sql Azure Database will get most of the love but it’ll be more expensive. Azure Storage is likely to have the most gotchas. No relational integrity, no order-by, you’ll fiddle with the in-memory context more. GAE’s store has far fewer restrictions and more features than Azure Tables.
  • Good choice if you’re using Python or JVM-based languages already. Many languages compile to Java bytecode nowadays.
  • Updating the app is very fast. For Python, I had a shortcut key setup and it took no time at all. I now use the Eclipse Plugin for Java and it works very well. Azure is more fiddly.
  • A locally tested app will probably run on the cloud without (much or any) changes. With Azure, the config is different and I spent some time stopping-deleting-building-uploading-starting before I got it right.
  • GAE has a great UI that includes a log viewer a data editor. With Azure, you currently have to find external viewers/editors for this.
  • GAE lets you have multiple versions of your application running on the same datastore. You can deploy, test a version and then set the current ‘live’ version when you’re ready. You can change back if something goes wrong.
  • Reasons to use Azure:

  • I’ve read of users getting “I think you’re a robot” messages, like sometimes pop up on other Google properties. I haven’t seen it, but it would alarm me.
  • Azure seems to be better designed if you have a SOA-type approach. Their architectures seem to benefit from experience in the enterprise world. GAE seems more focused on simply serving web pages.
  • You can run the app under debug, put in breakpoints, etc.
  • Azure has a “staging” environment where you can deploy to the cloud, but not make it live until you’re happy it works.
  • I’m using .Net for other things, and integrating them with .Net on the backend is much easier than with GAE. (Update – using Java on GAE works fine, and the 10-second timeout is now 30 seconds).
  • Azure has two approaches to storage, offering more choice. They are SQL Azure Database (SAD) which is a relational DB, and Azure Storage, which consists of non-relational tables, blobs and queues. If you have an investment in SQL Server then SAD will be easy to move to, but is quite costly and might be less scalable.
  • Integration with many MS “Live” offerings.So, no obvious answers. I’m defaulting to App Engine at the moment because of costs and ease of use. I might use Azure for very MS-oriented apps. I use Amazon S3 for downloads but likely won’t use EC2 because I prefer leaving everything under the application level to the experts.
  • Map-Reduce With Ruby Using Hadoop

    High Scalability – High Scalability – Map-Reduce With Ruby Using Hadoop.

    Map-Reduce With Hadoop Using Ruby

    A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.

    Fire-Up Your Hadoop Cluster

    I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting, who started Hadoop and drove it’s development at Yahoo! He also started Lucene, which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster……