Archive for the ‘ Performance ’ Category

A Practical Guide to Varnish – Why Varnish Matters

A Practical Guide to Varnish – Why Varnish Matters

http://www.activelancer.com/wp-content/uploads/2010/10/varnish-cache.jpg

What is Varnish?

Varnish is an open source, high performance http accelerator that sits in front of a web stack and caches pages.  This caching layer is very configurable and can be used for both static and dynamic content.

One great thing about Varnish is that it can improve the performance of your website without requiring any code changes.  If you haven’t heard of Varnish (or have heard of it, but haven’t used it), please read on.  Adding Varnish to your stack can be completely noninvasive, but if you tweak your stack to play along with some of varnish’s more advanced features, you’ll be able to increase performance by orders of magnitude.

Some of the high profile companies using Varnish include: TwitterFacebookHeroku and LinkedIn.

Our Use Case

One of Factual’s first high profile projects was Newsweek’s “America’s Best High Schools: The List”. After realizing that we had only a few weeks to increase our throughput by tenfold, we looked into a few options. We decided to go with Varnish because it was noninvasive, extremely fast and battlefield tested by other companies. The result yielded a system that performed 15 times faster and a successful launch that hit the front page of msn.com.  Varnish now plays a major role in our stack and we’re looking to implement more performance tweaks designed with Varnish in mind.

A Simple Use Case

The easiest and safest way to add Varnish to your stack is to serve and cache static content.  Aside from using a CDN, Varnish is probably the next best thing that you can use for free.  However, dynamic content is where you can squeeze real performance out of your stack if you know where and how to use it.  This guide will only scratch the surface on how Varnish can drastically improve performance.  Advanced features such as edge side includes and header manipulation allow you to leverage Varnish for even higher throughput.  Hopefully, we’ll get to more of these advanced features in future blog posts, but for now, we’ll just give you an introduction.

Open source web-crawler: Web-Harvest

Web-Harvest Project Home Page.

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code – that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.

Piwik – Web analytics – Open source

Piwik – Web analytics – Open source.

Piwik is a downloadable, open source (GPL licensed) real time web analytics software program. It provides you with detailed reports on your website visitors: the search engines and keywords they used, the language they speak, your popular pages… and so much more.

Piwik aims to be an open source alternative to Google Analytics.

Piwik is a PHP MySQL software program that you download and install on your own webserver. At the end of the five minute installation process you will be given a JavaScript tag. Simply copy and paste this tag on websites you wish to track (or use an existing plugin to do it automatically for you) and access your analytics reports in real time.