Archive for the ‘ Performance ’ Category

Open source web-crawler: Web-Harvest

Web-Harvest Project Home Page.

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code – that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.

Piwik – Web analytics – Open source

Piwik – Web analytics – Open source.

Piwik is a downloadable, open source (GPL licensed) real time web analytics software program. It provides you with detailed reports on your website visitors: the search engines and keywords they used, the language they speak, your popular pages… and so much more.

Piwik aims to be an open source alternative to Google Analytics.

Piwik is a PHP MySQL software program that you download and install on your own webserver. At the end of the five minute installation process you will be given a JavaScript tag. Simply copy and paste this tag on websites you wish to track (or use an existing plugin to do it automatically for you) and access your analytics reports in real time.

How to change default I/O scheduler

How to change default I/O scheduler? | Planet Admon.

Red Hat Enterprise Linux 3 with a 2.4 kernel base uses a single, robust, general purpose I/O elevator. The I/O schedulers provided in Red Hat Enterprise Linux 4, embedded in the 2.6 kernel, have advanced the I/O capabilities of Linux significantly. With Red Hat Enterprise Linux 4, applications can now optimize the kernel I/O at boot time, by selecting one of four different I/O schedulers to accommodate different I/O usage patterns:

* Completely Fair Queuing—elevator=cfq (default)
* Deadline—elevator=deadline
* NOOP—elevator=noop
* Anticipatory—elevator=as

The I/O scheduler can be selected at boot time using the “elevator” kernel parameter. In the following example, the system has been configured to use the deadline scheduler in the grub.conf file.

title Red Hat Enterprise Linux Server (2.6.18-8.el5)
root (hd0,0)
kernel /vmlinuz-2.6.18-8.el5 ro root=/dev/vg0/lv0 elevator=deadline
initrd /initrd-2.6.18-8.el5.img

In Red Hat Enterprise Linux 5, it is also possible to change the I/O scheduler for a particular disk on the fly.

# cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
# echo ‘deadline’ > /sys/block/sdb/queue/scheduler
# cat /sys/block/sdb/queue/scheduler
noop anticipatory [deadline] cfq

The following are the tunable files for the deadline scheduler. They can be tuned to any suitable value according to hardware performance and software requirements:

/sys/block/DEVNAME/queue/iosched/read_expire
/sys/block/DEVNAME/queue/iosched/write_expire
/sys/block/DEVNAME/queue/iosched/fifo_batch
/sys/block/DEVNAME/queue/iosched/write_starved
/sys/block/DEVNAME/queue/iosched/front_merges

DEVNAME is the name of block device (such as sda, sdb, hda, etc)

A detailed description of the deadline I/O scheduler can be found at:
/usr/share/doc/kernel-[version]/Documentation/block/deadline-iosched.txt.

http://www.redhat.com/magazine/008jun05/features/schedulers/

http://www.redbooks.ibm.com/abstracts/redp4285.html

Random read performance per I/O elevator (synchronous)

Random read performance per I/O elevator (synchronous)

CPU utilization by I/O elevator (asynchronous)

CPU utilization by I/O elevator (asynchronous)

 Impact of nr_requests on the Deadline elevator (random write ReiserFS)

Impact of nr_requests on the Deadline elevator (random write ReiserFS)

Impact of nr_requests on the CFQ elevator (random write Ext3)

Impact of nr_requests on the CFQ elevator (random write Ext3)

Random write throughput comparison between Ext and ReiserFS (synchronous)

Random write throughput comparison between Ext and ReiserFS (synchronous)

 Random write throughput comparison between Ext3 and ReiserFS (asynchronous)

Random write throughput comparison between Ext3 and ReiserFS (asynchronous)