Welcome

This project sets out to create a powerful, robust, quick to deploy, web scraper. A web scraper is a tool that extracts specific parts of web pages rather than the entire html page as a crawler would.

The outcome of this project will be a tool, written in Java, that accomplishes the following

A powerful, fast, reliable way of scraping the web,
Simple to set up - no cumbersome XML configuration to write,
Fully automatable,
Extension points for parsing and exporting

See our road map to view our progress.

Syntax

Parser - the extension that parses the HTML and creates a subset of that HTML (chunk) for export

Chunk Name - the label assigned to an HTML chunk.

Chunk - a part of the HTML we wish to capture.

The configuration file syntax is what defines which parts of the HTML we are going to capture. The first part is the chunk name we choose for the chunk we want to capture. Then a separator <|>. Next comes the name of the parser, in this case, lets say we want to use the token parser. Then come the parameters for this parser and a final separator of <||>. This repeats for as many pieces of information you want to capture per page.

For example, if we wish to capture the title of an HTML page, the following will capture most. title <|>token:<title>####</title><||>

Show me how it works

The fastest way to start poking around how the tool works is to launch start.{bat,sh} and click on the process button. When you clicked on the button the crawler started. It downloaded the three URLs in the url text field and cached them on disk. It parsed out the html on the way down, and created values for the named fields, as specified in the Scraper Config text box and it created three results that it put in the results box. You can play around with the configuration to bring out different parts of the wikipedia pages being scraped. I suggest you look at the parser documentation.

First full crawl

Once you've created the configuration file and tested on some pages using the scraper tool, it is time to create a full crawl. Open up the examples folder and look at the wikipedia-config.txt and wikipedia-seeds.txt. The config file holds the scraper configuration used to parse html, and the seed file contains a list of all of the URLs that we wish to crawl. In this case the three pages from wikipedia about crawling and scraping. (As a footnote, it is inadvisable to agressively crawl wikipedia as they will blackhole your IP ;)

Crawl.{bat,sh} is a command line tool. This runs a java process that kicks off the crawl. Running crawl.{bat,sh} will start the wikipedia crawl and you will end up with a results file containing a vertical layout of the inforamtion brought out of the pages.

Automation

Once you've got your crawl running sucessfully you can schedule (windows) or cron (*nix) a process to run the required crawl each night.

Coming soon to this section is documentation about how to generate seed lists and how to run continuous tests to make sure that the sources you are crawling have not changed in ways that you should be alerted about.

Parser Documentation

Token Parser
RegEx Parser
Beanshell Parser
XPath Parser

Token Parser

The parser token takes two parameters. The parameters are separated by four hash marks. The first parameter is a regex to locate a capture starting point index. The second regex is a finish capture point index. Everything inbetween the two capture points is assigned to the name given

Syntax

Example

Variations

<|>token:<RegEx Begin Index>####<RegEx End Index<||>

If we wish to capture the words 'my link' from the following html: <div id="interestingContent> <a href="somerandomlink.txt"> my link </a> </div> we could use the following token parser configuration. link <|>token:<div id="interestingContent"><a*>####</a></div><||>

Also available are two flavours of this parser. tokenmulti - which captures repeating elements on a page and tokenmulticount - which captures repeating sections and indexes the output. When these parsers are used your vertical export file will contain multiple instances of this chunk

tokenmulti - will find multiple matches of the expression on a page.

<|>tokenmulti:<RegEx Begin Index>####<RegEx End Index<||>

tokenmulticount - identical to tokenmulti except that the export file will have a zero indexed suffix appended to the chunk name

<|>tokenmulticount:<RegEx Begin Index>####<RegEx End Index<||>

RegEx Parser

The parser token takes one parameter. The parameter is a regular expression. The first matching section is what is assigned to chunk name. A matching section, as defined in RegEx is surrounded with parenthisis

Syntax

Example

Variations

<|>regex:(<RegEx>)<||>

If we wish to capture the price '9.99' excluding the dollar sign from this HTML: <div id="interestingContent> <a href="somerandomlink.txt"> $9.99 on sale! </a> </div> we could use the following token parser configuration. link <|>regex:<div id="interestingContent"><a*>\$([0-9]+\.[0-9]+)<||>

Also available are two flavours of this parser. regexmulti - which captures repeating elements on a page. and regexmulticount - which captures repeating elements and indexes them on the output. When these tokens are used your vertical export file will contain multiple instances of these chunks

regexmulti - will find multiple matches of the expression on a page.

<|>regexmulti:(<RegEx>)<||>

regexmulticount - identical to regexmulti except that the export file will have a zero indexed suffix appended to the chunk name

<|>regexmulticount:(<RegEx>)<||>

BeanShell Parser (To Be Implemented)

The beanshell parser will take beanshell expressions and have access to all the other chunks that have been processed. As well as the ability to create completely new chunks based on the page.

Syntax

TBI

XPath Parser (To Be Implemented)

The xpath parser will take xpath expressions to create specific chunks of HTML from a well formed DOM

Syntax

TBI

Regular Expressions

This tool uses a modified regular expression language - read below to find out more.

Regular Expressions (RegEx) is a language built for parsing text. You can read a great tutorial about RegEx here: http://www.regular-expressions.info/tutorial.html

As RegEx was not built with XML type text in mind, this tool gives some handy shortcuts specifically for HTML work. For example, the boundary of a tag is replaced with a looser expression to match whitespace between them. That is, in the Scraper RegEx, the code "<span><div>" is the same as "<span> <div>" is the same as

"<span>

<div>"

Another short-cut is that if you don't care about the attributes of a tag, but do care that the tag is present, you can use the * notation. For example, the HTML <a href="someUrlThatChangesAllTheTime.html?random=234132" class="mouseoverclass"> can be written as <a*>

Road Map

Other tools

Here is a list of other tools. Feel free to suggest something that you think covers something important.

WebScraper