Crawling a web sites with HtmlAgilityPack
Introduction
This is a first post of small series that I’m going to describe implementation and design of Crawler, that I've done recently for TDD demand analisys. I would split it up into several parts, covering its major architectural parts.
- Part 1 - Crawling a web sites with HtmlAgilityPack
- Part 2 - Regex to match a words in dictionary on page body
- Part 3 - EF4 Code First approach to store data
For references, you could use a source code - http://github.com/alexbeletsky/tdd.demand
Warning it’s quite long post, cause contain code examples, if you understand basic ideas I put here, best way it to go directly to repository and see the code, as best explanation material
Using HtmlAgilityPack
HtmlAgilityPack is one of the great open sources projects I ever worked with. It is a HTML parser for .NET applications, works with great performance, supports malformed HTML. I successfully used in one of the projects and really liked it. It contains very few documentation, but it designed so well that you can get basic understanding just by looking to Visual Studio Object Browser.
So, then you need to deal with HTML in .NET - HtmlAgilityPack is a definitely framework of choice.
I’ve downloaded latest version and were very pleased that now it supports Linq to Objects. That makes usage of HtmlAgilityPack more simple and fun. I’ll give you just a simple idea how it works. Task of every crawler is to extract some information from particular html page. Say, we need to get inner text from div element with class “required”. We have a 2 options here, classical one, using XPATH and brand new, using Linq to Objects.
XPATH approach
public string GetInnerTestWithXpath() { var document = new HtmlDocument(); document.Load(new FileStream("test.html", FileMode.Open)); var node = document.DocumentNode.SelectSingleNode(@"//div[@class=""required""]"); return node.InnerText; } * This source code was highlighted with Source Code Highlighter.
Linq to Objects approach
public string GetInnerTextWithLinq() { var document = new HtmlDocument(); document.Load(new FileStream("test.html", FileMode.Open)); var node = document.DocumentNode.Descendants("div").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("required")).SingleOrDefault(); return node.InnerText; } * This source code was highlighted with Source Code Highlighter.
As I personally like Linq to Objects approach, sometimes XPATH is more convenient and elegant (especially in cases you refer to page elements with out ids or special attributes).
Loading pages using WebRequest
In previous example I loaded page content from file, located on disk. Now, our goal is to load pages by URL using HTTP. .NET framework has a special WebRequest. I’ve created a separate class HtmlDocumentLoader (that implements IHtmlDocumentLoader interface) that all the details inside.
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Net; using System.Threading; namespace Crawler.Core.Model { public class HtmlDocumentLoader : IHtmlDocumentLoader { private WebRequest CreateRequest(string url) { var request = (HttpWebRequest)WebRequest.Create(url); request.Timeout = 5000; request.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"; return request; } public HtmlAgilityPack.HtmlDocument LoadDocument(string url) { var document = new HtmlAgilityPack.HtmlDocument(); try { using (var responseStream = CreateRequest(url).GetResponse().GetResponseStream()) { document.Load(responseStream, Encoding.UTF8); } } catch(Exception ) { //just do a second try Thread.Sleep(1000); using (var responseStream = CreateRequest(url).GetResponse().GetResponseStream()) { document.Load(responseStream, Encoding.UTF8); } } return document; } } } * This source code was highlighted with Source Code Highlighter.
Several comments here. First, You can see that we load UserAgent property of WebRequest. We are making our request look that same as it would be a Firefox web browser. Some web servers could prevent web requests from “unknown" agents, so this is kind of preventive action. Second, is how document object is being intialized.. as you might see we have a try/catch block here and just repeat the same initialization steps in catch block. It might happen that web server fails to process requirest (due to different reasons), so WebRequest object will throw and exception. We just wait for one second and retry it. I’ve noticed that such simple approach could really improve robustness of crawler.
Generic Crawler
So, now we know how to load HTML documents by using of WebRequest, specifying document URL, also we know how to use HtmlAgilityPack to extract data from a document. Now, we have to create an engine, that would automatically go through the document, extract the links for next portion of data, process data and store it. That is something that is called crawler.
As I implemented and tested several crawlers, I’ve seen that all off them have the same structure and operations and differs only in particular details of how data is extracted from pages. So, I came up with a generic crawler, implemented as abstract class. If you need to build next crawler you just inherit generic crawler and implement all abstract operations. Let’s see the heart of crawler, StartCrawling() method.
protected virtual void StartCrawling() { Logger.Log(BaseUrl + " crawler started..."); CleanUp(); for (var nextPage = 1; ; nextPage++) { var url = CreateNextUrl(nextPage); var document = Loader.LoadDocument(url); Logger.Log("processing page: [" + nextPage.ToString() + "] with url: " + url); var rows = GetJobRows(document); var rowsCount = rows.Count(); Logger.Log("extracted " + rowsCount + " vacations on page"); if (rowsCount == 0) { Logger.Log("no more vacancies to process, breaking main loop"); break; } Logger.Log("starting to process all vacancies"); foreach (var row in rows) { Logger.Log("starting processing div, extracting vacancy href..."); var vacancyUrl = GetVacancyUrl(row); if (vacancyUrl == null) { Logger.Log("FAILED to extract vacancy href, not stopped, proceed with next one"); continue; } Logger.Log("started to process vacancy with url: " + vacancyUrl); var vacancyBody = GetVacancyBody(Loader.LoadDocument(vacancyUrl)); if (vacancyBody == null) { Logger.Log("FAILED to extract vacancy body, not stopped, proceed with next one"); continue; } var position = GetPosition(row); var company = GetCompany(row); var technology = GetTechnology(position, vacancyBody); var demand = GetDemand(vacancyBody); var record = new TddDemandRecord() { Site = BaseUrl, Company = company, Position = position, Technology = technology, Demand = demand, Url = vacancyUrl }; Logger.Log("new record has been created and initialized"); Repository.Add(record); Repository.SaveChanges(); Logger.Log("record has been successfully stored to database."); Logger.Log("finished to process vacancy"); } Logger.Log("finished to process page"); } Logger.Log(BaseUrl + " crawler has successfully finished"); } * This source code was highlighted with Source Code Highlighter.
It uses abstract fields of Loader, Logger and Repository. We have already reviewed Loader functionality, Logger is simple interface with Log method (I’ve created one implementaion to put log messages to console, that is enough to me) and Repository that we will review next time.
GetTechnology, GetDemand methods are the same for all crawlers, so they are part of generic crawler, rest of operations are “site-dependent”, so each crawler overrides its behavior.
protected abstract IEnumerable<HtmlAgilityPack.HtmlNode> GetJobRows(HtmlAgilityPack.HtmlDocument document); protected abstract string CreateNextUrl(int nextPage); protected abstract string GetVacancyUrl(HtmlAgilityPack.HtmlNode row); protected abstract string GetVacancyBody(HtmlAgilityPack.HtmlDocument htmlDocument); protected abstract string GetPosition(HtmlAgilityPack.HtmlNode row); protected abstract string GetCompany(HtmlAgilityPack.HtmlNode row); * This source code was highlighted with Source Code Highlighter.
Here, we’ll review one of the crawlers and how it implements all methods required by CrawlerImpl class.
namespace Crawler.Core.Crawlers { public class RabotaUaCrawler : CrawlerImpl, ICrawler { private string _baseUrl = @"http://rabota.ua"; private string _searchBaseUrl = @"http://rabota.ua/jobsearch/vacancy_list?rubricIds=8,9&keyWords=&parentId=1"; public RabotaUaCrawler(ILogger logger) { Logger = logger; } public void Crawle(IHtmlDocumentLoader loader, ICrawlerRepository context) { Loader = loader; Repository = context; StartCrawling(); } protected override string BaseUrl { get { return _baseUrl; } } protected override string SearchBaseUrl { get { return _searchBaseUrl; } } protected override IEnumerable<HtmlAgilityPack.HtmlNode> GetJobRows(HtmlAgilityPack.HtmlDocument document) { var vacancyDivs = document.DocumentNode.Descendants("div") .Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyitem")); return vacancyDivs; } protected override string GetVacancyUrl(HtmlAgilityPack.HtmlNode div) { var vacancyHref = div.Descendants("a").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyDescription")) .Select(d => d.Attributes["href"].Value).SingleOrDefault(); return BaseUrl + vacancyHref; } private static string GetVacancyHref(HtmlAgilityPack.HtmlNode div) { var vacancyHref = div.Descendants("a").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyDescription")) .Select(d => d.Attributes["href"].Value).SingleOrDefault(); return vacancyHref; } protected override string CreateNextUrl(int nextPage) { return SearchBaseUrl + "&pg=" + nextPage; } protected override string GetVacancyBody(HtmlAgilityPack.HtmlDocument vacancyPage) { if (vacancyPage == null) { //TODO: log event here and skip this page return null; } var description = vacancyPage.DocumentNode.Descendants("div") .Where( d => d.Attributes.Contains("id") && d.Attributes["id"].Value.Contains("ctl00_centerZone_vcVwPopup_pnlBody")) .Select(d => d.InnerHtml).SingleOrDefault(); return description; } protected override string GetPosition(HtmlAgilityPack.HtmlNode div) { return div.Descendants("a").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyName") || d.Attributes["class"].Value.Contains("jqKeywordHighlight") ).Select(d => d.InnerText).First(); } protected override string GetCompany(HtmlAgilityPack.HtmlNode div) { return div.Descendants("div").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("companyName")).Select(d => d.FirstChild.InnerText).First(); } } } * This source code was highlighted with Source Code Highlighter.
To make a picture complete, just review implementation of the rest of crawlers- http://github.com/alexbeletsky/tdd.demand/tree/master/src/Crawler/Core/Crawlers/
Conclusions
You might see that implementation of simple crawler as a simple thing as soon as you got good tools for that. Of cause, the functionality of it as very specific and limited, but I hope it could give you ideas for your own crawlers.
In next blog post I’ll cover a topic of usage Regex in .NET and brand-new-cool-looking Entity Framework 4 Code First approach to work with databases.
Introduction
This is a first post of small series that I’m going to describe implementation and design of Crawler, that I've done recently for TDD demand analisys. I would split it up into several parts, covering its major architectural parts.
- Part 1 - Crawling a web sites with HtmlAgilityPack
- Part 2 - Regex to match a words in dictionary on page body
- Part 3 - EF4 Code First approach to store data
For references, you could use a source code - http://github.com/alexbeletsky/tdd.demand
Warning it’s quite long post, cause contain code examples, if you understand basic ideas I put here, best way it to go directly to repository and see the code, as best explanation material
Using HtmlAgilityPack
HtmlAgilityPack is one of the great open sources projects I ever worked with. It is a HTML parser for .NET applications, works with great performance, supports malformed HTML. I successfully used in one of the projects and really liked it. It contains very few documentation, but it designed so well that you can get basic understanding just by looking to Visual Studio Object Browser.
So, then you need to deal with HTML in .NET - HtmlAgilityPack is a definitely framework of choice.
I’ve downloaded latest version and were very pleased that now it supports Linq to Objects. That makes usage of HtmlAgilityPack more simple and fun. I’ll give you just a simple idea how it works. Task of every crawler is to extract some information from particular html page. Say, we need to get inner text from div element with class “required”. We have a 2 options here, classical one, using XPATH and brand new, using Linq to Objects.
XPATH approach
public string GetInnerTestWithXpath() { var document = new HtmlDocument(); document.Load(new FileStream("test.html", FileMode.Open)); var node = document.DocumentNode.SelectSingleNode(@"//div[@class=""required""]"); return node.InnerText; } * This source code was highlighted with Source Code Highlighter.
Linq to Objects approach
public string GetInnerTextWithLinq() { var document = new HtmlDocument(); document.Load(new FileStream("test.html", FileMode.Open)); var node = document.DocumentNode.Descendants("div").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("required")).SingleOrDefault(); return node.InnerText; } * This source code was highlighted with Source Code Highlighter.
As I personally like Linq to Objects approach, sometimes XPATH is more convenient and elegant (especially in cases you refer to page elements with out ids or special attributes).
Loading pages using WebRequest
In previous example I loaded page content from file, located on disk. Now, our goal is to load pages by URL using HTTP. .NET framework has a special WebRequest. I’ve created a separate class HtmlDocumentLoader (that implements IHtmlDocumentLoader interface) that all the details inside.
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Net; using System.Threading; namespace Crawler.Core.Model { public class HtmlDocumentLoader : IHtmlDocumentLoader { private WebRequest CreateRequest(string url) { var request = (HttpWebRequest)WebRequest.Create(url); request.Timeout = 5000; request.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"; return request; } public HtmlAgilityPack.HtmlDocument LoadDocument(string url) { var document = new HtmlAgilityPack.HtmlDocument(); try { using (var responseStream = CreateRequest(url).GetResponse().GetResponseStream()) { document.Load(responseStream, Encoding.UTF8); } } catch(Exception ) { //just do a second try Thread.Sleep(1000); using (var responseStream = CreateRequest(url).GetResponse().GetResponseStream()) { document.Load(responseStream, Encoding.UTF8); } } return document; } } } * This source code was highlighted with Source Code Highlighter.
Several comments here. First, You can see that we load UserAgent property of WebRequest. We are making our request look that same as it would be a Firefox web browser. Some web servers could prevent web requests from “unknown" agents, so this is kind of preventive action. Second, is how document object is being intialized.. as you might see we have a try/catch block here and just repeat the same initialization steps in catch block. It might happen that web server fails to process requirest (due to different reasons), so WebRequest object will throw and exception. We just wait for one second and retry it. I’ve noticed that such simple approach could really improve robustness of crawler.
Generic Crawler
So, now we know how to load HTML documents by using of WebRequest, specifying document URL, also we know how to use HtmlAgilityPack to extract data from a document. Now, we have to create an engine, that would automatically go through the document, extract the links for next portion of data, process data and store it. That is something that is called crawler.
As I implemented and tested several crawlers, I’ve seen that all off them have the same structure and operations and differs only in particular details of how data is extracted from pages. So, I came up with a generic crawler, implemented as abstract class. If you need to build next crawler you just inherit generic crawler and implement all abstract operations. Let’s see the heart of crawler, StartCrawling() method.
protected virtual void StartCrawling() { Logger.Log(BaseUrl + " crawler started..."); CleanUp(); for (var nextPage = 1; ; nextPage++) { var url = CreateNextUrl(nextPage); var document = Loader.LoadDocument(url); Logger.Log("processing page: [" + nextPage.ToString() + "] with url: " + url); var rows = GetJobRows(document); var rowsCount = rows.Count(); Logger.Log("extracted " + rowsCount + " vacations on page"); if (rowsCount == 0) { Logger.Log("no more vacancies to process, breaking main loop"); break; } Logger.Log("starting to process all vacancies"); foreach (var row in rows) { Logger.Log("starting processing div, extracting vacancy href..."); var vacancyUrl = GetVacancyUrl(row); if (vacancyUrl == null) { Logger.Log("FAILED to extract vacancy href, not stopped, proceed with next one"); continue; } Logger.Log("started to process vacancy with url: " + vacancyUrl); var vacancyBody = GetVacancyBody(Loader.LoadDocument(vacancyUrl)); if (vacancyBody == null) { Logger.Log("FAILED to extract vacancy body, not stopped, proceed with next one"); continue; } var position = GetPosition(row); var company = GetCompany(row); var technology = GetTechnology(position, vacancyBody); var demand = GetDemand(vacancyBody); var record = new TddDemandRecord() { Site = BaseUrl, Company = company, Position = position, Technology = technology, Demand = demand, Url = vacancyUrl }; Logger.Log("new record has been created and initialized"); Repository.Add(record); Repository.SaveChanges(); Logger.Log("record has been successfully stored to database."); Logger.Log("finished to process vacancy"); } Logger.Log("finished to process page"); } Logger.Log(BaseUrl + " crawler has successfully finished"); } * This source code was highlighted with Source Code Highlighter.
It uses abstract fields of Loader, Logger and Repository. We have already reviewed Loader functionality, Logger is simple interface with Log method (I’ve created one implementaion to put log messages to console, that is enough to me) and Repository that we will review next time.
GetTechnology, GetDemand methods are the same for all crawlers, so they are part of generic crawler, rest of operations are “site-dependent”, so each crawler overrides its behavior.
protected abstract IEnumerable<HtmlAgilityPack.HtmlNode> GetJobRows(HtmlAgilityPack.HtmlDocument document); protected abstract string CreateNextUrl(int nextPage); protected abstract string GetVacancyUrl(HtmlAgilityPack.HtmlNode row); protected abstract string GetVacancyBody(HtmlAgilityPack.HtmlDocument htmlDocument); protected abstract string GetPosition(HtmlAgilityPack.HtmlNode row); protected abstract string GetCompany(HtmlAgilityPack.HtmlNode row); * This source code was highlighted with Source Code Highlighter.
Here, we’ll review one of the crawlers and how it implements all methods required by CrawlerImpl class.
namespace Crawler.Core.Crawlers { public class RabotaUaCrawler : CrawlerImpl, ICrawler { private string _baseUrl = @"http://rabota.ua"; private string _searchBaseUrl = @"http://rabota.ua/jobsearch/vacancy_list?rubricIds=8,9&keyWords=&parentId=1"; public RabotaUaCrawler(ILogger logger) { Logger = logger; } public void Crawle(IHtmlDocumentLoader loader, ICrawlerRepository context) { Loader = loader; Repository = context; StartCrawling(); } protected override string BaseUrl { get { return _baseUrl; } } protected override string SearchBaseUrl { get { return _searchBaseUrl; } } protected override IEnumerable<HtmlAgilityPack.HtmlNode> GetJobRows(HtmlAgilityPack.HtmlDocument document) { var vacancyDivs = document.DocumentNode.Descendants("div") .Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyitem")); return vacancyDivs; } protected override string GetVacancyUrl(HtmlAgilityPack.HtmlNode div) { var vacancyHref = div.Descendants("a").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyDescription")) .Select(d => d.Attributes["href"].Value).SingleOrDefault(); return BaseUrl + vacancyHref; } private static string GetVacancyHref(HtmlAgilityPack.HtmlNode div) { var vacancyHref = div.Descendants("a").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyDescription")) .Select(d => d.Attributes["href"].Value).SingleOrDefault(); return vacancyHref; } protected override string CreateNextUrl(int nextPage) { return SearchBaseUrl + "&pg=" + nextPage; } protected override string GetVacancyBody(HtmlAgilityPack.HtmlDocument vacancyPage) { if (vacancyPage == null) { //TODO: log event here and skip this page return null; } var description = vacancyPage.DocumentNode.Descendants("div") .Where( d => d.Attributes.Contains("id") && d.Attributes["id"].Value.Contains("ctl00_centerZone_vcVwPopup_pnlBody")) .Select(d => d.InnerHtml).SingleOrDefault(); return description; } protected override string GetPosition(HtmlAgilityPack.HtmlNode div) { return div.Descendants("a").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("vacancyName") || d.Attributes["class"].Value.Contains("jqKeywordHighlight") ).Select(d => d.InnerText).First(); } protected override string GetCompany(HtmlAgilityPack.HtmlNode div) { return div.Descendants("div").Where( d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("companyName")).Select(d => d.FirstChild.InnerText).First(); } } } * This source code was highlighted with Source Code Highlighter.
To make a picture complete, just review implementation of the rest of crawlers- http://github.com/alexbeletsky/tdd.demand/tree/master/src/Crawler/Core/Crawlers/
Conclusions
You might see that implementation of simple crawler as a simple thing as soon as you got good tools for that. Of cause, the functionality of it as very specific and limited, but I hope it could give you ideas for your own crawlers.
In next blog post I’ll cover a topic of usage Regex in .NET and brand-new-cool-looking Entity Framework 4 Code First approach to work with databases.