Scraping prices off Walmart – A simple Regex Example

24 Feb
February 24, 2011

Scraping is the process of parsing an HTML web page and extracting specific pieces of text for your own use.  By coding a simple scraping program in ASP.NET, you can use freely available data for your own purposes such as monitoring pages for change, collecting pricing information from online retailers, verifying links etc.

This form of automated browsing has many legal issues for webmasters and developers, particularly if you obtain copyrighted data for your own use without the author’s permission.  An excellent use for scraping, which we will use as an example in this post, is to monitor a competitor’s web site for the price of a product and alert or perform some other action when the price changes.

For this example, our goal is to scrape the Walmart web site (www.walmart.com) and extract the price of a particular product, in this case a Nikon D3100 14.2MP DSLR Camera.

The product page for this product is at http://www.walmart.com/ip/Nikon-Nikon-D3100-Kit/15222286, but you’ll notice the price isn’t mentioned, it’s in fact in a separate page linked to from the highlighted link below:

Scraping prices off Walmart – A simple Regex Example

Clicking on ‘See Our Price’, brings up http://www.walmart.com/catalog/submapPricingPopup.do?product_id=15222286&isQL=false which is shown below:

Scraping prices off Walmart - A simple Regex Example

So the first step is to extract all the HTML in this page and save it in a string.

Step 1 – Read in the HTML

For simplicity we’ll put this code into the Page_Load function of the default.aspx.cs file, but first we need to import the namespaces that we’ll be using:

using System.IO;
using System.Text.RegularExpressions;
using System.Text;

Now we make a request to a Uniform Resource Identifier (URI) using the HTTPWebRequest class which provides an HTTP-specific implementation of the WebRequest class

HttpWebRequest webRequest = WebRequest.Create
("http://www.walmart.com/catalog/submapPricingPopup.do?
   product_id=15222286&isQL=false")
as HttpWebRequest;

Then we create a StreamReader object which takes the response of the webrequest defined above, stores it in a string called responseData and then closes all the connections we’ve opened.

StreamReader responseReader = new StreamReader(
   webRequest.GetResponse().GetResponseStream()
);
string responseData = responseReader.ReadToEnd();
responseReader.Close();
webRequest.GetResponse().Close();

Now the HTML that makes up the Walmart product page for the Nikon camera is in the responseData string.

Step 2 – Extract the price

Looking at the source of the HTML we’ve grabbed fromhttp://www.walmart.com/catalog/submapPricingPopup.do?product_id=15222286&isQL=false, you can see the price ($599.00) is in a DIV class called Price4XL:

Scraping prices off Walmart - A simple Regex Example

Now we use a regular expression (regex) to parse the HTML for the text located in the Price4XL DIV tag.

If you’re not familiar with regular expressions, then start with this Microsoft article on Regular Expressions in ASP.NET: http://msdn.microsoft.com/en-us/library/ms972966.aspx.  The regex string required to extract the text between the opening and closing DIV classes ends up being:

(?<=<div class="Price4XL"[^>]*>).*?(?=</div>)

This regex uses a positive LookBehind and a positive LookAhead, so the match must be between two patterns, capturing the price in our example.

The ASP.NET code to match the string, and extract the price:

MatchCollection m1 =
Regex.Matches(responseData,
@"(?<=<div class=""Price4XL""[^>]*>).*?(?=</div>)",
   RegexOptions.Singleline);

foreach (Match m in m1)
{
    Response.Write(m.ToString());
}

In the above Regex.Matches statement, we are using RegexOptions.SingleLine . MSDN states that SingleLine “Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).”

This simple example will display the price on a blank screen, however you can certainly get the data and compare it, store it, write it elsewhere etc.  If you had a table of the different Walmart product ID’s, then you can easily loop through all the products on the site and flag any changes.  The product ID in the URL (in this case 15222286 maps to a Nikon camera, and is the only item that needs to change to find the price for other products.

If there are multiple matches of the regex, then they can be stored in a string array in the foreach loop above, and the index taken to find the value you need.  Ie, if there are three matches of the regex string and you want the last one, then use MyArray[2].

Updated 25/2/2010

Here is the full source code, just put this into your default.aspx.cs file:

using System.IO;
using System.Text.RegularExpressions;
using System.Text;

public partial class _Default : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        HttpWebRequest webRequest = WebRequest.Create(
"http://www.walmart.com/catalog/submapPricingPopup.do?product_id=15222286&
isQL=false") as HttpWebRequest;
        StreamReader responseReader = new StreamReader(
              webRequest.GetResponse().GetResponseStream()
           );
        string responseData = responseReader.ReadToEnd();
        responseReader.Close();
        webRequest.GetResponse().Close();

        MatchCollection m1 = Regex.Matches(responseData,
@"(?<=<div class=""Price4XL""[^>]*>).*?(?=</div>)",
RegexOptions.Singleline);

        foreach (Match m in m1)
        {
            Response.Write(m.ToString());
        }
    }
}

Hope this helps!

Tags:
© Copyright - Evonet