Web scraping is when one wants to copy the content of a web page (or targeted parts of an entire site) and this is done automatically by a programmed robot (or a pre-programmed application). I use Python (and an open source library) to achieve successfully potentially any web scraping task.
An interesting use of web scraping to get automatically (hourly) the prices of products sold by competition
In this article I presume the reader has a business that sells to consumers (B2C) and, as any normal business, it needs to diligently monitor the competition, having available data about prices of its competitors. I presume also that each competitor has a website (eCommerce website, for example) that displays the product prices for each and every product they offer to consumers.
Disclaimer: it is debatable whether web scraping is legal or ilegal. A recent case showed that web scraping is not clearly legal, although legal issues such as copyright infrigements or contract law breaches were not addressed in the respective case of law (as per the author, i.e. Mr Eric Goldman, a professor at Santa Clara University School of Law, where he teaches and writes about Internet Law, Intellectual Property and Advertising Law). Therefore, before to pursue with web scraping, I strongly recommend the reader to consult his/her lawyer.
I decline any responsibility whatsoever for any endeavour the readers might pursue upon reading this article. Use web scraping at your own risk!
What I think it is still legal: no competitor can forbid you at watching. You open your browser, enter the URL of competitor’s website, you watch at figures and take a pencil and write down what prices that website displays for a specific product. But this is a manual approach that obviously is far from an efficient one.
Remark: the case of law displayed in the Disclaimer mentioned above dealt with an automated tool. Apparently, that tool was the root-cause of the problem because the target website stopped working.
Let’s clarify one thing: I would never recommend anything like that. Blocking websites from proper functioning is not in my intentions.
Now, that we passed the disclaimer part, I would like to say a few words about what this article provides. I explain a very convenient, cost effective and flexible method to get public info from the reader’s competitors’ websites (without disrupting any target website) and build a database to be analysed further. Such database aids any business to assess competition and improve internal decisions.
Thus, data gathered from competition or from market falls into “business intelligence” category. It is normal to make use of it when drafting the price strategy or set the size of discounts you are willing to offer to consumers (in order to make a difference compared to your main competitors), all aiming at maximising your sales (volume) and profit.
Advantage: gathering data from the market that goes into company’s database to be available to your analysts is what keep a business more adaptable to market. You might decide to reduce some prices to a category of products and/or increase prices for another products.
Manual approach won’t work for say hundred or thousand of products. If an important volume of market data from competitors is to be gathered, the process is less efficient, because you need additional staff to do it manually (or you can do it with less staff, but it is more time consuming).
A pre-programmed application (or a robot) aimed to extract the right data at the right time has the following advantages:
- extract data with the frequency desired (it can be even hourly), so you are updated with prices practiced by competition all the time;
- no manual error;
- data obtained can be saved in your database or in the format needed (i.e. “csv” – for example) ready to be imported in your company’s database;
- the bunch of data obtained would allow for various models to be created, various scenarios of prices modifications, with corresponding sales volumes and related profit by product and/or total profit maximisation.
Now, the technical part: what’s behind the scenes at the competition? Web pages and HTML
Any website has a bunch of web pages. When we visit a web page, our web browser makes a request to a web server that sends back some files that tell our browser how to render the page for us.
The files our browser receives fall into a few main types:
- HTML format / language – here we have the main content of the page.
- CSS (Cascading Style Sheets) that add styling to make the page look nicer.
- Images — in some image formats, such as JPG, or PNG, etc that allow web pages to show pictures.
HTML is the main focus for web scraping (because it has the content that we target to obtain).
HyperTextMarkup Language (HTML) allows you to do similar things to what you do in a word processor like Microsoft Word, namely make text bold, create paragraphs, and so on. HTML is not as complex as Python.
HTML consists of elements called tags. Wherever you see “<” followed by one or several other words and then a similar “>” that ends those words, then these are called “tags“.
<p> – indicates a beginning of a paragraph
“<a” – indicates a link then “>”. It is followed by the description of link (and finally followed by </a> at the end)
<body> – indicates the start of web page body
<head> – indicates the start of the header section of web page
and so on.
Why I entered into such details? Because usually, any web scraping tool uses these tags as the main elements needed to identify and extract the main content we are interested in.
Technical part of web scraping: as a tool used Python is the main actor
One of the most liked programming languages as per this link is Python.
In the next article I will continue with Python and a well known open-source module (or library) written in Python. I will explain how to create and approach web scrapping aimed at extracting the relevant content from your competitor’s website.