Web scraping in essence is the process of extracting data from websites using automated software. As for the process, the web-scraping pipeline goes as follows.
- Setup - Understand what we want to do and find sources to help us do it.
- Acquisition
- Read in the raw data from online.
- Format these data to be usable.
- Accessing the data.
- Parsing this information.
- Extracting these data into meaningful and useful data structures.
- Processing - There are many options to this but the main goal is to run the downloaded data through whatever analyses or processes needed to achieve the desired goal.
There are many web scraping tools available, but two of the one’s that I’ve the most success in are Scrapy and Selenium.
Scrapy is a Python package that is primarily used for web scraping. Scrapy provides a framework that offers a simple web crawling tool to extract data from websites. These tools are commonly known as spiders and allow for easy extraction of desired information. A spider program can crawl through websites and extracts data from them (Just like real spiders crawl through there webs). Scrapy works by sending HTTP requests to websites and receiving HTML responses, then the spider parses the HTML to extract the data it needs. These are some of benefit of using Scrapy that I’ve found:
- Scrapy is designed to be fast and can handle large volumes of data. It’s built on top of Twisted, an asynchronous networking framework, which in simple terms makes it very fast. It can handle thousands of requests per second without slowing down. This makes it really handy for the scope of this project.
- Scrapy is highly customizable and can be configured to work with many different websites and data formats. Allowing us to specify how the spider should act when navigating through websites and extract data that we told it to collect. Scrapy also provides features for handling cookies, redirects, and other HTTP features.
- Scrapy comes with built-in support for XPath and CSS selectors. With the right skill set these tools provide a powerful and easy way to extract data from HTML and XML documents.
- Scrapy automatically comes with built-in support for data cleaning, Making the the cleaning and normalize scraped data much easier.
As for the cons of Scrapy:
- Scrapy does have a relatively steep learning curve to it. Especially if you are new to web scraping, XPath/CSS selectors, or asynchronous programming. Unless you have more than a week to spend learning and doing web scrapping it is best to spend the time doing another method. Not to mention the fact that you need to have some knowledge of Python to use it effectively
- Scrapy does not have built-in support for JavaScript, which means that it cannot scrape websites and other dynamic web content that rely heavily on JavaScript for rendering content. Which a vast amount of websites rely on to function.
- This tool is not perfect nor undetectable. Some websites may block Scrapy from accessing their content, especially if they detect that the traffic is coming from a bot.
- Scrapy does help you in cleaning data however it is not perfect. You still need to spend some time cleaning the data that was collected and may require additional libraries, such as Pandas or NumPy, to process and analyze scraped data.
Selenium is another Python package which is used for automating web browsers. It allows you to control a browser programatically, which mimics the actions of a what a user will do. Selenium uses a web driver to control your web browser and can interact with several different browsers (such as Chrome, Firefox, and Safari). Selenium works by opening up a web browser and navigates to the target website. The user can then interact with the website using Selenium commands. Just like Scrapy, Selenium can also extract data from the website using XPath and CSS selectors. As for some of benefit of using Selenium:
- Selenium has the ability to handle JavaScript and other dynamic web content. Allowing it to handle websites that Scrapy is unable too.
- Selenium can handle more complex interactions with websites, such as clicking buttons, filling out forms, scrolling, etc. which can simulate user behavior. Making Selenium more diverse and a power tool to use when doing specific tasks that require a more human like approach.
For the Cons of Selenium:
- Selenium can be really slow and very resource-intensive. Especially when automating complex tasks or interacting with multiple pages. It requires your web browser to be opened to run meaning If your internet or computer is slow it has a really hard time doing anything you want it. This time that you have to wait for the program to preform any simple task is not ideal when working through large data sets.
- It is prone to errors and requires extensive debugging and testing. For example, if say that you close your browser or if it freezes (for long periods of time) the program will not function correctly and throw an error.
- It can be very challenging to set up and configure. Not to mention that It can be difficult to maintain test scripts when the web application is updated or changed. Meaning if you are not dedicated in learning or maintaining your code it will come with a bunch of headaches down the road.
Ideally for this project I strongly recommend using Scrapy spiders were ever possible. Scrapy’s speed and efficiency and overall better performance makes this part of the process much more tolerable in building spiders. However, when it comes to more complex tasks and websites Selenium is the way to go. I would also like to acknowledge that there are other tools and resources that could have been explored and used which may be better or more efficient for the scope of the project. These two tools are just some that I found and was able to implement in the time frame I was given. For example, there is a python package called Scrapy Splash. This works as an extension for Scrapy which provides a JavaScript rendering service. Allowing Scrapy to handle JavaScript and other dynamic web content, similar to Selenium. I was not able to master this tool, nor could I successfully get it this tool to work. If I had more time to develop a clear comprehension of this utility and fully understood how to get it to work. It would have been the tool of choice to handle such websites. I used Selenium since I had more success in getting it to work in comparison.