Lessons Learned While Scraping Data From Dynamic Sites for my Regression ML Project

Serdar
5 min readJan 11, 2022

--

Photo Credit: Shutter Stock
Photo Credit: Shutter Stock

Wish this article was about sharing my experience about how much I made with “2 lines of code” to scrape a website or “how I created a crawler to run it for years”. But my goal was to scrape salary data from Indeed.com or real estate data from Zillow.com but unfortunately due to the dynamic HTML content I was not able to successfully scrape the data or any of the tutorials available on Youtube or Medium.com were helpful.

My motivation to go after Zillow was to add more ‘features’ to my data thanks to the school rating info they have. Since the scraped data would be used to create a linear regression algorithm discussing the relationship between the school ratings and the house prices, Zillow made much more sense.

Another development on the Zillow side is that they banished their free API roughly a year ago. It is now offered through another platform to which you need to get invited.

Since this had become a challenge I wanted to overcome I had to find a way to extract at least some data from the site. In this case, I used Apify which uses Puppeteer on the backend to crawl on the page you wanted to scrape. You can check this video out for detailed usage info.

Warning: One thing common I realized was most, if not all, tutorials on Youtube are more than a year old so by the time you check them out the dynamic content of their page may have already been changed on such websites. Also, they are mainly tackling the static sites, tables, and such to create content so most of them just are click baits this point.

Agenda

  1. Web Scraping vs Web Crawling
  2. What can you do?
  3. “/robots.txt” thing

My aim in writing this article is not to teach you how to scrape a page but amongst some of the most commonly used tools, to help you overcome some of the pitfalls you will likely encounter.

1. Web Scraping vs Web Crawling

Essentially, we want to do web scraping, but since I had this confusion before, I wanted to touch upon the differences. Web Scraping according to Wikipedia,

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web.

It has some elements of web crawling in it as well. But web crawling is usually referred to as what a search engine does. It is more of indexing than bringing up the full content of a web page.

2. What can you do?

By using the most applied three libraries BeautifulSoup, Requests and Selenium.

By using BeautifulSoup library;

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = ''
html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table',{'id':'giftList'}).children:
print(child)

As a natural Selenium user, my first impression about the tool was that as I did not feel the need for it. But as I have started using it in more projects and combining it with Selenium, I realized that something beautiful can come out of it.

By using requests;

import requests
from pandas.io.json import json_normalize

url = 'url you want to scrape'
jsonData = requests.get(url).json()

table = json_normalize(jsonData['data'])

You can check out the requests documentation here. It does return JSON format you just need to take it from there.

Or by using Selenium WebDriver

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import json

driver=webdriver.Chrome(executable_path='./chromedriver.exe')
driver.get(url)
rating=WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, 'your xpath locator'))
)

Selenium WebDriver is an automation tool that helps you automate the browser movements by targeting the website DOM by using the locators such as CSS Selectors, Xpath, Id, Name, and so on. It is a commonly used tool amongst automation testers and the tests written can be flaky because if there is any change in the DOM it would be impossible for a tool to locate an item. Above is some boilerplate code you can use to begin your journey. Also by using CroPath Chrome Extension you can easily locate elements by finding relative and absolute Xpaths.

Another issue with the tool is that it is dealings with AJAX calls. AJAX = Asynchronous JavaScript And XML. You can get more information on it here, but to sum up, it is

AJAX allows web pages to be updated asynchronously by exchanging data with a web server behind the scenes. This means that it is possible to update parts of a web page, without reloading the whole page.

This is also the nightmare of the test automation engineers. Since Selenium (unlike Cypress) works on the browser it does not have any control over the AJAX calls. Here the waits come to rescue. Waits simply enable users to wait explicitly or implicitly until the expected element/s load.

sample code:

from selenium.webdriver.support.ui import WebDriverWait

driver.navigate("file:///race_condition.html")
el = WebDriverWait(driver).until(lambda d: d.find_element_by_tag_name("p"))
assert el.text == "Hello from JavaScript!"

Another thing before forgetting to mention, I also got banned from Zillow and Realtor by using Selenium, so make sure that you are adding the sleep() method to your calls. Or an even better option is that to use proxies. But be told that proxies are open for security vulnerabilities. Here is a code example to check if a proxy works then you can use it to make your code run on the site.

# Import the required Modules
import requests
# Create a pool of proxies
proxies = {
'http://',
'http://',
'http://',

}
url = ''# Iterate the proxies and check if it is working.
for proxy in proxies:
try:
page = requests.get(
url, proxies={"http": proxy, "https": proxy})
# Prints Proxy server IP address if proxy is alive.
print("Status OK, Output:", page.text)
except OSError as e:# Proxy returns Connection error
print(e)

3. “/robots.txt” thing 🤔

Screenshot credit to the article writer

When you add “/robots.txt” after the site URL, it tells you about what is allowed or not allowed to scrape from the webpage. As it can be seen from the three websites I tried to scrape they tend to be not allowing anything for scraping?!

Conclusion

Should you need to work with real estate data my suggestion to you is that to check out other real estates websites like century21 or realtor.com. You may have better luck. In terms of scraping Indeed.com, my honest feedback on that is don’t even try. Indeed.com is a very messy place and most jobs don’t even have salary info or a range shared. But if you want to scrape a site like Craigslist you can find a small example I worked on here. Also please be mindful of the site’s web scraping rules.

Thanks for reading!

--

--