View on GitHub

Code Fellows reading notes

A repository for organizing notes from my learning.

Web Scraping

How to Scrape

Source: Toward Data Science

Web scraping is a technique to scan through and access large amounts of data from a website. The first step is finding a website and some data you want to collect.

The easiest way to find data to scrape is by inspecting a website with the browser.

There are some useful libraries to help with scraping once we find some data:

#from reading:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

Next we’ll target the site with a request and use BeautifulSoup to pick out the <a> tags containing the files we want to scrape.

url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)

soup = BeautifulSoup(response.text, “html.parser”)
soup.findAll('a')

one_a_tag = soup.findAll(‘a’)[38]
link = one_a_tag[‘href’]

download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])

time.sleep(1)

It’s also important to slow down our process using something like time.sleep(1). This avoids making rapid repeated requests to a website which can get our scraper flagged and blocked, or overload the site if we’re grabbing a particularly large amount of data.

More on Scraping

Source: Wikipedia

Web scraping has been around since 1993, when the first web robot was created to measure the size of the web.

Since then, the internet has grown considerably and web scrapers are now used by common search providers to collect and aggregate website data from all around.

Techniques

Not Getting Blocked

Source: Scrape Hero

Web scraping is a cool tool, but requires making several requests per second. If you don’t want to get blocked, there are some rules to follow:

Fundamental: Be Nice - Be good and follow a website’s crawling policies.

Additional best practices

How Crawlers Get Spotted

Quick Summary

Do your best not to get banned! Be a good neighbor, but also remember to be sneaky. Just because a site lets Google scrape them doesn’t necessarily mean they’re inviting any old crawler to do the same.