View on GitHub

Code Fellows reading notes

A repository for organizing notes from my learning.

Web Scraping

How to Scrape

Web scraping is a technique to scan through and access large amounts of data from a website. The first step is finding a website and some data you want to collect.

The easiest way to find data to scrape is by inspecting a website with the browser.

There are some useful libraries to help with scraping once we find some data:

#from reading:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

Next we’ll target the site with a request and use BeautifulSoup to pick out the <a> tags containing the files we want to scrape.

url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)

soup = BeautifulSoup(response.text, “html.parser”)
soup.findAll('a')

one_a_tag = soup.findAll(‘a’)[38]
link = one_a_tag[‘href’]

download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])

time.sleep(1)

It’s also important to slow down our process using something like time.sleep(1). This avoids making rapid repeated requests to a website which can get our scraper flagged and blocked, or overload the site if we’re grabbing a particularly large amount of data.

More on Scraping

Source: Wikipedia

Web scraping has been around since 1993, when the first web robot was created to measure the size of the web.

Since then, the internet has grown considerably and web scrapers are now used by common search providers to collect and aggregate website data from all around.

Techniques

Human copy-paste - The slowest and manual way of scraping. Literally copy pasting, but you get a human level of discretion.
Text pattern matching - Using regex or grep to find and extract information
HTTP programming - using http requests to retrieve static or dynamic web pages for parsing
HTML parsing - unwrapping a website wrapper to get at the underlying HTML.
DOM parsing - using an embedded browser to view and parse the DOM tree of a site
Vertical aggregation
Semantic annotation recognizing
Computer vision web-page analysis

Not Getting Blocked

Source: Scrape Hero

Web scraping is a cool tool, but requires making several requests per second. If you don’t want to get blocked, there are some rules to follow:

Fundamental: Be Nice - Be good and follow a website’s crawling policies.

Additional best practices

Check and respect robots.txt
Make a crawler slower to avoid slamming a server
Mix up your crawl pattern to appear human
Make requests through proxies and rotate as needed
Rotate user agents and request headers
Use a headless browser
Watch out for honeypots (traps)
Check if a website is changing layouts to throw off crawlers
Don’t scrape behind a login
Use captcha solving services

How Crawlers Get Spotted

Unusual traffic or high dl rate
Repetitive tasks that follow a pattern
Checking for a real browser
Honeypots (hyperlinks only visible to a scraper)

Quick Summary

Do your best not to get banned! Be a good neighbor, but also remember to be sneaky. Just because a site lets Google scrape them doesn’t necessarily mean they’re inviting any old crawler to do the same.