Web Scraping
How to Scrape
Web scraping is a technique to scan through and access large amounts of data from a website. The first step is finding a website and some data you want to collect.
The easiest way to find data to scrape is by inspecting a website with the browser.
There are some useful libraries to help with scraping once we find some data:
#from reading:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
Next we’ll target the site with a request and use BeautifulSoup to pick out the <a> tags containing the files we want to scrape.
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, “html.parser”)
soup.findAll('a')
one_a_tag = soup.findAll(‘a’)[38]
link = one_a_tag[‘href’]
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])
time.sleep(1)
It’s also important to slow down our process using something like time.sleep(1). This avoids making rapid repeated requests to a website which can get our scraper flagged and blocked, or overload the site if we’re grabbing a particularly large amount of data.
More on Scraping
Web scraping has been around since 1993, when the first web robot was created to measure the size of the web.
Since then, the internet has grown considerably and web scrapers are now used by common search providers to collect and aggregate website data from all around.
Techniques
- Human copy-paste - The slowest and manual way of scraping. Literally copy pasting, but you get a human level of discretion.
- Text pattern matching - Using regex or grep to find and extract information
- HTTP programming - using http requests to retrieve static or dynamic web pages for parsing
- HTML parsing - unwrapping a website wrapper to get at the underlying HTML.
- DOM parsing - using an embedded browser to view and parse the DOM tree of a site
- Vertical aggregation
- Semantic annotation recognizing
- Computer vision web-page analysis
Not Getting Blocked
Web scraping is a cool tool, but requires making several requests per second. If you don’t want to get blocked, there are some rules to follow:
Fundamental: Be Nice - Be good and follow a website’s crawling policies.
Additional best practices
- Check and respect robots.txt
- Make a crawler slower to avoid slamming a server
- Mix up your crawl pattern to appear human
- Make requests through proxies and rotate as needed
- Rotate user agents and request headers
- Use a headless browser
- Watch out for honeypots (traps)
- Check if a website is changing layouts to throw off crawlers
- Don’t scrape behind a login
- Use captcha solving services
How Crawlers Get Spotted
- Unusual traffic or high dl rate
- Repetitive tasks that follow a pattern
- Checking for a real browser
- Honeypots (hyperlinks only visible to a scraper)
Quick Summary
Do your best not to get banned! Be a good neighbor, but also remember to be sneaky. Just because a site lets Google scrape them doesn’t necessarily mean they’re inviting any old crawler to do the same.