Web Scraping

1. Background Research
- 1.1. Identifying the technology
2. urllib
3. Requests
4. Parsing Tools
5. Scrapy
6. Selenium
7. Avoiding Scraping Traps
8. Scaping Rule Checklist
9. Tor
- 9.1. PySocks
- 9.2. polipo

1 Background Research

robots.txt
sitemap
Google search: site:xxx.com:

1.1 Identifying the technology

1.1.1 detectem

docker pull scrapinghub/splash
pip install detectem

det http://example.webscraping.com

1.1.2 wappalyzer

docker pull wappalyzer/cli
docker run wappalyzer/cli http://example.webscraping.com

1.1.3 whois

pip install python-whois

from whois import whois
whois("xxx.com")

2 urllib

urllib.request.urlopen(url).read()

2.1 setting a user agent

request = urllib.request.Request(url)
request.add_header('User-agent', user_agent)
urllib.request.urlopen(request).read()

2.2 urlparse

from urllib.parse import urlparse
urlparse(url).netloc

2.3 urljoin

from urllib.parse import urljoin
abs_link = urljoin(start_url, link)

2.4 robots.txt parser

from urllib import robotparser
rp = robotparser.RobotFileParser(robots_url)
rp.read()
rp.can_fetch(useragent, url)

useful function

def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp

2.5 Supporting proxies

proxy = 'http://myproxy.net:1234' # example string
proxy_handler = urllib.request.ProxyHandler({'http': proxy})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)
# now requests via urllib.request will be handled via proxy

2.6 Avoiding spider traps

use depth. when a maximum depth is reached, the crawler does not add links from that web page to the queue.

2.7 full feature code

advanced_link_crawler.py

3 Requests

3.1 Basic use

import requests
params = {'firstname': 'Ryan', 'lastname': 'Mitchell'}
r = requests.post("http://pythonscraping.com/files/processing.php", data=params)
print(r.text)

3.2 cookies

import requests
params = {'username': 'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print(r.cookies.get_dict())
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies=r.cookies)
print(r.text)

3.3 Session

session = requests.Session()
params = {'username': 'username', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print(s.cookies.get_dict())
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)

4 Parsing Tools

BeautifulSoup
lxml: fast
HTML Parser: built-in

5 Scrapy

docs: https://docs.scrapy.org/en/latest/index.html

├── scrapy.cfg # deploy configuration file
└── myproject # import your code from here
    ├── __init__.py
    ├── items.py: # each scrapy item object represents a single page on the website
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
	├── spider1.py
	├── spider2.py
	└── __init__.py

5.1 Commands

see: https://docs.scrapy.org/en/latest/topics/commands.html

5.2 Spider Example

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    # start_urls = [
    # 'http://quotes.toscrape.com/page/1/',
    # 'http://quotes.toscrape.com/page/2/',
    # ]
    # start_urls attribute is a shortcut to the start_requests method

    def start_requests(self):
	urls = [
	    'http://quotes.toscrape.com/page/1/',
	    'http://quotes.toscrape.com/page/2/',
	]
	for url in urls:
	    yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response): # response is a TextResponse object
	page = response.url.split("/")[-2]
	filename = 'quotes-%s.html' % page
	with open(filename, 'wb') as f:
	    f.write(response.body)
	self.log('Saved file %s' % filename)

5.2.1 Interactive Shell

scrapy shell 'url'

inspection with spider

def parse(self, response):
    from scrapy.shell import inspect_response
    inspect_response(response, self)

5.2.2 CSS selector

response.css('title').extract()
response.css('title::text').extract_first()
response.css('title::text')[0].extract()
response.css('title::text').re(r'Q\w+')

5.2.3 XPath selector

response.xpath('//title')
response.xpath('//title/text()').extract_first()

5.2.4 Following links

next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

shortcut:

next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
    yield response.follow(next_page, callback=self.parse) # supports relative URLs directly

simpler version:

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

simplest version:

for a in response.css('li.next a'): # for a elements, css method uses their href attribute automatically
    yield response.follow(a, callback=self.parse)

5.3 Feed Export

scrapy crawl spider_name -o some_items.csv -t csv
scrapy crawl spider_name -o some_items.json -t json
scrapy crawl spider_name -o some_items.xml -t xml

5.4 Settings

Command Line(most precedence)

scrapy crawl myspider -s LOG_FILE=scrapy.log

Settings per-spider

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
	'SOME_SETTING': 'some value',
    }

Project settings module

settings.py in project root

Default settings per-command

default_settings attribute of the command class.

Default global settings(less precedence)

scrapy.settings.default_settings

Access settings

In a spider, the settings are available through self.settings

Useful settings

DOWNLOAD_DELAY
ROBOTSTXT_OBEY
USER_AGENT
DEFAULT_REQUEST_HEADERS

5.5 Deploy(Scrapyd)

6 Selenium

often using with non-GUI engine PhantomJS

driver = webdriver.PhantomJS(executable_path='')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

6.1 Waiting until fully loaded

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS(executable_path='')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
try:
    element = WebDriverWait(driver, 10).until(
	EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
    print(driver.find_element_by_id("content").text)
    driver.close()

6.2 Selectors

The following locator selection strategies can used with the By object:

ID: Used in the example; finds elements by their HTML id attribute.
CLASS_NAME

Used to find elements by their HTML class attribute. Why is this function CLASS_NAME: and not simply CLASS? Using the form object.CLASS would create problems for Selenium’s Java library, where .class is a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAME was used instead.

CSS_SELECTOR

Find elements by their class, id, or tag name, using the #idName, .className, tagName convention.

LINK_TEXT

Finds HTML <a> tags by the text they contain. For example, a link that says “Next” can be selected using (By.LINK_TEXT, "Next").

PARTIAL_LINK_TEXT: Similar to LINK_TEXT , but matches on a partial string.
NAME: Finds HTML tags by their name attribute. This is handy for HTML forms.
TAG_NAME: Finds HTML tags by their tag name.
XPATH: Uses an XPath expression (the syntax of which is described in the upcoming sidebar) to select matching elements.

6.3 Handling Redirects

We can detect that redirect in a clever way by “watching” an element in the DOM when the page initially loads, then repeatedly calling that element until Selenium throws a StaleElementReferenceException; that is, the element is no longer attached to the page's DOM and the site has redirected:

from selenium import webdriver
import time
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException
def waitForLoad(driver):
    elem = driver.find_element_by_tag_name("html")
    count = 0
    while True:
	count += 1
	if count > 20:
	    print("Timing out after 10 seconds and returning")
	    return
    time.sleep(.5)
    try:
	elem == driver.find_element_by_tag_name("html")
    except StaleElementReferenceException:
	return
driver = webdriver.PhantomJS(executable_path='<Path to Phantom JS>')
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
waitForLoad(driver)
print(driver.page_source)

7 Avoiding Scraping Traps

7.1 Adjust Youer Headers

User-Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X)
AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257
Safari/9537.53

7.2 Handling Cookies

7.3 Timing is Everything

7.4 Hidden Input Field Values(or hidden by CSS)

8 Scaping Rule Checklist

If you keep getting blocked by websites and you don’t know why, here’s a checklist you can use to remedy the problem

First, if the page you are receiving from the web server is blank, missing information, or is otherwise not what you expect (or have seen in your own browser), it is likely caused by JavaScript being executed on the site to create the page.
If you are submitting a form or making a POST request to a website, check the page to make sure that everything the website is expecting you to submit is being submitted and in the correct format. Use a tool such as Chrome’s Network inspector to view an actual POST command sent to the site to make sure you’ve got everything.
If you are trying to log into a site and can’t make the login “stick,” or the website is experiencing other strange “state” behavior, check your cookies. Make sure that cookies are being persisted correctly between each page load and that your cookies are sent to the site for every request.
If you are getting HTTP errors from the client, especially 403 Forbidden errors, it might indicate that the website has identified your IP address as a bot and is unwilling to accept any more requests. You will need to either wait until your IP address is removed from the list, or obtain a new IP address. To make sure you don’t get blocked again, try the following:
- Make sure that your scrapers aren’t moving through the site too quickly. Fast scraping is a bad practice that places a heavy burden on the web administrator’s servers, can land you in legal trouble, and is the number-one cause of scrapers getting blacklisted. Add delays to your scrapers and let them run overnight. Remember: Being in a rush to write programs or gather data is a sign of bad project management; plan ahead to avoid messes like this in the first place.
- The obvious one: change your headers! Some sites will block anything that advertises itself as a scraper. Copy your own browser’s headers if you’re unsure about what some reasonable header values are.
- Make sure you’re not clicking on or accessing anything that a human normally would not be able to
- If you find yourself jumping through a lot of difficult hoops to gain access, consider contacting the website administrator to let them know what you’re doing. Try emailing webmaster@<domain name> or admin@<domain name> for permission to use your scrapers. Admins are people, too!

9 Tor

sudo apt-get install tor tor-geoipdb
tor Socks5Proxy host:port # use sock5 proxy

# test
sudo apt-get install torsocks
torsocks curl 'https://api.ipify.org' #or http://icanhazip.com/

9.1 PySocks

import socks
import socket
from urllib.request import urlopen
socks.set_default_proxy(socks.SOCKS5, "localhost", 9050)
socket.socket = socks.socksocket
print(urlopen('https://api.ipify.org').read())

9.2 polipo

http proxy wrapper

sudo apt-get install polipo

add following config to /etc/polipo/config

socksParentProxy = "localhost:1080"
socksProxyType = socks5

default port: 8123