Web Scraping
Table of Contents
1 Background Research
- robots.txt
- sitemap
- Google search: site:xxx.com:
1.1 Identifying the technology
1.1.1 detectem
docker pull scrapinghub/splash pip install detectem det http://example.webscraping.com
1.1.2 wappalyzer
docker pull wappalyzer/cli docker run wappalyzer/cli http://example.webscraping.com
1.1.3 whois
pip install python-whois
from whois import whois whois("xxx.com")
2 urllib
urllib.request.urlopen(url).read()
2.1 setting a user agent
request = urllib.request.Request(url) request.add_header('User-agent', user_agent) urllib.request.urlopen(request).read()
2.2 urlparse
from urllib.parse import urlparse urlparse(url).netloc
2.3 urljoin
from urllib.parse import urljoin abs_link = urljoin(start_url, link)
2.4 robots.txt parser
from urllib import robotparser rp = robotparser.RobotFileParser(robots_url) rp.read() rp.can_fetch(useragent, url)
useful function
def get_robots_parser(robots_url): " Return the robots parser object using the robots_url " rp = robotparser.RobotFileParser() rp.set_url(robots_url) rp.read() return rp
2.5 Supporting proxies
proxy = 'http://myproxy.net:1234' # example string proxy_handler = urllib.request.ProxyHandler({'http': proxy}) opener = urllib.request.build_opener(proxy_handler) urllib.request.install_opener(opener) # now requests via urllib.request will be handled via proxy
2.6 Avoiding spider traps
use depth. when a maximum depth is reached, the crawler does not add links from that web page to the queue.
2.7 full feature code
3 Requests
3.1 Basic use
import requests params = {'firstname': 'Ryan', 'lastname': 'Mitchell'} r = requests.post("http://pythonscraping.com/files/processing.php", data=params) print(r.text)
3.2 cookies
import requests params = {'username': 'Ryan', 'password': 'password'} r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params) print(r.cookies.get_dict()) r = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies=r.cookies) print(r.text)
3.3 Session
session = requests.Session() params = {'username': 'username', 'password': 'password'} s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params) print(s.cookies.get_dict()) s = session.get("http://pythonscraping.com/pages/cookies/profile.php") print(s.text)
4 Parsing Tools
- BeautifulSoup
- lxml: fast
- HTML Parser: built-in
5 Scrapy
docs: https://docs.scrapy.org/en/latest/index.html
├── scrapy.cfg # deploy configuration file
└── myproject # import your code from here
├── __init__.py
├── items.py: # each scrapy item object represents a single page on the website
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── spider1.py
├── spider2.py
└── __init__.py
5.1 Commands
5.2 Spider Example
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" # start_urls = [ # 'http://quotes.toscrape.com/page/1/', # 'http://quotes.toscrape.com/page/2/', # ] # start_urls attribute is a shortcut to the start_requests method def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): # response is a TextResponse object page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)
5.2.1 Interactive Shell
scrapy shell 'url'
inspection with spider
def parse(self, response): from scrapy.shell import inspect_response inspect_response(response, self)
5.2.2 CSS selector
response.css('title').extract() response.css('title::text').extract_first() response.css('title::text')[0].extract() response.css('title::text').re(r'Q\w+')
5.2.3 XPath selector
response.xpath('//title') response.xpath('//title/text()').extract_first()
5.2.4 Following links
next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
shortcut:
next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse) # supports relative URLs directly
simpler version:
for href in response.css('li.next a::attr(href)'): yield response.follow(href, callback=self.parse)
simplest version:
for a in response.css('li.next a'): # for a elements, css method uses their href attribute automatically yield response.follow(a, callback=self.parse)
5.3 Feed Export
scrapy crawl spider_name -o some_items.csv -t csv scrapy crawl spider_name -o some_items.json -t json scrapy crawl spider_name -o some_items.xml -t xml
5.4 Settings
Command Line(most precedence)
scrapy crawl myspider -s LOG_FILE=scrapy.log
Settings per-spider
class MySpider(scrapy.Spider): name = 'myspider' custom_settings = { 'SOME_SETTING': 'some value', }
Project settings module
settings.py in project root
Default settings per-command
default_settings attribute of the command class.
Default global settings(less precedence)
scrapy.settings.default_settings
Access settings
In a spider, the settings are available through self.settings
Useful settings
- DOWNLOAD_DELAY
- ROBOTSTXT_OBEY
- USER_AGENT
- DEFAULT_REQUEST_HEADERS
5.5 Deploy(Scrapyd)
6 Selenium
often using with non-GUI engine PhantomJS
driver = webdriver.PhantomJS(executable_path='') driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3) print(driver.find_element_by_id("content").text) driver.close()
6.1 Waiting until fully loaded
from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.PhantomJS(executable_path='') driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "loadedButton"))) finally: print(driver.find_element_by_id("content").text) driver.close()
6.2 Selectors
The following locator selection strategies can used with the By object:
- ID: Used in the example; finds elements by their HTML id attribute.
- CLASS_NAME
Used to find elements by their HTML class attribute. Why is this function CLASS_NAME: and not simply CLASS? Using the form object.CLASS would create problems for Selenium’s Java library, where .class is a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAME was used instead.
- CSS_SELECTOR
Find elements by their class, id, or tag name, using the #idName, .className, tagName convention.
- LINK_TEXT
Finds HTML <a> tags by the text they contain. For example, a link that says “Next” can be selected using (By.LINK_TEXT, "Next").
- PARTIAL_LINK_TEXT: Similar to LINK_TEXT , but matches on a partial string.
- NAME: Finds HTML tags by their name attribute. This is handy for HTML forms.
- TAG_NAME: Finds HTML tags by their tag name.
- XPATH: Uses an XPath expression (the syntax of which is described in the upcoming sidebar) to select matching elements.
6.3 Handling Redirects
We can detect that redirect in a clever way by “watching” an element in the DOM when the page initially loads, then repeatedly calling that element until Selenium throws a StaleElementReferenceException; that is, the element is no longer attached to the page's DOM and the site has redirected:
from selenium import webdriver import time from selenium.webdriver.remote.webelement import WebElement from selenium.common.exceptions import StaleElementReferenceException def waitForLoad(driver): elem = driver.find_element_by_tag_name("html") count = 0 while True: count += 1 if count > 20: print("Timing out after 10 seconds and returning") return time.sleep(.5) try: elem == driver.find_element_by_tag_name("html") except StaleElementReferenceException: return driver = webdriver.PhantomJS(executable_path='<Path to Phantom JS>') driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html") waitForLoad(driver) print(driver.page_source)
7 Avoiding Scraping Traps
7.1 Adjust Youer Headers
User-Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53
7.2 Handling Cookies
7.3 Timing is Everything
7.4 Hidden Input Field Values(or hidden by CSS)
8 Scaping Rule Checklist
If you keep getting blocked by websites and you don’t know why, here’s a checklist you can use to remedy the problem
- First, if the page you are receiving from the web server is blank, missing information, or is otherwise not what you expect (or have seen in your own browser), it is likely caused by JavaScript being executed on the site to create the page.
- If you are submitting a form or making a POST request to a website, check the page to make sure that everything the website is expecting you to submit is being submitted and in the correct format. Use a tool such as Chrome’s Network inspector to view an actual POST command sent to the site to make sure you’ve got everything.
- If you are trying to log into a site and can’t make the login “stick,” or the website is experiencing other strange “state” behavior, check your cookies. Make sure that cookies are being persisted correctly between each page load and that your cookies are sent to the site for every request.
- If you are getting HTTP errors from the client, especially 403 Forbidden errors, it might indicate that the website has identified your IP address as a bot and is unwilling to accept any more requests. You will need to either wait until your IP address is removed from the list, or obtain a new IP address. To make sure you don’t get blocked again, try the following:
- Make sure that your scrapers aren’t moving through the site too quickly. Fast scraping is a bad practice that places a heavy burden on the web administrator’s servers, can land you in legal trouble, and is the number-one cause of scrapers getting blacklisted. Add delays to your scrapers and let them run overnight. Remember: Being in a rush to write programs or gather data is a sign of bad project management; plan ahead to avoid messes like this in the first place.
- The obvious one: change your headers! Some sites will block anything that advertises itself as a scraper. Copy your own browser’s headers if you’re unsure about what some reasonable header values are.
- Make sure you’re not clicking on or accessing anything that a human normally would not be able to
- If you find yourself jumping through a lot of difficult hoops to gain access, consider contacting the website administrator to let them know what you’re doing. Try emailing webmaster@<domain name> or admin@<domain name> for permission to use your scrapers. Admins are people, too!
9 Tor
sudo apt-get install tor tor-geoipdb tor Socks5Proxy host:port # use sock5 proxy # test sudo apt-get install torsocks torsocks curl 'https://api.ipify.org' #or http://icanhazip.com/
9.1 PySocks
import socks import socket from urllib.request import urlopen socks.set_default_proxy(socks.SOCKS5, "localhost", 9050) socket.socket = socks.socksocket print(urlopen('https://api.ipify.org').read())
9.2 polipo
http proxy wrapper
sudo apt-get install polipo
add following config to /etc/polipo/config
socksParentProxy = "localhost:1080" socksProxyType = socks5
default port: 8123