前言
Scrapy学习笔记
下载依赖
Python
MacOS
Linux
Debian
交互式终端
<url>:爬取的URL
创建项目
<project_name>:定义项目名
1 2
| scrapy startproject <project_name> cd <project_name>
|
修改配置文件
robots协议配置
True:缺省值,遵守robots协议采集数据
False:不遵守robots协议采集数据
//settings.py
请求头相关配置
设置默认请求头
//settings.py1 2 3 4
| DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', }
|
修改UserAgent
//settings.py
开启的管道
加载管道类
//settings.py1 2 3
| ITEM_PIPELINES = { "demo.pipelines.DemoPipeline": 300, }
|
加载中间件
定义开启的爬虫中间件
//settings.py1 2 3
| SPIDER_MIDDLEWARES = { "demo.middlewares.DemoSpiderMiddleware": 543, }
|
定义开启的下载中间件
//settings.py1 2 3
| DOWNLOADER_MIDDLEWARES = { "demo.middlewares.DemoSpiderMiddleware": 543, }
|
Cookie相关设置
是否开启Cookie自动保存
//settings.py
是否开启在日志中输出使用的Cookie
//settings.py
日志相关设置
设置日志等级
LOG_LEVEL:日志等级
DEBUG:缺省值,调试
INFO:信息
WARNING:警告
ERROR:错误
CRITICAL:严重错误
//settings.py
设置日志文件保存的目录
并发相关设置
指定请求并发量
1
| CONCURRENT_REQUESTS = 16
|
指定域名并发量
1
| CONCURRENT_REQUESTS_PER_DOMAIN = 16
|
指定IP地址并发量
1
| CONCURRENT_REQUESTS_PER_IP = 16
|
发送请求的间隔
//settings.py
创建爬虫
Basic爬虫
通过命令生成爬虫初始代码
<spider_name>:爬虫的名称
<domain>:允许爬取的域名
-t <template>:指定模板
basic:缺省值,普通爬虫
crawl:crawl爬虫
1
| scrapy genspider <spider_name> <domain>
|
- 会在
<project_name>/<project_name>/spiders:目录下创建<spider_name>.py文件
数据建模
//items.py1 2 3 4 5 6 7 8
| import scrapy
class DemoItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field()
|
定义请求目标并处理响应
- 引入自定义的要爬取的模型
- 在
parse()方法中处理返回的响应结果
- Scrapy封装的
xpath()方法返回的是Selector选择器对象,需要使用extract()方法提取列表数据,或使用extract_first()方法提取单个数据
extract()方法用于解析列表
extract_first()方法用于解析单个元素
- 如果是从列表中返回数据,只会返回第一个元素
- 如果是从空列表中返回数据,则会返回
None
- 通过
yield关键字返回数据
allowed_domains:定义允许爬取的域名
start_urls:定义起始URL
//spiders/.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| import scrapy from <project_name>.items import DemoItem
class TestSpider(scrapy.Spider): name = "test" allowed_domains = ["loli.fj.cn"] start_urls = ["https://loli.fj.cn"]
def parse(self, response, *args, **kwargs): node_list = response.xpath('//article') for node in node_list: item = DemoItem() item["title"] = node.xpath('./header/h2/a/text()').extract_first() item["content"] = node.xpath('./div[@itemprop="articleBody"]/p/text()').extract_first() yield item
|
模拟翻页
- 通过返回
scrapy.Request对象来实现发送GET请求,从而实现模拟翻页
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| import scrapy from <project_name>.items import DemoItem
class TestSpider(scrapy.Spider): name = "test" allowed_domains = ["loli.fj.cn"] start_urls = ["https://loli.fj.cn/"]
def parse(self, response, *args, **kwargs): node_list = response.xpath('//article') for node in node_list: item = DemoItem() item["title"] = node.xpath('./header/h2/a/text()').extract_first() item["content"] = node.xpath('./div[@itemprop="articleBody"]/p/text()').extract_first() yield item last_a = response.xpath('/html/body/main/div[2]/nav/a[last()]') if last_a.xpath('./@rel').extract_first() == "next": next_url = response.urljoin(last_a.xpath('./@href').extract_first()) yield scrapy.Request( url=next_url, callback=self.parse )
|
scrapy.Request的构造方法的属性
url:请求的URL地址
callback:回调函数,用于处理返回的响应结果
self.parse:缺省值,当前函数作为回调函数(相当于递归操作)
meta={"<key>": "<value>"}:传递给回调函数的数据,用于在回调函数中获取数据
dont_filter:是否过滤重复的URL
False:缺省值,过滤重复的URL
True:不过滤重复的URL
method:请求方法
GET:缺省值,GET请求
POST:POST请求
headers={}:请求头
cookies={}:Cookie
body="<json>":请求体
翻页时传递参数
- 通过response对象的meta属性获取由调用者传递来的参数
- 不要使用框架的关键字作为meta的key
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| import scrapy from <project_name>.items import DemoItem
class TestSpider(scrapy.Spider): name = "test" allowed_domains = ["loli.fj.cn"] start_urls = ["https://loli.fj.cn/"]
def parse(self, response, *args, **kwargs): node_list = response.xpath('//article') for node in node_list: item = DemoItem() item["title"] = node.xpath('./header/h2/a/text()').extract_first() inner_a = response.urljoin(node.xpath('./header/h2/a/@href').extract_first()) print(inner_a) yield scrapy.Request( url=inner_a, callback=self.parse_inner, meta={"item": item} )
def parse_inner(self, response, *args, **kwargs): item = response.meta["item"] item["content"] = response.xpath('//article/div[@itemprop="articleBody"]/p/text()').extract_first() yield item
|
携带Cookie
- 重写
start_requests()方法,并在创建scrapy.Request对象时传递Cookie
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| import scrapy from <project_name>.items import DemoItem
class TestSpider(scrapy.Spider): name = "test" allowed_domains = ["loli.fj.cn"] start_urls = ["https://loli.fj.cn"]
def start_requests(self): url = self.start_urls[0] cookies_str = "key1=value; key2=value" cookies_dic = {item.split("=")[0] : item.split("=")[-1] for item in cookies_str.split("; ")} yield scrapy.Request( url=url, callback=self.parse, cookies=cookies_dic, )
def parse(self, response, *args, **kwargs): node_list = response.xpath('//article') for node in node_list: item = DemoItem() item["title"] = node.xpath('./header/h2/a/text()').extract_first() item["content"] = node.xpath('./div[@itemprop="articleBody"]/p/text()').extract_first() yield item
|
发送POST请求
- 通过返回
scrapy.FormRequest对象来实现发送POST请求,请求参数为x-www-form-urlencoded格式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| import scrapy from <project_name>.items import DemoItem
class TestSpider(scrapy.Spider): name = "test" allowed_domains = ["loli.fj.cn"] start_urls = ["https://loli.fj.cn"]
def parse(self, response, *args, **kwargs): yield scrapy.FormRequest( url="", callback=self.login, formdata={ "username": "", "password": "" } )
def login(self, response, *args, **kwargs): pass
|
Crawl爬虫
- 通过crawl爬虫快速提取链接
- crawl爬虫只能在一个页面上爬取数据,不能跨多个URL爬取数据,如果需要跨URL爬取数据应该使用basic爬虫
- crawl爬虫中不要重写
parse()方法
通过命令生成爬虫初始代码
1
| scrapy genspider <spider_name> <domain> -t crawl
|
数据建模
//items.py1 2 3 4 5 6 7 8
| import scrapy
class DemoItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field()
|
定义请求目标并处理响应
allow=r"":定义提取的URL的正则表达式
callback="":定义成功提取的URL发送请求后,处理响应的回调函数名
follow="":定义是否递归操作
False:缺省值,不进行递归操作
True:进行递归操作,在新的页面继续使用当前的规则匹配URL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from demo.items import DemoItem
class TestSpider(CrawlSpider): name = "test" allowed_domains = ["loli.fj.cn"] start_urls = ["https://loli.fj.cn"]
rules = ( Rule(LinkExtractor(allow=r"/\d{4}/\d{2}/\d{2}/.*?"), callback="parse_item", follow=False), Rule(LinkExtractor(allow=r"/page/\d+/"), follow=True), )
def parse_item(self, response): item = DemoItem() item['title'] = response.xpath('//article').xpath('./header/h1/text()').extract_first().strip() item['content'] = response.xpath('//article').xpath('./div[@itemprop="articleBody"]/p/text()').extract_first() return item
|
利用管道保存数据
在pipelines文件中定义保存数据的操作
open_spider():爬虫开始运行时执行
close_spider():爬虫结束运行时执行
//pipelines.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| import json
class DemoPipeline:
def open_spider(self, spider): self.file = open("data.json", "w")
def process_item(self, item, spider): item = dict(item) json_data = json.dumps(item, ensure_ascii=False) self.file.write(f"{json_data},\n") return item
def close_spider(self, spider): self.file.close()
|
只将指定的爬虫的数据进行管道的处理
//pipelines.py1 2 3 4 5 6 7 8 9 10 11 12 13 14
| class DemoPipeline:
def open_spider(self, spider): if spider.name == "test": pass
def process_item(self, item, spider): if spider.name == "test": pass return item
def close_spider(self, spider): if spider.name == "test": pass
|
定义多个管道
//pipelines.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| class DemoPipelineForFile:
def open_spider(self, spider): pass
def process_item(self, item, spider): return item
def close_spider(self, spider): pass
class DemoPipelineForDatabase:
def open_spider(self, spider): pass
def process_item(self, item, spider): return item
def close_spider(self, spider): pass
|
在settings文件中启用管道类
- 去掉65~67行的注释,并配置需要启用的管道类的全局限定名
demo.pipelines.DemoPipeline:由目录名.文件名.类名构成的管道类的全局限定名
300:权重,数值越小优先级越高,数值不建议超过1000
//settings.py1 2 3
| ITEM_PIPELINES = { "demo.pipelines.DemoPipeline": 300, }
|
- 运行后,在日志的
INFO: Enabled item pipelines:中会显示启用的管道类,如果没有启动任何管道类则显示[]
启用多个管道类
//settings.py1 2 3 4
| ITEM_PIPELINES = { "demo.pipelines.DemoPipelineForFile": 300, "demo.pipelines.DemoPipelineForDatabase": 301, }
|
运行指定爬虫
--nolog:不显示日志
1
| scrapy crawl <spider_name>
|
响应对象的属性
获取当前URL地址
//spiders/.py
获取当前响应对应的请求的URL地址
//spiders/.py
获取当前响应头
//spiders/.py
获取当前响应对应的请求的请求头
//spiders/.py1
| response.request.headers
|
获取当前响应体
//spiders/.py
获取当前响应状态码
//spiders/.py
响应对象的方法
根据响应中的URN拼接URL路径为完整的URI
1
| response.urljoin("/index.html")
|
1
| https://loli.fj.cn/index.html
|
中间件
- 在
middlewares.py文件中重写中间件类的process_request()请求中间件方法和process_response()响应中间件方法
下载中间件
process_request()
- 如果下载中间件返回为None,或什么也不返回,则继续执行下一个中间件,直到执行完所有的中间件,最后执行下载器
- 如果下载中间件返回为Request对象,则将请求交给调度器
- 如果下载中间件返回为Response对象,则直接将Response对象交给爬虫
process_response()
- 如果下载中间件返回为Request对象,则将请求交给调度器
- 如果下载中间件返回为Response对象,则直接将Response对象交给爬虫
通过下载中间件实现随机UserAgent
//middlewares.py1 2 3 4 5 6 7 8 9
| import random
UA_LIST = ()
class RandomUserAgentMiddleware:
def process_request(self, request, spider): request.headers["User-Agent"] = random.choice(UA_LIST)
|
通过下载中间件实现随机代理
//middlewares.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| import random import base64
PROXY_LIST = ( {"ip": "", "port": ""}, {"ip": "", "port": "", "username": "", "password": ""}, )
class RandomProxyMiddleware:
def process_request(self, request, spider): proxy = random.choice(PROXY_LIST) if "username" in proxy: auth = base64.b64encode(f'{proxy["username"]}:{proxy["password"]}'.encode()).decode() request.headers["Proxy-Authorization"] = f'Basic {auth}' request.meta["proxy"] = f'{proxy["ip"]}:{proxy["port"]}'
|
通过下载中间件联动Selenium实现爬取动态网页
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| from selenium import webdriver import time from scrapy.http import HtmlResponse
class SeleniumMiddleware:
def process_request(self, request, spider): driver = webdriver.Chrome() driver.get(request.url) time.sleep(3) body = driver.page_source driver.close() return HtmlResponse(url=request.url, body=body, encoding='utf-8', request=request)
|
在settings文件中启用中间件类
- 去掉53~55行的注释,并配置需要启用的中间件类的全局限定名
//settings.py1 2 3 4 5
| DOWNLOADER_MIDDLEWARES = { "demo.middlewares.RandomUserAgentMiddleware": 543, "demo.middlewares.RandomProxyMiddleware": 543, "demo.middlewares.SeleniumMiddleware": 543, }
|
完成
参考文献
哔哩哔哩——莉莉的茉莉花