【笔记】ScrapySplash学习笔记

前言

ScrapySplash学习笔记

准备工作

  • 已经部署了Splash服务

下载依赖

1
2
pip3 install scrapy
pip3 install scrapy-splash

创建项目

1
2
scrapy startproject <project_name>
cd <project_name>

修改配置文件

SPLASH服务URL

<project_name>/<project_name>/settings.py
1
SPLASH_URL = 'http://127.0.0.1:8050'

添加下载器中间件

<project_name>/<project_name>/settings.py
1
2
3
4
5
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

去重过滤器

<project_name>/<project_name>/settings.py
1
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

使用Splash的HTTP缓存

<project_name>/<project_name>/settings.py
1
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

创建爬虫

通过命令生成爬虫初始代码

1
scrapy genspider <spider_name> <domain>

数据建模

  • 在items文件中预先定义需要爬取的字段
<project_name>/<project_name>/items.py
1
2
3
4
5
6
7
8
import scrapy


class DemoItem(scrapy.Item):
# 文章标题
title = scrapy.Field()
# 文章内容
content = scrapy.Field()

定义请求目标并处理响应

  • start_requests()方法中返回给引擎使用scrapy_splash.SplashRequest类定义的请求

wait:请求最大超时时间,单位秒

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import scrapy
from demo.items import DemoItem
from scrapy_splash import SplashRequest


class TestSpider(scrapy.Spider):
name = "test"
allowed_domains = ["loli.fj.cn"]
start_urls = ["https://loli.fj.cn"]

def start_requests(self):
yield SplashRequest(self.start_urls[0], callback=self.parse, args={'wait': 10}, endpoint='render.html')

def parse(self, response, *args, **kwargs):
# 通过xpath解析数据
node_list = response.xpath('//article')
# 遍历节点列表
for node in node_list:
# 实例化模型
item = DemoItem()
item['title'] = node.xpath('./header/h2/a/text()').extract_first()
item['content'] = node.xpath('./div[@itemprop="articleBody"]/p/text()').extract_first()
# 返回数据
yield item

完成

参考文献

哔哩哔哩——莉莉的茉莉花