前言
This library provides Scrapy and JavaScript integration using Splash. The license is BSD 3-clause.(维基百科)
准备工作
1
| docker run -p 8050:8050 scrapinghub/splash
|
下载依赖
1 2
| pip3 install scrapy pip3 install scrapy-splash
|
创建项目
1 2
| scrapy startproject <project_name> cd <project_name>
|
修改配置文件
SPLASH服务URL
<project_name>/<project_name>/settings.py1
| SPLASH_URL = 'http://127.0.0.1:8050'
|
添加下载器中间件
<project_name>/<project_name>/settings.py1 2 3 4 5
| DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
|
去重过滤器
<project_name>/<project_name>/settings.py1
| DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
|
使用Splash的HTTP缓存
<project_name>/<project_name>/settings.py1
| HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
|
创建爬虫
通过命令生成爬虫初始代码
1
| scrapy genspider <spider_name> <domain>
|
数据建模
<project_name>/<project_name>/items.py1 2 3 4 5 6 7 8
| import scrapy
class DemoItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field()
|
定义请求目标并处理响应
- 在
start_requests()方法中返回给引擎使用scrapy_splash.SplashRequest类定义的请求
wait:请求最大超时时间,单位秒
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| import scrapy from demo.items import DemoItem from scrapy_splash import SplashRequest
class TestSpider(scrapy.Spider): name = "test" allowed_domains = ["loli.fj.cn"] start_urls = ["https://loli.fj.cn"]
def start_requests(self): yield SplashRequest(self.start_urls[0], callback=self.parse, args={'wait': 10}, endpoint='render.html')
def parse(self, response, *args, **kwargs): # 通过xpath解析数据 node_list = response.xpath('//article') # 遍历节点列表 for node in node_list: # 实例化模型 item = DemoItem() item['title'] = node.xpath('./header/h2/a/text()').extract_first() item['content'] = node.xpath('./div[@itemprop="articleBody"]/p/text()').extract_first() # 返回数据 yield item
|
完成
参考文献
哔哩哔哩——莉莉的茉莉花