前言
Scrapy的分布式学习笔记
下载项目
1 2
| git clone https://github.com/rolando/scrapy-redis.git cd scrapy-redis/example-project
|
配置分布式爬虫的专有配置
设置重复过滤器的模块
example-project/example/settings.py1
| DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
|
设置调度器
example-project/example/settings.py1
| SCHEDULER = "scrapy_redis.scheduler.Scheduler"
|
设置爬虫结束时是否保留Redis数据库中的去重集合与任务队列
example-project/example/settings.py1
| SCHEDULER_PERSIST = True
|
设置与Redis交互的管道
example-project/example/settings.py1 2 3
| ITEM_PIPELINES = { "scrapy_redis.pipelines.RedisPipeline": 400, }
|
设置Redis数据库的连接地址
example-project/example/settings.py1
| REDIS_URL = "redis://127.0.0.1:6379"
|
example-project/example/settings.py1 2
| REDIS_HOST = "127.0.0.1" REDIS_PORT = 6379
|
启动支持断点续爬的爬虫
<name>:爬虫名
启动分布式爬虫
定义爬虫
Basic
redis_key:定义起始URL的Redis键
example-project/example/spiders/myspider_redis.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| from scrapy_redis.spiders import RedisSpider
class MySpider(RedisSpider):
name = "myspider_redis" redis_key = "start_urls"
def __init__(self, *args, **kwargs): domain = kwargs.pop("domain", "") self.allowed_domains = list(filter(None, domain.split(","))) super().__init__(*args, **kwargs)
def parse(self, response): return { "name": response.css("title::text").extract_first(), "url": response.url, }
|
Crawl
example-project/example/spiders/mycrawler_redis.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| from scrapy.linkextractors import LinkExtractor from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
class MyCrawler(RedisCrawlSpider):
name = "mycrawler_redis" redis_key = "start_urls"
rules = ( Rule(LinkExtractor(), callback="parse_page", follow=True), )
def __init__(self, *args, **kwargs): domain = kwargs.pop("domain", "") self.allowed_domains = list(filter(None, domain.split(","))) super().__init__(*args, **kwargs)
def parse_page(self, response): return { "name": response.css("title::text").extract_first(), "url": response.url, }
|
在Redis中定义起始URL
<url>:起始URL
1
| 127.0.0.1:6379> lpush start_urls <url>
|
运行分布式爬虫
<file>.py:爬虫文件
domain="":指定允许的域名,多个域名用,隔开
1
| scrapy runspider <file>.py
|
改造普通爬虫为分布式爬虫
改造爬虫类
导入分布式爬虫类
1
| from scrapy_redis.spiders import RedisSpider
|
继承分布式爬虫类
注释start_urls和allowed_domains变量
改用redis_key变量作为起始URL
1
| redis_key = "start_urls"
|
添加init方法
1 2 3 4 5
| def __init__(self, *args, **kwargs): domain = kwargs.pop("domain", "") self.allowed_domains = list(filter(None, domain.split(","))) super().__init__(*args, **kwargs)
|
修改配置文件
完成
参考文献
哔哩哔哩——莉莉的茉莉花