【笔记】Scrapy的分布式学习笔记

前言

Scrapy的分布式学习笔记

下载项目

1
2
git clone https://github.com/rolando/scrapy-redis.git
cd scrapy-redis/example-project

配置分布式爬虫的专有配置

设置重复过滤器的模块

example-project/example/settings.py
1
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

设置调度器

example-project/example/settings.py
1
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

设置爬虫结束时是否保留Redis数据库中的去重集合与任务队列

example-project/example/settings.py
1
SCHEDULER_PERSIST = True

设置与Redis交互的管道

example-project/example/settings.py
1
2
3
ITEM_PIPELINES = {
"scrapy_redis.pipelines.RedisPipeline": 400,
}

设置Redis数据库的连接地址

example-project/example/settings.py
1
REDIS_URL = "redis://127.0.0.1:6379"
example-project/example/settings.py
1
2
REDIS_HOST = "127.0.0.1"
REDIS_PORT = 6379

启动支持断点续爬的爬虫

<name>:爬虫名

1
scrapy crawl <name>

启动分布式爬虫

定义爬虫

Basic

redis_key:定义起始URL的Redis键

example-project/example/spiders/myspider_redis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from scrapy_redis.spiders import RedisSpider


class MySpider(RedisSpider):

name = "myspider_redis"
redis_key = "start_urls"

def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop("domain", "")
self.allowed_domains = list(filter(None, domain.split(",")))
super().__init__(*args, **kwargs)

def parse(self, response):
return {
"name": response.css("title::text").extract_first(),
"url": response.url,
}

Crawl

example-project/example/spiders/mycrawler_redis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule

from scrapy_redis.spiders import RedisCrawlSpider


class MyCrawler(RedisCrawlSpider):

name = "mycrawler_redis"
redis_key = "start_urls"

rules = (
# follow all links
Rule(LinkExtractor(), callback="parse_page", follow=True),
)

def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop("domain", "")
self.allowed_domains = list(filter(None, domain.split(",")))
super().__init__(*args, **kwargs)

def parse_page(self, response):
return {
"name": response.css("title::text").extract_first(),
"url": response.url,
}

在Redis中定义起始URL

<url>:起始URL

1
127.0.0.1:6379> lpush start_urls <url>

运行分布式爬虫

<file>.py:爬虫文件
domain="":指定允许的域名,多个域名用,隔开

1
scrapy runspider <file>.py

改造普通爬虫为分布式爬虫

改造爬虫类

导入分布式爬虫类

1
from scrapy_redis.spiders import RedisSpider

继承分布式爬虫类

1
class 类名(RedisSpider):

注释start_urls和allowed_domains变量

1
2
# allowed_domains = [""]
# start_urls = [""]

改用redis_key变量作为起始URL

1
redis_key = "start_urls"

添加init方法

1
2
3
4
5
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop("domain", "")
self.allowed_domains = list(filter(None, domain.split(",")))
super().__init__(*args, **kwargs)

修改配置文件

完成

参考文献

哔哩哔哩——莉莉的茉莉花