57uv6Z6g55qE5Y2a5a6i

MS4wLjABAAAA5qMD8Gzdcgq7HXUOviKB59i0-ybJ59jJvNzyaPt5XOsVNqP6DU7WLcoAXvdxvYdp💗
本站所有文章仅作技术研究，请勿非法破坏，请遵守相关法律法规，后果自负

【笔记】Scrapy的分布式学习笔记

发表于 2024-10-22 更新于 2026-01-15 阅读次数：

前言

Scrapy的分布式学习笔记

下载项目

1 2	git clone https://github.com/rolando/scrapy-redis.git cd scrapy-redis/example-project

配置分布式爬虫的专有配置

设置重复过滤器的模块

example-project/example/settings.py

1	DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

设置调度器

example-project/example/settings.py

1	SCHEDULER = "scrapy_redis.scheduler.Scheduler"

设置爬虫结束时是否保留Redis数据库中的去重集合与任务队列

example-project/example/settings.py

1	SCHEDULER_PERSIST = True

设置与Redis交互的管道

example-project/example/settings.py

1
2
3

ITEM_PIPELINES = {
    "scrapy_redis.pipelines.RedisPipeline": 400,
}

设置Redis数据库的连接地址

example-project/example/settings.py

1	REDIS_URL = "redis://127.0.0.1:6379"

example-project/example/settings.py

1 2	REDIS_HOST = "127.0.0.1" REDIS_PORT = 6379

启动支持断点续爬的爬虫

<name>：爬虫名

1	scrapy crawl <name>

启动分布式爬虫

定义爬虫

Basic

redis_key：定义起始URL的Redis键

example-project/example/spiders/myspider_redis.py

from scrapy_redis.spiders import RedisSpider


class MySpider(RedisSpider):

    name = "myspider_redis"
    redis_key = "start_urls"

    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop("domain", "")
        self.allowed_domains = list(filter(None, domain.split(",")))
        super().__init__(*args, **kwargs)

    def parse(self, response):
        return {
            "name": response.css("title::text").extract_first(),
            "url": response.url,
        }

Crawl

example-project/example/spiders/mycrawler_redis.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule

from scrapy_redis.spiders import RedisCrawlSpider


class MyCrawler(RedisCrawlSpider):

    name = "mycrawler_redis"
    redis_key = "start_urls"

    rules = (
        # follow all links
        Rule(LinkExtractor(), callback="parse_page", follow=True),
    )

    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop("domain", "")
        self.allowed_domains = list(filter(None, domain.split(",")))
        super().__init__(*args, **kwargs)

    def parse_page(self, response):
        return {
            "name": response.css("title::text").extract_first(),
            "url": response.url,
        }

在Redis中定义起始URL

<url>：起始URL

1	127.0.0.1:6379> lpush start_urls <url>

运行分布式爬虫

<file>.py：爬虫文件
domain=""：指定允许的域名，多个域名用,隔开

1	scrapy runspider <file>.py

改造普通爬虫为分布式爬虫

改造爬虫类

导入分布式爬虫类

1	from scrapy_redis.spiders import RedisSpider

继承分布式爬虫类

1	class 类名(RedisSpider):

注释start_urls和allowed_domains变量

1 2	# allowed_domains = [""] # start_urls = [""]

改用redis_key变量作为起始URL

1	redis_key = "start_urls"

添加init方法

def __init__(self, *args, **kwargs):
    # Dynamically define the allowed domains list.
    domain = kwargs.pop("domain", "")
    self.allowed_domains = list(filter(None, domain.split(",")))
    super().__init__(*args, **kwargs)

修改配置文件

传送门

完成

参考文献

哔哩哔哩——莉莉的茉莉花

0%