笨鸟编程-零基础入门Pyhton教程 › 首页 ›Scrapy中文手册 › 查看内容

Scrapy 教程

2022-2-21 05:57| 发布者: 笨鸟自学网| 查看: 11485| 评论: 0

摘要: 在本教程中，我们假定scrapy已经安装在您的系统上。如果不是这样的话，看安装指南.我们将抓取' quotes.toscrape.com http: quotes.toscrape.com=""/http: ' _，这是一个列出著名作家名言的网站。本教程将指导您完成 ...

创建请求的快捷方式¶

作为创建请求对象的快捷方式，您可以使用 response.follow ：：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

不像Scrapy.Request， response.follow 直接支持相对URL-无需调用URLJOIN。注意 response.follow 只返回一个请求实例；您仍然需要生成这个请求。

也可以将选择器传递给 response.follow 而不是字符串；此选择器应提取必要的属性：

for href in response.css('ul.pager a::attr(href)'):
    yield response.follow(href, callback=self.parse)

为了 <a> 元素有一个快捷方式： response.follow 自动使用其href属性。因此代码可以进一步缩短：

for a in response.css('ul.pager a'):
    yield response.follow(a, callback=self.parse)

要从iterable创建多个请求，可以使用 response.follow_all 取而代之的是：

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

或者，进一步缩短：

yield from response.follow_all(css='ul.pager a', callback=self.parse) 

1 2 3 4 5 6 789 10 / 10 页下一页

收藏分享邀请

		自动登录	找回密码
密码			立即注册

Scrapy 教程

创建请求的快捷方式¶

相关分类