找回密码
 立即注册

请求和响应

2022-2-21 06:12| 发布者: 笨鸟自学网| 查看: 10482| 评论: 0

摘要: 零星用途Request和Response用于对网站进行爬网的对象。通常,Request对象在spider中生成并在系统中传递,直到它们到达下载程序,下载程序执行请求并返回Response返回发出请求的spider的对象。两个Request和Response ...


使用errbacks捕获请求处理中的异常

请求的errback是一个函数,在处理异常时将调用该函数。

它收到一个 Failure 作为第一个参数,可用于跟踪连接建立超时、DNS错误等。

下面是一个spider示例,记录所有错误,并在需要时捕获一些特定错误:

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "https://example.invalid/",             # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url) 

上一篇:Feed 导出下一篇:链接提取器

Archiver|手机版|笨鸟自学网 ( 粤ICP备20019910号 )

GMT+8, 2025-8-31 13:31 , Processed in 0.014996 second(s), 19 queries .

Powered by Discuz! X3.5

© 2001-2017 Discuz Team. Template By 【未来科技】【 www.wekei.cn 】

返回顶部