Scrapy爬虫Response子类在应用中的问题解析_Python

正文

今天用scrapy爬取壁纸的时候（url：http://pic.netbian.com/4kmein...）絮叨了一些问题，记录下来，供后世探讨，以史为鉴。**

因为网站是动态渲染的，所以选择scrapy对接selenium（scrapy抓取网页的方式和requests库相似，都是直接模拟HTTP请求，而Scrapy也不能抓取JavaScript动态渲染的网页。）

所以在Downloader Middlewares中需要得到Request并且返回一个Response，问题出在Response，通过查看官方文档发现class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])，随即通过from scrapy.http import Response导入Response

Scrapy爬虫Response子类在应用中的问题解析

输入scrapy crawl girl得到如下错误：

*results=response.xpath('//[@id="main"]/div[3]/ul/lia/img')
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text**

检查相关代码：

				?

									# middlewares.py

									from scrapy import signals

									from scrapy.http import Response

									from scrapy.exceptions import IgnoreRequest

									import selenium

									from selenium.webdriver.common.by import By

									from selenium.webdriver.support.ui import WebDriverWait

									from selenium.webdriver.support import expected_conditions as EC

									class Pic4KgirlDownloaderMiddleware(object):

									    # Not all methods need to be defined. If a method is not defined,

									    # scrapy acts as if the downloader middleware does not modify the

									    # passed objects.

									    def process_request(self, request, spider):

									        # Called for each request that goes through the downloader

									        # middleware.

									        # Must either:

									        # - return None: continue processing this request

									        # - or return a Response object

									        # - or return a Request object

									        # - or raise IgnoreRequest: process_exception() methods of

									        #   installed downloader middleware will be called

									        try:

									            self.browser=selenium.webdriver.Chrome()

									            self.wait=WebDriverWait(self.browser,10)

									            self.browser.get(request.url)

									            self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#main > div.page > a:nth-child(10)')))

									            return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))

									        #except:

									            #raise IgnoreRequest()

									        finally:

									            self.browser.close()

推断问题出在：

return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))

查看Response类的定义

				?

									@property

									    def text(self):

									        """For subclasses of TextResponse, this will return the body

									        as text (unicode object in Python 2 and str in Python 3)

									        """

									        raise AttributeError("Response content isn't text")

									    def css(self, *a, **kw):

									        """Shortcut method implemented only by responses whose content

									        is text (subclasses of TextResponse).

									        """

									        raise NotSupported("Response content isn't text")

									    def xpath(self, *a, **kw):

									        """Shortcut method implemented only by responses whose content

									        is text (subclasses of TextResponse).

									        """

									        raise NotSupported("Response content isn't text")

说明Response类不可以被直接使用，需要被继承重写方法后才能使用

响应子类

				?

									**TextResponse对象**

									class scrapy.http.TextResponse(url[, encoding[, ...]])

									**HtmlResponse对象**

									class scrapy.http.HtmlResponse(url[, ...])

									**XmlResponse对象**

									class scrapy.http.XmlResponse（url [，... ] ）

举例观察TextResponse的定义from scrapy.http import TextResponse

导入TextResponse发现

				?

									class TextResponse(Response):

									    _DEFAULT_ENCODING = 'ascii'

									    def __init__(self, *args, **kwargs):

									        self._encoding = kwargs.pop('encoding', None)

									        self._cached_benc = None

									        self._cached_ubody = None

									        self._cached_selector = None

									        super(TextResponse, self).__init__(*args, **kwargs)

其中xpath方法已经被重写

				?

									@property

									    def selector(self):

									        from scrapy.selector import Selector

									        if self._cached_selector is None:

									            self._cached_selector = Selector(self)

									        return self._cached_selector

									    def xpath(self, query, **kwargs):

									        return self.selector.xpath(query, **kwargs)

									    def css(self, query):

									        return self.selector.css(query)