精通Scrapy网络爬虫【九】下载文件和图片实战演练

Love The Way You Lie 2021-09-08 12:52 538阅读 0赞

## FilesPipeline和ImagesPipeline ##

### FilesPipeline使用说明 ###

1.  在配置文件settings.py中启用FilesPipeline，通常将其置于其他ItemPipeline之前：

ITEM_PIPELINES = { 
        'scrapy.pipelines.files.FilesPipeline': 1,
    }

1.  在配置文件settings.py中，使用FILES\_STORE指定文件下载目录

FILES_STORE='C:/Users/30452/PycharmProjects/untitled10'

1.  在Spider解析一个包含文件下载链接的页面时，将所有需要下载文件的url地址收集到一个列表，赋给item的file\_urls字段（item\[‘file\_urls’\]）。FilesPipeline在处理每一项item时，会读取item\[‘file\_urls’\]，对其中每一个url进行下载，Spider示例代码如下

class DownloadBookSpider(scrapy.Spider):
        def parse(response):
            item={ }
            item['file_urls']=[]
            for url in response.xpath('//a/@href').extract():
                download_url=response.urljoin(url)
                item['file_urls'].append(download_url)
            yield item

当FilesPipeline下载完item\[‘file\_urls’\]中的所有文件后，会将各文件的下载结果信息收集到另一个列表，赋给item的files字段（item\[‘files’\]）。  
下载结果信息包括以下内容：  
● Path文件下载到本地的路径（相对于FILES\_STORE的相对路径）。  
● Checksum文件的校验和。  
● url文件的url地址。

### ImagesPipeline使用说明 ###

ImagesPipeline是FilesPipeline的子类，使用上和FilesPipeline大同小异，只是在所使用的item字段和配置选项上略有差别

<table> 
 <thead> 
  <tr> 
   <th></th> 
   <th>FilesPipeline</th> 
   <th>ImagesPipeline</th> 
  </tr> 
 </thead> 
 <tbody> 
  <tr> 
   <td>导入路径</td> 
   <td>scrapy.pipelines.files.FilesPipeline</td> 
   <td>scrapy.pipelines.images.ImagesPipeline</td> 
  </tr> 
  <tr> 
   <td>Item字段</td> 
   <td>file_urls,files</td> 
   <td>image_urls,images</td> 
  </tr> 
  <tr> 
   <td>下载目录</td> 
   <td>FILES_STORE</td> 
   <td>IMAGES_STORE</td> 
  </tr> 
 </tbody> 
</table>

ImagesPipeline特有功能：

为图片生成缩略图，在配置文件settings.py中设置IMAGES\_THUMBS，它是一个字典，每一项的值是缩略图的尺寸，代码如下：

IMAGES_THUMBS={ 
        'small':(50,50),
        'big':(270,270),
    }

过滤掉尺寸过小的图片，在配置文件settings.py中设置IMAGES\_MIN\_WIDTH和IMAGES\_MIN\_HEIGHT，它们分别指定图片最小的宽和高，代码如下：

IMAGES_MIN_WIDTH=200
    IMAGES_MIN_HEIGHT=200

## 项目实战：爬取matplotlib例子源码文件 ##

在浏览器中访问[http://matplotlib.org/examples/index.html][http_matplotlib.org_examples_index.html]

### 分析页面 ###

![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70]所有例子页面的链接都在`<div class="toctree-wrappercompound">`下的每一个`<li class="toctree-l2">`中  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70 1]  
在一个例子页面中，例子源码文件的下载地址可在`<a class="reference external">`中找到

### 编写代码 ###

创建Scrapy项目，取名为matplotlib\_examples  
使用scrapy genspider命令创建Spider

scrapy startproject matplotlib_examples
    cd matplotlib_examples
    scrapy genspider examples matplotlib.org

在配置文件settings.py中启用FilesPipeline，并指定文件下载目录

ITEM_PIPELINES = { 
        'scrapy.pipelines.files.FilesPipeline': 1,
    }
    FILES_STORE='C:/Users/30452/PycharmProjects/untitled10'

实现ExampleItem，需定义file\_urls和files两个字段，在items.py中

class ExampleItem(scrapy.Item):
        file_urls = scrapy.Field()
        files = scrapy.Field()

实现Examples

import scrapy
    from scrapy.linkextractors import LinkExtractor
    from ..items import ExampleItem
    
    
    class ExamplesSpider(scrapy.Spider):
        name = 'examples'
        allowed_domains = ['matplotlib.org']
        start_urls = ['https://matplotlib.org/2.0.2/examples/index.html']
    
        def parse(self, response):
            le = LinkExtractor(restrict_css='div.toctree-wrapper.compound', deny='/index.html$')
            print(len(le.extract_links(response)))
            for link in le.extract_links(response):
                yield scrapy.Request(link.url, callback=self.parse_example)
    
        def parse_example(self, response):
            href = response.css('a.reference.external::attr(href)').extract_first()
            url = response.urljoin(href)
            example = ExampleItem()
            example['file_urls'] = [url]
            return example

parse方法是例子列表页面的解析函数，在该方法中提取每个例子页面的链接，用其构造Request对象并提交  
parse\_example方法为例子页面的解析函数  
运行爬虫  
查看目录  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70 2]  
修改FilesPipeline为文件命名的规则

在pipelines.py  
实现一个FilesPipeline的子类，覆写file\_path方法来实现所期望的文件命名规则

from scrapy.pipelines.files import FilesPipeline
    from urllib.parse import urlparse
    from os.path import basename,dirname,join
    
    class MyFilesPipeline(FilesPipeline):
        def file_path(self, request, response=None, info=None):
            path=urlparse(request.url).path
            return join(basename(dirname(path)),basename(path))

修改配置文件，使用MyFilesPipeline替代FilesPipeline：

ITEM_PIPELINES = { 
        # 'scrapy.pipelines.files.FilesPipeline': 1,
        'matplotlib_examples.pipelines.files.MyFilesPipeline': 1,
    }

重新运行爬虫  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70 3]  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70 4]

[http_matplotlib.org_examples_index.html]: http://matplotlib.org/examples/index.html
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70]: /images/20210813/0fa45b99c7e84866bc4735bc69b44d75.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70 1]: /images/20210813/c10fe11f49164f81b6a35e6f401c929b.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70 2]: /images/20210813/c8ca91d041d340b7972060bda030b199.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70 3]: /images/20210813/3af76e165d8e4183a6f549e540f49743.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MjQwMzYzMg_size_16_color_FFFFFF_t_70 4]: /images/20210813/5a619ddfcdd7446696b2c6f476e93dfe.png