scrapy爬取图片

需求

爬取站长素材中的图片

图片数据爬取之ImagesPipeline

scrapy爬取图片

xpath 解析出图片 src 的属性值。单独对图片地址发起请求获取图片二进制类型的数据。

ImagesPipeline

只需要将 img 中 src 的属性进行解析，提交到管道，管道就会对图片的 src 进行请求发送获取图片的二进制并进行持久化存储。

使用流程

数据解析（获得图片的地址）
将存储地址的 item 提交到制定的管道类
在管道文件中自定制有一个基于 ImagesPipeLine 的管道类
— 函数方法 get_media_request 发送请求
— 函数方法 fifle_path 自定义图片名称
— 函数方法 item_completed 将 item 传递给下一个被执行的管道类
在配置文件中
— 指定图片存储的位置：IMAGES_STORE = "./photo"
— 指定开启的管道：自定制的管道类

注意事项

当我们按照 xpath 解析图片地址的时候，发现自己爬取到的地址为空。

主文件如下：

import scrapy

class PhotoSpider(scrapy.Spider):
    name = 'photo'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        src_list = response.xpath('//div[@id="container"]/div/div/a/img/@src').extract()
        print(src_list)
        for src in src_list:
            src = "https:" + src
            print(src)

这是为什么呢？

我们打开网页，检查代码会发现，这些图片地址所在的属性分为两种，一种是 src，一种是 src2。

那么这两种有什么区别呢？

我们打开检查代码下拉页面会发现，如果图片进入可视化区域那么属性就会从 src2 变为 src 。

所以在爬取数据 xpath 解析的时候我们属性的选择应该是 src2 。

主文件

import scrapy
from photopro.items import PhotoproItem
import re

class PhotoSpider(scrapy.Spider):
    name = 'photo'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        src_list = response.xpath('//div[@id="container"]/div/div/a/img/@src2').extract()
        for src in src_list:
            src = "https:" + src
            print(src)

同时我们通过点击详情页观看图片的方式，发现爬取到的地址只是缩略图，而原图和爬取到的地址相差了一个_s。

爬取到的地址 https://scpic3.chinaz.net/Files/pic/pic9/202107/bpic23822_s.jpg
高清的地址 https://scpic3.chinaz.net/Files/pic/pic9/202107/bpic23822.jpg

🆗，了解到了这两个注意事项，我们就可以来写代码了。

代码实现

自定制管道类 pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


# class PhotoproPipeline:
#     def process_item(self, item, spider):
#         return item

import scrapy
from scrapy.pipelines.images import ImagesPipeline

class ImgPipeline(ImagesPipeline):
    # 对 item 中的图片进行请求操作
    def get_media_requests(self, item, info):

        yield scrapy.Request(url = item["src"])

    # 定制图片的名称
    def file_path(self, request, response = None, info = None):
        url = request.url
        file_name = url.split("/")[-1]
        return file_name

    def item_completed(self, result, item, info):
        return item # 该返回值会传递给下一个即将被执行的管道类

items.py 的代码

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class PhotoproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    src = scrapy.Field()
    pass

主文件代码

import scrapy
from photopro.items import PhotoproItem
import re

class PhotoSpider(scrapy.Spider):
    name = 'photo'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['https://sc.chinaz.com/tupian/']

    def parse(self, response):
        src_list = response.xpath('//div[@id="container"]/div/div/a/img/@src2').extract()
        for src in src_list:
            src = "https:" + src
            src = re.sub("_s", "", src)
            item = PhotoproItem()
            item["src"] = src
            # print(src)
            yield item