scrapy对西刺代理ip的爬取

柔光的暖阳◎ 2021-11-29 04:56 383阅读 0赞

目标网址：[https://www.xicidaili.com/][https_www.xicidaili.com]

[编写要爬取的item.py][item.py]

import scrapy
    class GetipsItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        ip = scrapy.Field()#代理IP地址
        port = scrapy.Field()#端口
        position = scrapy.Field()#服务器地址
        type = scrapy.Field()#类型
        speed = scrapy.Field()
        last_check_time = scrapy.Field()#验证时间

[编写要爬取的spider.py][spider.py]

# -*- coding: utf-8 -*-
    import scrapy
    from getips.items import GetipsItem
    
    class GetipSpider(scrapy.Spider):
        name = 'getip'
        allowed_domains = ['xicidaili.com']
        start_urls = ['http://xicidaili.com/']
    
    custom_settings = {
      "DEFAULT_REQUEST_HEADERS": {
        'User-Agent': '"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0',
    }
    }
    
    def start_requests(self):
        reqs = []
        for i in range(1, 20):
            req = scrapy.Request("https://www.xicidaili.com/nn/%s"%i)
            reqs.append(req)
        return reqs
    
    def parse(self, response):
        ip_list = response.xpath('//table[@id="ip_list"]')
    
        trs = ip_list[0].xpath('tr')
    
        items = []
    
        for ip in trs[1:]:
            item = GetipsItem()
            item['ip'] = ip.xpath('td[2]/text()')[0].extract()
            item['port'] = ip.xpath('td[3]/text()')[0].extract()
            item['position'] = ip.xpath('string(td[4])')[0].extract().strip()
            item['type'] = ip.xpath('td[6]/text()')[0].extract()
            # pre_item['speed'] = ip.xpath('td[7]/div[@class="bar"]/@tittle')[0].re('\d{0,2}\.\d{0,}')[0]
            item['last_check_time'] = ip.xpath('td[10]/text()')[0].extract()
            # items.append(pre_item)
            yield item

[编写piplines.py][piplines.py]

def process_item(self, item, spider):
        with codecs.open('打开.csv', "a+", encoding='utf-8-sig')as f:
            writer = csv.writer(f)
            writer.writerow((item['ip'], item['port'], item['position'],
                             item['type'], item['last_check_time']))
            return item

[编写piplines.py][piplines.py]  
ITEM\_PIPELINES = \{  
‘getips.pipelines.GetipsPipeline’: 300,  
\}

## 结果： ##

![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDg0MTMxMg_size_16_color_FFFFFF_t_70]

[https_www.xicidaili.com]: https://www.xicidaili.com/
[item.py]: http://xn--item-4z8fq2jgu7ehjhc6tb0x.py
[spider.py]: http://xn--spider-hw2j11m7s0g0cjpwxps2a.py
[piplines.py]: http://xn--piplines-ts6mn078a.py
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDg0MTMxMg_size_16_color_FFFFFF_t_70]: /images/20211129/d0b9052dbe6b4b098a834cd6dcec731f.png