2014-12-18 3 views
1

Я пытаюсь очистить xkcd.com, чтобы получить все изображения, которые у них есть. Когда я запускаю свой скребок, он загружает 7-8 случайных изображений в диапазоне от www.xkcd.com/1-1461. Я хотел бы, чтобы он проходил каждую страницу последовательно и сохранял изображение, чтобы убедиться, что у меня есть полный набор.Скребок изображений с XKCD с помощью scrapy

Ниже паук я написал ползать и вывод, который я получаю от Scrapy:

SPIDER:

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 
from xkcd.items import XkcdItem 

class XkcdimagesSpider(CrawlSpider): 
    name = "xkcdimages" 
    allowed_domains = ["xkcd.com"] 
    start_urls = ['http://www.xkcd.com'] 
    rules = [Rule(LinkExtractor(allow=['\d+']), 'parse_xkcd')] 

    def parse_xkcd(self, response): 
     image = XkcdItem() 
     image['title'] = response.xpath(\ 
      "//div[@id='ctitle']/text()").extract() 
     image['image_urls'] = response.xpath(\ 
      "//div[@id='comic']/img/@src").extract() 
     return image 

ВЫВОД

2014-12-18 19:57:42+1300 [scrapy] INFO: Scrapy 0.24.4 started (bot: xkcd) 
2014-12-18 19:57:42+1300 [scrapy] INFO: Optional features available: ssl, http11, django 
2014-12-18 19:57:42+1300 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xkcd.spiders', 'SPIDER_MODULES': ['xkcd.spiders'], 'DOWNLOAD_DELAY': 1, 'BOT_NAME': 'xkcd'} 
2014-12-18 19:57:42+1300 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled item pipelines: ImagesPipeline 
2014-12-18 19:57:43+1300 [xkcdimages] INFO: Spider opened 
2014-12-18 19:57:43+1300 [xkcdimages] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-12-18 19:57:43+1300 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2014-12-18 19:57:43+1300 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com> (referer: None) 
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Filtered offsite request to 'creativecommons.org': <GET http://creativecommons.org/licenses/by-nc/2.5/> 
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://xkcd.com/1461/large/> (referer: http://www.xkcd.com) 
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Scraped from <200 http://xkcd.com/1461/large/> 
    {'image_urls': [], 'images': [], 'title': []} 
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/1/> (referer: http://www.xkcd.com) 
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg> referred in <None> 
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/1/> 
    {'image_urls': [u'http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg'], 
    'images': [{'checksum': '953bf3bf4584c2e347eaaba9e4703c9d', 
       'path': 'full/ab31199b91c967a29443df3093fac9c97e5bbed6.jpg', 
       'url': 'http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg'}], 
    'title': [u'Barrel - Part 1']} 
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/556/> (referer: http://www.xkcd.com) 
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg> referred in <None> 
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/556/> 
    {'image_urls': [u'http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg'], 
    'images': [{'checksum': 'c88a6e5a3018bce48861bfe2a2255993', 
       'path': 'full/b523e12519a1735f1d2c10cb8b803e0a39bf90e5.jpg', 
       'url': 'http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg'}], 
    'title': [u'Alternative Energy Revolution']} 
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/688/> (referer: http://www.xkcd.com) 
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/self_description.png> referred in <None> 
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/688/> 
    {'image_urls': [u'http://imgs.xkcd.com/comics/self_description.png'], 
    'images': [{'checksum': '230b38d12d5650283dc1cc8a7f81469b', 
       'path': 'full/e754ff4560918342bde8f2655ff15043e251f25a.jpg', 
       'url': 'http://imgs.xkcd.com/comics/self_description.png'}], 
    'title': [u'Self-Description']} 
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/162/> (referer: http://www.xkcd.com) 
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/angular_momentum.jpg> referred in <None> 
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/162/> 
    {'image_urls': [u'http://imgs.xkcd.com/comics/angular_momentum.jpg'], 
    'images': [{'checksum': '83050c0cc9f4ff271a9aaf52372aeb33', 
       'path': 'full/7c180399f2a2cffeb321c071dea2c669d83ca328.jpg', 
       'url': 'http://imgs.xkcd.com/comics/angular_momentum.jpg'}], 
    'title': [u'Angular Momentum']} 
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/730/> (referer: http://www.xkcd.com) 
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/circuit_diagram.png> referred in <None> 
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/730/> 
    {'image_urls': [u'http://imgs.xkcd.com/comics/circuit_diagram.png'], 
    'images': [{'checksum': 'd929f36d6981cb2825b25c9a8dac7c9e', 
       'path': 'full/15ad254b5cd5c506d701be67f525093af79e6ac0.jpg', 
       'url': 'http://imgs.xkcd.com/comics/circuit_diagram.png'}], 
    'title': [u'Circuit Diagram']} 
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/150/> (referer: http://www.xkcd.com) 
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/grownups.png> referred in <None> 
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/150/> 
    {'image_urls': [u'http://imgs.xkcd.com/comics/grownups.png'], 
    'images': [{'checksum': '9d165fd0b00ec88bcc953da19d52a3d3', 
       'path': 'full/57fdec7b0d3b2c0a146ea77937c776994f631a4a.jpg', 
       'url': 'http://imgs.xkcd.com/comics/grownups.png'}], 
    'title': [u'Grownups']} 
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/1460/> (referer: http://www.xkcd.com) 
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/smfw.png> referred in <None> 
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/1460/> 
    {'image_urls': [u'http://imgs.xkcd.com/comics/smfw.png'], 
    'images': [{'checksum': '705b029ffbdb7f2306ccb593426392fd', 
       'path': 'full/93805911ad95e7f5c2f93a6873a2ae36c0d00f86.jpg', 
       'url': 'http://imgs.xkcd.com/comics/smfw.png'}], 
    'title': [u'SMFW']} 
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Closing spider (finished) 
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 2173, 
    'downloader/request_count': 9, 
    'downloader/request_method_count/GET': 9, 
    'downloader/response_bytes': 26587, 
    'downloader/response_count': 9, 
    'downloader/response_status_count/200': 9, 
    'file_count': 7, 
    'file_status_count/uptodate': 7, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2014, 12, 18, 6, 57, 52, 133428), 
    'item_scraped_count': 8, 
    'log_count/DEBUG': 27, 
    'log_count/INFO': 7, 
    'offsite/domains': 1, 
    'offsite/filtered': 1, 
    'request_depth_max': 1, 
    'response_received_count': 9, 
    'scheduler/dequeued': 9, 
    'scheduler/dequeued/memory': 9, 
    'scheduler/enqueued': 9, 
    'scheduler/enqueued/memory': 9, 
    'start_time': datetime.datetime(2014, 12, 18, 6, 57, 43, 153440)} 
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Spider closed (finished) 

ответ

3

Вы должны установить follow параметр True в crawling rules. Попробуйте что-то вроде этого:

linkextractor = LinkExtractor(allow=('\d+'), unique=True) 
rules = [Rule(linkextractor, callback='parse_xkcd', follow=True)] 
+0

Проработал, спасибо, перейдя по документации, чтобы выяснить, что произошло. Cheers – Duncan

+0

отличный :) рассмотрите, пожалуйста, ответьте. –

+0

Все сделано, извинения не заметили маленькую стрелу. Еще раз спасибо. – Duncan

Смежные вопросы