2013-12-23 2 views
2

Пробовал реализовать Common Practices в scrapy. Поэтому попытался внедрить библиотеку crawlera.Crawlera не работает с Scrapy, загрузчик не работает

Установлен и настроен Crawlera как указано here. (Я могу увидеть scrapylib библиотеку в моей системе, делая help('modules'))

Это мой settings.py для Scrapy:

BOT_NAME = 'cnn' 

SPIDER_MODULES = ['cnn.spiders'] 
NEWSPIDER_MODULE = 'cnn.spiders' 
COOKIES_ENABLED = False 
DOWNLOADER_MIDDLEWARES = {'scrapylib.crawlera.CrawleraMiddleware': 600,'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,} 
CRAWLERA_ENABLED = True 
CRAWLERA_USER = 'abc' 
CRAWLERA_PASS = '[email protected]' 

Но когда я бегу паук ничего не происходит.

я могу видеть в моем журнале Scrapy, что CrawleraMiddleware загружен:

2013-12-23 20:12:54+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CrawleraMiddleware, ChunkedTransferMiddleware, DownloaderStats 

почему он не ползает?

Это журнал с Crawlera Enabled:

2013-12-23 21:58:14+0530 [scrapy] INFO: Scrapy 0.20.2 started (bot: cnn) 
2013-12-23 21:58:14+0530 [scrapy] DEBUG: Optional features available: ssl, http11 
2013-12-23 21:58:14+0530 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'cnn.spiders', 'FEED_URI': 'news.json', 'MEMDEBUG_ENABLED': True, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['cnn.spiders'], 'BOT_NAME': 'cnn', 'DOWNLOAD_TIMEOUT': 240, 'COOKIES_ENABLED': False, 'FEED_FORMAT': 'json', 'MEMUSAGE_REPORT': True, 'REDIRECT_ENABLED': False, 'MEMUSAGE_ENABLED': True} 
2013-12-23 21:58:14+0530 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, MemoryDebugger, SpiderState 
2013-12-23 21:58:14+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CrawleraMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2013-12-23 21:58:14+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2013-12-23 21:58:14+0530 [scrapy] DEBUG: Enabled item pipelines: 
2013-12-23 21:58:14+0530 [cnn] INFO: Spider opened 
2013-12-23 21:58:14+0530 [cnn] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2013-12-23 21:58:14+0530 [cnn] INFO: Using crawlera at http://proxy.crawlera.com:8010 (user: xmpirate) 
2013-12-23 21:58:14+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2013-12-23 21:58:14+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2013-12-23 21:58:15+0530 [cnn] DEBUG: Crawled (407) <GET http://www.example1.com> (referer: None) 
2013-12-23 21:58:15+0530 [cnn] DEBUG: Crawled (407) <GET http://www.example2.com> (referer: None) 
2013-12-23 21:58:15+0530 [cnn] INFO: Closing spider (finished) 
2013-12-23 21:58:15+0530 [cnn] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 464, 
    'downloader/request_count': 2, 
    'downloader/request_method_count/GET': 2, 
    'downloader/response_bytes': 364, 
    'downloader/response_count': 2, 
    'downloader/response_status_count/407': 2, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2013, 12, 23, 16, 28, 15, 679961), 
    'log_count/DEBUG': 8, 
    'log_count/INFO': 4, 
    'memusage/max': 30236737536, 
    'memusage/startup': 30236737536, 
    'response_received_count': 2, 
    'scheduler/dequeued': 2, 
    'scheduler/dequeued/memory': 2, 
    'scheduler/enqueued': 2, 
    'scheduler/enqueued/memory': 2, 
    'start_time': datetime.datetime(2013, 12, 23, 16, 28, 14, 853975)} 
2013-12-23 21:58:15+0530 [cnn] INFO: Spider closed (finished) 

и это с Crawlera инвалидов:

2013-12-23 22:00:45+0530 [scrapy] INFO: Scrapy 0.20.2 started (bot: cnn) 
2013-12-23 22:00:45+0530 [scrapy] DEBUG: Optional features available: ssl, http11 
2013-12-23 22:00:45+0530 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'cnn.spiders', 'FEED_URI': 'news.json', 'MEMDEBUG_ENABLED': True, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['cnn.spiders'], 'BOT_NAME': 'cnn', 'DOWNLOAD_TIMEOUT': 240, 'COOKIES_ENABLED': False, 'FEED_FORMAT': 'json', 'MEMUSAGE_REPORT': True, 'REDIRECT_ENABLED': False, 'MEMUSAGE_ENABLED': True} 
2013-12-23 22:00:46+0530 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, MemoryDebugger, SpiderState 
2013-12-23 22:00:46+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CrawleraMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2013-12-23 22:00:46+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2013-12-23 22:00:46+0530 [scrapy] DEBUG: Enabled item pipelines: 
2013-12-23 22:00:46+0530 [cnn] INFO: Spider opened 
2013-12-23 22:00:46+0530 [cnn] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2013-12-23 22:00:46+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2013-12-23 22:00:46+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2013-12-23 22:00:46+0530 [cnn] DEBUG: Crawled (200) <GET http://www.example1.com> (referer: None) 
2013-12-23 22:00:47+0530 [cnn] DEBUG: Crawled (200) <GET http://www.example2.com> (referer: None) 
**Pages are crawled here** 
2013-12-23 22:01:00+0530 [cnn] INFO: Closing spider (finished) 
2013-12-23 22:01:00+0530 [cnn] INFO: Stored json feed (7 items) in: news.json 
2013-12-23 22:01:00+0530 [cnn] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 10151, 
    'downloader/request_count': 36, 
    'downloader/request_method_count/GET': 36, 
    'downloader/response_bytes': 762336, 
    'downloader/response_count': 36, 
    'downloader/response_status_count/200': 35, 
    'downloader/response_status_count/404': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2013, 12, 23, 16, 31, 0, 376888), 
    'item_scraped_count': 7, 
    'log_count/DEBUG': 49, 
    'log_count/INFO': 4, 
    'memusage/max': 30157045760, 
    'memusage/startup': 30157045760, 
    'request_depth_max': 1, 
    'response_received_count': 36, 
    'scheduler/dequeued': 36, 
    'scheduler/dequeued/memory': 36, 
    'scheduler/enqueued': 36, 
    'scheduler/enqueued/memory': 36, 
    'start_time': datetime.datetime(2013, 12, 23, 16, 30, 46, 61019)} 
2013-12-23 22:01:00+0530 [cnn] INFO: Spider closed (finished) 
+0

Показать ur spider code? – ajkumar25

+0

@ ajkumar25 Это имеет какое-либо отношение к коду паука? Я думаю, что импортирование модуля Crawlera - это то, что здесь не работает. Может быть, я ошибаюсь. Но все же, если это необходимо, вы можете взять любой базовый ** CrawlSpider ** пример как ** [здесь] (http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example) ** –

+0

ли паук сканирует с crawlera middleware отключен? показывая полный журнал с включенным промежуточным программным обеспечением crawlera и без него, может помочь отладить вашу проблему. – dangra

ответ

0

код 407 ошибки от Crawlera ошибка аутентификации, есть возможно опечатка в APIKEY или, возможно, вы не используете правильный.

Source

Смежные вопросы