2014-01-29 5 views
1

Я пытаюсь просканировать веб-сайт и только анализировать/просматривать/страницы, но, похоже, scrapy обрабатывает другие типы страниц, такие как/ip /. Я скопировал свой код и консольный журнал ниже. У меня проблемы с правилами. По сути, я хотел бы сканировать весь сайт, каждый тип страницы, но анализировать только URL-адреса w// browse/pages.Как установить правила Scrapy только для разбора/просмотра/страниц?

Ниже мой журнал:

2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Danksin-Now-Women-s-Maternity-Microfleece-Hoodie/27582877> 
    {'canonical': [u'http://www.mydomain.com/ip/Danksin-Now-Women-s-Maternity-Microfleece-Hoodie/27582877'], 
    'class_text': '', 
    'referer': 'http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6', 
    'title': [u"Danksin Now Women's Maternity Microfleece Hoodie: Maternity : mydomain.com "], 
    'url': 'http://www.mydomain.com/ip/Danksin-Now-Women-s-Maternity-Microfleece-Hoodie/27582877'} 
2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/FAST-TRACK-Loving-Moments-by-Leading-Lady-Maternity-Adjustable-Legging-Wear-From-Maternity-Back-To-Your-Regular-Size/29740820> 
    {'canonical': [u'http://www.mydomain.com/ip/FAST-TRACK-Loving-Moments-by-Leading-Lady-Maternity-Adjustable-Legging-Wear-From-Maternity-Back-To-Your-Regular-Size/29740820'], 
    'class_text': '', 
    'referer': 'http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6', 
    'title': [u'Loving Moments by Leading Lady Maternity Adjustable Legging: Maternity : mydomain.com '], 
    'url': 'http://www.mydomain.com/ip/FAST-TRACK-Loving-Moments-by-Leading-Lady-Maternity-Adjustable-Legging-Wear-From-Maternity-Back-To-Your-Regular-Size/29740820'} 
2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/Danskin-Now-Maternity-Microfleece-Pants-2-Pack-Value-Bundle/31022736> (referer: http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6) 
2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/Danskin-Now-Maternity-Performance-Jacket/32360420> (referer: http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6) 
2014-01-28 18:11:40-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Danskin-Now-Maternity-Microfleece-Pants-2-Pack-Value-Bundle/31022736> 
    {'canonical': [u'http://www.mydomain.com/ip/Danskin-Now-Maternity-Microfleece-Pants-2-Pack-Value-Bundle/31022736'], 
    'class_text': '', 
    'referer': 'http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6', 
    'title': [u'Danskin Now Maternity Microfleece Pants, 2-Pack Value Bundle: Maternity : mydomain.com '], 
    'url': 'http://www.mydomain.com/ip/Danskin-Now-Maternity-Microfleece-Pants-2-Pack-Value-Bundle/31022736'} 

Ниже мой код:

from scrapy.contrib.spiders import CrawlSpider,Rule 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from wallspider.items import Website 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

class Myspider(CrawlSpider): 
    name = "newbrowsepages" 
    allowed_domains = ["mydomain.com"] 
    start_urls = ["http://www.mydomain.com/"] 

    rules = (
    Rule(SgmlLinkExtractor(allow=('/browse/',),) 
    , callback='parse_links', follow= True, process_links=lambda links: [link for link in links if not link.nofollow],), 
    Rule(SgmlLinkExtractor(allow=('/browse/',),deny=('/[1-9]$', '(bti=)[1-9]+(?:\.[1-9]*)?', '(sort_by=)[a-zA-Z]', '(sort_by=)[1-9]+(?:\.[1-9]*)?', '(ic=32_)[1-9]+(?:\.[1-9]*)?', '(ic=60_)[0-9]+(?:\.[0-9]*)?', '(search_sort=)[1-9]+(?:\.[1-9]*)?', 'browse-ng.do\?', '/page/', '/ip/', 'out\+value', 'fn=', 'customer_rating', 'special_offers', 'search_sort=&', 'facet='))), 
    ) 

    def parse_start_url(self, response): 
     list(self.parse_links(response)) 

    def parse_links(self, response): 
     hxs = HtmlXPathSelector(response) 
     links = hxs.select('//a') 
     domain = 'http://www.mydomain.com' 
     for link in links: 
      class_text = ''.join(link.select('./@class').extract()) 
      title = ''.join(link.select('./@class').extract()) 
      url = ''.join(link.select('./@href').extract()) 
      meta = {'title':title,} 
      meta = {'class_text':class_text,} 
      yield Request(domain+url, callback = self.parse_page, meta=meta,) 

    def parse_page(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//html') 
     item = Website() 
     for site in sites: 
      item['class_text']=response.meta['class_text'] 
      item['url'] = response.url 
      item['title'] = site.xpath('/html/head/title/text()').extract() 
      item['referer'] = response.request.headers.get('Referer') 
      item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract() 

     return item 

Обновлено Log:

2014-01-28 18:44:21-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Dorel-Twin-Over-Full-Metal-Black-Bunk-Bed-with-Optional-Mattresses/20690436> 
    {'canonical': [u'http://www.mydomain.com/ip/Dorel-Twin-Over-Full-Metal-Black-Bunk-Bed-with-Optional-Mattresses/20690436'], 
    'class_text': '', 
    'referer': 'http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102', 
    'title': [u"Dorel Twin-Over-Full Metal Black Bunk Bed with Optional Mattresses: Kids' & Teen Rooms : mydomain.com "], 
    'url': 'http://www.mydomain.com/ip/Dorel-Twin-Over-Full-Metal-Black-Bunk-Bed-with-Optional-Mattresses/20690436'} 
2014-01-28 18:44:21-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Mainstays-Twin-over-Twin-Wood-Bunk-Bed-Multiple-Finishes/20563913> 
    {'canonical': [u'http://www.mydomain.com/ip/Mainstays-Twin-over-Twin-Wood-Bunk-Bed-Multiple-Finishes/20563913'], 
    'class_text': '', 
    'referer': 'http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102', 
    'title': [u'Shop for the Mainstays Twin Over Twin Wood Bunk Bed at mydomain.com. Save money. Live better.'], 
    'url': 'http://www.mydomain.com/ip/Mainstays-Twin-over-Twin-Wood-Bunk-Bed-Multiple-Finishes/20563913'} 
2014-01-28 18:44:21-0800 [newbrowsepages] DEBUG: Scraped from <200 http://www.mydomain.com/ip/Office-Task-Chair-with-Arms-Black/13007418> 
    {'canonical': [u'http://www.mydomain.com/ip/Office-Task-Chair-with-Arms-Black/13007418'], 
    'class_text': '', 
    'referer': 'http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102', 
    'title': [u'Student Task Chair with Arms - mydomain.com'], 
    'url': 'http://www.mydomain.com/ip/Office-Task-Chair-with-Arms-Black/13007418'} 
2014-01-28 18:44:22-0800 [newbrowsepages] INFO: Crawled 92 pages (at 92 pages/min), scraped 11 items (at 11 items/min) 
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/browse/apparel/activewear/5438_133284_656659/?_refineresult=true&browsein=true&povid=cat5438-env999999-moduleBR122713-lLink10UpActivewearMaternity&search_sort=6> (referer: http://www.mydomain.com/browse/apparel/activewear-for-the-family/5438_1156558/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L127&search_sort=6) 
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/boys-shoes/5438_1045804_1045805_624079> (referer: http://www.mydomain.com/browse/apparel/baby-kids/5438_1045804_1045805/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L132) 
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/jewelry/jewelry-storage/3891_132987/?amp;ic=48_0&amp;ref=125876.183604&amp;tab_value=10874_All&catNavId=3891&povid=P1171-C1110.2784+1455.2776+1115.2956-L147> (referer: http://www.mydomain.com/) 
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/girls-shoes/5438_1045804_1045805_605881> (referer: http://www.mydomain.com/browse/apparel/baby-kids/5438_1045804_1045805/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L132) 
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/baby-toddler-shoes/5438_1045804_1045805_587407> (referer: http://www.mydomain.com/browse/apparel/baby-kids/5438_1045804_1045805/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L132) 
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/South-Shore-Smart-Basics-3-Drawer-Chest-Chocolate/12480393> (referer: http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102) 
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/South-Shore-Country-Double-Dresser-Cream/3921886> (referer: http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102) 
2014-01-28 18:44:22-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/Mainstays-Twin-Platform-Bed-with-Headboard-Cinnamon-Cherry/23735992> (referer: http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102) 
2014-01-28 18:44:23-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/apparel/handbags/5438_1045799_1045800_163873/?_refineresult=true&povid=cat661959-env498314-moduleB020613-lLinkPOV2_Handbags> (referer: http://www.mydomain.com/browse/apparel/bags/5438_1045799/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L135) 
2014-01-28 18:44:23-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/baby/cribs/5427_414099_1101429> (referer: http://www.mydomain.com/browse/baby/5427/?_refineresult=true&facet=customer_rating%3A4+-+5+Stars&povid=P1171-C1110.2784+1455.2776+1115.2956-L148) 
2014-01-28 18:44:23-0800 [newbrowsepages] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/Charleston-Storage-Loft-Bed-with-Desk-White/12338217> (referer: http://www.mydomain.com/browse/home/teen-furniture/4044_1156136_1156142/?_refineresult=true&povid=P1171-C1110.2784+1455.2776+1115.2956-L102) 
+0

Можете ли вы опубликовать немодифицированный журнал (до и после первой страницы с неправильным обходом)? – Blender

+0

@Blender обновил журнал и код, который я использовал, дайте мне знать, если вы хотите увидеть более длинный журнал. –

+0

Некоторые '/ browse /' URL-адреса перенаправления делают '/ ip /' ones. Посмотрите на строки 'referer:' выше. Вам нужно будет найти, какие из них делать и исключить их. – Blender

ответ

1

Что произойдет, если смешать эти 2 правила?

rules = (
    Rule(
     SgmlLinkExtractor(
      allow=('/browse/',), 
      deny=('/[1-9]$', 
        '(bti=)[1-9]+(?:\.[1-9]*)?', 
        '(sort_by=)[a-zA-Z]', 
        '(sort_by=)[1-9]+(?:\.[1-9]*)?', 
        '(ic=32_)[1-9]+(?:\.[1-9]*)?', 
        '(ic=60_)[0-9]+(?:\.[0-9]*)?', 
        '(search_sort=)[1-9]+(?:\.[1-9]*)?', 
        'browse-ng.do\?', 
        '/page/', 
        '/ip/', 
        'out\+value', 
        'fn=', 
        'customer_rating', 
        'special_offers', 
        'search_sort=&', 
        'facet='), 
     ), 
     follow=True, 
     process_links=lambda links: [ 
      link for link in links if not link.nofollow], 
     callback='parse_page'), 
) 
+0

'rules = ( Правило (SgmlLinkExtractor (allow = ('/ browse /',),) , callback = 'parse_links', follow = True, process_links = lambda links: [ссылка для ссылки в ссылках, если не link.nofollow ],), Правило (SgmlLinkExtractor (allow =(), deny = ('/ [1-9] $', '(bti =) [1-9] + (?: \. [1-9] *) ? ',' (sort_by =) [a-zA-Z] ',' (sort_by =) [1-9] + (?: \. [1-9] *)? ',' (ic = 32 _) [ 1-9] + (?: \. [1-9] *)? ',' (Ic = 60 _) [0-9] + (?: \. [0-9] *)? ',' (Search_sort =) [1-9] + (?: \. [1-9] *)? ',' Browse-ng.do \? ','/Page/','/ip/',' out \ + value ',' fn = ',' customer_rating ',' special_offers ',' search_sort = & ',' facet = '))), ' –

+0

Вышеупомянутое правило, которое я бы использовал. Думаю, мне нужны два правила, поэтому что он сканирует весь сайт, но только для синтаксического анализа/просмотра/страниц. Во втором правиле я использовал allow = (/ browse /), потому что было легче показать консольный журнал, способствующий моей проблеме. правила, как вы предлагаете и сообщаете мои Результаты. оцените ваше время –

+0

Я пробовал комбинировать правила, и он сталкивается с той же проблемой. –

Смежные вопросы