2013-05-27 3 views
0

Я использую apache nutch для сканирования. Когда я просканирую страницу http://www.google.co.in. Он правильно сканирует страницу и дает результаты. Но когда я добавляю один параметр в этот URL-адрес, он не получает никаких результатов для URL-адреса http://www.google.co.in/search?q=bill+gates.Nutch Crawling не работает для определенного URL

solrUrl is not set, indexing will be skipped... 
crawl started in: crawl 
rootUrlDir = urls 
threads = 10 
depth = 3 
solrUrl=null 
topN = 100 
Injector: starting at 2013-05-27 08:01:57 
Injector: crawlDb: crawl/crawldb 
Injector: urlDir: urls 
Injector: Converting injected urls to crawl db entries. 
Injector: total number of urls rejected by filters: 0 
Injector: total number of urls injected after normalization and filtering: 1 
Injector: Merging injected urls into crawl db. 
Injector: finished at 2013-05-27 08:02:11, elapsed: 00:00:14 
Generator: starting at 2013-05-27 08:02:11 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: topN: 100 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: Partitioning selected urls for politeness. 
Generator: segment: crawl/segments/20130527080219 
Generator: finished at 2013-05-27 08:02:26, elapsed: 00:00:15 
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. 
Fetcher: starting at 2013-05-27 08:02:26 
Fetcher: segment: crawl/segments/20130527080219 
Using queue mode : byHost 
Fetcher: threads: 10 
Fetcher: time-out divisor: 2 
QueueFeeder finished: total 1 records + hit by time limit :0 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
fetching http://www.google.co.in/search?q=bill+gates 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold retries: 5 
-finishing thread FetcherThread, activeThreads=8 
-finishing thread FetcherThread, activeThreads=7 
-finishing thread FetcherThread, activeThreads=1 
-finishing thread FetcherThread, activeThreads=2 
-finishing thread FetcherThread, activeThreads=3 
-finishing thread FetcherThread, activeThreads=4 
-finishing thread FetcherThread, activeThreads=5 
-finishing thread FetcherThread, activeThreads=6 
-finishing thread FetcherThread, activeThreads=1 
-finishing thread FetcherThread, activeThreads=0 
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 
-activeThreads=0 
Fetcher: finished at 2013-05-27 08:02:33, elapsed: 00:00:07 
ParseSegment: starting at 2013-05-27 08:02:33 
ParseSegment: segment: crawl/segments/20130527080219 
ParseSegment: finished at 2013-05-27 08:02:40, elapsed: 00:00:07 
CrawlDb update: starting at 2013-05-27 08:02:40 
CrawlDb update: db: crawl/crawldb 
CrawlDb update: segments: [crawl/segments/20130527080219] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: 404 purging: false 
CrawlDb update: Merging segment data into db. 
CrawlDb update: finished at 2013-05-27 08:02:54, elapsed: 00:00:13 
Generator: starting at 2013-05-27 08:02:54 
Generator: Selecting best-scoring urls due for fetch. 
Generator: filtering: true 
Generator: normalizing: true 
Generator: topN: 100 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: 0 records selected for fetching, exiting ... 
Stopping at depth=1 - no more URLs to fetch. 
LinkDb: starting at 2013-05-27 08:03:01 
LinkDb: linkdb: crawl/linkdb 
LinkDb: URL normalize: true 
LinkDb: URL filter: true 
LinkDb: internal links will be ignored. 
LinkDb: adding segment: file:/home/muthu/workspace/webcrawler/crawl/segments/20130527080219 
LinkDb: finished at 2013-05-27 08:03:08, elapsed: 00:00:07 
crawl finished: crawl 

Я уже добавить код

# skip URLs containing certain characters as probable queries, etc. 
-.*[?*[email protected]=].* 

Почему это происходит? может получить URL-адреса, если я добавлю параметр? Заранее спасибо за вашу помощь.

ответ

1

Nutch crawler подчиняется robots.txt, и если вы видите файл robots.txt, расположенный по адресу http://www.google.co.in/robots.txt, вы обнаружите, что/search запрещен для сканирования.

+0

Это так или иначе разрешить поиск? – muthu

+0

Если вы не напишете какую-нибудь вещь, чтобы сделать это. Однако это будет нарушать Google TOS – abhinav

+0

, но в этом (http://stackoverflow.com/questions/11842913/apache-nutch-dont-crawl-website) они сказали ползать, отключив роботов – muthu