2016-11-25 2 views
1

Я пытаюсь передать данные просканированы веб cralwer Nutch к поиску и индексации платформы Solr с помощью следующей команды:передать данные из Nutch в Solr

bin/nutch index -Dsolr.server.url=http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/ -dir crawl/segments/20161124145935/ crawl/segments/20161124150145/ -filter -normalize 

Но я получаю следующее сообщение об ошибке :

The input path at segments is not a segment... skipping 
The input path at content is not a segment... skipping 
The input path at crawl_fetch is not a segment... skipping 
Skipping segment: file:/Users/cell/Desktop/usi/information-retrieval/project/apache-nutch-1.12/crawl/segments/20161124145935/crawl_generate. Missing sub directories: parse_data, parse_text, crawl_parse, crawl_fetch 
The input path at crawl_parse is not a segment... skipping 
The input path at parse_data is not a segment... skipping 
The input path at parse_text is not a segment... skipping 
Segment dir is complete: crawl/segments/20161124150145. 
Indexer: starting at 2016-11-25 05:02:17 
Indexer: deleting gone documents: false 
Indexer: URL filtering: true 
Indexer: URL normalizing: true 
Active IndexWriters : 
SOLRIndexWriter 
    solr.server.url : URL of the SOLR instance 
    solr.zookeeper.hosts : URL of the Zookeeper quorum 
    solr.commit.size : buffer size when sending to SOLR (default 1000) 
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) 
    solr.auth : use authentication (default false) 
    solr.auth.username : username for authentication 
    solr.auth.password : password for authentication 


Indexing 250/250 documents 
Deleting 0 documents 
Indexing 250/250 documents 
Deleting 0 documents 
Indexer: java.io.IOException: Job failed! 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) 
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) 

Вот бревно из Nutch:

2016-11-25 06:05:03,378 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
2016-11-25 06:05:03,500 WARN segment.SegmentChecker - The input path at segments is not a segment... skipping 
2016-11-25 06:05:03,506 WARN segment.SegmentChecker - The input path at content is not a segment... skipping 
2016-11-25 06:05:03,506 WARN segment.SegmentChecker - The input path at crawl_fetch is not a segment... skipping 
2016-11-25 06:05:03,507 WARN segment.SegmentChecker - Skipping segment: file:/Users/cell/Desktop/usi/information-retrieval/project/apache-nutch-1.12/crawl/segments/20161124145935/crawl_generate. Missing sub directories: parse_data, parse_text, crawl_parse, crawl_fetch 
2016-11-25 06:05:03,507 WARN segment.SegmentChecker - The input path at crawl_parse is not a segment... skipping 
2016-11-25 06:05:03,507 WARN segment.SegmentChecker - The input path at parse_data is not a segment... skipping 
2016-11-25 06:05:03,507 WARN segment.SegmentChecker - The input path at parse_text is not a segment... skipping 
2016-11-25 06:05:03,509 INFO segment.SegmentChecker - Segment dir is complete: crawl/segments/20161124150145. 
2016-11-25 06:05:03,510 INFO indexer.IndexingJob - Indexer: starting at 2016-11-25 06:05:03 
2016-11-25 06:05:03,512 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 
2016-11-25 06:05:03,512 INFO indexer.IndexingJob - Indexer: URL filtering: true 
2016-11-25 06:05:03,512 INFO indexer.IndexingJob - Indexer: URL normalizing: true 
2016-11-25 06:05:03,614 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 
2016-11-25 06:05:03,615 INFO indexer.IndexingJob - Active IndexWriters : 
SOLRIndexWriter 
    solr.server.url : URL of the SOLR instance 
    solr.zookeeper.hosts : URL of the Zookeeper quorum 
    solr.commit.size : buffer size when sending to SOLR (default 1000) 
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) 
    solr.auth : use authentication (default false) 
    solr.auth.username : username for authentication 
    solr.auth.password : password for authentication 


2016-11-25 06:05:03,616 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb 
2016-11-25 06:05:03,616 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb 
2016-11-25 06:05:03,617 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161124150145 
2016-11-25 06:05:04,006 WARN conf.Configuration - file:/tmp/hadoop-cell/mapred/staging/cell1463380038/.staging/job_local1463380038_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-11-25 06:05:04,010 WARN conf.Configuration - file:/tmp/hadoop-cell/mapred/staging/cell1463380038/.staging/job_local1463380038_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-11-25 06:05:04,088 WARN conf.Configuration - file:/tmp/hadoop-cell/mapred/local/localRunner/cell/job_local1463380038_0001/job_local1463380038_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-11-25 06:05:04,090 WARN conf.Configuration - file:/tmp/hadoop-cell/mapred/local/localRunner/cell/job_local1463380038_0001/job_local1463380038_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-11-25 06:05:04,258 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 
2016-11-25 06:05:04,272 INFO regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default 
2016-11-25 06:05:08,950 INFO regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default 
2016-11-25 06:05:09,344 INFO regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default 
2016-11-25 06:05:09,734 INFO regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default 
2016-11-25 06:05:10,908 INFO regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default 
2016-11-25 06:05:11,376 INFO regex.RegexURLNormalizer - can't find rules for scope 'indexer', using default 
2016-11-25 06:05:11,686 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 
2016-11-25 06:05:11,775 INFO solr.SolrMappingReader - source: content dest: content 
2016-11-25 06:05:11,775 INFO solr.SolrMappingReader - source: title dest: title 
2016-11-25 06:05:11,775 INFO solr.SolrMappingReader - source: host dest: host 
2016-11-25 06:05:11,775 INFO solr.SolrMappingReader - source: segment dest: segment 
2016-11-25 06:05:11,775 INFO solr.SolrMappingReader - source: boost dest: boost 
2016-11-25 06:05:11,775 INFO solr.SolrMappingReader - source: digest dest: digest 
2016-11-25 06:05:11,775 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 
2016-11-25 06:05:11,940 INFO solr.SolrIndexWriter - Indexing 250/250 documents 
2016-11-25 06:05:11,940 INFO solr.SolrIndexWriter - Deleting 0 documents 
2016-11-25 06:05:12,139 INFO solr.SolrIndexWriter - Indexing 250/250 documents 
2016-11-25 06:05:12,139 INFO solr.SolrIndexWriter - Deleting 0 documents 
2016-11-25 06:05:12,207 WARN mapred.LocalJobRunner - job_local1463380038_0001 
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html> 
<head> 
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> 
<title>Error 404 Not Found</title> 
</head> 
<body><h2>HTTP ERROR 404</h2> 
<p>Problem accessing /solr/update. Reason: 
<pre> Not Found</pre></p> 
</body> 
</html> 

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) 
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html> 
<head> 
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> 
<title>Error 404 Not Found</title> 
</head> 
<body><h2>HTTP ERROR 404</h2> 
<p>Problem accessing /solr/update. Reason: 
<pre> Not Found</pre></p> 
</body> 
</html> 

    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:543) 
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) 
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) 
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220) 
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209) 
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173) 
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85) 
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) 
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) 
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493) 
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422) 
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:367) 
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) 
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) 
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) 
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
2016-11-25 06:05:12,293 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed! 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) 
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) 

Я не сформировало ядро ​​или collecti но я не уверен точно в значении этой команды, чтобы передавать данные в solr ...

Поскольку я очень новичок в Nutch и Solr, это трудно отлаживать ...

+0

Вы также должны указать название ядра solr. – Shafiq

ответ

2

Журнал показывает ошибку, так как вы не создали ни одного ядра/коллекции, библиотека SolrJ жалуется на то, что не обнаружил обработчик /solr/update, что приводит к сбою индекса. Просто создайте ядро ​​/ коллекцию и обновите URL-адрес solr, который вы передаете сценарию bin/crawl. Для выполнения первого сканирования выполните следующие действия: https://wiki.apache.org/nutch/NutchTutorial.

1

this link. Я столкнулся с такой же проблемой, как и вы. Этот поэтапный процесс определенно будет работать.