2014-02-07 5 views
1

Я пытаюсь просканировать списки рассылки Apache, чтобы получить все архивные сообщения, используя Crawler4j. Я предоставил URL-адрес семян и пытаюсь получить ссылки на другие сообщения. Однако, похоже, он не извлекает все ссылки.Crawler4j отсутствует исходящие ссылки?

Ниже приводится HTML моего семени страницы (http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E):

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 

<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
    <title>Re: some healthy broker disappear from zookeeper</title> 
    <link rel="stylesheet" type="text/css" href="/archives/style.css" /> 
</head> 

<body id="archives"> 
    <h1>kafka-users mailing list archives</h1> 

    <h5> 
<a href="http://mail-archives.apache.org/mod_mbox/" title="Back to the archives depot">Site index</a> &middot; <a href="/mod_mbox/kafka-users" title="Back to the list index">List index</a></h5> <table class="static" id="msgview"> 
    <thead> 
    <tr> 
    <th class="title">Message view</th> 
    <th class="nav"><a href="/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e" title="Previous by date">&laquo;</a> <a href="/mod_mbox/kafka-users/201211.mbox/date" title="View messages sorted by date">Date</a> <a href="/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e" title="Next by date">&raquo;</a> &middot; <a href="/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e" title="Previous by thread">&laquo;</a> <a href="/mod_mbox/kafka-users/201211.mbox/thread" title="View messages sorted by thread">Thread</a> <a href="/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e" title="Next by thread">&raquo;</a></th> 
    </tr> 
    </thead> 

    <tfoot> 
    <tr> 
    <th class="title"><a href="#archives">Top</a></th> 
    <th class="nav"><a href="/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e" title="Previous by date">&laquo;</a> <a href="/mod_mbox/kafka-users/201211.mbox/date" title="View messages sorted by date">Date</a> <a href="/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e" title="Next by date">&raquo;</a> &middot; <a href="/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e" title="Previous by thread">&laquo;</a> <a href="/mod_mbox/kafka-users/201211.mbox/thread" title="View messages sorted by thread">Thread</a> <a href="/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e" title="Next by thread">&raquo;</a></th> 
    </tr> 
    </tfoot> 

    <tbody> 
    <tr class="from"> 
    <td class="left">From</td> 
    <td class="right">Neha Narkhede &lt;[email protected]&gt;</td> 
    </tr> 
    <tr class="subject"> 
    <td class="left">Subject</td> 
    <td class="right">Re: some healthy broker disappear from zookeeper</td> 
    </tr> 
    <tr class="date"> 
    <td class="left">Date</td> 
    <td class="right">Tue, 20 Nov 2012 19:01:56 GMT</td> 
    </tr> 
    <tr class="contents"><td colspan="2"><pre> 
zookeeper server version is 3.3.3 is pretty buggy and has known 
session expiration and unexpected ephemeral node deletion bugs. 
Please upgrade to 3.3.4 and retry. 

Thanks, 
Neha 

On Tue, Nov 20, 2012 at 10:42 AM, Xiaoyu Wang &lt;[email protected]&gt; wrote: 
&gt; Hello everybody, 
&gt; 
&gt; We have run into this problem a few times in the past week. The symptom is 
&gt; some broker disappear from zookeeper. The broker appears to be healthy. 
&gt; After that, producers start producing lots of ZK producer cache stale log 
&gt; and stop making any progress. 
&gt; "logger.info("Try #" + numRetries + " ZK producer cache is stale. 
&gt; Refreshing it by reading from ZK again")" 
&gt; 
&gt; We are running kafka 0.7.1 and the zookeeper server version is 3.3.3. 
&gt; 
&gt; The missing broker will show up in zookeeper after we restart it. My 
&gt; question is 
&gt; 
&gt; 1. Did anyone encounter the same problem? how did you fix it? 
&gt; 2. Why producer is not making any progress? Can we make the producer 
&gt; work with those brokers that are listed in zookeeper. 
&gt; 
&gt; 
&gt; Thanks, 
&gt; 
&gt; -Xiaoyu 

</pre></td></tr> 
    <tr class="mime"> 
    <td class="left">Mime</td> 
    <td class="right"> 
<ul> 
<li><a rel="nofollow" href="/mod_mbox/kafka-users/201211.mbox/raw/%[email protected].com%3e/">Unnamed text/plain</a> (inline, None, 1037 bytes)</li> 
</ul> 
</td> 
</tr> 
    <tr class="raw"> 
    <td class="left"></td> 
    <td class="right"><a href="/mod_mbox/kafka-users/201211.mbox/raw/%[email protected].com%3e" rel="nofollow">View raw message</a></td> 
    </tr> 
    </tbody> 
    </table> 
</body> 
</html> 

Они являются исходящие URL-адреса, как идентифицированные Crawler4j.

http://mail-archives.apache.org/archives/style.css 
http://mail-archives.apache.org/mod_mbox/ 
http://mail-archives.apache.org/mod_mbox/kafka-users 
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date 
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread 
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3CCAOG_4QZ-yyrcwTpRu-8eu6VoUoM3%3DAo_J8Linhpnc%2B6y7tOcxg%40mail.gmail.com%3E 
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/date 
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread 

Однако URL-адреса, которые меня интересуют, отсутствуют.

http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e 
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e 
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e 
http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%[email protected].com%3e 

Что я делаю неправильно? Как получить Crawler4j для извлечения необходимых мне URL-адресов?

ответ

0

Возможно, вы неправильно указали страницу с семенами. Я думаю, что ваша страница семя должно быть:

http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/thread

, а затем использовать

public boolean shouldVisit(WebURL url) { 
    String href = url.getURL().toLowerCase(); 
    return (!FILTERS.matcher(href).matches() && href.contains("http://mail-archives.apache.org/mod_mbox/kafka-users/201211.mbox/%3cCA")); 
} 

Я надеюсь, что помогает.

Смежные вопросы