hadoop - Difference between Nutch crawl giving depth='N' and crawling in loop N times with depth='1' -


Background of my problem: I'm running Nutch1.4 at Hutchop0.20.203. I'm working on the niche segment to get the last output. But waiting for the entire crawl before running the matrade, the solution causes a long lasting run. Now I am triggering the mapping as soon as they are released, as soon as the jobs are removed from the areas. I give a depth = 1 to crawl a loop ('n = at the time of' depth). I am losing some URL while I crawl in the loop in depth 1, giving depth while making the N times equal.

Please get below the pseudo code:

Case 1 : Noah Crawl = 3

// on Hobachup Create a list object to store, which we are going to pass the Enchatch

list nutchArgsList = new ArrayList ();

nutchArgsList.add ("-depth");

nutchArgsList.add (integer.stosting (3));

& lt; ... other nutch args ... & gt; Case 2 : In Crawling In Loop 3, Crawling In Loop 3 (multiple times with depth) = 1 (int depth = Roe; =; Depth; Stretch; ;;; Depth; Run ++) {

// Create list items to store arguments that we are going to pass for nuts

list nutchArgsList = new ArrayList ();

nutchArgsList.add ("- Depth");

nutchArgsList.add (Integer.toString (1)); // Note I have provided depth as 1

& lt; ... other nutch args ... & gt;

ToolRunner.run (nutchConf, new crawl), nutchArgsList.toArray (new string [nutchArgsList.size ()]);

}

Getting url (DB footfade) When I crawling as a depth in the loop several times in the loop

I have tried it on standalone nach, where I walk 3 times vs 3 times the same with a 3x depth Runs on URL 1. I compare the difference between crawlbund and url only 12. But when I get 1000 urls from hoodop using the Touloner As far as DB_APFETED.

As far as I have understood, Nichtha triggers crawl in a loop as a depth value. Please suggest it.

Apart from this, please let me know that when I am using the tooler on the headpiece in the same way on different looks, then why is the difference?

Thanks in Adendas.

I have found that the behavior of nach alone (hard disk) Integrated directly with) running change and Hdop cluster. Generator score filtering appears to be much higher with a Hdop cluster, so "-topan" setting should be high enough.

I recommend running your crawl with a high (at least 1000) "TopN" and not

This is similar to my reaction.

After doing this, I came to know that my nails stand alone and HDFS started to start better. / P>

Comments

Popular posts from this blog

Python SQLAlchemy:AttributeError: Neither 'Column' object nor 'Comparator' object has an attribute 'schema' -

java - How not to audit a join table and related entities using Hibernate Envers? -

mongodb - CakePHP paginator ignoring order, but only for certain values -