2007年12月6日星期四

nutch:try local search

2006/11/01
sometimes, google cannot search somewhere, but you need search,so why not try local search?

(below skip chinese language support)

google "nutch", and a lot of experiences about it. Here I just pick sth. I met:
java: pls install j2se(some guy said he is only OK with J2EE, just a recall) (note: JDK which include JRE, NOT JRE only) set JAVA_HOME.
tomcat: also a enviromment.
resin: an alternative of tomcat, esp if tomcat fails
cygwin:contain a linux shell emulator. one great artilce is good enough about it
nutch: 0.8.1 alternative: 0.7.2 . NUTCH_JAVA_HOME: same as JAVA_HOME

0.8.1: URL in flat file (abc.txt, eg) which in a dir(def, e.g) under the root dir where nutch is unpacked
0.7.2:URL just in flat file, like: def

conf/crawl-urlfilter: + means those match, - means those unmatch

0.8.2:conf/nutch-site.xml

in cygwin, cd to nutch root dir, bin/nutch crawl def -dir just-test-storage-house -depth 3 >& a.log
(note: just-test-storage-house/ should not exist before this cmd runs)

if OK (see a.log), then deploy tomcat or resin to show the UI


tomcat: donot use NT/2K/XP tray icon (the gui interfact). cmd enough and gui cause confusion
dele the ROOT
copy .war file to it. change to ROOT.war
start tomcat, and ROOT.war will be extracte automatically. move the .war. stop tomcat
change the root/web-inf/classes/nutch-site.xml (OR neednot, just CWD to just-test-storage-house dir created by the previous cmd)
start tomcat. in browser, http://localhost:8080/, should be the nutch search page. else error.
(http://localhost:8080/search.jsp ???)
As for me, I can only see the default page . So I tried resin
all same tomcat, more: change conf/resin.conf about Xerces
then: if see error like missing en/include/footer.html, search it in resin, copy it.

OK now with the tested URL offered in "Nutch 使用之锋芒初试"
but fail with http://sb.blog.sohu.com/
also fail with tasktracker system I work with. 401 error. Nutch wiki explained it will supported authorization in the future

没有评论: