SearchEngine in a Box using Combine/Zebra

Sprung from development in the EU project ALVIS (IST-1-002068-STP) with the help of .SE:s Internetfond and based on the two systems Combine Focused Crawler and Zebra text indexing and retrieval engine. This system allows you build a vertical search engine for your favorite topic in just 5 easy steps. But before that you have to install the system on your machine. (Or you can try it out online before installing).

Installation and testing instructions

  1. Edit /etc/apt/sources.list and add
    deb http://combine.it.lth.se/ debian/
    deb http://ftp.indexdata.dk/debian sarge main
    deb-src http://ftp.indexdata.dk/debian sarge main
    
  2. Get the crawler, indexer and XSLT tools. Run:
    sudo apt-get update
    sudo apt-get install combine idzebra2.0 yaz xsltproc
    
    Make sure you have combine version 3.4 or better.
  3. Download the 'SearchEngine ina Box' system, unpack it, and change to where the software was unpacked. Run
    tar zxf SEbox.tgz
    cd SearchEngineBox
    
  4. Initialize crawler for simple test. Run:
    sudo combineINIT --jobname atest
    combineCtrl --jobname atest load < seeds.txt
    
  5. Change to the Zebra configuration directory:
    cd ZebraConf
    make Combine
    
  6. Tell Zebra where it should run. Edit ZebraConf.xml and change
        <host>ldbkit06</host>
        <port>3003</port>
    
    to whatever host you are running on and your preferred port
  7. Tell the crawler where the indexer is. Edit /etc/combine/atest/combine.cfg and add
    ZebraHost = <host>:<port>
    
    at the end
    ie for the original ZebraConf.xml it would be
    ZebraHost = ldbkit06:3003
    
  8. Generate Zebra configuration. Run
    make rmConfs
    make
    
  9. Start the Zebra indexing and database server. Run
    rm server.log 
    zebrasrv -f yazserver.xml -l server.log &
    
  10. You might consider copying the simple UI to a Web-server (see instructions at the end of the README file in this directory)
  11. Test it all by starting the simple test crawling. Run
    combineCtrl --jobname atest start 
    
    You should see things happening in the Zebra log ZebraConf/server.log
  12. Test searching your new database. Use either or both of these possibilities
  13. Kill the crawler and Zebra server. Run
    combineCtrl --jobname atest kill
    kill `cat lock/zebrasrv.pid`
    
Now you are ready to tailor it to your own application:

Build a vertical search engine in just 5 easy steps

So once the software is installed and tested ...
  1. Create a configuration for Zebra - see the ZebraConf directory
  2. Configure Combine to the crawl you want. Please refer to Combine Documentation sections 'Configuration' and 'Use Scenarios'. Specifically you have to create a topic-definition (section 'Crawler operation') for your particular topic.
  3. Create the crawler
    sudo combineINIT --jobname atest --topic YourTopicDefFile.txt
    combineCtrl --jobname atest load < seeds.txt
    
    Tell the crawler where the indexer is. Edit /etc/combine/atest/combine.cfg and add
    ZebraHost = <host>:<port>
    at the end, where host and port correspond to your Zebra configuration
  4. Start Zebra and the crawler
    zebrasrv -f yazserver.xml -l server.log &
    combineCtrl --jobname atest start
    
  5. Make your own UI
And now it's ready for use, building the database as we speak.

Demos

Simple demonstrators of Vertical Search Engines are available here.

Create your own Vertical Search Engine.


Last updated 2009-06-16 by Anders Ardö