The
Combine
Harvesting
Robot
Combined harvester
Announcements (2011-05-27) Features Documentation Downloads Installation

What is Combine?

A Focused Crawler System for the WEB

If you want to download
- all Web-pages from a list of servers (like all servers at Lund University)
or
- all Web-pages pertaining to a particular topic (like 'Carnivorous Plants')

Then Combine focused crawler is the system for you!

Combine is an open system for crawling [harvesting and threshing (indexing)] Internet resources. It can be used both as a general and focused crawler. The name is derived from the combine-harvester since the two perform their jobs in a similar way.

Its is implemented in Perl and is easily configured for most types of Web-crawling.

Together with the Zebra text indexing and retrieval engine Combine makes an easy to use search engine in a box. In a few simple steps you create a vertical search engine with structured searching.

Features

  • part of the SearchEngine-in-a-Box system
  • higly configurable
  • integrated automated topic classifier for focused crawling mode
  • possibility to use any topic classifier (if provided as a Perl PlugIn module) in focused crawling mode
  • crawl based on regular expression on URLs both include and exclude
  • characterset detection/normalization (UTF-8)
  • language detection
  • HTML cleaning
  • metadata extraction
  • many document types (text, HTML, PDF, PostScript, MsWord, PowerPoint, Excel, RTF, TeX, images)
  • obeys robot exclusion
  • SQL database for data storage and administration

Where can I get further information?

Documentation

Mailing list at SourceForge Get Combine focused crawler at SourceForge.net. Fast, secure and Free Open Source software downloads

Installation

Installation Debian stable

The recommended procedure to install Combine is to use the Debian package system - apt.
  • Add the line
    deb http://combine.it.lth.se/ debian/
    to your file /etc/apt/sources.list
  • apt-get update
  • apt-get install combine
    this also drags in MySQL and a lot of Perl modules

Installation from CPAN

Combine is also available from CPAN.

Manual installation for the impatient

Download the source version 4.005 (gzipped tar)
Unpack and cd into combine-4.005
Make sure you have all the dependencies installed
The following command sequence will install Combine:
perl Makefile.PL
make
make test
make install
mkdir /etc/combine
cp conf/* /etc/combine/
mkdir /var/run/combine

Test that it all works (run as root)
perl ./doc/InstallationTest.pl

Test the installation

  • sudo combineINIT --jobname aatest (Must be run as root)
    will create a MySQL database called 'aatest' plus create and initialize the job specific configuration directory '/etc/combine/aatest/'
  • enter some (at least one) seed URLs like
    echo 'http://www.eit.lth.se/' | combineCtrl --jobname aatest load
  • combineCtrl --jobname aatest start
    will start 1 crawler process
  • combineCtrl --jobname aatest stat
    will give status of the frontier que (the list of URLs currently beeing crawled).
  • combineCtrl --jobname aatest kill
    kills all running crawlers for the job 'aatest'
  • combineExport --jobname aatest --profile combine
    will export to STDOUT all records for job 'aatest' stored in the MySQL database. It will be formatted in XML.

Where can I get help?

Examples, HowTo's and hints.

For reporting bugs, getting help or information or just to let us know that you are using Combine, send an e-mail to Combine general discussion list which is a mailing list at Get Combine focused crawler at SourceForge.net. Fast, secure and Free Open Source software downloads Subscribe to the list here.


Downloads

Also available from CPAN and Get Combine focused crawler at SourceForge.net. Fast, secure and Free Open Source software downloads.

History, acknowledgements The Combine system was initially developed as a part of the Development of a European Service for Information on Research and Education (DESIRE) project, which was funded by the European Commission within Telematics for Science Program.

It is later beeing modified for focused crawling by integrating the automated topic classification algorithms also developed in DESIRE with the crawler. This work was funded by Vinnova, Swedish Agency for Innovation Systems (project P22504-1 A) and the EU project ALVIS project (IST-1-002068-STP). Currently supported by .SEs Internetfond in the project 'Vertical Search Engines'.


Copyright

From version 1.1 Combine is distributed under General Public License.

View metadata for this page Know Lib logo Acknowledgements:: - The Combine Harvester was created at NetLab and is maintained by the KnowLib group at Dept. of Electrical and Information Technology, Lund University.

Last modified
2012-03-07