This page contains information about the new scalable web crawler called Raven
Spider Bot and our RavenMeta
Search engine. Perhaps you are here because Raven crawled your site, or
because you are doing research on search engines, or perhaps you just wondered
in off the web. The Information super-highway has many small roads and back
alleys. We hope you find this small path a pleasant diversion.
Nov 25, 2000
We are pleased to be releasing our RavenMeta Search
engine within the next few weeks. It is made for small businesses who wish
to have their own meta search but without all the bother of traditional search
engine set-up. It has a truely point-n-click configuration program that runs on
Windows. The RavenMeta Search
engine itself is an executable that sits in your servers CGI-BIN so there
are no bizarre programming languages to install or modules to fool with! It is a
sequential search that has 15 search engines to choose from as well as stock
quotes. The configuration program is extremely easy to use and requires only
that you have very basic HTML knowledge to use it. If your looking to add a
search engine to your site in under 30 minutes then this is it!
Our RavenSearch Spider Bot is still coming along nicely. We will be posting
details on this in the coming months.
What is RavenSearch Spider Bot:
Raven is a scalable search engine capable of processing 200+ web pages per
second. We call it scalable because it can easily be "upgraded" to process many
more pages per second with the addition of more computers. It's purpose is to
crawl the entire web (all 300+ million pages) and make an inverted index that
can be searched via a simple search query form on a web page. Some estimates of
how many web pages are actually on the web go as high as a billion.
We have made it our goal to provide the best Internet Web Search Site. We will
offer more than one type of search at different levels of complexity and
exactness. While our main premiss will be to search Web Pages we may also offer
other types of search. We will do this within an atmosphere of 'Search Only'
related pages. This will allow us to concentrate on the one thing we do best and
to not get bogged down with peripheral projects that are non-search related. It
should also be noted that our spider will catalog all sites it comes across. We
have no wish to create yet another database of 'Popular' sites like so many
other search engines. We have the enlightened idea that no software or computer
hardware can decide for us what sites are 'Best'. No attempt will be made to
remove this choice from the user.
What Language Do You Use:
We use Perl. Perl is an interpreted language that is ideally suited to creating
web crawlers. We get around the speed issue by compiling our code and running it
on very fast computers (P3 500+ Mhz). We hope to use pure Perl to keep the
source code from getting too large as will happen with other languages. It just
seems to be the perlfect langauge for this project.
Source Code Availability:
Sorry, we can not make the source code available at this time. We would like to
explore all possibilities before we simply give the code away. If we find this
is not a viable product then we may release the code under the GNU license. If
we do it will be posted here and on other sites (mirrors).
How Does It Work:
We don't wish to give everything away at this point. However, the following
synopsis should be acceptable: 1) We start with a page of seed URL's (Bookmarks)
and load all the URL's into the "URL Frontier". This is just a big cache of
URL's that we will fetch and process. 2) We load hundreds of the URL's at a time
and fetch the web pages that they point to. 3) We then get the URL's that are
contained on those web pages and put them in the "URL Frontier" so they will
also be processed. 4) The whole process is repeated over and over until all web
pages everywhere have been processed. Sounds simple doesn't it! ;-)
Raven is a very friendly robot. Raven adheres to
The Robots Exclusion Protocol whereby any System Administrator can indicate
which parts of the site should not be visited by a robot. It's as simple as
making a text file with just a few lines in it.
Raven will also use
The Robots META tag to determine what it can access.
Beyond that, we make sure that no server is accessed by Raven more than once a
minute. Also, Raven has some pretty smart code that looks for traps, accidental
or not, and implements a banned URL/Domain queue. This latter means that some
sites are just too scary for Raven and it will leave them alone.
See Web Robots
Faq for more information about web robots in general.
The Anatomy of a Large-Scale Hypertextual Web Search Engine,(Sergey Brin and
Mercator: A Scalable, Extensible Web Crawler,(Allan Heydon and Marc Najork).
Measuring Index Quality Using Random Walks on the Web, (Monika R. Hinzinger,Alan
Heydon, Michael Mitsenmacher, Mark Najork).
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery1,(Soumen
Chakrabarti, Martin van den Berg, Byron Dom).
Efficient crawling through URL ordering, (Junghoo Cho, Hector Garcia-Molina and
And so many others.....