Raven bird
Raven Search (tm)

This page contains information about the new scalable web crawler called Raven Spider Bot and our RavenMeta Search engine. Perhaps you are here because Raven crawled your site, or because you are doing research on search engines, or perhaps you just wondered in off the web. The Information super-highway has many small roads and back alleys. We hope you find this small path a pleasant diversion.



Current News

Nov 25, 2000
We are pleased to be releasing our RavenMeta Search engine within the next few weeks. It is made for small businesses who wish to have their own meta search but without all the bother of traditional search engine set-up. It has a truely point-n-click configuration program that runs on Windows. The RavenMeta Search engine itself is an executable that sits in your servers CGI-BIN so there are no bizarre programming languages to install or modules to fool with! It is a sequential search that has 15 search engines to choose from as well as stock quotes. The configuration program is extremely easy to use and requires only that you have very basic HTML knowledge to use it. If your looking to add a search engine to your site in under 30 minutes then this is it!

Our RavenSearch Spider Bot is still coming along nicely. We will be posting details on this in the coming months.

What is RavenSearch Spider Bot:

Raven is a scalable search engine capable of processing 200+ web pages per second. We call it scalable because it can easily be "upgraded" to process many more pages per second with the addition of more computers. It's purpose is to crawl the entire web (all 300+ million pages) and make an inverted index that can be searched via a simple search query form on a web page. Some estimates of how many web pages are actually on the web go as high as a billion.

Our Mission:

We have made it our goal to provide the best Internet Web Search Site. We will offer more than one type of search at different levels of complexity and exactness. While our main premiss will be to search Web Pages we may also offer other types of search. We will do this within an atmosphere of 'Search Only' related pages. This will allow us to concentrate on the one thing we do best and to not get bogged down with peripheral projects that are non-search related. It should also be noted that our spider will catalog all sites it comes across. We have no wish to create yet another database of 'Popular' sites like so many other search engines. We have the enlightened idea that no software or computer hardware can decide for us what sites are 'Best'. No attempt will be made to remove this choice from the user.


What Language Do You Use:

We use Perl. Perl is an interpreted language that is ideally suited to creating web crawlers. We get around the speed issue by compiling our code and running it on very fast computers (P3 500+ Mhz). We hope to use pure Perl to keep the source code from getting too large as will happen with other languages. It just seems to be the perlfect langauge for this project.

Source Code Availability:

Sorry, we can not make the source code available at this time. We would like to explore all possibilities before we simply give the code away. If we find this is not a viable product then we may release the code under the GNU license. If we do it will be posted here and on other sites (mirrors).

How Does It Work:

We don't wish to give everything away at this point. However, the following synopsis should be acceptable: 1) We start with a page of seed URL's (Bookmarks) and load all the URL's into the "URL Frontier". This is just a big cache of URL's that we will fetch and process. 2) We load hundreds of the URL's at a time and fetch the web pages that they point to. 3) We then get the URL's that are contained on those web pages and put them in the "URL Frontier" so they will also be processed. 4) The whole process is repeated over and over until all web pages everywhere have been processed. Sounds simple doesn't it! ;-)

Netiquette:

Raven is a very friendly robot. Raven adheres to The Robots Exclusion Protocol whereby any System Administrator can indicate which parts of the site should not be visited by a robot. It's as simple as making a text file with just a few lines in it.
Raven will also use The Robots META tag to determine what it can access.
Beyond that, we make sure that no server is accessed by Raven more than once a minute. Also, Raven has some pretty smart code that looks for traps, accidental or not, and implements a banned URL/Domain queue. This latter means that some sites are just too scary for Raven and it will leave them alone.
See Web Robots Faq for more information about web robots in general.

References:

The Anatomy of a Large-Scale Hypertextual Web Search Engine,(Sergey Brin and Lawrence Page).
Mercator: A Scalable, Extensible Web Crawler,(Allan Heydon and Marc Najork).
Measuring Index Quality Using Random Walks on the Web, (Monika R. Hinzinger,Alan Heydon, Michael Mitsenmacher, Mark Najork).
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery1,(Soumen Chakrabarti, Martin van den Berg, Byron Dom).
Efficient crawling through URL ordering, (Junghoo Cho, Hector Garcia-Molina and Lawrence Page).
And so many others.....




Raven bird

FeedBack:

Ravensearch@hotmail.com