Spider crawls the Web, then search is on

October 9, 2004 at 12:00a.m.

GoogleBots continuously scan and store billions of Web pages.
KNIGHT RIDDER NEWSPAPERS
WASHINGTON -- If you're considering investing in Google or you use this popular Internet searching system, you may wonder how in the world the amazing thing works.
What computer magic makes it possible for Google to pick out, in a fraction of a second, the information you want from the incredible mass of material heaped on the Web?
To answer users' queries, the system founded six years ago by two Stanford University graduate students has scanned and stored nearly 4.3 billion Web pages. If all those documents were printed, they'd make a stack of paper 300 miles high.
Some details of Google's methods are closely held trade secrets, but the broad outlines of how it and its competitors work are well known to computer scientists.
In computer jargon, Google's "search engines" use robotic "spiders" -- special software programs -- that "crawl" continuously along the myriad trails of the World Wide Web, "harvesting" documents as they go.
A separate piece of software builds an index of every word the spiders find.
Crawling, indexing, sorting
When a user submits a query -- such as "Mount Everest" or "Bill Clinton" -- the search engine checks the index, fetches each document that contains those words, sorts them by relevance and returns the most pertinent ones first.
"For Google, the major operations are crawling, indexing and sorting," the system's founders, Sergey Brin and Lawrence Page, wrote in their original paper describing the system.
To improve the results, Google uses a patented method called "PageRank," a sort of popularity contest that tries to determine which documents are likely to be most valuable to the user.
For each page, the PageRank system counts the number of other pages that are linked, or connected, to it.
In essence, Google interprets a link from Page A to Page B as a "vote" by Page A for Page B.
In addition to the number of votes a page receives, the system analyzes the status of the pages that cast the votes. Popular pages weigh more heavily in the calculation.
"Pages that are well cited from many places around the Web are worth looking at," Brin and Page wrote.
Google uses other tricks as well to determine a document's ranking. Words in a special typeface, bold, underlined or all capitals, get extra credit.
Words occurring close together -- such as "George" and "Bush" -- count for more than those that are far apart.
Finally, Google returns the documents that match a user's query, ranked in order of their relevance as determined by their page rank.
How GoogleBots work
Here's how Google's stable of spiders, known as GoogleBots, go about their business:
A spider visits every Web page that isn't marked private, reads it and stores it in compressed form. The spider looks for any links that the page may contain to other pages. It follows those links to pages it hasn't seen before, and continues the process until there are no more links to visit
While the spider is chugging along, an "indexer" is creating a catalog or dictionary of every word it encounters, except for short grammatical terms such as "the," "in" or "where."
For each word, the system keeps a list of all the pages that word appears in.
It also records the exact position of the word in each document so it can be found quickly later on.
The lists can be extremely long, since some words appear in millions of documents.
For example, a search for the word "carnival" returns about 5.6 million entries, far more than anyone could possibly use.
The combination of "George" and "Bush" gets about 7.4 million hits.