How Search Engine works??

Well i am very much interested to write technical blogs.So i just started Googling with a query "About which topic should i start writing Technical Blogs" and submitted the query.Then many results came nearly in 3-4 pages in Google.But i am too lazy to see the last result on the first page itself...who will look at the results shown in the 2nd page.Followed several links from the search results.But none are satisfactory to me.Then suddenly.....here comes a sparkling thought in my mind...........!!!

Lets go to www.google.com.....!!You may wonder going to Google home page is the thought i got.But this is the page that most of the scientists,engineers,doctors...and who not....???depends on....!!I just got a question mark in my mind..why google has became the major part of our life????The answer is because of it's powerful tool it has called as SEARCH ENGINE.So as soon as i got the thought in mind...in no more time i decided to explore about the working of Search Engine .Let's start exploring "How search engine works???"

Well frankly speaking i am just a toddler in this competitive world.Don't expect a high level of technical information from me for now.But,I am trying my best part to give more information as much as i can.No more waiting..lets jump to the Search Engine architecture.First i will explain the components of a general Search Engine.Then i will introduce you the Google's search engine architecture.

Any general search engine(dynamic) consists of mainly five components.They are

The Crawler
The Parser
The Indexer
The Rank Engine
The front End

The crawler. This is the part that goes through the web, grabs the pages, and stores information about them into some central data store. In addition to the text itself, you will want things like the time you accessed it, etc. The crawler needs to be smart enough to know how often to hit certain domains, to obey the robots.txt convention, etc..
The parser. This reads the data fetched by the crawler, parses it, saves whatever meta-data it needs to, throws away junk, and possibly makes suggestions to the crawler on what to fetch next time around.
The indexer. Reads the stuff the parser parsed, and creates inverted indexes into the terms found on the web pages. It can be as smart as you want it to be apply NLP(Natural Language Processing) techniques to make indexes of concepts, cross-link things, throw in synonyms, etc..
The ranking engine. Given a few thousand URL's matching "apple", how do you decide which result is the best? Jut the index doesn't give you that information. You need to analyze the text, the linking structure, and whatever other pieces you want to look at, and create some scores. This may be done completely on the fly (that's really hard), or based on some pre-computed notions of "experts" like Page Rank,etc..
The front end. Something needs to receive user queries, hit the central engine, and respond; this something needs to be smart about caching results, possibly mixing in results from other sources, etc. It has its own set of problems.

All Components Together

Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a typical search engine employs special software robots, called spiders, to build lists of the words found on Websites. When a spider is building its lists, the process is called Web crawling. A Web crawler is a program, which automatically traverses the web by downloading documents and following links from page to page. They are mainly used by web search engines to gather data for indexing. Other possible applications include page validation, structural analysis and visualization, update notification, mirroring and personal web assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc.Crawlers are automated programs that follow the links found on the web pages.
There is a URL Server that sends lists of URL's to be fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository. Every web page has an associated ID number called a doc ID, which is assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform the indexing function. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URL Resolver reads the anchors file and converts relative URL's into absolute URL's and in turn into doc IDs. It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links, which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of word IDs and offsets into the inverted index. A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. A lexicon lists all the terms occurring in the index along with some term-level statistics (e.g., total number of documents in which a term occurs) that are used by the ranking algorithms.The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries.

vishnu's blog

Search This Blog

How Search Engine works??