All about computing....: How search engines work?

Search engines are the life of the internet. Search engines played an important role in making the web as simple as it is now. Otherwise the web will just be a geeks’ only land. Now it is a playground for all. It is definite that the web will not be what it is now without search engines.
A search engine can be simply defined as a piece of software that can access a database of internet resources. When a user enters a query in the search box it searches in this database to provide relevant search results.

The first search engine in the world is ALIEWEB, which was created in 1990 by Alan Entrage. It was not a full fledged search engine like what we see today. FTP was primary mode of sharing files online at that time. A user who wants to share files will run an FTP server. The person who wants to download a file runs an FTP client to connect to the FTP server and download the file from the server. The client need not be a server and only need a FTP client software to connect to a server and download files.. Later anonymous FTP sites came into existence which allowed users to post and retrieve content from FTP servers. But it is not as simple as you think. There are so many problems in using FTP protocol for file sharing. First the user has to know that an FTP server exists and he has to visit that server to download files from it. The problem is that FTP servers are not well organized and it is a monotonous task finding FTP servers at that time. You can get a list of anonymous FTP servers here.

Entrage created a searchable directory of FTP files called ALIEWEB or simply Archie. It used a script which scours the FTP sites and indexes the contents of FTP sites on its run. It also allowed users to submit the location of FTP files and servers. It requires Telnet and FTP. The regular expression matcher of ALIEWEB allowed users to access its database. It was considered as the first web search engine by many because it is the first tool developed to search files on the Internet.

Another search engine that was developed to search the internet is Veronica. It is a search engine that was developed in 1992. It was used to search and index Gopher files. Gopher is a protocol used to share plain text documents. So the indexed files in Veronica consist of plain text documents. Yet another Gopher search engine popular at that time was JUGHEAD. It was similar to Veronica but is not very effective.
Wanderer developed by Matthew Gray was the first web search engine to use the robot concept. It was also the first search engine that indexed the contents of the worldwide web A robot is a software program that is used to access WebPages using the links contained in WebPages. It is actually designed to count the number of web servers in the world to determine the size of the web. Wanderer captured one URL after the other and ultimately created the first web database. So we can say that it is wanderer that laid the foundation for future search engines like Google, Yahoo, Ask, AltaVista etc which also employ almost the same concept.
What is a search Spider?

A search spider is the component of a search engine that crawls WebPages and index the links contained in them and then store them in a database. The spider crawls the internet from time to time updating the contents of the database. A spider works on the principle that all WebPages are linked through hyperlinks. They work very much like a web browser. They first start off by requesting an URL and then proliferate by accessing the hyperlinks contained in successive WebPages. A spider did not index all the contents of a website. It is programmed to filter the contents based on certain criteria and this makes crawling fast and efficient. The frequency with which a spider visits a WebPage depends on how often a website is updated. A large number of policies are used by a spider for this. These policies include the following:
1. Selection policy: A spider uses a selection policy which is used to decide which websites to be indexed and which not. The web is a huge country and it is not possible to index all the contents of it. So the web spiders use a selection policy so that the links include in its database will provide useful content. So the spider uses different techniques to decide which pages to include and which to neglect. First it checks for the quality of the webpage, then the number of inbound links, the amount of traffic to the website etc. Google spider uses the best selection policy and that is the reason why Google is considered the best engine.
2. Revisit policy: A spider uses a revisit policy to decide when to visit a website again to check for updates. This depends on the frequency in which a website is updated. If a website is updated on a daily basis then it may be possible that a spider revisits it often. But revisit policy used by spiders is a complex thing to deal with. Considering the amount of information contained in the web and the tremendous speed at which it is expanding revisiting websites for updates is a time consuming process. Here also different techniques are used by spiders. A spider always looks for most recent content. It tries to visit pages frequently, that serve fresh content and even if a site provide useful content but it is not relevant and fresh then the possibility of a spider revisiting the site is megre.

3. Politeness policy: The functionality of a search engine also depends on the politeness policy it uses. A politeness policy is something that prevents spiders from overloading websites. Spiders hog a small amount of bandwidth while crawling websites. This may affect the performance of the servers. Also spiders work in parallel because of the very nature of their functionality. This may create overload on the servers. To avoid this search engine uses politeness policies. A spider usually searches the site thinking that it is made for crawling! But sometimes the spider cannot be able to access the links in a website and may even crash a server if it cannot properly crawl a site. First a search engine uses a robot exclusion policy to avoid this. This allows a web administrator to specify which pages to crawl and which not to. On top of a URL a file named robot.txt is specified which invokes the robot exclusion protocol. Another method is to specify this information in the Meta tags of a webpage. Using anyone of these methods we can inform the spider that links contained in the page are not for indexing. A spider is programmed to obey this policy.
4. Parallelization policy: A spider is a multitasking application. It run multiple instances of it at the same time to increase the download speed and the number of pages crawled at a time. This also helps to increase the overhead on the server because the monotonous task of crawling websites is divided into different lighter processes. Hence this helps to improve the overall performance of the spider and also helps to reduce the work load on servers. There are mainly two types of parallelization policies used by search spiders. They are the following:
1. Dynamic Assignment: In dynamic assignment, a central server takes the responsibility of assigning URLs to each instance of the spider dynamically. This helps to maintain the load the server in equilibrium.
2. Static Assignment: In static assignment different algorithms are used to assign URLs to spider instances. These algorithms specify certain conditions which are to be satisfied for a spider process to start crawling a webpage. Different search engines use different algorithms for this purpose. But the concept implemented by these algorithms are almost the same.
Indexing the contents
After crawling the contents of a webpage the spider indexes the links. Indexing is the key process that decides the quality of a search engine. A search engine should index the contents crawled by the spider in such a way that the user gets relevant links while searching. If not it will not benefit the user or the search engine even how innovative and powerful the search spider is. The indexing occurs in a number of steps. The indexing of the links fetched by a spider is done by an Indexer. It is the indexers that do the following things.
1. It decompresses the contents fetched by the spider.
2. It parses the contents and links in a webpage fetched by the spider and store them in an anchor file
3. It converts the document into certain word occurrences called as hits which is used to determine the location of a particular keyword in a webpage.
Displaying the contents.

The next and the final phase is to display the indexed contents based on the keywords inputted by the user. This is the visible part of a search engine. It is based on this a user tends to rate a search engine. He does not want to understand how the contents are displayed. But the displayed contents are of importance to him. So the popularity of a search engine more or less depends of this. Most search engines use page ranking technique to display results. The efficiency of a search engine also depends on this page ranking technique. Currently many people consider Google uses the best page ranking technique. Different algorithms are used to rank WebPages. The ranking of a webpage is based on certain criteria as defined in the algorithm. So understanding how these algorithms work will help people figure out how to make their WebPages displayed among the first ten results of a search engine. Because most people only visit the sites listed in the first few pages of a search engine. The baddies use illegal techniques to be listed in the top ten and these websites are called as scraper sites because they make use of web spiders to index their content in the form of search results. The efficiency of page ranking algorithms also depends on how efficiently it can counter these illegal techniques. The search engine companies are constantly working on improving their page ranking techniques and there by increasing their popularity.
Types of Search Engines.
The websites we use now can be classified into different categories based on their functionality and working. The most popular type of search engines are the following

1. Meta search engine: A Meta search engine searches in more than one search engine at a time and then display the results in WebPages. The advantage of using Meta search engines is that you get the results in all major search engines without searching in them individually. Many Meta search engines allow users to specify the search engines in which they want to search. Meta search engines are available both as client side and server side applications. The disadvantage of Meta search engines is that they cannot provide the entire search results of search engines in which they search.
2. Subject directories: A subject directory is a completely human created database of internet resources. It is called directory based search engine because a directory is maintained by the webmaster which contain links to different websites. Also users can submit websites to these directories. The links in a directory is divided into subcategories based on the topic they are related to. A subject directory contains number of such directories. The disadvantage of a subject directory is that it is very difficult to fetch information from them.
3. Searchable databases: A searchable database is another type of search engine. The pages contained in a searchable database are not often indexed by search engines. So a searchable database is also called as invisible web because the contents of it are often not indexed by a search engine. These databases generally include research papers, library catalogs etc. They are called as specialized databases and to get results contained in them you have to visit that website and enter your query in the search box. Then results relevant to that keyword are displayed in a web interface. Searchable databases are generally created and maintained by universities, government organizations etc.
Which search engine is the best?

There are lots of search engines in the world. You even do not know the names of these search engines except the name of a few top ones like Google, Yahoo, Ask, AltaVista, Alltheweb, Alexa etc.But are lot of crawler and meta search engines out there unnoticed by you. A good search in Google or other popular search engines will unleash them before you. Are you confused with which search engine is the best? Then I will tell you some things that will help you decide which search engine is the best. The ultimate answer to this question should be found by you.
You have understood how a search engine works already. All search engines have the above mentioned components. What makes some search engines stand out and others neglected is determined by the way those components are implemented. Still I will mention some of the features that will help you determine the best search engine.
Database functionality

The database is the core of a search engine. How big the database of a search engine is plays a very important role in determining where a search engine stands. A good search engine database has answers to almost everything queried by a user. Google has the largest database of websites indexed. So Google surpasses all others in these in the size of database. It is evident from the results we got from Google and that is the reason why it is more dependable than any other search engine.

The size of a database alone is not a factor that determines the quality of a database. But it also depends on the freshness of the database. If you search for something and got the same results all the time then you can understand that the database is not updated efficiently. A good search engine has its database updated regularly and constantly. Another feature of a good search database is to include the results which contain different file formats. If you search for something like "C++" and you got only links to few HTML tutorials on C++ then the search engine cannot be considered as good. For example if you search Google for "C++", you got results which contain links to C++ books from Google Books, some links from Google Scholar, some pdf tutorial links, pages etc. The diversity of the links provided by a search engine counts. Google currently has the ability to provide you with most relevant content because it offers a lot of services which are intelligently weaved to its basic service the Google Search engine. Many other search engines are still miles behind Google in that case also.

Another cool feature you have to look out for is the consistency in search results. Search for something and if you got the same results consistently then the search engine can be considered consistent. If the results change rapidly then the search engine cannot be considered reliable.
The size and the diversity of the links definitely count. But the greatest database functionality of a search engine depends on its ability to support Boolean search. A good search engine supports searching using all the Boolean operators. There are three main Boolean operators you use while searching online. They are AND, OR, NOT. These are the basic Boolean operators. Most time we did not use Boolean operators but while searching for some specific kind of information you can use these operators. If you search for something using AND to say Cricket AND Football you got results related to both. If you use OR you got results related to anyone of them. But if you use NOT you got results related to just Cricket and not Football. This is how these operators work.

Google supports AND and OR by default. Yahoo supports AND, OR, NOT and also nesting of these operators. So you can use a combination of these operators while searching in Yahoo.
But instead of NOT you can use "-" in Google. So if you use "-" the results contain links related to the main keyword minus the keyword to be excluded. For example if you search for Cricket - Football you got results related to cricket and the results related to Football are excluded. You can similarly use "+" if you want to search for some additional results to be included while searching for something. For example you can search for Pamela + Jolly so that results related to Jolly will also be included while searching for Pamela. Google supports both operators. Search with these operators in search engine you want to test and see the results.
The other option you have to watch out for is how accurate the search engine is. All the above functionalities are used to improve the accuracy of your search. But there are some additional components you can use to get more accurate results. Ultimately these options also play an important role in determining the quality of a search engine. For example you can use quotation marks to search for results that contain a particular phrase. For example you can search for something like C++ "book" so that all the links containing books along with the term C++ will be included in your results. Another option is to use wild cards. If you want to include a large number of categories related to something or if you do not know the keyword to search for something you can use the "*" operator. For example you can search for Christmas * to get everything related to Christmas. It is a very useful option and it is also a key feature to look out for. Google supports all the above features. But it did not support truncation. Yahoo supports quotation marks but did not support search using wild cards. But it supports truncation.

The list of features you have to look out for to decide which search engine is the best is still not enough. There are many other search options that are very useful. They are the inurl:, intitle:, site:, link: etc. If you search for something using inurl: then you got results with that term in URL. For example search for inurl: download you will get links which contain download in the URL. Similarly also try other search options. All these options are supported by Google and Yahoo. But Yahoo supports an additional option which allows users to search for hosts and the option is host name. It is a useful option too. Ask as also supports these features in addition to some very useful features like geoloc:, last:, lang: after date: before date:, between date: etc.
These are some of the basic things that helps you determine which search engine is the best. Still there is some additional functionality that help you how useful a search engine is. Because the usability of a search engine also very important in deciding whether a search engine is best or worst. These options include spell checker, language translator, calculators, definitions, maps, phone books, stocks etc. Google supports all the above functionalities. Google Maps is the best map tool you can get. You can convert WebPages to about 50 different languages. Yahoo too is not behind Google in that case. It also offers additional search features like searching for ISBN patents, synonym, traffic encyclopedia etc. Ask also offers all the above features.

The quest for the best search engine still continues because there are lots of things that remain to be mentioned. Try to understand about more search options search engines support. Watch out for new features added to these search engines. Evaluate how useful they are. Search using all the options you know. Search and find other unknown options you can find. Search with them. Ultimately you will find the answer which search engine is the best. So until then the question remains unanswered.

Ask2Roger

All about computing

How search engines work?

1 comments:

Welcome!

Subscribe via email

Labels

Giveaway of the Day

Blog Archive