Google search engine is undoubtedly most widely used search engine. It was founded by Larry Pageand Sergey Brin. We must have the knowledge of basic working and methodology used by google search engine. I have explained the things in very simple words.
Read Carefully
Read Carefully
Overview :
Okay lets assume , you wanna design a little search engine that would search the requested key words in few websites (say 5 websites) ,So what would be our approach ? First of all, we will store the contents that is webpages of that 5 websites in our database. Then we will make an index including the important part of these web pages like titles,headings,meta tags etc. Then we would make a simple search box meant for users where they could enter the search query or keyword. User's entered query will be processed to match with the keywords in the index and the results would be returned accordingly. We will return user with list of the links of actual websites and the preference to those websites will be given to them using some algorithm. I hope the basic overview of working of search engine is clear to you.
Now read more regarding the same.
A web search engine works basically in the following manner. There are basically three parts.
1. Web Crawling
2. Indexing
3. Query processing or searching
1. First step of working of search engine is web crawling. A web crawler or a web spider is a software that travels across the world wide web and downloads,saves webpages. A web crawaler is fed with URLs of websites and it starts proceeding. It starts downloading and saving web pages associated with that websites. Wanna have feel of web crawaler. Download one from here. Feed it with links of websites and it will start downloading webpages,images etc associated with those websites. Name of google web crawler is GoogleBot. Wanna see the copies of webpages saved in google database ? (actually not exactly)
Lets take example of any website , say www.wikipedia.org
Do this -:Go to google. and search for 'wikipedia' Hopefully you would get this link on top.
OR
Directly search for 'cache:wikipedia.org'
2. After googlebot has saved all pages, it submits them to google indexer. Indexing means extracting out words from titles,headings,metatags etc.The indexed pages are stored in google index database. The contents of index database is similar to the index at the back of your book. Google ignores the common or insignificant words like as,for,the,is,or,on (called as stop words) which are usually in every webpage. Index is done basically to improve the speed of searching.
3. The third part is query processing or searching. It includes the search box where we enter the search query/keyword for which we are looking for. When user enters the serach query, google matches the entered key words in the pages saved in indexed database and returns the actual links of webpages from where those pages are reterived. The priority is obviously given to best matching results. Google uses a patented algorithm called PageRank that helps rank web pages that match a given search string.
The above three steps are followed not only google search but most of the web search engines.Ofcourse there are many variations but methodology is same.
What is Robots.txt ?
Web Administrators do not the web crawlers or Web spiders to fetch every page/file of the website and show the links in search results.Robots.txt is a simple text file meant to be placed in top-level directory of the website which contain the links that web administrators do not want to be fetched by web crawlers. The first step of a Web Crawler is to check the content of Robots.txt
Example of contents of Robots.txt
User-agent: * //for web crawlers of all search engines
Disallow:/directory_name/file_name //specify a file of particular dir.
Disallow:/directory_name/ //all files of particular dir.
You can see robots.txt of websites (if exists). Example http://www.microsoft.com/robots.txt
No comments:
Post a Comment