Internet
a global computer network providing a variety of information and communication facilities, consisting of interconnected networks using standardized communication protocols.
There are one billion websites.The size of the internet is approx about 1.2 million terabytes. (one terabyte is 1,000 gigabytes).On internet,google search engine is used by the 80% of the world.So how does google gives you result in just just millisecond from this huge data base .the answers are search engine .so how does these search engines works?
search engine answer tens of millions of queries every day
search engine answer tens of millions of queries every day
When you search for something on google you are actually searching web actually the googles index of the web.So there is 60 trillions of website and growing per second and google index every web page if owner has allowed.
So lets see ho do this works?
The most important measure for a search engine is the search performance, quality of the results and ability to crawl, and index the web efficiently. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Some of the efficient and recommended search engines are Google, Yahoo and Teoma, which share some common features and are standardized to some extent
The most important measure for a search engine is the search performance, quality of the results and ability to crawl, and index the web efficiently. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Some of the efficient and recommended search engines are Google, Yahoo and Teoma, which share some common features and are standardized to some extent
- crawling -collecting the data from the web pages.
- indexing -analysing the collected data and its storage in server
- retrieval-delivery of result on search query
what is crawling of website or a page?
A Web crawler is a computer progarm which travels the web automatically and downloads data and stores Web pages, often for a Web search engine.Crawlers or spider build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling.
WebCrawler is a metasearch engine that blends the top search results from Google Search engine and Yahoo! Search.
Crawling of website means acquisition of data from the website .crawling of website is done through the the computer software called spiders or bots .google crawlers is also know as googlebots .The web crawlers are as fast as they can scan 100 of pages in millisecond.
There is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository
There is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository
crawlers visit the page as we do,they scan everything on the webpage .The page title,keywords,page links,etc .modern crawlers also scan page layout,the advertising space on webpage etc.
First thing the crawlers do when they visit any website that they search for they search for the file name "robots.txt" (robots protocol). robots.txt file all the information about which pages have to be crawled and which page not to be crawled .with out robots.txt file web crawlers will not crawl the website and site will not be submitted to the google index.
example of robots.txt
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
have a look at facebook robot.txt file www.facebook.com/robots.txt
or my bloggers techtysechyblog.blogspot.in/robots.txt
example of robots.txt
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
have a look at facebook robot.txt file www.facebook.com/robots.txt
or my bloggers techtysechyblog.blogspot.in/robots.txt
google crawlers fetch entire page and then it fetch the links available on the webpage and the it also fetches the links available on that page and so on until the crawlers decide to stop.
every search engine either its is google,yahoo,bing,bidu,etc uses crawlers
what is indexing of the webpage ?
When you search for something on google you are actually searching the googles index of the web. it is search engine index which provide the result of your query. Without search engine index is would be not possible for search engines to give your result with in a second it would take much time and afford.
search engine index is a place where all the data which is collected by the crawler is stored in the severs.The indexer is is a progarm which reads the data collected by the web crawlers and decides what is the web page is about.
Web indexing includes back-of-book-style indexes to individual websites or web documents and the creation of metadata (subject keywords and description tags) to provide a more useful vocabulary for Internet search engines.
how google results are ordered ?
So after crawling and indexing, how does google result the search query ? .the result of your search query is based on the web page ranking ,the ranking of the web page is decide on at least 200 hundred factors the page which rank well is definitely going to be on the top of the search result .
the ranking algorithm is is set up by the human but no human can manually adjust the the ranking of the web page.
some of the factors deciding page rank:
Keyword usage: Site structure Site speed Time spent on site Number of inbound links Quality of inbound link all most all the search engine uses this method foe crawling ,indexing ,and to result the query.hera google has been taken as example because most of the world uses google search engine.
When you search for something on google you are actually searching the googles index of the web. it is search engine index which provide the result of your query. Without search engine index is would be not possible for search engines to give your result with in a second it would take much time and afford.
search engine index is a place where all the data which is collected by the crawler is stored in the severs.The indexer is is a progarm which reads the data collected by the web crawlers and decides what is the web page is about.
Web indexing includes back-of-book-style indexes to individual websites or web documents and the creation of metadata (subject keywords and description tags) to provide a more useful vocabulary for Internet search engines.
once the crawling process is over .It compile the massive index of all the words and store these data in googles server it remember the the location of each webpage.the stored is then organised and interpreted by the search engines algorithm to measure its importance compared to similar pages.
how google results are ordered ?
So after crawling and indexing, how does google result the search query ? .the result of your search query is based on the web page ranking ,the ranking of the web page is decide on at least 200 hundred factors the page which rank well is definitely going to be on the top of the search result .
the ranking algorithm is is set up by the human but no human can manually adjust the the ranking of the web page.
some of the factors deciding page rank:
if you find this post useful please share .
if any recommendation then please comment below.
Reference;
www.wikipedia.com
http://www.totallycommunications.com/latest/search-engine-basics-crawling-indexing-ranking/
http://www.brickmarketing.com/define-search-engine-index.htm
http://searchsoa.techtarget.com/definition/crawler
http://www.slideshare.net/sanchitsaini/working-of-a-web-crawler
http://computer.howstuffworks.com/internet/basics/search-engine1.htm
http://www.makeuseof.com/tag/how-do-search-engines-work-makeuseof-explains/
pdf:
HOW SEARCH ENGINES WORK
AND A WEB CRAWLER APPLICATION
Monica Peshave Department of Computer Science University of Illinois at Springfield Springfield, IL 62703 mpesh01s@uis.edu
Advisor: Kamyar Dezhgosha University of Illinois at Springfield One University Plaza, MS HSB137 Springfield, IL 62703-5407 kdezh1@uis.edu
DRAFT! © April 1, 2009 Cambridge University Press
J. Pei: Information Retrieval and Web Search -- Web Crawling