What is Googlebot?
Googlebot is the web crawler software used by Google; it collects documents from the webpages to build a searchable index for the Google Search engine. Googlebot is the generic name for Google’s web crawler. Googlebot constantly visits billions of webpages all over the web.
What is a Web Crawler?
Web crawlers (also known as bots, robots or spiders) are a type of software designed to follow links, gather information and then send that information somewhere.
How does Googlebot work?
Googlebot uses sitemaps and databases of links discovered during previous crawls to determine where to go next. Whenever the crawler finds new links on a site, it adds them to the list of pages to visit next. If Googlebot finds changes in the links or broken links, it will make a note of that so the index can be updated. The program determines how often it will crawl pages. To make sure Googlebot can correctly index your site, you need to check its crawlability. If your site is available to crawlers they come around often.
Googlebot was designed to be run simultaneously by thousands of machines to improve performance and scale as the web grows. Generally, Googlebot crawls over HTTP/1.1. However, starting November 2020, Googlebot may crawl sites that may benefit from it over HTTP/2 if it’s supported by the site. This may save computing resources (for example, CPU, RAM) for the site and Googlebot, but otherwise it doesn’t affect indexing or ranking of your site.
The difference between Googlebot and the Google index
Googlebot
- Googlebot retrieves content from the web.
- Googlebot does not judge the content in anyway, it only retrieves it.
- The only concerns Googlebot has is “Can I access this content?” and “Is there any further content that I can access?”
The Google index
- The Google index takes the content it receives from Googlebot and uses it to rank pages
- The first step of being ranked by Google is to be retrieved by Googlebot.
How to Block Googlebot from visiting site?
To Block Googlebot from visiting your site to gather information available on the site, you can use the following method;
- Use appropriate directives in the robots.txt as Googlebot follows the instructions in it.
- Adding robot instructions in the metadata or meta tag <meta name=”Googlebot” content=”nofollow” /> to the web page.
- Using proper sitemaps in the XML sitemap file.
Types of Google crawlers (user agents)
There are several types of Google crawlers. Google’s main crawler is called Googlebot.
- Crawler is a generic term for any program (such as a robot or spider) that is used to automatically discover and scan websites by following links from one webpage to another.
- User agent token is used in the
User-agent:
line in robots.txt to match a crawler type when writing crawl rules for your site.
Crawler | User agent token (product token) |
---|---|
APIs-Google | APIs-Google |
AdSense | Mediapartners-Google |
AdsBot Mobile Web Android (Checks Android web page ad quality) | AdsBot-Google-Mobile |
AdsBot Mobile Web (Checks iPhone web page ad quality) | AdsBot-Google-Mobile |
AdsBot (Checks desktop web page ad quality) | AdsBot-Google |
Googlebot Image | Googlebot-Image Googlebot |
Googlebot News | Googlebot-News Googlebot |
Googlebot Video | Googlebot-Video Googlebot |
Googlebot (Desktop) | Googlebot |
Googlebot (Smartphone) | Googlebot |
Mobile AdSense | Mediapartners-Google |
Mobile Apps Android (Checks Android app page ad quality. Obeys AdsBot-Google robots rules.) | AdsBot-Google-Mobile-Apps |
Feedfetcher | FeedFetcher-Google |
Google Read Aloud | Google-Read-Aloud |
Duplex on the web | DuplexWeb-Google |
Google Favicon (Retrieves favicons for various services) | Google Favicon |
Web Light | googleweblight |