Timing Google's Crawl
Google crawls the Web at varying depths and on more than one schedule. The so-called deep crawl occurs roughly once a month. This extensive reconnaissance of Web content requires more than a week to complete and an undisclosed length of time after completion to build the results into the index. For this reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first.
If Google relied entirely on the deep crawl, its index would quickly become outdated in the rapidly shifting Web. To stay current, Google launches various supplemental fresh crawls that skim the Web more shallowly and frequently than the deep crawl. These supplementary spiders (automated software programs that travel from link to link on the Web, collecting content from online pages) do not update the entire index, but they freshen it by updating the content of some sites. Google does not divulge its fresh-crawling schedules or targets, but Webmasters can get an indication of the crawl's frequency through sharp observance.
Google has no obligation to touch any particular URL with a fresh crawl. Sites can increase their chance of being crawled often, however, by changing their content and adding pages frequently. Remember the shallowness aspect of the fresh crawl; Google might dip into the home page of your site (the front page, or index page) but not dive into a deep exploration of the site's inner pages. (You may, for example, notice that a new index page of your site appears in Google within a day of your updates, while a new inner page added at the same time may be missing.) But Google's spider can compare previous crawl results with the current crawl, and if it learns from the top navigation page that new content is added regularly, it might start crawling the entire site during its frequent visits.
The deep crawl is more automatic and mindlessly thorough than the fresh crawl. Chances are good that in a deep crawl cycle, any URL already in the main index will be reassessed down to its last page. However, Google does not necessarily include every page of a site. As usual, the reasons and formulas involved in excluding certain pages are not divulged. The main fact to remember is that Google applies
End of article
The following is an extract from How Google Works - adapted from Chris Sherman and Gary Price’s description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books, 2001).
Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.
Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.
Googlebot finds pages in two ways: through an add URL form, https://www.google.com/ webmasters/ tools/ submit-url, and through finding links by crawling the web.
When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web.
To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.
End of article