How Google Works

Click here for a map to the Googleplex (Google's headquarters)

How do Search Engines Work?

Extract from http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SearchEngines.html

Search Engines for the general web do not really search the World Wide Web directly. Each one searches a database of the full text of web pages automatically harvested from the billions of web pages out there residing on servers. When you search the web using a search engine, you are always searching a somewhat stale copy of the real web page. When you click on links provided in a search engine's search results, you retrieve from the server the current version of the page.

Search engine databases are selected and built by computer robot programs called spiders. These "crawl" the web, finding pages for potential inclusion by following the links in the pages they already have in their database (i.e., already "know about"). They cannot think or type a URL or use judgment to "decide" to go look something up and see what's on the web about it. (Computers are getting more sophisticated all the time, but they are still brainless.)

If a web page is never linked to in any other page, search engine spiders cannot find it. The only way a brand new page - one that no other page has ever linked to - can get into a search engine is for its URL to be sent by some human to the search engine companies as a request that the new page be included. All search engine companies offer ways to do this.

After spiders find pages, they pass them on to another computer program for "indexing." This program identifies the text, links, and other content in the page and stores it in the search engine database's files so that the database can be searched by keyword and whatever more advanced approaches are offered, and the page will be found if your search matches its content.

Many web pages are excluded from most search engines by policy. The contents of most of the searchable databases mounted on the web, such as library catalogs and article databases, are excluded because search engine spiders cannot access them. All this material is referred to as the "Invisible Web" -- what you don't see in search engine results.

Click here for the Wikipedia article on Google

Google Data Centre FAQ

March 27th, 2008 : Rich Miller

Extract from Original article: http://www.datacenterknowledge.com/archives/2008/03/27/google-data-center-faq

How many data centres does Google have?
Nobody knows for sure, and the company isn’t saying. The conventional wisdom is that Google has dozens of data centres. We’re aware of at least 12 significant Google data centre installations in the United States, with another three under construction. In Europe, Google is known to have equipment in at least five locations, with new data centres being built in two other venues.

Where are Google’s data centres located?
Google Data Centre, The Dalles, Oregon Google has disclosed the sites of four new facilities announced in 2007, but many of its older data centre locations remain under wraps. Much of Google’s data centre equipment is housed in the company’s own facilities, but it also continues to lease space in a number of third-party facilities. Much of its third-party data centre space is focused around peering centres in major connectivity hubs. Here’s our best information about where Google is operating data centres, building new ones, or maintaining equipment for network peering. Facilities we believe to be major data centres are bold-faced.

Most of the international locations likely are for network peering or to house servers supporting the more than 30 country-specific versions of the Google search engine.

How big are Google’s data centres?
Google doesn’t disclose the size of individual data centre buildings, but journalists have managed to learn details of several sites from site plans filed with local planning boards: 

Data centre operators often standardize some of their construction process. The difference in the square footage reports for the data centres in The Dalles and Lenoir suggest that Google doesn’t standardize a single data centre size (at least not on the level of MCI/WorldCom, which once built identical 109,000 square foot data centres in 25 cites). Google spokesman Barry Schnitt says Google data centres are not cookie-cutter designs, as the company is constantly updating its data centre design and equipment to take advantage of the latest technological advances and efficiencies.

How much do Google data centres cost?
Each of the four new Google data centre projects unveiled in 2007 cost an estimated $600 million. That figure includes capital investment for construction, infrastructure and computers for two data centre buildings, according to Schnitt. Each project budget includes two data centre facilities, with the option of adding a third, which would require additional expense beyond the $600 million. Those expenses will be realized over time. In its earnings reports, Google reported $1.9 billion in spending on data centres in 2006 and $2.4 billion in 2007.

How does Google decide where to build its data centres?
Here are the factors that are known to influence Google’s data centre site location process:

What kind of hardware and software does Google use in its data centres?
Google uses commodity web servers that it customizes with highly-efficient power supplies. The company’s engineers have filed a patent on a power supply that integrates a battery, allowing it to function as an uninterruptible power supply (UPS). Google is also reported to be building its own energy-efficient 10 Gigabit Ethernet switches for its data centres.

Google is known to use in-house software for its operations. These programs include: 

Do Google’s sites ever go offline?

Pingdom Google tracking

Not very often. The web site monitoring service Pingdom tracked Google’s worldwide network of search sites for a one-year period ending in October 2007, and found that all 32 of Google’s worldwide search portals (including google.co.uk, google.in, etc.) maintained uptime of at least “four nines” – 99.99 percent. The main site at google.com was down for 31 minutes in the 12-month monitoring period. The best performer was Google Brazil (google.com.br), with 3 minutes of downtime. Some Google services (notably Blogger) experience performance problems more often.

Does Google lease space in other companies’ data centres?
In recent years Google has made a concerted effort to reduce its use of data centre space in multi-tenant facilities. Google is reported to have reduced its use of space at Equinix (EQIX) and Savvis (SVVS). Operating its own facilities allows Google more flexibility in its design and more space to grow. Sharing a data centre building with other providers also makes it harder for Google to maintain secrecy around its operations, as data centre admins and contractors for other companies will be coming and going in a multi-tenant site. As an example, blogger Robert Scoble once got a look at some of Google’s data centre space when he visited the cages of a photo sharing startup, which happened to be housed in the same facility.

Are there pictures of Google’s servers and data centres?
There are many pictures of the exterior of Google’s data centres, and a smaller selection of images of the equipment inside. A Google Image Search turns up many photos around the web, and many others have been posted on Flickr.

Equipment: The Computer History Museum in Silicon valley has a display of the first Google production server rack from 1999. Several more recent photos of Google racks from presentations have appeared on the web, one showing an image from a slide from Google Developer Day in 2007 and another from 2006 that has been widely circulated.

Exteriors: Google’s facility in The Dalles has been the subject of many photos, most notably collections posted online by Dalles resident John Nelson in 2006 and Information Week editor John Foley in 2007.

End of Rich Miller's FAQ

Google's Network Topology

Below is a message thread from lammert on WebmasterWorld

Google operates a number of datacentres around the world. I am not sure about the exact number, but at the moment there are about 15. Each datacentre has one or more clusters, and each cluster consists of thousands of computers calculating the SERPs (Search Engine Results Pages) for your search query. When you do a query, you are connected with one of these data centres. Which one is determined by the DNS settings of the nameservers of Google called ns1.google.com ... ns4.google.com.

The DNS servers play an important role in the load distribution and disaster recovery. When you request the IP address for www.google.com, the DNS server first replies with a canonical name. This name has the form www.X.google.com where X is a letter. At this moment the name www.l.google.com is returned from the location where I am working, but this can vary depending on location and time.

Then a second query is done to translate this canonical name to an IP address. Every canonical name of the form www.X.google.com returns 3 IP addresses which can be used by the browser to attach to the search engine.

Throughout the day, you are not connected to the same data centre or cluster. This is, because Google has decided to set an extremely short TTL (time to live) time for the canonical name and IP address. They have a good reason for it. If a cluster is overloaded or brakes down, they can route requests to another cluster or datacentre. Within 5 minutes (the TTL of the IP addresses) all clients will request a new IP address for www.google.com and all traffic is rerouted.

Some tests you can do yourself. This works on Windows 2000, also on XP.

Start the command line program: nslookup
Type the command: set d2
Type: www.google.com

The program will now query the Google nameservers for the canonical name and the IP addresses for www.google.com. Because debugging is switched on with the set d2 command, you will also see the TTL times for the canonical name and IP's.

Click here for a news article in April 2009 on Google's servers.

Click here for an article on "Timing Google's Crawl"

** End of Report