How Google Works

Click here for a map to the Googleplex (Google's headquarters)

In July 2016 Gartner estimated that Google at the time had 2.5 million servers (individual computers). That figure has indeed grown, but the company keeps fairly tight-lipped on the subject.

How do Search Engines Work?
Extract from www.lib.berkeley.edu (archive)

Search Engines for the general web do not really search the World Wide Web directly. Each one searches a database of the full text of web pages automatically harvested from the billions of web pages out there residing on servers. When you search the web using a search engine, you are always searching a somewhat stale copy of the real web page. When you click on links provided in a search engine's search results, you retrieve from the server the current version of the page.

Search engine databases are selected and built by computer robot programs called spiders. These "crawl" the web, finding pages for potential inclusion by following the links in the pages they already have in their database (i.e., already "know about"). They cannot think or type a URL or use judgment to "decide" to go look something up and see what's on the web about it. (Computers are getting more sophisticated all the time, but they are still brainless.)

If a web page is never linked to in any other page, search engine spiders cannot find it. The only way a brand new page - one that no other page has ever linked to - can get into a search engine is for its URL to be sent by some human to the search engine companies as a request that the new page be included. All search engine companies offer ways to do this.

After spiders find pages, they pass them on to another computer program for "indexing." This program identifies the text, links, and other content in the page and stores it in the search engine database's files so that the database can be searched by keyword and whatever more advanced approaches are offered, and the page will be found if your search matches its content.

Many web pages are excluded from most search engines by policy. The contents of most of the searchable databases mounted on the web, such as library catalogs and article databases, are excluded because search engine spiders cannot access them. All this material is referred to as the "Invisible Web" -- what you don't see in search engine results.

Google in Australia

Google launched in Australia in 2002 with one employee, establishing its presence via key partner networks, providing both web search engine services, and advertising services. In 2017 it launched its first Google Cloud Platform region in Sydney, thus speeding up traffic (click here), opened a second region in Melbourne in 2021, and in 2024 employs approximately 2,130 people throughout Australia. It is administered by its head office in Pyrmont, New South Wales.

Google Data Centre FAQ

Go to further background on how Google keeps its servers from getting overloaded.

BY DATA CENTER KNOWLEDGE ON MARCH 16, 2017

Extract from Original article: www.datacenterknowledge.com /archives /2017/03/16 /google-data-center-faq

Google is the largest, most-used search engine in the world, with a global market share that has held steady at about 90 percent since Google Search launched in 1997 as Backrub. In 2017, Google became the most valuable brand in the world, topping Apple, according to the Brand Finance Global 500 report. Google's position is due mainly to its core business as a search engine and its ability to transform users into payers via advertising.

About 32 percent of Google visitors come from the US, where the company holds 63.9 percent of the search engine market, according to statista.com. Google had 247 million unique US users in November 2015. Globally, it boasts 1.5 billion search engine users and more than 1 billion users of Gmail.

Google data centers process an average of 40 million searches per second, resulting in 3.5 billion searches per day and 1.2 trillion searches per year, Internet Live Stats reports. That's up from 795.2 million searches per year in 1999, one year after Google was launched.

In a reorganization in October 2015, Google became a subsidiary of a new company it created called Alphabet. Since then, several projects have been canceled or scaled back, including the halt of further rollout of Google Fiber. Following the reorg, however, Google has placed a lot of focus (and dedicated a lot of resources) to selling cloud services to enterprises, going head-to-head against the market giant Amazon Web Services and the second-largest player in the space, Microsoft Azure.

That has meant a major expansion of Google data centers specifically to support those cloud services. At the Google Cloud Next conference in San Francisco in March 2017, the company's execs revealed that it spent nearly $30 billion on data centers over the preceding three years. While the company already has what is probably the world's largest cloud, it was not built to support enterprise cloud services. To do that, the company needs to have data centers in more locations, and that's what it has been doing, adding new locations to support cloud services and adding cloud data center capacity wherever it makes sense in existing locations.

How Many Servers Does Google Have?

There's no official data on how many servers there are in Google data centers, but Gartner estimated in a July 2016 report that Google at the time had 2.5 million servers. This number, of course, is always changing as the company expands capacity and refreshes its hardware.

How Many Google Data Centers are There?

Few outside Google know exactly how many data centers Google operates. There are the massive Google data center campuses, of which it says it has 15. Some of its enterprise cloud regions are on those campuses, and some are elsewhere. As of March 2017, the company had six enterprise cloud regions online and 11 in the works. Most if not all of these locations have or will have multiple data centers each. Google has not shared publicly exactly how many there are in each location.

Also unclear is the amount of caching sites, also referred to as edge Points of Presence, Google has around the world. These are small-capacity deployments in leased spaces inside colocation facilities operated by data center providers like Equinix, Interxion, or NTT. The company says there are more than 100 such sites but doesn't share the exact number.

Where are Google Data Centers Located?

Google lists eight data center locations in the U.S., one in South America, four in Europe and two in Asia. Its cloud sites, however, are expanding, and Google's cloud map shows many points of presence worldwide. The company also has many caching sites in colocation facilities throughout the world, whose locations it does not share.

This far-flung network is necessary not only to support operations than run 24/7, but to meet specific regulations (like the EU's privacy regulations) of certain regions and to ensure business continuity in the face of risks like natural disasters.

In the works as of March 2017, are Google data centers for cloud services in California, Canada, The Netherlands, Northern Virginia, Sao Paulo, London, Finland, Frankfurt, Mumbai, Singapore, and Sydney.

Click here for a map of the world, highlighting Google's published Data Centres.

Google Data Centre, The Dalles, Oregon

UNITED STATES

The Dalles, Oregon (see picture on right)
Atlanta, Georgia.
Jackson County, Alabama (announced in 2015)
Montgomery County, Tennessee (also announced in 2015)
Lenoir, North Carolina
Goose Creek, South Carolina
Pryor, Oklahoma
Council Bluffs, Iowa

INTERNATIONAL SITES

Mons, Belgium
Eemshaven, Netherlands
Dublin, Ireland
Hamina, Finland
Taiwan
Singapore
Quilicura, Chile

Where are Google's Cloud data centers?

Google has several cloud data centers throughout the world and is bringing more online in 2017. In addition to data centers in the Western US, Central US, Eastern US, Western Europe, and Eastern Asia, the company announced that new "cloud regions" will come online in 2017 in Frankfurt, London, Mumbai, Singapore, Sydney, and Sao Paulo, and in undisclosed areas in Finland, California, The Netherlands, and Northern Virginia.

The proliferation of new cloud data centers reduces latency for Google customers and enables customers to address data sovereignty issues, as different countries have different laws governing storage and transfer of citizens' personal data.

How Big are Google Data Centers?

A paper presented during the IEEE 802.3bs Task Force in May 2014 estimates the size of five of Google's US facilities as:

Pryor Creek (Mayes County), Oklahoma, 90,000 square metres
Lenoir, North Carolina, 30,000 square metres
The Dalles, Oregon, 18,580 square metres initially (in 2016, an extra 15,240 square metres expansion)
Council Bluff, Iowa, 18,580 square metres
Berkely County, South Carolina, 18,580 square metres.

Many of these sites have multiple data center buildings, as Google prefers to build additional structures as sites expand rather than containing operations in a single massive building.

Google itself doesn't disclose the size of its data centers. Instead, it mentions the cost of the sites or number of employees. Sometimes, facility size slips out. For example, the announcement about the opening of The Dalles in Oregon said the initial building was 15,240 square metres. The size of subsequent expansions, however, has been kept tightly under wraps.

Reports discussing Google's new data center in Emeshaven, Netherlands, which opened December 2016, didn't mention size. Instead, they said the company has contracted for the entire 62 Megawatt output of a nearby windfarm and ran 16,000 kilometres of computer cable within the facility. The data center employs 150 people.

How much do Google data centres cost to build?

Google's newest data center at The Dalles in Oregon, a 15,240 square metre building that opened in 2016, brought its total investment in that site to $1.2 billion. The overall size totals 32,710 square metres of data center divided among three buildings. The site first opened in 2006 and currently employs 175 people. Google has announced plans to add another $600 million data center about a mile away, bringing the investment to $1.8 billion. That center is expected to employ about 50 people.

Likewise, the Pryor Creek, Oklahoma, data center also is continuing to expand. It first went online in 2011 with a 12,000 square metre, $600 million facility and soon after built another building for staff offices. When the expansion announced in 2016 is completed, Google's Pryor Creek data center will represent a $2 billion investment.

The new data center under construction in 2016 in Eemshaven, Netherlands, is expected to cost $773 million. In typical Google fashion, there's no word on size.

Overall, Google's capital expenditures for 2016 were just under $10.2 billion. Most of that can be accounted for by its data centers and land acquisitions.

What Kind of Hardware and Software Does Google Use in Its Data Centers?

It's no secret that Google has built its own Internet infrastructure since 2004 from commodity components, resulting in nimble, software-defined data centers. The resulting hierarchical mesh design is standard across all its data centers.

The hardware is dominated by Google-designed custom servers and Jupiter, the switch Google introduced in 2012. With its economies of scale, Google contracts directly with manufacturers to get the best deals.

Google's servers and networking software run a hardened version of the Linux open source operating system. Individual programs have been written in-house. They include, to the best of our knowledge:

Google Web Server (GWS) - custom Linux-based Web server that Google uses for its online services.
Storage systems: Colossus - the cluster-level file system that replaced the Google File System
BigTable - a high performance NoSQL database service for large analytical and operational workloads
Spanner - a globally-distributed NewSQL database
Google F1 - a distributed, relational database that replaced MySQL
Chubby lock service - provides coarse-grained locking and reliable, low-volume storage for loosely coupled distributed systems.
Programming languages - C++, Java and Python dominate
Indexing/search systems:
1. Caffeine - a continuous indexing system launched in 2010 to replace TeraGoogle and
2. Hummingbird - major search index algorithm introduced in 2013.
Borg - a cluster manager that runs hundreds of thousands of jobs from thousands of applications across multiple clusters on thousands of machines

Google also has developed several abstractions that it uses for storing most of its data:

Protocol Buffers - a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more
SSTable (Sorted Strings Table) a persistent, ordered, immutable map from keys to values, where both keys and values are arbitrary byte strings. It is also used as one of the building blocks of BigTable
RecordIO - a file defining IO interfaces compatible with Google's IO specifications

End of FAQ

Google's Network Topology

Below is an extract from a message thread from lammert in 2005 on WebmasterWorld

Google operates a number of datacentres around the world. I am not sure about the exact number, but at the moment there are about 15. Each datacentre has one or more clusters, and each cluster consists of thousands of computers calculating the SERPs (Search Engine Results Pages) for your search query. When you do a query, you are connected with one of these data centres. Which one is determined by internal DNS settings of the nameservers of Google called ns1.google.com ... ns4.google.com.

These DNS servers play an important role in load distribution and disaster recovery. When you request the IP address for www.google.com, their DNS server replies with one which can be used by the browser to attach to the search engine, and can be cached by your ISP for its next query.

But throughout the day, you are not connected to this same data centre or cluster. This is because Google has decided to set an extremely short TTL (time to live) time for their name and IP addresses. They have a good reason for it. If a cluster is overloaded or breaks down, they can route requests to another cluster or datacentre. Within 5 minutes, due to that short TTL, clients are requesting a new IP address for www.google.com and all traffic is rerouted.

Some tests you can do yourself.

In Windows, start the command line program and type: nslookup
Type the command: set d2
Type: www.google.com

The program will now query the Google nameservers for www.google.com. Because debugging is switched on with the set d2 command, you will also see the TTL times for these IP addresses.

Click here for an article on "Timing Google's Crawl"

** End of Report