Monday, June 05, 2006

Google SEO Algorithm Problems

Have you noticed anything different with Google lately? We webmasters certainly have, and if recent talk on SEO forums is an indicator, we're very frustrated! Over the last two years, Google has introduced a series of algorithm and filter changes that have led to unpredictable search engine results, dropping the rankings of many clean (non-spamming) web sites.

Google's algorithm changes started in November 2003 with the Florida update, now remembered as a legendary event among the webmaster community. This was followed by the Austin, Brandy, Bourbon, and Jagger updates. Google updates used to occur monthly; they're now carried out quarterly. But with so many servers, there seem to be several different results rolling through the servers at any time during a quarter.

BigDaddy, Google's most recent update, is partly to blame. Believed to be using a 64-bit architecture, BigDaddy is an update of Google's infrastructure as much as it is an update of its search algorithm. Pages lose their first page rankings and drop to the 100th page, or worse still, the Supplemental index!

BigDaddy's algorithm problems fall into four categories: canonical issues, duplicate content issues, the Sandbox, and supplemental page issues.


  1. Canonical Issues. These occur when a search engine treats www.yourdomain.com, yourdomain.com, and yourdomain.com/index.html as different web sites. When Google does this, it flags the different copies as duplicate content, and penalizes them. If yourdomain.com is not penalized and all other sites link to your web site using www.yourdomain.com, then the version left in the index will have no ranking. These are basic issues that other major search engines, such as Yahoo and MSN, have no problem dealing with. Google's reputation as the world's greatest search engine (self-ranked as a ten on a scale of one to ten) is hindered by its inability to resolve basic indexing issues.
  2. The Sandbox. It's believed that Google has implemented a time penalty for new links and sites before fully marking the index, based on the presumption that 100,000-page websites can't be created overnight. Certain web sites, or links to them, are "sandboxed" for a period of time before they are given full rank in the index. Speculation is that only a set of competitive keywords (the ones that are manipulated the most) are sandboxed. A drifting legend in the search engine world, the existence of the Sandbox has been debated, and is yet to be confirmed by Google.
  3. Duplicate Content Issues. Since web pages drive search engine rankings, Black Hat SEOs began duplicating the content of entire web sites under their own domain name, instantly producing a ton of web pages (kind of like downloading an encyclopedia onto your web site). Due to this abuse, Google aggressively attacked duplicate content abusers with their algorithm updates, knocking out many legitimate websites as collateral damage in the process. For example, when someone scrapes your site, Google will look at both renditions of the site, and in some cases it may determine the legitimate one to be the duplicate. The only way to prevent this is to track down sites as they are scraped and then submit spam reports to Google. Issues with duplicate content also arise because there are a lot of legitimate uses for them. News feeds are the most obvious example: a news story is covered by many websites because it's the content that viewerss want to see. Any filter will inevitably catch some legitimate uses.


  4. Supplemental Page Issues. "Supplemental Hell" to webmasters, the issue has been lurking in places like Webmasterworld for over a year, but it was the major shake-up in late February (coinciding with the ongoing BigDaddy rollout) that finally led to all hell breaking loose in the webmaster community. You may be aware that Google has two indexes: the main index, which is the one you see when you search; and the Supplemental index, a graveyard where old, erroneous and obsolete pages are laid to rest (among others). Nobody's disputing the need for a Supplemental index, it does indeed provide a worthy cause. But when you're buried alive, it's another story! Which is exactly what's been happening: active, recent, and clean pages have been showing up in the Supplemental index. The true nature of the issue is unclear, nor has a common causing leading to it been determined.

Google's monthly updates, once fairly predictable, were anticipated by webmasters with both joy and angst. Google followed a well published algorithm that gave each web page a Page Rank (a numerical ranking based on the number and rank of the web pages that link to it). When you searched for a term, Google ordered all the web pages that were deemed relevant to your search by their Page Rank. A number of factors were used to determine the relevancy of pages, including keyword density, page titles, meta tags, and header tags.

This original algorithm favored incoming links that used selected keywords as anchor text. The more sites that linked to yours using that keyword-rich anchor text, the better your search rank for those keywords. As Google became the dominant search force in the early part of the decade, site owners fought for high rankings in its SERPs. The release of Google's Adsense program made it very lucrative for those site owners who won: if a web site ranked highly for a popular keyword, they could run Google ads under Adsense, and split the revenue with Google! This led to an SEO epidemic that had never before been seen by the webmaster world.


The nature of links between web sites changed. Webmasters found that referring links on their websites could not only reduce their own search engine rankings, but boost those of their competitors as well. Google's algorithm works in a manner whereby links coming into your web site boost its Page Rank (PR), while outgoing links from your web pages to other web sites reduce your PR. Attempts to boost the page rankings for websites led to people creating link farms, participating in reciprocal link partnerships, and buying and selling links. Instead of using links to provide quality content for their visitors, webmasters now included links to support PRs and for monetary gain.

This led to the wholesale scraping of web sites, as I mentioned earlier. Black Hat SEOs could combine the content of an entire web site with Google's ad, and a few high-powered incoming links, to produce high page rankings and generate revenue from Google's Adsense program -- all without providing any unique web site content themselves! Aware of the manipulation that was taking place, Google aggressively altered their algorithms to prevent it. Thus began the cat-and-mouse game that has become the Google algorithm: in order to blacklist the duplicate sites so they could provide their users with the most relevant search results, sometimes the algorithm attacked the original site instead of the scraped one.

This led to a period of unstable updates that caused many top ranking, authentic web sites to drop from their ranks. Most end-users may not perceive this to be a problem. As far as they're concerned, Google's job is to provide the most relevant listings for their search, which it still does. For this reason, the problem hasn't made an immediate major impact on Google. However, if Google continues to produce unintended results as it evolves, slowly but surely, problems will surface. As these problems escalate, the webmaster community will lose faith in Google, making it vulnerable to the growing competition.

Webmasters are the word-of-mouth experts; and run the web sites that use Google's Adsense program. Fluctuations in ranking are part of the internet business, and most webmasters realize this – we're simply calling on Google to fix the bugs that unfairly snatch the correct rankings of our websites.

Of course, we understand that the reason that the bugs surfaced to begin with is because not all webmasters are innocent. Some have violated the guidelines laid out by Google, and continue to do so. We support Google's needs to fight spam and Black Hat SEO manipulation, and accept that there's probably no easy fix.

We don't expect Google to reveal their algorithm, or the changes that have been made to it. But given the impact Google's rankings have on companies, we webmasters would like to see more communication around the known issues, and be able to assist with identifying future algorithm issues, rather than speculating.

The most recent of these speculations suggest that Google is currently looking at attributes such as the age of a domain name, the number web sites on the same IP, and frequency of fresh content to churn their search results. As webmasters, we'd appreciate the ability to report potential bugs to Google, and receive a responses to our feedback.

After all, it's not just in Google's best interests to have a bug-free algorithm. Ultimately, this will provide the best search results for everyone.