2/13/2008

Potential Solutions & Updates to the Google Sandbox Theory

The Google 'dance' or update that began on February 4, 2005 - the Friday before the Super bowl - saw the first mass exodus of sites that were trapped in what SEOs have dubbed Google's "Sandbox". This release did not include ALL sandboxed sites, and indeed, many SEOs are still suffering with this strange filter, preventing them from ranking for even some non-competitive phrases in the Google SERPs.

There are two major problems with this release from my perspective. The first is that Google did not release the sites on all datacenters, suggesting a trial or test period was involved to decide whether to keep the changes that were made on 02/04, or whether to roll back to the previous results. The second major issue is that Google themselves released these sites, rather than individual SEOs or SEOs as a group finding a solution to the problem. This means that SEOs still have no known solutions to the sandbox, although we can speculate and hypothesize on what helped certain sites escape, while others remain 'boxed'.

Datacenter Rank-Monitoring Tools:
Use these tools to simultaneously check your rank across multiple Google datacenters

1. SEOmoz Rank Check Tool
2. McDar DataCenter Watch Tool
3. Google Dance Machine

Topics Covered:

In this article I will discuss the current possibilities for the cause of Google's sandbox, examine each theory distinctly and attempt to prove/disprove its validity based on my own experiences and the posted experiences of others. I will also outline several techniques that can potentially help SEOs to avoid this factor and discuss the details of the sites I have personal knowledge of that have escaped.

Current Theories About Google's Sandbox:

I will propose several theories in this space which have been previously hinted at or suggested on the major SEO forums, but do not yet have formalized names. Please feel free to email me if you have additional theories you would like to share or have me examine.

TrustRank Box
This theory, named for the TrustRank paper, infers that some measure of spam-blocking is being assigned to new domains which trigger a specific filter set to enter the sandbox. From there, only a combination of links from sites and/or web clusters (sites which are 1-3 links away from one another) that have been hand-picked (by an engineer, Google employee, etc.) as trustworthy, reputable link sources can release a site from ranking purgatory.

Based on my knowledge, TrustRank is among the more likely contributing factors to Google's current algorithm, if not a direct cause for the sandbox. I have no information to prove this case, but, unlike many of the other theories, it does not directly contradict existing evidence. Note that TrustRank relies on a manual review of many tens of thousands of sites - a difficult task, but not impossible, especially given Google's Craigslist ad from 6 months ago (no longer online) requesting experts from a multitude of fields to contribute to a remote surfing & review project.

Manual Inclusion Box
This theory suggests that sites are placed into the sandbox based on a triggering of Google algorithm filters and the sites are not removed until they have been briefly looked over, manually, by a person in Google's employ.

I find this theory nearly impossible to take seriously, simply given the time commitment required to perform such a task. Based on the number of queries sent to the sandbox detection tool on this site, there are probably many tens of thousands of 'sandboxed' sites, and it is much more plausible that Google is incorporating something like TrustRank, rather than manually approving every sandboxed site.

Broken Index Box
Broken Index suggests that Google has/had been experiencing problems with the technical ranking aspects of sites in its index and that the sandbox was merely a result of not being 'properly' accounted for by Google - through accident or negligence. This theory assumes that Google's own software created the 'sandbox' issue without the express design of the engineers.

This is the most distasteful of the theories I have seen. It is not supported by any evidence, and is well refuted by a number of experts on the subject, including Xan Porter, an IR researcher who sums up the 'broken Google' theory exceptionally well:

"Its just a search engine for pete's sake. What about software found in planes, ships, medical equipment, military equipment, aerospace equipment (right down to rocket control), etc... I doubt very much that they have lost control of it, there are processes and system engineering in place."

Link Threshold Box
Link Threshold presumes that once a certain amount of quality/quantity inbound links are acquired by a website, it is released from the sandbox. This suggests that quality incoming links are a good measure of site quality and an indication that the site is not spam.

Link Threshold by itself is not fully convincing to me. There are several sites I know of who escaped the sandbox during the past run without massive numbers or incredible quality of inbound links. I suspect that link analysis may be part of what releases sites from the sandbox, but I surmise that there is more to it than simply a calculated number of inbounds, amount of PR, etc. This theory, however, is in my 'more likely' category.

Time Delay Box
The Time Delay Box is one of the simplest explanations for the sandbox. It suggests that Google will continue to do large-scale 'releases' for sandboxed sites at various times, letting out those sites which have been 'in the box' for a set amount of time. Since the release of most sites boxed in March-April of 2004 was done in February of 2005, this figure is guessed to be 10-11 months. However, with only a single data-point to plot, reliability is suspect.

I don't see Time Delay as being a likely candidate for the simple reason that it does not serve any purpose. Time Delay by itself is an artificially imposed penalty that cannot, in my opinion, improve relevancy. Google wants to provide the best relevancy so it can continue to take share of the market and grow. This theory does not fit well with the business model of a major company like Google.

AlgoShift AntiBox
This theory presumes that the sandbox does not and has not ever existed and that those sites which 'escaped' the filter on Feb. 2 are, in fact, simply the beneficiaries of Google's latest algorithm shift. Several long-time SEOs with a great amount of experience have proposed this, although they usually maintain that it is only a possibility. Longtime sandbox naysayers are also fond of this explanation.

I find this theory exceptionally difficult to believe and would be unlikely to put any stock into it at all if the sources suggesting it were not so reendowed in the field. To my mind, no slight or massive (across-the-board-type) algorithm shift could have the kind of impact the Feb. 02 update did. Certainly Google's algorithm technically changed, but sites that were 'sandboxed' suddenly jumped to top 5 and top 3 positions, while the remainder of the SERPs were relatively stable. This evidence to me suggests that a penalty on these sites was released, rather than an entire method for calculating position.

These theories are not comprehensive, but do give a good idea of the most likely and most talked-about reasons for the existence of the sandbox. One item to note is that in each case, the process for release is similar to the way in which Google conducts their other types of updates - PageRank in the Toolbar, directory PR, backlinks updates, etc. On a specific date, the engineers make an index-wide change that pushes the previously trapped sites out of the sandbox and enables them to compete normally with older sites.

Techniques to Avoid the Sandbox

There are 6 primary methods I have detailed below for dodging the sandbox's influence and keeping your website out of the 'penalty zone'. These are not authoritative and do not have direct evidence to back them up. They are composed based on the likely explanations for the existence of the sandbox and the SEO best practices that appear to have kept some sites completely unaffected.

  1. Build Only Quality, User-Focused Content

    This means no spammy, keyword-dense pages that are simply there to attract searchers. Build with customers and traffic in mind, not search engines. That said, it is important to still use basic optimization techniques - keywords in title tags & internal anchor text, use positive term weight and related terms, etc.
  2. Build Only Quality, Natural Inbound Links

    Don't let your link greed get ahead of you. Link farms, large sitewide links & big link packages from seedy text link dealers should be dodged. Change anchor text constantly and descriptions even more frequently. Stick to pages where the related terms count is high and the topic is clearly related to your own site. This does not mean you shouldn't buy links, it just means you should buy them individually, from site owners who can send you real traffic through them, rather than just as an attempt to boost link popularity.
  3. Keep Link-Building Slow and Steady

    Don't get 400 links in a week. Build links at the rate of 2-10 per day depending on your size and market. When possible, use content to attract naturally built links and stick to a believable pace.
  4. Acquire Links from sites/pages that already rank well (top 50-100) for your major search terms

    This is one of the best ways to insure that you build 'local popularity', the opposite of PageRank (which is a measure of global popularity). Not only will this bring targeted traffic from people who click on the links, it will also help to signal to Google that your site is on-topic and in the 'cluster' of trusted sites for a particular topic.
  5. Conduct Frequent Checks for Duplicate Content

    Many have suspected that duplicate content is responsible for some of the sandbox-like factors. Run Google searches daily with a long sentence from your home page and major sub-pages to make sure that no one is copying your content.
  6. Stick to White-Hat Only Techniques

    This one is obvious, but the use of cloaking, doorway pages, page-hijacking, and other so-called 'black hat' techniques is not going to be smiled upon by Google. In addition, unless you can retire on your earnings fast, the potential for long-term penalization could ruin your business.

Details on Sites Known to Escape

Sadly, there are only two sites that I have recently seen escape that I am permitted to comment on and mention names. I have knowledge of many others, but cannot share them except anonymously.

The first site I'll mention is one I have not been shy about mentioning in the past - www.avatarfinancial.com. It's a site that I have been optimizing for since April of 2004 and was actually my first foray into professional SEO on my own. The site has the following features that have allowed it to crawl into many top rankings at multiple engines:

  1. Links - 7100 @ Yahoo!, 1530 @ Google, 16,360 @ MSN (although that count seems erroneous)
  2. Anchor text - #1 allinanchor positions for every keyword targeted
  3. Links from top sites - 25+ of the top 100 pages in Google's SERPs for various kw phrases link to the site, including many in the top 30 and several in the top 10.
  4. On-Page Optimized including high term weight on page for primary terms
  5. Use of high C-Index related terms on the site (mostly naturally occurring) and throughout the link building process, especially in the latter stages.
  6. Links from major hubs - there are 6-10 major hub sites in my SERPs, of which nearly all (or perhaps all) link to avatarfinancial.com
  7. Content - I and my boss write the articles for the site, the early ones tend to be of poorer quality, but the latter ones have improved steadily (since I learned about the importance of content)

The second sites is this one - socengine.com, which appears to be ranking very highly for terms like 'msn beta search' and 'seo tools'. I have little explanation, but I assume it is because of you, my faithful readers, who have built the links for me. I certainly have not added any manually, except in my signature links at SEO forums (but I don't do that for the search engines' sake). I like to think of SEOmoz/SOCEngine as living proof that a site can succeed with purely natural link building.

Conclusions

The sandbox is still a major issue for many websites, but with the release of the first batch through the Feb. 02 update, there is more hope than at any time since spring of '04 that the problem can be solved and it is possible for new sites to eventually gain ranking positions in Google.

My general feeling about Google's results on the whole after the update is that they are slightly more relevant and contain slightly less spam. As I have been monitoring and watching a great variety of terms, I have noticed more and more newer sites appearing in Google's top SERPs - an encouraging sign that age is a factor that can be overcome.

If you would like to discuss this article or read what others have written in response, please visit the SEOChat forums, where I have posted a thread on the subject.

SeoMOZ

No comments:

Live Page Popularity