Google’s Indexation Cap

Posted by randfish

Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation – getting more of their pages included in Google’s index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I’m going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.

First, a quick introduction to a truth that I’m not sure Google’s shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven’t seen) – that is – the concept that there’s an "indexation cap" on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn’t feel like the principle was "confirmed," but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google’s been more open about recently.

The "indexation cap" makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little "unique" content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google’s "last 24 hours" function:

SEOmoz blog post search on Google in the past 24 hours
Seriously, go have a look; the quantity of "junk" you wouldn’t want in your search engine’s index is remarkable

Since Tom published the post on Xenu’s Link Sleuth last night, Google’s already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that’s conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who’s tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it’s been a dead metric for a long time).

So – long story short – Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

The interesting part is that, in the past 3 months, the number of big websites (I’ll use that to refer to sites with an excess of 1 million unique pages) we’ve talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we’re not alone. The pattern is usually the same:

  • One morning, you wake up, and 40% of your search traffic is gone with no signal as to what’s happened
  • Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
  • Enter statistics data, showing that rankings for big terms aren’t down (or, maybe down a little), but that the long tail has gotten a lot shorter
  • Re-consideration request goes to Google
  • Somewhere between 10 to 40 days later, a message arrives saying:

We’ve processed your reconsideration request for http://xyz.com.

We received a request from a site owner to reconsider how we index the following site: http://xyz.com

We’ve now reviewed your site. When we review a site, we check to see if it’s in violation of our Webmaster Guidelines. If we don’t find any problems, we’ll reconsider our indexing of your site. If your site still doesn’t appear in our search results, check our Help Center for steps you can take.

  • This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
  • Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
    Exec: "We spent $10 million @#$%ing dollars with you last month and you can’t help?"
    AdWords Rep: "I’m sorry. We wish we could help. We just don’t have any influence on that side of the business. We don’t know anyone there or talk to anyone there."
    Exec: "Get me your boss on the phone. Now."
    Repeat ad nauseum until you reach level of management commensurate with spend of the exec’s company (or their connections)
    Exec: "Can you get me some answers?"
    AdWords Boss: "They won’t tell me much, but apparently they’re not keeping as many pages in the index from your site as they were before."
    Exec: "Yeah, we kind figured that part out. Are they going to put us back in."
    AdWords Boss: "My understanding is no."
    Exec: "So what am I supposed to do? We’re not going to have money to buy those $10 million in ads next month, you know."
    AdWords Boss: "You might try talking to someone who does SEO."
  • At this point, consultants receive desperate email or phone messages

To help site owners facing these problems, let’s examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don’t have statistical or quantitative data to back them up at this time):

  1. Importance on the Web’s Link Graph
    We’ve talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It’s likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index.
  2. Backlink Profile of the Domain
    The profile of a site’s links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value).
  3. Trustworthiness of the Domain
    Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up.
  4. Rate of Growth in Pages vs. Backlinks
    If your site’s content is growing dramatically, but you’re not earning many new links, this can be a signal to the engine that your content isn’t "worthy" of ongoing attention and inclusion.
  5. Depth & Frequency of Linking to Pages on the Domain
    If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they’re not particularly keen on the deep content – which is why the index may toss it out.
  6. Content Uniqueness
    Uniqueness is a constantly moving target and hard to nail down, but basically, if you don’t have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you’re at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they’re also, in my experience, much tougher on pages and sites that don’t earn high quantities of external links to their deep content with this analysis.
  7. Visitor, CTR and Usage Data Metrics
    If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny.
  8. Search Quality Rater Analysis + Manual Spam Reports
    If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).

Now let’s talk about some leading indicators that can help to show if you’re at risk:

  • Deep pages rarely receive external links – if you’re producing hundreds or thousands of pages of new content and fewer than "dozens" earn any external link at all, you’re in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
  • Deep pages don’t appear in Google Alerts – if Google Alerts is consistently passing you by (not reporting, this can be (but isn’t universally) an indication that they’re not perceiving your pages as being unique or worthy enough of the main index in the long run.
  • Rate of crawling is slow – if you’re updating content, links and launching new pages multiple times per day, and Google’s coming by every week, you’re likely in trouble. XML Sitemaps might help, but it’s likely you’re going to need to improve some of those factors described above to get in good graces for the long term.

There’s no doubt that indexation can be a vexing problem, and one that’s tremendously challenging to conquer. When the answer to the "how do we get those pages back?" is "make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs," there’s going to be resistance and a search for easier answers. But, like most things in life, what’s worth having is hard to get.

As always, I’m looking forward to your thoughts (and your shared experiences) on this tough issue. I’m also hopeful that, at some point in the future, we’ll be able to run some correlations on sites that aren’t fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.

Do you like this post? Yes No

Posted by randfish

Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation – getting more of their pages included in Google’s index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I’m going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.

First, a quick introduction to a truth that I’m not sure Google’s shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven’t seen) – that is – the concept that there’s an "indexation cap" on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn’t feel like the principle was "confirmed," but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google’s been more open about recently.

The "indexation cap" makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little "unique" content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google’s "last 24 hours" function:

SEOmoz blog post search on Google in the past 24 hours
Seriously, go have a look; the quantity of "junk" you wouldn’t want in your search engine’s index is remarkable

Since Tom published the post on Xenu’s Link Sleuth last night, Google’s already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that’s conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who’s tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it’s been a dead metric for a long time).

So – long story short – Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

The interesting part is that, in the past 3 months, the number of big websites (I’ll use that to refer to sites with an excess of 1 million unique pages) we’ve talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we’re not alone. The pattern is usually the same:

  • One morning, you wake up, and 40% of your search traffic is gone with no signal as to what’s happened
  • Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
  • Enter statistics data, showing that rankings for big terms aren’t down (or, maybe down a little), but that the long tail has gotten a lot shorter
  • Re-consideration request goes to Google
  • Somewhere between 10 to 40 days later, a message arrives saying:

We’ve processed your reconsideration request for http://xyz.com.

We received a request from a site owner to reconsider how we index the following site: http://xyz.com

We’ve now reviewed your site. When we review a site, we check to see if it’s in violation of our Webmaster Guidelines. If we don’t find any problems, we’ll reconsider our indexing of your site. If your site still doesn’t appear in our search results, check our Help Center for steps you can take.

  • This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
  • Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
    Exec: "We spent $10 million @#$%ing dollars with you last month and you can’t help?"
    AdWords Rep: "I’m sorry. We wish we could help. We just don’t have any influence on that side of the business. We don’t know anyone there or talk to anyone there."
    Exec: "Get me your boss on the phone. Now."
    Repeat ad nauseum until you reach level of management commensurate with spend of the exec’s company (or their connections)
    Exec: "Can you get me some answers?"
    AdWords Boss: "They won’t tell me much, but apparently they’re not keeping as many pages in the index from your site as they were before."
    Exec: "Yeah, we kind figured that part out. Are they going to put us back in."
    AdWords Boss: "My understanding is no."
    Exec: "So what am I supposed to do? We’re not going to have money to buy those $10 million in ads next month, you know."
    AdWords Boss: "You might try talking to someone who does SEO."
  • At this point, consultants receive desperate email or phone messages

To help site owners facing these problems, let’s examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don’t have statistical or quantitative data to back them up at this time):

  1. Importance on the Web’s Link Graph
    We’ve talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It’s likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index.
  2. Backlink Profile of the Domain
    The profile of a site’s links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value).
  3. Trustworthiness of the Domain
    Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up.
  4. Rate of Growth in Pages vs. Backlinks
    If your site’s content is growing dramatically, but you’re not earning many new links, this can be a signal to the engine that your content isn’t "worthy" of ongoing attention and inclusion.
  5. Depth & Frequency of Linking to Pages on the Domain
    If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they’re not particularly keen on the deep content – which is why the index may toss it out.
  6. Content Uniqueness
    Uniqueness is a constantly moving target and hard to nail down, but basically, if you don’t have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you’re at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they’re also, in my experience, much tougher on pages and sites that don’t earn high quantities of external links to their deep content with this analysis.
  7. Visitor, CTR and Usage Data Metrics
    If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny.
  8. Search Quality Rater Analysis + Manual Spam Reports
    If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).

Now let’s talk about some leading indicators that can help to show if you’re at risk:

  • Deep pages rarely receive external links – if you’re producing hundreds or thousands of pages of new content and fewer than "dozens" earn any external link at all, you’re in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
  • Deep pages don’t appear in Google Alerts – if Google Alerts is consistently passing you by (not reporting, this can be (but isn’t universally) an indication that they’re not perceiving your pages as being unique or worthy enough of the main index in the long run.
  • Rate of crawling is slow – if you’re updating content, links and launching new pages multiple times per day, and Google’s coming by every week, you’re likely in trouble. XML Sitemaps might help, but it’s likely you’re going to need to improve some of those factors described above to get in good graces for the long term.

There’s no doubt that indexation can be a vexing problem, and one that’s tremendously challenging to conquer. When the answer to the "how do we get those pages back?" is "make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs," there’s going to be resistance and a search for easier answers. But, like most things in life, what’s worth having is hard to get.

As always, I’m looking forward to your thoughts (and your shared experiences) on this tough issue. I’m also hopeful that, at some point in the future, we’ll be able to run some correlations on sites that aren’t fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.

Do you like this post? Yes No

Link Building Has Changed

Posted by randfish

When I first started in SEO, link acquisition was almost always a manual process. I’d search the engines for links that pointed to the competition, find relevant directories and link lists, email relevant sites and beg, borrow or bribe (aka buy advertising) to get a link. I tried reciprocal link building (and did some pretty dumb stuff). Then, as I got more intertwined in the SEO community, I found vendors who built large networks of sites, spammed blogs/forums/guestbooks and ran text link sales operations. I leveraged these services to help clients rank better, almost always with great success. Then I met Matt Cutts, found out more about Google’s webspam team, saw penalties and their impact (remember Florida?) and even found some sites we worked on in the Sandbox.

Over time, I got smarter. I read papers about HilltopTrustrank, Anti-Trustrank and many more. I saw sites escaping the sandbox once they’d earned greater quantities of trusted links. I started understanding that Google’s search quality team was only going to get better at recognizing and counting legitimate links (and tossing out the junk), so I focused exclusively on more "white hat" kinds of links. That’s when I discovered linkbaiting and the power of Digg, Reddit & StumbleUpon to drive traffic that would naturally link. We had success with quizzes (and after Matt left SEOmoz, he had a little too much success) and viral content that earned thousands of links overnight and started offering it as a service.

As our clientele and foci changed, we changed again. Linkbait gave way to broader viral marketing efforts. Social media marketing arose as a practical and high quality way to earn links. Our clients became larger brands and organizations and one-off link projects weren’t scalable, so we consulted on tactics like content and technology licensing, training editorial staff to earn links & participate in the social media world themselves, and incentivizing user-generated content, which in turn brought links from those users. We found ways to drive natural links to deep pages on huge sites targeting the long tail, how to combine embeddable content and user-adopted brand affinity to drive link growth. And we stopped buying links entirely.

I figured a visual history might make for a compelling view:

A History of Link Building Tactics

Now, link building is changing again. I’m of the distinct impression that the engines (nowadays referring to Bing & Google, since the others are all but out of the picture) are evolving to keep up with the web’s breakneck speed and new forms of data, along with new ways of analyzing links, are making themselves felt in the SERPs. My guesses/observations would include:

  • Twitter really is cannibalizing the web’s link graph, or at least, the blogosphere’s and Google seems to be using Tweet counts in some way (though possibly only in the QDF algo).
  • The acceleration rate of link acquisition and the freshness of new links is having a more dramatic impact than before, and the "old crusty links" paradigm may be fading a bit.
  • Brand mentions and keyword associations with brand names are influencing the rankings more and more.
  • Un-trustworhty link patterns are conferring more filters and penalties than ever before.
  • QDD is as strong as ever, and vertical results are more prominent than at any time in the engines’ histories.
  • Google and Microsoft both know more about traffic and surfing habits than ever before, and this data is likely being used to, at the least, quality control for potential algorithmic misses.
  • Ad blindness is worse than ever (16% of Internet users are responsible for 85% of all ad clicks on the web), forcing the engines to make ads more relevant and more obvious to continue earning revenue.
  • Paid inclusion is going away, and talk of potentially paying sites to be in the indices (the reverse model) is in the air (or maybe not).
  • Billions of non-linked "references" flow out across the web through social media messages, emails, tweets and IMs. Someone, at some search engine, is undoubetdly mining this data to see how they can derive value and relevancy from it.

As marketers, we have to evolve or be left behind by those who can better adapt. It’s hard to see the forest for the trees right now, but I think we’re closing in on a time when real-time, social and traditional web references are all a part of the rankings equation. The future may be less about links and more about brand building and brand participation. I don’t want to be the most-linked-to site in my niche; I want to be the site that’s synonymous with my niche.

Now we just have to figure out the tactics…

Do you like this post? Yes No

Posted by randfish

When I first started in SEO, link acquisition was almost always a manual process. I’d search the engines for links that pointed to the competition, find relevant directories and link lists, email relevant sites and beg, borrow or bribe (aka buy advertising) to get a link. I tried reciprocal link building (and did some pretty dumb stuff). Then, as I got more intertwined in the SEO community, I found vendors who built large networks of sites, spammed blogs/forums/guestbooks and ran text link sales operations. I leveraged these services to help clients rank better, almost always with great success. Then I met Matt Cutts, found out more about Google’s webspam team, saw penalties and their impact (remember Florida?) and even found some sites we worked on in the Sandbox.

Over time, I got smarter. I read papers about HilltopTrustrank, Anti-Trustrank and many more. I saw sites escaping the sandbox once they’d earned greater quantities of trusted links. I started understanding that Google’s search quality team was only going to get better at recognizing and counting legitimate links (and tossing out the junk), so I focused exclusively on more "white hat" kinds of links. That’s when I discovered linkbaiting and the power of Digg, Reddit & StumbleUpon to drive traffic that would naturally link. We had success with quizzes (and after Matt left SEOmoz, he had a little too much success) and viral content that earned thousands of links overnight and started offering it as a service.

As our clientele and foci changed, we changed again. Linkbait gave way to broader viral marketing efforts. Social media marketing arose as a practical and high quality way to earn links. Our clients became larger brands and organizations and one-off link projects weren’t scalable, so we consulted on tactics like content and technology licensing, training editorial staff to earn links & participate in the social media world themselves, and incentivizing user-generated content, which in turn brought links from those users. We found ways to drive natural links to deep pages on huge sites targeting the long tail, how to combine embeddable content and user-adopted brand affinity to drive link growth. And we stopped buying links entirely.

I figured a visual history might make for a compelling view:

A History of Link Building Tactics

Now, link building is changing again. I’m of the distinct impression that the engines (nowadays referring to Bing & Google, since the others are all but out of the picture) are evolving to keep up with the web’s breakneck speed and new forms of data, along with new ways of analyzing links, are making themselves felt in the SERPs. My guesses/observations would include:

  • Twitter really is cannibalizing the web’s link graph, or at least, the blogosphere’s and Google seems to be using Tweet counts in some way (though possibly only in the QDF algo).
  • The acceleration rate of link acquisition and the freshness of new links is having a more dramatic impact than before, and the "old crusty links" paradigm may be fading a bit.
  • Brand mentions and keyword associations with brand names are influencing the rankings more and more.
  • Un-trustworhty link patterns are conferring more filters and penalties than ever before.
  • QDD is as strong as ever, and vertical results are more prominent than at any time in the engines’ histories.
  • Google and Microsoft both know more about traffic and surfing habits than ever before, and this data is likely being used to, at the least, quality control for potential algorithmic misses.
  • Ad blindness is worse than ever (16% of Internet users are responsible for 85% of all ad clicks on the web), forcing the engines to make ads more relevant and more obvious to continue earning revenue.
  • Paid inclusion is going away, and talk of potentially paying sites to be in the indices (the reverse model) is in the air (or maybe not).
  • Billions of non-linked "references" flow out across the web through social media messages, emails, tweets and IMs. Someone, at some search engine, is undoubetdly mining this data to see how they can derive value and relevancy from it.

As marketers, we have to evolve or be left behind by those who can better adapt. It’s hard to see the forest for the trees right now, but I think we’re closing in on a time when real-time, social and traditional web references are all a part of the rankings equation. The future may be less about links and more about brand building and brand participation. I don’t want to be the most-linked-to site in my niche; I want to be the site that’s synonymous with my niche.

Now we just have to figure out the tactics…

Do you like this post? Yes No

Seth Godin: Sliced Bread

Malcolm Gladwell: Outliers

Anthony Parinello: Your Price is Too High