Indexation for SEO: Real Numbers in 5 Easy Steps

Posted by randfish

How many pages has Google indexed?

This question and the problems surrounding it run rampant through the SEO world. It usually arises when someone starts doing searches like this:

Indexation of SEOmoz According to Google

Google claims to have 93,800 pages indexed on the root domain, seomoz.org. That sounds pretty good, but when I ran that search query last week, the number was closer to 75,000 and when I run it again from Google.co.uk 60 seconds later, the number changes even more dramatically:

Indexation of SEOmoz.org on Google.co.uk

How about if I hit refresh on my Google.com results again:

Indexation on Google.com 3 minutes later

Doh! Google just dropped 8,500 of my pages out of their index. That sucks – but not nearly as much as managers, marketing directors and CEOs who use these numbers as actual KPIs! Can you imagine? A number that means nothing, fluctuates 300% between data centers, can change at a moment’s notice and provides no actionable insight being used as a business metric?

And yet… It happens.

Fortunately, there’s an easy way to get much, much better data than what the search engines provide through "site:" queries and this post is here to walk you through that process step-by-step.

Step 1: Go to Traffic Sources in Your Analytics

Google Analytics Step 1

Click the "traffic sources" link in Google analytics or Omniture (it can also be called "referring sources" in other analytics packages).

Step 2: Head to the Search Engines Section

Step 2 of the Indexation Process

We want to find out how many pages the search engines have indexed, so the obvious next step is to go to the "search engines" sub-section.

Step 3: Choose an Engine

Step 3: Choose an Engine 

Choose the engine you want indexation data on and click. If you have both paid and organic traffic from this engine, you’ll want to display organic only at this step, too.

Step 4: Filter by Landing Pages

Step 4: Filter by Landing Page

The "Landing Page" filter in the dropdown will show you the traffic each individual page on your site received from the engine you’ve selected. This also produces the magical "total" number of pages that have received traffic, described in the last step.

Step 5: Record the Number at the Bottom

Step 5: Indexation Count Arrives

That count tells you the unique number of pages that received at least one visit from searches performed on Google. It’s the Holy Grail of indexation – a number you can accurately track over time to see how the search engine is indexing your site. On its own, it isn’t particularly useful, but over time (I usually recommend recording monthly, but for some sites, every 2-3 months can make more sense), it gives you insight into whether your pages are doing better or worse at drawing in traffic from the engine.

Now, technically I’m being a bit cheeky here. This number doesn’t tell you the full story – it’s not showing the actual number of pages a search engine has crawled or indexed on your site, but it does tell you the unique number of URLs that received at least 1 visit from the engine. In my opinion this data is far more accurate and more actionable. The first adjective – accurate – is hard to argue (particularly given the visual evidence atop this post), but the second requires a bit of an explanation.

Why is Number of Pages Receiving ≥1 Visit Actionable?

Indexation numbers alone are useless. Businesses and websites use them as KPIs because they want to know if, over time, more of their pages are making their way into the engines’ indices. I’d argue that actually, you don’t care if your pages are in the indices – you care if your pages have the opportunity to EARN TRAFFIC!

Being a row in a search index means nothing if your page is:

  • too low in PageRank/link juice to appear in any results
  • displaying content the engines can’t properly parse
  • devoid of keywords or content that could send traffic
  • broken, misdirected or unavailable
  • a duplicate of other pages that the engine will rank instead

Thus, the metric you want to count over time isn’t (in most cases) number of pages indexed, it’s number of pages that earned traffic. Over time, that’s the number you want to rise, the number you want marketers to concentrate on and the KPI that’s meaningful. It tells you whether the engine is crawling, indexing AND listing your pages in the results where someone might (has) actually click(ed) them.

If the number drops, you can investigate the actual pages that are no longer receiving traffic by exporting the data to Excel and doing a side-by-side with the previous month. If the number rises, you can see the new pages getting traffic. Those individual URLs will tell a story – of pages that broke, that stopped being linked-to, that fell too far down in paginated results or lost their unique content. It’s so much better than playing the mystery game that SEOs so often confront in the face of "lower indexation numbers" from the site: command.

Some Necessary Caveats

This methodology certainly isn’t perfect, and there are some important points to be aware of (thanks especially to some folks in the comments who brought these up):

  • Google Analytics (and many other analytics packages) use sampled data at times to make guesstimates. If you want to be sure you’re getting the absolute best number, export to CSV and do the side-by-side in Excel. You can even expunge similar results from two time period to see only those pages that uniquely did/didn’t receive traffic. In many of these cases, you might also only care about pages that gained/lost 5/10/20+ visits.
  • Greater accuracy can be found from shrinking the time period in the analytics, but it also reduces the liklihood that a page receiving very long tail query traffic once in a blue moon will be properly listed, so adjust accordingly, and plan for imperfect data. This method isn’t foolproof, but it is (in my opinion), better than the random roulette wheel of site: queries.
  • This technique isn’t going to help you catch other kinds of SEO issues like duplicate content (it can in some cases, but it’s not as good as something like GG WM Tools reporting) or 301s, 302s, etc. which can require a crawling solution.

I’d, of course, love your feedback. I know many SEOs are addicted to and supportive of the site: command numbers as a way to measure progress, so maybe there’s things I’m not considering or situations where it makes sense. I also know that many of you like the number reported in Google Webmaster tools under the Sitemaps crawl data (I’m skeptical of this too, for the record) and I’d like to hear how you find value with that data as well.

p.s. Tomorrow we’ll be announcing two webinars (open to all) about using Open Site Explorer to get ACTIONABLE data. Be sure to leave either Wednesday the 27th at 2pm Pacific or Thursday the 28th at 10am Pacific free :-)

Do you like this post? Yes No

Posted by randfish

How many pages has Google indexed?

This question and the problems surrounding it run rampant through the SEO world. It usually arises when someone starts doing searches like this:

Indexation of SEOmoz According to Google

Google claims to have 93,800 pages indexed on the root domain, seomoz.org. That sounds pretty good, but when I ran that search query last week, the number was closer to 75,000 and when I run it again from Google.co.uk 60 seconds later, the number changes even more dramatically:

Indexation of SEOmoz.org on Google.co.uk

How about if I hit refresh on my Google.com results again:

Indexation on Google.com 3 minutes later

Doh! Google just dropped 8,500 of my pages out of their index. That sucks – but not nearly as much as managers, marketing directors and CEOs who use these numbers as actual KPIs! Can you imagine? A number that means nothing, fluctuates 300% between data centers, can change at a moment’s notice and provides no actionable insight being used as a business metric?

And yet… It happens.

Fortunately, there’s an easy way to get much, much better data than what the search engines provide through "site:" queries and this post is here to walk you through that process step-by-step.

Step 1: Go to Traffic Sources in Your Analytics

Google Analytics Step 1

Click the "traffic sources" link in Google analytics or Omniture (it can also be called "referring sources" in other analytics packages).

Step 2: Head to the Search Engines Section

Step 2 of the Indexation Process

We want to find out how many pages the search engines have indexed, so the obvious next step is to go to the "search engines" sub-section.

Step 3: Choose an Engine

Step 3: Choose an Engine 

Choose the engine you want indexation data on and click. If you have both paid and organic traffic from this engine, you’ll want to display organic only at this step, too.

Step 4: Filter by Landing Pages

Step 4: Filter by Landing Page

The "Landing Page" filter in the dropdown will show you the traffic each individual page on your site received from the engine you’ve selected. This also produces the magical "total" number of pages that have received traffic, described in the last step.

Step 5: Record the Number at the Bottom

Step 5: Indexation Count Arrives

That count tells you the unique number of pages that received at least one visit from searches performed on Google. It’s the Holy Grail of indexation – a number you can accurately track over time to see how the search engine is indexing your site. On its own, it isn’t particularly useful, but over time (I usually recommend recording monthly, but for some sites, every 2-3 months can make more sense), it gives you insight into whether your pages are doing better or worse at drawing in traffic from the engine.

Now, technically I’m being a bit cheeky here. This number doesn’t tell you the full story – it’s not showing the actual number of pages a search engine has crawled or indexed on your site, but it does tell you the unique number of URLs that received at least 1 visit from the engine. In my opinion this data is far more accurate and more actionable. The first adjective – accurate – is hard to argue (particularly given the visual evidence atop this post), but the second requires a bit of an explanation.

Why is Number of Pages Receiving ≥1 Visit Actionable?

Indexation numbers alone are useless. Businesses and websites use them as KPIs because they want to know if, over time, more of their pages are making their way into the engines’ indices. I’d argue that actually, you don’t care if your pages are in the indices – you care if your pages have the opportunity to EARN TRAFFIC!

Being a row in a search index means nothing if your page is:

  • too low in PageRank/link juice to appear in any results
  • displaying content the engines can’t properly parse
  • devoid of keywords or content that could send traffic
  • broken, misdirected or unavailable
  • a duplicate of other pages that the engine will rank instead

Thus, the metric you want to count over time isn’t (in most cases) number of pages indexed, it’s number of pages that earned traffic. Over time, that’s the number you want to rise, the number you want marketers to concentrate on and the KPI that’s meaningful. It tells you whether the engine is crawling, indexing AND listing your pages in the results where someone might (has) actually click(ed) them.

If the number drops, you can investigate the actual pages that are no longer receiving traffic by exporting the data to Excel and doing a side-by-side with the previous month. If the number rises, you can see the new pages getting traffic. Those individual URLs will tell a story – of pages that broke, that stopped being linked-to, that fell too far down in paginated results or lost their unique content. It’s so much better than playing the mystery game that SEOs so often confront in the face of "lower indexation numbers" from the site: command.

Some Necessary Caveats

This methodology certainly isn’t perfect, and there are some important points to be aware of (thanks especially to some folks in the comments who brought these up):

  • Google Analytics (and many other analytics packages) use sampled data at times to make guesstimates. If you want to be sure you’re getting the absolute best number, export to CSV and do the side-by-side in Excel. You can even expunge similar results from two time period to see only those pages that uniquely did/didn’t receive traffic. In many of these cases, you might also only care about pages that gained/lost 5/10/20+ visits.
  • Greater accuracy can be found from shrinking the time period in the analytics, but it also reduces the liklihood that a page receiving very long tail query traffic once in a blue moon will be properly listed, so adjust accordingly, and plan for imperfect data. This method isn’t foolproof, but it is (in my opinion), better than the random roulette wheel of site: queries.
  • This technique isn’t going to help you catch other kinds of SEO issues like duplicate content (it can in some cases, but it’s not as good as something like GG WM Tools reporting) or 301s, 302s, etc. which can require a crawling solution.

I’d, of course, love your feedback. I know many SEOs are addicted to and supportive of the site: command numbers as a way to measure progress, so maybe there’s things I’m not considering or situations where it makes sense. I also know that many of you like the number reported in Google Webmaster tools under the Sitemaps crawl data (I’m skeptical of this too, for the record) and I’d like to hear how you find value with that data as well.

p.s. Tomorrow we’ll be announcing two webinars (open to all) about using Open Site Explorer to get ACTIONABLE data. Be sure to leave either Wednesday the 27th at 2pm Pacific or Thursday the 28th at 10am Pacific free :-)

Do you like this post? Yes No

Diagrams for Solving Crawl Priority & Indexation Issues

Posted by randfish

Yesterday night I stayed up way too late authoring a post on Google’s Indexation Cap. Today, despite getting up way too early, I wanted to follow up and answer some of the questions from the comments, Twitter and my email. I think SEOs who read the post rightly asked for more direction in solving this problem – a fair request. Below, I’ve done my best to tackle these problems visually, as I believe we all think about site architecture and crawling issues in a visual structure.

First off, here’s a sample site hieararchy to set down the concept and give the colors I’m using in the following diagrams more context:

A Sample Site Architecture

Next, I’ve illustrated in a more representative fashion, how those hieararchies might look on a website, and noted the external link potential of each:

Typical Site's Link Earning Potential by Content Section

In this next piece, I’m trying to explain a very important concept and something that’s frequently misunderstood by SEOs. Once upon a time, search spiders would crawl the web largely recursively – hitting a homepage that had been submitted to its index (remember way back when search engines had submission?!), then crawling in an outward fashion based on the links they discoverd there. That hasn’t been the case for a long time, and as we all see with crawl paths (if you’re looking at the requests Google/Yahoo!/Bing make to your domain), multiple entry points are nearly universal and crawling pushes "outward" from those priority URLs. It looks a bit like Minesweeper, right? :-)

Spider Crawl Priority Paths Graphic

Finally, I’ve got a graphic to help understand how to positively approach these problems and solve them.

Methods to Improve Crawling, Indexing & Ranking

There are certainly more recommendations that can be provided around these issues, and I look forward to a discussion of them in the comments.

p.s. I covered site architecture and navigation in a good bit of detail at the PRO Training this summer, but I like this image format so much, I think I might re-craft something new for next year. It feels like structuring sites properly is still a big pain point for SEOs (but possibly that’s less to do with lack of knowledge and more to do with lack of influence during the design phase?)

Do you like this post? Yes No

Posted by randfish

Yesterday night I stayed up way too late authoring a post on Google’s Indexation Cap. Today, despite getting up way too early, I wanted to follow up and answer some of the questions from the comments, Twitter and my email. I think SEOs who read the post rightly asked for more direction in solving this problem – a fair request. Below, I’ve done my best to tackle these problems visually, as I believe we all think about site architecture and crawling issues in a visual structure.

First off, here’s a sample site hieararchy to set down the concept and give the colors I’m using in the following diagrams more context:

A Sample Site Architecture

Next, I’ve illustrated in a more representative fashion, how those hieararchies might look on a website, and noted the external link potential of each:

Typical Site's Link Earning Potential by Content Section

In this next piece, I’m trying to explain a very important concept and something that’s frequently misunderstood by SEOs. Once upon a time, search spiders would crawl the web largely recursively – hitting a homepage that had been submitted to its index (remember way back when search engines had submission?!), then crawling in an outward fashion based on the links they discoverd there. That hasn’t been the case for a long time, and as we all see with crawl paths (if you’re looking at the requests Google/Yahoo!/Bing make to your domain), multiple entry points are nearly universal and crawling pushes "outward" from those priority URLs. It looks a bit like Minesweeper, right? :-)

Spider Crawl Priority Paths Graphic

Finally, I’ve got a graphic to help understand how to positively approach these problems and solve them.

Methods to Improve Crawling, Indexing & Ranking

There are certainly more recommendations that can be provided around these issues, and I look forward to a discussion of them in the comments.

p.s. I covered site architecture and navigation in a good bit of detail at the PRO Training this summer, but I like this image format so much, I think I might re-craft something new for next year. It feels like structuring sites properly is still a big pain point for SEOs (but possibly that’s less to do with lack of knowledge and more to do with lack of influence during the design phase?)

Do you like this post? Yes No

90-Minute PRO Webinar: Link Building Strategies on Thursday Dec. 10th

Posted by randfish

Thanks so much for all your votes and feedback on our PRO Webinar Series over the holiday weekend. We received 285 responses and we’re taking your suggestions very seriously and conducting the webinar as you’ve requested :-)

Here are the stats from the questionairre/form (you can still fill it out if you’d like to give more input):

Will you be able to attend the PRO Webinar on Dec. 10th at 11am Pacific (2pm Eastern, 7pm London)?

  • 74% – Yes, I’m planning to attend!
  • 20% – I’m unsure if I can make it (but would like to)
  • 6% – No, I’m busy at that time (but would like to join in others in the future)
  • 0% – I’m not attending (because I’m not a fan of webinars or uninterested in the subject matter)

What topics most interest you for the webinar (check all that apply)?

  • 79% – Link Building & Link Acquisition
  • 51% SEO Metrics, Analytics and Key Performance Indicators
  • 44% – Social Media Marketing for SEO
  • 41% – Keyword Research Tools & Processes
  • 40% – Navigation & Site Architecture for SEO
  • 35% – Content Creation & Optimization
  • 24% – Avoiding Spam, Penalties & Filters
  • 20% – Incenting UGC & User Participation for SEO

What webinar format would you prefer?

  • 47% – 45 min. presentation, 45 min. Q+A (90 min. total)
  • 37% – 30 min. presentation, 30 min. Q+A (60 min. total)
  • 12% – 30 min. presentation, 60 min. Q+A (90 min. total)
  • 4% – All Q+A (60 min. total)

Based on this, we’re going to be running a 90 minute webinar, with a 45 minute slide deck presentation (and possibly video as well, though it will likely just be of me on the webcam) from 11am – 12:30pm Pacific (2pm – 3:30pm Eastern, 7pm-8:30pm London) on Thursday December 10th. The webinar will cover the following rough outline (obviously, in more detail):

  • Link Building Strategies for 2010
  • What Goals Can Link Building Help Us Achieve?
    • Bolster Individual Rankings
    • Improve a Domain’s Ability to Rank Pages
    • Achieve Full(er) Indexation
    • Drive Direct Traffic & Branding
  • The 8 Basic Link Building Food Groups (with examples)
    • Manual Link Submissions/Requests
    • Competitive Link Research + Acquisition
    • Links via Embedded Content
    • Content Based, Linkbait & Viral Link Attraction
    • Content, Technology & API Licensing
    • Link Exchanges & Trades-in-Kind
    • Paid Links
    • Link Reclamation
  • What are the Right Kinds of Links to Accomplish my Goals?
    • Links for Individual Rankings
    • Links for Domain "Authority"
    • Links for Indexation
    • Links for Traffic & Branding
  • How to Use Tools & Processes to Make Link Building Easier
    • Tools for Competitive Link Research
    • Metrics for Evaluating a Link’s Value
    • Building a Link Acquisition Process (i.e. the "Link Conversion Funnel")
    • Making Processes Scalable
  • Link Building Shortcuts to Take (and Avoid)
    • How to Get Your Community Link Building for You
    • How to Get the Anchor Text and Target You Want
    • How to Avoid Links that You Think Are Helping Your Competition (but really aren’t)
    • How to Spot Strategies that the Engines May Devalue
  • Wrap-Up / Q+A

I’m certainly open to feedback about what you’d like to see in there, and happy to make some inclusions where possible. All PRO members will receive an invite via email in the next 2-3 days with a link to register. You’ll be able to dial-in or hear the webinar via your computer speakers/headphone and ask questions via a chat interface. You can see an examples of a past presentation I’ve made below:

This lengthy one came from my HostingCon keynote and serves as a fun introduction to SEO (BTW – let me strongly recommend against creating slide decks using photos of a whiteboard; it’s fun and the audience likes it, but it took about 12 solid hours of surprisingly intensive whiteboard drawing and erasing, nevermind the editing, cropping and pasting):

I’m very much looking forward to spending the morning with our PRO members next week! If you’re not yet PRO, Scott’s got some pretty sweet offers still available including the SES Chicago ticket + 1 year of PRO for $799 (and you can trade in the Chicago pass for any SES event in 2010) and the Advanced Training DVD for PRO members at $199.

Note that the other topics that received lots of votes – SEO Metrics & KPIs, Social Media Marketing, etc. will likely be the topics for webinars in January, February and March.

Do you like this post? Yes No

Posted by randfish

Thanks so much for all your votes and feedback on our PRO Webinar Series over the holiday weekend. We received 285 responses and we’re taking your suggestions very seriously and conducting the webinar as you’ve requested :-)

Here are the stats from the questionairre/form (you can still fill it out if you’d like to give more input):

Will you be able to attend the PRO Webinar on Dec. 10th at 11am Pacific (2pm Eastern, 7pm London)?

  • 74% – Yes, I’m planning to attend!
  • 20% – I’m unsure if I can make it (but would like to)
  • 6% – No, I’m busy at that time (but would like to join in others in the future)
  • 0% – I’m not attending (because I’m not a fan of webinars or uninterested in the subject matter)

What topics most interest you for the webinar (check all that apply)?

  • 79% – Link Building & Link Acquisition
  • 51% SEO Metrics, Analytics and Key Performance Indicators
  • 44% – Social Media Marketing for SEO
  • 41% – Keyword Research Tools & Processes
  • 40% – Navigation & Site Architecture for SEO
  • 35% – Content Creation & Optimization
  • 24% – Avoiding Spam, Penalties & Filters
  • 20% – Incenting UGC & User Participation for SEO

What webinar format would you prefer?

  • 47% – 45 min. presentation, 45 min. Q+A (90 min. total)
  • 37% – 30 min. presentation, 30 min. Q+A (60 min. total)
  • 12% – 30 min. presentation, 60 min. Q+A (90 min. total)
  • 4% – All Q+A (60 min. total)

Based on this, we’re going to be running a 90 minute webinar, with a 45 minute slide deck presentation (and possibly video as well, though it will likely just be of me on the webcam) from 11am – 12:30pm Pacific (2pm – 3:30pm Eastern, 7pm-8:30pm London) on Thursday December 10th. The webinar will cover the following rough outline (obviously, in more detail):

  • Link Building Strategies for 2010
  • What Goals Can Link Building Help Us Achieve?
    • Bolster Individual Rankings
    • Improve a Domain’s Ability to Rank Pages
    • Achieve Full(er) Indexation
    • Drive Direct Traffic & Branding
  • The 8 Basic Link Building Food Groups (with examples)
    • Manual Link Submissions/Requests
    • Competitive Link Research + Acquisition
    • Links via Embedded Content
    • Content Based, Linkbait & Viral Link Attraction
    • Content, Technology & API Licensing
    • Link Exchanges & Trades-in-Kind
    • Paid Links
    • Link Reclamation
  • What are the Right Kinds of Links to Accomplish my Goals?
    • Links for Individual Rankings
    • Links for Domain "Authority"
    • Links for Indexation
    • Links for Traffic & Branding
  • How to Use Tools & Processes to Make Link Building Easier
    • Tools for Competitive Link Research
    • Metrics for Evaluating a Link’s Value
    • Building a Link Acquisition Process (i.e. the "Link Conversion Funnel")
    • Making Processes Scalable
  • Link Building Shortcuts to Take (and Avoid)
    • How to Get Your Community Link Building for You
    • How to Get the Anchor Text and Target You Want
    • How to Avoid Links that You Think Are Helping Your Competition (but really aren’t)
    • How to Spot Strategies that the Engines May Devalue
  • Wrap-Up / Q+A

I’m certainly open to feedback about what you’d like to see in there, and happy to make some inclusions where possible. All PRO members will receive an invite via email in the next 2-3 days with a link to register. You’ll be able to dial-in or hear the webinar via your computer speakers/headphone and ask questions via a chat interface. You can see an examples of a past presentation I’ve made below:

This lengthy one came from my HostingCon keynote and serves as a fun introduction to SEO (BTW – let me strongly recommend against creating slide decks using photos of a whiteboard; it’s fun and the audience likes it, but it took about 12 solid hours of surprisingly intensive whiteboard drawing and erasing, nevermind the editing, cropping and pasting):

I’m very much looking forward to spending the morning with our PRO members next week! If you’re not yet PRO, Scott’s got some pretty sweet offers still available including the SES Chicago ticket + 1 year of PRO for $799 (and you can trade in the Chicago pass for any SES event in 2010) and the Advanced Training DVD for PRO members at $199.

Note that the other topics that received lots of votes – SEO Metrics & KPIs, Social Media Marketing, etc. will likely be the topics for webinars in January, February and March.

Do you like this post? Yes No

Google’s Indexation Cap

Posted by randfish

Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation – getting more of their pages included in Google’s index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I’m going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.

First, a quick introduction to a truth that I’m not sure Google’s shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven’t seen) – that is – the concept that there’s an "indexation cap" on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn’t feel like the principle was "confirmed," but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google’s been more open about recently.

The "indexation cap" makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little "unique" content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google’s "last 24 hours" function:

SEOmoz blog post search on Google in the past 24 hours
Seriously, go have a look; the quantity of "junk" you wouldn’t want in your search engine’s index is remarkable

Since Tom published the post on Xenu’s Link Sleuth last night, Google’s already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that’s conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who’s tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it’s been a dead metric for a long time).

So – long story short – Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

The interesting part is that, in the past 3 months, the number of big websites (I’ll use that to refer to sites with an excess of 1 million unique pages) we’ve talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we’re not alone. The pattern is usually the same:

  • One morning, you wake up, and 40% of your search traffic is gone with no signal as to what’s happened
  • Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
  • Enter statistics data, showing that rankings for big terms aren’t down (or, maybe down a little), but that the long tail has gotten a lot shorter
  • Re-consideration request goes to Google
  • Somewhere between 10 to 40 days later, a message arrives saying:

We’ve processed your reconsideration request for http://xyz.com.

We received a request from a site owner to reconsider how we index the following site: http://xyz.com

We’ve now reviewed your site. When we review a site, we check to see if it’s in violation of our Webmaster Guidelines. If we don’t find any problems, we’ll reconsider our indexing of your site. If your site still doesn’t appear in our search results, check our Help Center for steps you can take.

  • This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
  • Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
    Exec: "We spent $10 million @#$%ing dollars with you last month and you can’t help?"
    AdWords Rep: "I’m sorry. We wish we could help. We just don’t have any influence on that side of the business. We don’t know anyone there or talk to anyone there."
    Exec: "Get me your boss on the phone. Now."
    Repeat ad nauseum until you reach level of management commensurate with spend of the exec’s company (or their connections)
    Exec: "Can you get me some answers?"
    AdWords Boss: "They won’t tell me much, but apparently they’re not keeping as many pages in the index from your site as they were before."
    Exec: "Yeah, we kind figured that part out. Are they going to put us back in."
    AdWords Boss: "My understanding is no."
    Exec: "So what am I supposed to do? We’re not going to have money to buy those $10 million in ads next month, you know."
    AdWords Boss: "You might try talking to someone who does SEO."
  • At this point, consultants receive desperate email or phone messages

To help site owners facing these problems, let’s examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don’t have statistical or quantitative data to back them up at this time):

  1. Importance on the Web’s Link Graph
    We’ve talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It’s likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index.
  2. Backlink Profile of the Domain
    The profile of a site’s links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value).
  3. Trustworthiness of the Domain
    Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up.
  4. Rate of Growth in Pages vs. Backlinks
    If your site’s content is growing dramatically, but you’re not earning many new links, this can be a signal to the engine that your content isn’t "worthy" of ongoing attention and inclusion.
  5. Depth & Frequency of Linking to Pages on the Domain
    If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they’re not particularly keen on the deep content – which is why the index may toss it out.
  6. Content Uniqueness
    Uniqueness is a constantly moving target and hard to nail down, but basically, if you don’t have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you’re at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they’re also, in my experience, much tougher on pages and sites that don’t earn high quantities of external links to their deep content with this analysis.
  7. Visitor, CTR and Usage Data Metrics
    If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny.
  8. Search Quality Rater Analysis + Manual Spam Reports
    If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).

Now let’s talk about some leading indicators that can help to show if you’re at risk:

  • Deep pages rarely receive external links – if you’re producing hundreds or thousands of pages of new content and fewer than "dozens" earn any external link at all, you’re in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
  • Deep pages don’t appear in Google Alerts – if Google Alerts is consistently passing you by (not reporting, this can be (but isn’t universally) an indication that they’re not perceiving your pages as being unique or worthy enough of the main index in the long run.
  • Rate of crawling is slow – if you’re updating content, links and launching new pages multiple times per day, and Google’s coming by every week, you’re likely in trouble. XML Sitemaps might help, but it’s likely you’re going to need to improve some of those factors described above to get in good graces for the long term.

There’s no doubt that indexation can be a vexing problem, and one that’s tremendously challenging to conquer. When the answer to the "how do we get those pages back?" is "make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs," there’s going to be resistance and a search for easier answers. But, like most things in life, what’s worth having is hard to get.

As always, I’m looking forward to your thoughts (and your shared experiences) on this tough issue. I’m also hopeful that, at some point in the future, we’ll be able to run some correlations on sites that aren’t fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.

Do you like this post? Yes No

Posted by randfish

Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation – getting more of their pages included in Google’s index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I’m going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.

First, a quick introduction to a truth that I’m not sure Google’s shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven’t seen) – that is – the concept that there’s an "indexation cap" on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn’t feel like the principle was "confirmed," but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google’s been more open about recently.

The "indexation cap" makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little "unique" content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google’s "last 24 hours" function:

SEOmoz blog post search on Google in the past 24 hours
Seriously, go have a look; the quantity of "junk" you wouldn’t want in your search engine’s index is remarkable

Since Tom published the post on Xenu’s Link Sleuth last night, Google’s already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that’s conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who’s tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it’s been a dead metric for a long time).

So – long story short – Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

The interesting part is that, in the past 3 months, the number of big websites (I’ll use that to refer to sites with an excess of 1 million unique pages) we’ve talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we’re not alone. The pattern is usually the same:

  • One morning, you wake up, and 40% of your search traffic is gone with no signal as to what’s happened
  • Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
  • Enter statistics data, showing that rankings for big terms aren’t down (or, maybe down a little), but that the long tail has gotten a lot shorter
  • Re-consideration request goes to Google
  • Somewhere between 10 to 40 days later, a message arrives saying:

We’ve processed your reconsideration request for http://xyz.com.

We received a request from a site owner to reconsider how we index the following site: http://xyz.com

We’ve now reviewed your site. When we review a site, we check to see if it’s in violation of our Webmaster Guidelines. If we don’t find any problems, we’ll reconsider our indexing of your site. If your site still doesn’t appear in our search results, check our Help Center for steps you can take.

  • This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
  • Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
    Exec: "We spent $10 million @#$%ing dollars with you last month and you can’t help?"
    AdWords Rep: "I’m sorry. We wish we could help. We just don’t have any influence on that side of the business. We don’t know anyone there or talk to anyone there."
    Exec: "Get me your boss on the phone. Now."
    Repeat ad nauseum until you reach level of management commensurate with spend of the exec’s company (or their connections)
    Exec: "Can you get me some answers?"
    AdWords Boss: "They won’t tell me much, but apparently they’re not keeping as many pages in the index from your site as they were before."
    Exec: "Yeah, we kind figured that part out. Are they going to put us back in."
    AdWords Boss: "My understanding is no."
    Exec: "So what am I supposed to do? We’re not going to have money to buy those $10 million in ads next month, you know."
    AdWords Boss: "You might try talking to someone who does SEO."
  • At this point, consultants receive desperate email or phone messages

To help site owners facing these problems, let’s examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don’t have statistical or quantitative data to back them up at this time):

  1. Importance on the Web’s Link Graph
    We’ve talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It’s likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index.
  2. Backlink Profile of the Domain
    The profile of a site’s links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value).
  3. Trustworthiness of the Domain
    Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up.
  4. Rate of Growth in Pages vs. Backlinks
    If your site’s content is growing dramatically, but you’re not earning many new links, this can be a signal to the engine that your content isn’t "worthy" of ongoing attention and inclusion.
  5. Depth & Frequency of Linking to Pages on the Domain
    If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they’re not particularly keen on the deep content – which is why the index may toss it out.
  6. Content Uniqueness
    Uniqueness is a constantly moving target and hard to nail down, but basically, if you don’t have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you’re at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they’re also, in my experience, much tougher on pages and sites that don’t earn high quantities of external links to their deep content with this analysis.
  7. Visitor, CTR and Usage Data Metrics
    If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny.
  8. Search Quality Rater Analysis + Manual Spam Reports
    If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).

Now let’s talk about some leading indicators that can help to show if you’re at risk:

  • Deep pages rarely receive external links – if you’re producing hundreds or thousands of pages of new content and fewer than "dozens" earn any external link at all, you’re in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
  • Deep pages don’t appear in Google Alerts – if Google Alerts is consistently passing you by (not reporting, this can be (but isn’t universally) an indication that they’re not perceiving your pages as being unique or worthy enough of the main index in the long run.
  • Rate of crawling is slow – if you’re updating content, links and launching new pages multiple times per day, and Google’s coming by every week, you’re likely in trouble. XML Sitemaps might help, but it’s likely you’re going to need to improve some of those factors described above to get in good graces for the long term.

There’s no doubt that indexation can be a vexing problem, and one that’s tremendously challenging to conquer. When the answer to the "how do we get those pages back?" is "make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs," there’s going to be resistance and a search for easier answers. But, like most things in life, what’s worth having is hard to get.

As always, I’m looking forward to your thoughts (and your shared experiences) on this tough issue. I’m also hopeful that, at some point in the future, we’ll be able to run some correlations on sites that aren’t fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.

Do you like this post? Yes No

Relationship Between Link Growth And Indexation

With every passing day, the number of websites and hence the number of web pages are growing at an explosive rate on the internet. This can cause a major headache to the search engines as they gear up to meet the challenge of crawling and subsequently indexing the new sites popping up everywhere in the [...]

Related posts:

  1. Exciting News — Netconcepts Acquired by Covario
  2. Increasing The Scope Of Existing PPC Campaigns Effectively
  3. LinkedIn, But NoFollow Link Love
  4. Relationship Between Link Growth And Indexation
  5. Inbound Deep Links Benefit Page Rank Distribution Sitewide
  6. New Tool to Annualize Google Keyword Data
  7. How To Breathe Life Into A Lacklustre PPC Campaign
  8. Good Practices SEO With A Tinge Of Creativity
  9. SEO Tools: Using Xenu and Excel – Blindfolded SEO Audit Part 2
  10. Blindfolded SEO Audit Part 1

With every passing day, the number of websites and hence the number of web pages are growing at an explosive rate on the internet. This can cause a major headache to the search engines as they gear up to meet the challenge of crawling and subsequently indexing the new sites popping up everywhere in the cybersphere.

Today, when a new web site is launched, it will take a while before its pages get crawled and indexed in Google. With the increasing strain on hardware and resources due to the rapid growth of new sites, Google has become very strict in its policy of admitting sites and retaining web pages of sites in its index. It is a case of survival of the fittest in cyberspace.

Some of the basic facts to be borne in mind when looking at the issue in its entirety are:

  • The Page rank is proportional to the total number of pages in Google’s index
  • The page rank gained by a site depends on the number of inbound and outbound links to that site
  • To increase its page rank, a site must build more pages and increase its virtual real estate

When a new site is launched, the number of backlinks to that site is negligible unless the business is well known and has a credible following offline and is launching its brand online. The average site owner has to set about building an external link profile by submitting to directories, guest blogging on well established industry relevant blogs, providing a platform for user generated content on her site, promoting site badges etc etc.

All this takes time and effort and it is a slow and steady natural growth. There are several link building software programs that promise instant deliverance by helping you build multitude of links in no time. The problem with this approach is that an average human cannot acquire 100 links in a day (read 8 hours of work time). Google also knows this and it is an easy recipe for raising a red flag.

Coming to the crux of the issue, creating and growing the number of pages on your site is relatively easy as you, the site owner, have full control over it. If you are passionate about your industry with good working knowledge, you can build lots of content over a short span of time. But this alone will not make the cut in today’s circumstances for making it into the Google index and being retained and ranked over time.

The most powerful links that can be obtained today are editorial links. When another site owner regards your site content as one of high quality and decides to link to from her blog or site, it is clearly a double thumbs up for your content and Google will also consider it seriously. A great linkbait program can help your site gain lots of natural inbound links from the linkerati.

If you have votes from other sites in the form of backlinks to the various pages of your site, this is crucial in Google retaining those pages in its index. Again, you cannot produce top quality content across all pages of your site as the subject being discussed can be limited in scope or not very popular in the eyes of users.

I have been noticing of late that even powerful domains with several product pages with wafer thin content and footer heavy links do not pass muster to be admitted/retained in the index. It is becoming increasingly clear that each individual page must attain a certain pagerank threshold to be retained in the index. This clearly proves that things cannot be taken for granted. Also, well established sites cannot afford to rest on their laurels any more.

To achieve a minimum pagerank threshold, internal linking can help to an extent and you as the site owner can do your bit to this end. But it is very vital to get a link from external unbiased sources to derive some link juice that can boost the pagerank of the page in question.

If the momentum on “natural” external link building profile of your website is maintained at a steady level from the inception of your website to its current state, you can expect Google to maintain a decent indexation level of your site and update its index regularly with the fresh content and growing number of pages offered by your site.

Rand in his Whiteboard presentation on Link Growth Patterns explains the relationship between link growth patterns and indexation levels.

Eric Enge in his post on The Disproportionate Value of Deep Links talks about improving the pagerank flow to hitherto areas of the site where there was no link juice flowing before.

Ravi Venkatesan is a senior SEO consultant at Netconcepts, an Auckland seo firm offering both search engine optimisation and ppc services to their customers in New Zealand and Australia.

Related posts:

  1. Exciting News — Netconcepts Acquired by Covario
  2. Increasing The Scope Of Existing PPC Campaigns Effectively
  3. LinkedIn, But NoFollow Link Love
  4. Relationship Between Link Growth And Indexation
  5. Inbound Deep Links Benefit Page Rank Distribution Sitewide
  6. New Tool to Annualize Google Keyword Data
  7. How To Breathe Life Into A Lacklustre PPC Campaign
  8. Good Practices SEO With A Tinge Of Creativity
  9. SEO Tools: Using Xenu and Excel – Blindfolded SEO Audit Part 2
  10. Blindfolded SEO Audit Part 1

New Tool to Annualize Google Keyword Data

Do you use Google’s AdWord Keyword Tool for your keyword research? If not, you might be missing out. Like all keyword research tools, it may not be the end all be all, and it isn’t without its own little quirks, but it is still rich keyword data whether you use it on its own or [...]

Related posts:

  1. Exciting News — Netconcepts Acquired by Covario
  2. Increasing The Scope Of Existing PPC Campaigns Effectively
  3. LinkedIn, But NoFollow Link Love
  4. Relationship Between Link Growth And Indexation
  5. Inbound Deep Links Benefit Page Rank Distribution Sitewide
  6. New Tool to Annualize Google Keyword Data
  7. How To Breathe Life Into A Lacklustre PPC Campaign
  8. Good Practices SEO With A Tinge Of Creativity
  9. SEO Tools: Using Xenu and Excel – Blindfolded SEO Audit Part 2
  10. Blindfolded SEO Audit Part 1

Do you use Google’s AdWord Keyword Tool for your keyword research? If not, you might be missing out. Like all keyword research tools, it may not be the end all be all, and it isn’t without its own little quirks, but it is still rich keyword data whether you use it on its own or in relation with the other keyword tools you are using.

Google has modified the tool over time, and one of the great additions was the ability to see the monthly demand via a small little bar chart. This can be very useful for factoring in seasonality or growing demand for certain phrases. Wrapping your head around the actual numerical data is a bit more challenging. The Local number is just for the most recent month while the Global number is a monthly average. This is further complicated in that the Global number includes the world essentially while the Local number may factor in your campaign settings and locality (based on your AdWords campaign configuration).

To help tighten up data and provide a little more insight into the Local numbers, I just released an Excel spreadsheet that can take your Google Keyword Tool’s export and annualize the Local demand numbers. In some cases, this may dramatically change the order of importance of keywords to target.

Best of all, this tool is free to use so give it a play. The link below will take you to the download page for the tool as well as more detail about how it works and an example.

Google Keyword Tool Annualizer

Related posts:

  1. Exciting News — Netconcepts Acquired by Covario
  2. Increasing The Scope Of Existing PPC Campaigns Effectively
  3. LinkedIn, But NoFollow Link Love
  4. Relationship Between Link Growth And Indexation
  5. Inbound Deep Links Benefit Page Rank Distribution Sitewide
  6. New Tool to Annualize Google Keyword Data
  7. How To Breathe Life Into A Lacklustre PPC Campaign
  8. Good Practices SEO With A Tinge Of Creativity
  9. SEO Tools: Using Xenu and Excel – Blindfolded SEO Audit Part 2
  10. Blindfolded SEO Audit Part 1

Seth Godin: Sliced Bread

Malcolm Gladwell: Outliers

Anthony Parinello: Your Price is Too High