Posted by randfish
How many pages has Google indexed?
This question and the problems surrounding it run rampant through the SEO world. It usually arises when someone starts doing searches like this:
Google claims to have 93,800 pages indexed on the root domain, seomoz.org. That sounds pretty good, but when I ran that search query last week, the number was closer to 75,000 and when I run it again from Google.co.uk 60 seconds later, the number changes even more dramatically:
How about if I hit refresh on my Google.com results again:
Doh! Google just dropped 8,500 of my pages out of their index. That sucks – but not nearly as much as managers, marketing directors and CEOs who use these numbers as actual KPIs! Can you imagine? A number that means nothing, fluctuates 300% between data centers, can change at a moment’s notice and provides no actionable insight being used as a business metric?
And yet… It happens.
Fortunately, there’s an easy way to get much, much better data than what the search engines provide through "site:" queries and this post is here to walk you through that process step-by-step.
Step 1: Go to Traffic Sources in Your Analytics
Click the "traffic sources" link in Google analytics or Omniture (it can also be called "referring sources" in other analytics packages).
Step 2: Head to the Search Engines Section
We want to find out how many pages the search engines have indexed, so the obvious next step is to go to the "search engines" sub-section.
Step 3: Choose an Engine
Choose the engine you want indexation data on and click. If you have both paid and organic traffic from this engine, you’ll want to display organic only at this step, too.
Step 4: Filter by Landing Pages
The "Landing Page" filter in the dropdown will show you the traffic each individual page on your site received from the engine you’ve selected. This also produces the magical "total" number of pages that have received traffic, described in the last step.
Step 5: Record the Number at the Bottom
That count tells you the unique number of pages that received at least one visit from searches performed on Google. It’s the Holy Grail of indexation – a number you can accurately track over time to see how the search engine is indexing your site. On its own, it isn’t particularly useful, but over time (I usually recommend recording monthly, but for some sites, every 2-3 months can make more sense), it gives you insight into whether your pages are doing better or worse at drawing in traffic from the engine.
Now, technically I’m being a bit cheeky here. This number doesn’t tell you the full story – it’s not showing the actual number of pages a search engine has crawled or indexed on your site, but it does tell you the unique number of URLs that received at least 1 visit from the engine. In my opinion this data is far more accurate and more actionable. The first adjective – accurate – is hard to argue (particularly given the visual evidence atop this post), but the second requires a bit of an explanation.
Why is Number of Pages Receiving ≥1 Visit Actionable?
Indexation numbers alone are useless. Businesses and websites use them as KPIs because they want to know if, over time, more of their pages are making their way into the engines’ indices. I’d argue that actually, you don’t care if your pages are in the indices – you care if your pages have the opportunity to EARN TRAFFIC!
Being a row in a search index means nothing if your page is:
- too low in PageRank/link juice to appear in any results
- displaying content the engines can’t properly parse
- devoid of keywords or content that could send traffic
- broken, misdirected or unavailable
- a duplicate of other pages that the engine will rank instead
Thus, the metric you want to count over time isn’t (in most cases) number of pages indexed, it’s number of pages that earned traffic. Over time, that’s the number you want to rise, the number you want marketers to concentrate on and the KPI that’s meaningful. It tells you whether the engine is crawling, indexing AND listing your pages in the results where someone might (has) actually click(ed) them.
If the number drops, you can investigate the actual pages that are no longer receiving traffic by exporting the data to Excel and doing a side-by-side with the previous month. If the number rises, you can see the new pages getting traffic. Those individual URLs will tell a story – of pages that broke, that stopped being linked-to, that fell too far down in paginated results or lost their unique content. It’s so much better than playing the mystery game that SEOs so often confront in the face of "lower indexation numbers" from the site: command.
Some Necessary Caveats
This methodology certainly isn’t perfect, and there are some important points to be aware of (thanks especially to some folks in the comments who brought these up):
- Google Analytics (and many other analytics packages) use sampled data at times to make guesstimates. If you want to be sure you’re getting the absolute best number, export to CSV and do the side-by-side in Excel. You can even expunge similar results from two time period to see only those pages that uniquely did/didn’t receive traffic. In many of these cases, you might also only care about pages that gained/lost 5/10/20+ visits.
- Greater accuracy can be found from shrinking the time period in the analytics, but it also reduces the liklihood that a page receiving very long tail query traffic once in a blue moon will be properly listed, so adjust accordingly, and plan for imperfect data. This method isn’t foolproof, but it is (in my opinion), better than the random roulette wheel of site: queries.
- This technique isn’t going to help you catch other kinds of SEO issues like duplicate content (it can in some cases, but it’s not as good as something like GG WM Tools reporting) or 301s, 302s, etc. which can require a crawling solution.
I’d, of course, love your feedback. I know many SEOs are addicted to and supportive of the site: command numbers as a way to measure progress, so maybe there’s things I’m not considering or situations where it makes sense. I also know that many of you like the number reported in Google Webmaster tools under the Sitemaps crawl data (I’m skeptical of this too, for the record) and I’d like to hear how you find value with that data as well.
p.s. Tomorrow we’ll be announcing two webinars (open to all) about using Open Site Explorer to get ACTIONABLE data. Be sure to leave either Wednesday the 27th at 2pm Pacific or Thursday the 28th at 10am Pacific free