Some Google News RSS Subscriptions Not Validating?

The Google News Help forums has a couple of reports that Google News RSS feed is not working properly. Some are complaining they cannot subscribe to Google News searches and some are complaining they are not validating properly.

I know that my tests seem to work just fine and I am able to subscribe to Google News searches via Google Reader. However, when I plug in those RSS URLs into FeedValidator.org, the feeds are not valid, according to them. Here is a sample showing the errors of the Google News rss searches.

There is no word from Google on this as of yet. We know Google News had issues with RSS feeds in the past.

Forum discussion at Google News Help.


The Google News Help forums has a couple of reports that Google News RSS feed is not working properly. Some are complaining they cannot subscribe to Google News searches and some are complaining they are not validating properly.

I know that my tests seem to work just fine and I am able to subscribe to Google News searches via Google Reader. However, when I plug in those RSS URLs into FeedValidator.org, the feeds are not valid, according to them. Here is a sample showing the errors of the Google News rss searches.

There is no word from Google on this as of yet. We know Google News had issues with RSS feeds in the past.

Forum discussion at Google News Help.



Google Increases Sitemaps Limit to 50,000 from 1,000

I am not sure when this happened, but fairly recently, Google has changed the number of Sitemaps you can reference in a Sitemap index file. The number use to be 1,000 sitemaps can be referenced in a Sitemap index file, now the number is 50,000 Sitemaps. This is a huge increase in capacity.

Still, each Sitemap file can contain up to 50,000 URLs, so technically 50,000 multiplied by 50,000 is 2,500,000,000 or 2.5 billion URLs can be submitted to Google via Sitemaps. That is if I can add correctly.

Googler, Jonathan Simon, said this in a Google Webmaster Help thread:

Thanks for resurfacing this thread as we’ve improved our capacity a bit since then. The limit used to be 1,000.

The Help Center article you point to is correct. The current maximum number of Sitemaps that can be referenced in a Sitemap Index file is 50,000.

Forum discussion at Google Webmaster Help.


I am not sure when this happened, but fairly recently, Google has changed the number of Sitemaps you can reference in a Sitemap index file. The number use to be 1,000 sitemaps can be referenced in a Sitemap index file, now the number is 50,000 Sitemaps. This is a huge increase in capacity.

Still, each Sitemap file can contain up to 50,000 URLs, so technically 50,000 multiplied by 50,000 is 2,500,000,000 or 2.5 billion URLs can be submitted to Google via Sitemaps. That is if I can add correctly.

Googler, Jonathan Simon, said this in a Google Webmaster Help thread:

Thanks for resurfacing this thread as we’ve improved our capacity a bit since then. The limit used to be 1,000.

The Help Center article you point to is correct. The current maximum number of Sitemaps that can be referenced in a Sitemap Index file is 50,000.

Forum discussion at Google Webmaster Help.



Guide to a (hopefully) successful 301

I just made a major change for me and 301′d PatrickGavin.com to Searchengineoptimization.net  The focus of this site moving forward will be more of a SEO resource than my personal blog so I wanted a name that says this loud and clear.  I wanted to kick off the new site (new design coming soon btw) [...]

I just made a major change for me and 301′d PatrickGavin.com to Searchengineoptimization.net  The focus of this site moving forward will be more of a SEO resource than my personal blog so I wanted a name that says this loud and clear.  I wanted to kick off the new site (new design coming soon btw) with a quick guide on how to properly 301 a domain.  If you would have done anything differently please let me know in the comments!  In the end, the judge will be Google and how it handles this 301.

Goal is to 301 redirect www.abc.com to www.123.com

1. Register both www.abc.com and www.123.com with Google webmaster tools.  Note that you will have to insert a snippet of code or upload a file to both of your sites to verify you do own or control both websites.

2. Make sure www.123.com takes on the exact design, look and feel, etc of the original www.abc.com site.  This is important as Google doesn’t like too much change going on at once.  Keep the design the same during a 301 so you don’t throw too much change at Google at once.

3. Make sure you keep the url structure the same, ie www.patrickgavin.com/2009/04/19/april-sandbox-update/ should be moved to www.searchengineoptimization.net/2009/04/19/april-sandbox-update/

4. Follow Google’s steps technical guidelines on completing a 301 

5. Double check your redirects.  Use a 301 redirect checker to make sure your redirects are seo friendly (http://www.internetofficer.com/seo-tool/redirect-check/).  Also, test a handful of urls from your old domain to make sure they are getting redirected properly.  You can easily do this by googling your old domain name and clicking through your sites indexed pages.  Be very sure that every single page of your old domain is getting redirected to another active page at your new domain.  You definitely don’t want to be left with a bunch of 404 not founds.  Additionally, it is recommended that you chose to redirect everything to either the www or non-www version of your new domain.

6. Let Google Webmaster Tools know about the 301 by submitting a "Change of Address" request. 

7. After the 301, it is highly recommended you build some strong links and add some content to the blog to show the site is still active & growing.  If you want to go further, have some of your old incoming links update their link to point to the new site (this is not required, but it shows that the sites that linked to you before still want to link to you now indicating it’s the same site as before).  Remember Google logs redirects just like they log your backlink data so don’t rely on 301’s as your sole link strategy.

8. Wait patiently and have some faith!  Your original site can disappear for a period of time from Google’s index leaving you with NOTHING for a period of time.  This could be days or weeks.  What should happen is the new site you 301′d to should appear taking on similar rankings that you had for your original site give or take a few spots up or down.

Now we will wait and see what happens with my 301 and I will keep you updated.  The PatrickGavin.com domain is currently ranking #6 for the coveted "search engine optimization" query so this is a bit of a gamble and it will be fun to see if it pays off!

UPDATE! +3 days after the 301.  My PatrickGavin.com site has lost all Google keyword rankings!  I have lost all rankings including #1 for "patrick gavin" #6 for "search engine optimization", etc.  Here is the good news: this is what happens when you do a 301.  Fingers crossed but the SearchEngineOptimization.net domain should* reappear in the coming days/weeks…

Google & Lightbox JavaScript: Can GoogleBot Index Images in Lightbox JS?

A WebmasterWorld thread has discussion around getting Google to index a popular image feature sites use to show off images on their web site. It is called Lightbox JS and it basically uses JavaScript to open up a neat larger view of the image on the page.

I use it on many sites, but you can see a quick example on the RustyBrick Mobile Portfolio. Just click on the image and it opens up a larger picture of that image. Here is a screen capture showing the larger image as it overlays on top of the page:

Lightbox & Google

The issue is, GoogleBot is having a tough time capturing these images in their index. WebmasterWorld administrator, Tedster, explained:

I’ve been up against the same challenge. Even though regular Google search is aggressively discovering URLs and content by spidering JavaScript, apparently the image bot is not so inquisitive at this point. This surprised me, because there are many images being displayed through Lightbox scripts these days.

Yes, GoogleBot is able to execute JavaScript, but is GoogleImageBot able to at the same pace?

Tedster is exploring other ways to get GoogleBot to index Lightbox JS. He tried the following method, but it doesn’t seem to work:

My latest attempt involves making the anchor part of the link a thumbnail image – but the thumbnail is not just a smaller version of the larger image. I use the same exact image file for the anchor, but I resize it on the the page with the HTML width and height attributes. This means that the page loads more slowly, but at least the image bot gets a direct <img src=[url]> style mark-up.

If you have a solution, let us know.

Forum discussion at WebmasterWorld.


A WebmasterWorld thread has discussion around getting Google to index a popular image feature sites use to show off images on their web site. It is called Lightbox JS and it basically uses JavaScript to open up a neat larger view of the image on the page.

I use it on many sites, but you can see a quick example on the RustyBrick Mobile Portfolio. Just click on the image and it opens up a larger picture of that image. Here is a screen capture showing the larger image as it overlays on top of the page:

Lightbox & Google

The issue is, GoogleBot is having a tough time capturing these images in their index. WebmasterWorld administrator, Tedster, explained:

I’ve been up against the same challenge. Even though regular Google search is aggressively discovering URLs and content by spidering JavaScript, apparently the image bot is not so inquisitive at this point. This surprised me, because there are many images being displayed through Lightbox scripts these days.

Yes, GoogleBot is able to execute JavaScript, but is GoogleImageBot able to at the same pace?

Tedster is exploring other ways to get GoogleBot to index Lightbox JS. He tried the following method, but it doesn’t seem to work:

My latest attempt involves making the anchor part of the link a thumbnail image – but the thumbnail is not just a smaller version of the larger image. I use the same exact image file for the anchor, but I resize it on the the page with the HTML width and height attributes. This means that the page loads more slowly, but at least the image bot gets a direct <img src=[url]> style mark-up.

If you have a solution, let us know.

Forum discussion at WebmasterWorld.



Bug: Bing Webmaster Tools Not Accepting URLs with Hyphens

There are a few reports in a Bing Forum thread that adding a site to Bing Webmaster Tools might not work. Specifically, if the URL or domain contains a hyphen (dash) such as www.best-domain.com.

Brett Yount from the Bing Webmaster team confirmed the bug, saying:

Currently, we are having a few difficulties which I just received confirmation from the indexing team. They are currently working on it, but said that if you try a couple times, it should work. If not, and your site isn’t in the index at all , please post on the not in index thread and I will work to get your home page (only) into the index.

I personally tried adding a domain with a hyphen and it worked for me on the first try. So maybe it is resolved or maybe those specific domains have other issues?

Forum discussion at Bing Forum.


There are a few reports in a Bing Forum thread that adding a site to Bing Webmaster Tools might not work. Specifically, if the URL or domain contains a hyphen (dash) such as www.best-domain.com.

Brett Yount from the Bing Webmaster team confirmed the bug, saying:

Currently, we are having a few difficulties which I just received confirmation from the indexing team. They are currently working on it, but said that if you try a couple times, it should work. If not, and your site isn’t in the index at all , please post on the not in index thread and I will work to get your home page (only) into the index.

I personally tried adding a domain with a hyphen and it worked for me on the first try. So maybe it is resolved or maybe those specific domains have other issues?

Forum discussion at Bing Forum.



Google To Add “Trustworthy Indicator” to Site Performance Tool

A Google Webmaster Help thread has reports of page load time speeds spiking up to ridiculous numbers in the new Google site performance reports. Google’s response to these reports was pretty interesting, I’ll get to that soon, firs the context.

A Top Contributor in the forum wrote:

After about 6 months of “flat line” Site Performance reports of averagepage load time around 1 or 2 seconds, I am now seeing in Tools a report that: “On average, pages in your site take 83.1 seconds to load (updated on Dec 7, 2009).” and of course the graph has shot up and I’m tol that my site’s average page load time is “slower than 100% of sites”.

However, the only two pages listed in that report both show load times of 1 to 2 seconds.

Now a Googler with the code name “sreeram” replied saying:

The 83s number is bogus. Your site’s toolbar traffic dropped by more than an order of magnitude in the last few days. You should ignore the average for now. We’ll soon be showing site owners some indication of how trustworthy the numbers are, so you can decide when to ignore it and when not to.

Not all URLs may have toolbar traffic, so it’s possible to have many URLs indexed, and even visited by users, but only a couple may show up on Site Performance. In addition, when there’s very little data for a given URL, we won’t display it (for privacy reasons), though it will be included in the overall site average.

So in this case, the site’s traffic as seen by the Google Toolbar dropped significantly, which caused a weird spike in the webmaster’s site performance reports. Thus, Google promised to provide an “indication of how trustworthy the numbers are” in this report.

Clearly, some of these numbers are not trustworthy, such as factoring in Toolbar fluctuations or Google Analytics speed.

Forum discussion at Google Webmaster Help.

Update: John Mueller from Google sent me a note about this:

The Webmaster Tools team is constantly working on ways to improve the product as well as the data provided there. In general, we prefer not to comment on possible future releases. The Labs section in Webmaster Tools allows us to easily try out and iterate on new and innovative features, which is one reason we launched the Site Performance tool there. Personally, I found the data provided there quite actionable and have seen a lot of positive feedback from webmasters around this tool. To fine-tune a website with regards to speed, it can be useful to start with the information provided here and then to look into the details using browser-based tools such as Page Speed and Speed Tracer.

We’re always looking into ways we can take our products and services to the next level. We appreciate all of the feedback and coverage that you provide! I’ll get in touch with you once I have more information that I can share.


A Google Webmaster Help thread has reports of page load time speeds spiking up to ridiculous numbers in the new Google site performance reports. Google’s response to these reports was pretty interesting, I’ll get to that soon, firs the context.

A Top Contributor in the forum wrote:

After about 6 months of “flat line” Site Performance reports of averagepage load time around 1 or 2 seconds, I am now seeing in Tools a report that: “On average, pages in your site take 83.1 seconds to load (updated on Dec 7, 2009).” and of course the graph has shot up and I’m tol that my site’s average page load time is “slower than 100% of sites”.

However, the only two pages listed in that report both show load times of 1 to 2 seconds.

Now a Googler with the code name “sreeram” replied saying:

The 83s number is bogus. Your site’s toolbar traffic dropped by more than an order of magnitude in the last few days. You should ignore the average for now. We’ll soon be showing site owners some indication of how trustworthy the numbers are, so you can decide when to ignore it and when not to.

Not all URLs may have toolbar traffic, so it’s possible to have many URLs indexed, and even visited by users, but only a couple may show up on Site Performance. In addition, when there’s very little data for a given URL, we won’t display it (for privacy reasons), though it will be included in the overall site average.

So in this case, the site’s traffic as seen by the Google Toolbar dropped significantly, which caused a weird spike in the webmaster’s site performance reports. Thus, Google promised to provide an “indication of how trustworthy the numbers are” in this report.

Clearly, some of these numbers are not trustworthy, such as factoring in Toolbar fluctuations or Google Analytics speed.

Forum discussion at Google Webmaster Help.

Update: John Mueller from Google sent me a note about this:

The Webmaster Tools team is constantly working on ways to improve the product as well as the data provided there. In general, we prefer not to comment on possible future releases. The Labs section in Webmaster Tools allows us to easily try out and iterate on new and innovative features, which is one reason we launched the Site Performance tool there. Personally, I found the data provided there quite actionable and have seen a lot of positive feedback from webmasters around this tool. To fine-tune a website with regards to speed, it can be useful to start with the information provided here and then to look into the details using browser-based tools such as Page Speed and Speed Tracer.

We’re always looking into ways we can take our products and services to the next level. We appreciate all of the feedback and coverage that you provide! I’ll get in touch with you once I have more information that I can share.



How To Find a Domain’s # of Indexed Pages In Google Post-Caffeine

In the olden days, as in before this week, you used to be able to get an idea of how many pages you had in Google’s index by searching “site:<yourdomain>”.  The resulting page would say something like “results 1-10 of 1,390,000″ which while not entirely accurate gave you a general idea of how well indexed [...]

In the olden days, as in before this week, you used to be able to get an idea of how many pages you had in Google’s index by searching “site:<yourdomain>”.  The resulting page would say something like “results 1-10 of 1,390,000″ which while not entirely accurate gave you a general idea of how well indexed your site was. Now with the official launch of Google Caffeine (update: I stand corrected, this is not a Caffeine issue but a new GOOG UI issue that I neglected to stay on top of – thanks Rhaghavan), the site: query no longer displays the number of total results (update: at least it doesn’t work for me but as you can see in the comments others have not experienced this yet).

While many people were unduly obsessed with this number, it did have its uses.  For example, while big swings in the reported number say from 10,000,000 to 236,000 were scary but irrelevant, small changes in the reported number seemed to be more in sync with SEO problems or fixes.

So if you still want to find out how many pages your domain has in the index how do you do it?

  1. Sign up for Google Webmaster Tools and submit xml sitemaps for every URL on your domain.  The Sitemaps report in GWT will then show the number of indexed URLs from your sitemaps (btw it’s not clear that this number is accurate either).  My guess is getting more xml sitemaps submitted was one of the primary reasons that GOOG stopped reporting this number.  That and maybe saving bandwidth from all of those site: queries that nervous site owners did all day long.
  2. If you don’t want to give GOOG your data via GWT, then you can still do a fake site: query by using “inurl:<yourdomain>”. Make sure you don’t use “www” in the query (e.g. inurl:localseoguide.com).  This isn’t a perfect query – sites that incorporate your domain into their URLs will show up (e.g. www.alexa.com/siteinfo/localseoguide.com), but for most sites this shouldn’t be a huge number of URLs.  It’s hard to judge how accurate this query is but I have tried it for several client sites and it seems to square up pretty well with how many pages they seem to have.If anyone has any other ideas feel free to add them to the comments and/or put them on your blog, link back here and it will show up in the trackbacks.

Looking Back at Linkscape’s Trillion + URLs (and Announcing our Latest Index Update)

Posted by Nick Gerner

As we rapidly approach the end of 2009 and opening of 2010, we’ve got a much anticipated index update ready to roll out gang.  Say it with me "twenty-ten".  Oh yeah, I’m so gonna get a flying car and a cyberpunk android :)    …Ahem.  I thought this would be a great time to take a look back at the year and ask, "where did all those pages go?"  Being a data-driven kind of guy, I want to take a look at some numbers about churn, freshness and what it means for the size of the web and web indexes over the last year, and the hundreds of billions, indeed trillion plus urls we’ve gotten our hands on.

This index update has a lot going on, so I’ve broken things out section by section:

An Analysis of the Web’s Churn Rate

Not too long ago, at SMX East, I heard Joachim Kupke (senior software engineer on Google’s indexing team) say that "a majority of the web is duplicate content". I made great use of that point at a Jane and Robot meet up shortly after.  Now, I’d like to add my own corollary to that statement: "most of the web is short-lived".

Churn on the Web

 

After just a single month, a full 25% of the URLs are what we call "unverifiable".  By that I mean that the content was either duplicate, included session parameters, or for some reason could not be retrieved (verified) again (404s, 500s, etc.).  Six months later, 75% of the tens of billions of URLs we’ve seen are "unverifiable" and a year later, only 20% qualifies for "verified" status. As Rand noted earlier this week, Google’s doing a lot of verifying themselves.

To visualize this dramatic churn, imagine the web six months ago…

the web six months ago

Using Joachim’s point, plus what we’ve observed, that six-month old content today looks something like this:

what remains of the the six month old web

What this means for you as a marketer is that some of the links you build and content you share across the web is not permanent. If you engage heavily with high-churn portions of the web, the statistics you monitor over time can vary pretty wildly. It’s important to understand the difference between getting links (and republishing content) in places that will make a splash now, but fade away, versus engaging in lasting ways.  Of course, both are important (as high-churn areas may drive traffic that turns into more permanent value), but the distinction shouldn’t be overlooked. 

Canonicalization, De-Duping & Choosing Which Pages to Keep

Regarding Linkscape’s indices, we capture both of these cases:

  • We’ve got an up-to-date crawl including fresh content that’s making waves right now. Blogscape helps power this, monitoring 10 million+ feeds and sending those back to Linkscape for inclusion in our crawl.
  • We include the lasting content which will continue to support your SEO efforts by analyzing which sites and pages are "unverifiable" and removing these from each new index. This is why our index growth isn’t cumulative — we re-crawl the web each cycle to make sure that the links + data you’re seeing are fresh and verifiable.

To put it another way, consider the quality of most of the pages on the web, as measured, for instance, by mozRank:

Most Pages are Junk (via mozRank)

I think the graph speaks for itself. The vast majority of pages have very little "importance" as defined by a measure of link juice. So it doesn’t surprise me (now at least) that most of these junk pages are disappearing after not too long.  Of course, there are still plenty of really important pages that do stick around.

But what does this say about the pages we’re keeping?  First of let’s take out any discussion of the pages that we saw over a year ago (as we’ve seen above, there’s likely less than 1/5th of them remaining on the web).  In just the past 12 months, we’ve seen between 500 billion and well over 1 trillion pages depending on how you count it (via Danny at Search Engine Land).

Linkscape URLs in the last year

So in just a year we’ve provided 500 billion unique urls through Linkscape and the Linkscape powered tools (Competitive Link Finder, Visualization, Backlink Analysis, etc.). And what’s more, this represents less than half of the URLs we’ve seen in total, as the "scrubbing" we do for each index cuts approx. 50% of the "junk" (including canonicalization, de-duping, and straight tossing for spam and other reasons). There’s likely many trillions of URLs out there, but the engines (and Linkscape) certainly don’t want anything close to all of these in an index.

Linkscape’s December Index Update:

From this latest index (compiled over approx. the last 30 days) we’ve included:

  • 47,652,586,788 unique URLs (47.6 billion)
  • 223,007,523 subdomains (223 million)
  • 58,587,013 root domains (59.5 billion)
  • 547,465,598,586 links (547 billion)

We’ve checked that all of these URLs and links existed within the last month or so.  And I call out this notion of "verified" because we believe that’s what matters for a lot of reasons:

I hope you’ll agree. Or, at least, share your thoughts :)

New Updates to the Free & Paid Versions of our API

I also want to call a shout out to Sarah who’s been hard at work on repackaging our site intelligence API suite.  She’s got all kinds of great stuff planned for early the coming year, including tons of data in our free APIs.  Plus she’s dropped the prices on our paid suite by nearly 90%.

Both of these items are great news to some of our many partners, including:

Thanks to these partners we’ve doubled the traffic to our APIs to over 4 million hits per day, more than half of which are from external partners!  We’re really excited to be working with so many of you.

Do you like this post? Yes No

Posted by Nick Gerner

As we rapidly approach the end of 2009 and opening of 2010, we’ve got a much anticipated index update ready to roll out gang.  Say it with me "twenty-ten".  Oh yeah, I’m so gonna get a flying car and a cyberpunk android :)    …Ahem.  I thought this would be a great time to take a look back at the year and ask, "where did all those pages go?"  Being a data-driven kind of guy, I want to take a look at some numbers about churn, freshness and what it means for the size of the web and web indexes over the last year, and the hundreds of billions, indeed trillion plus urls we’ve gotten our hands on.

This index update has a lot going on, so I’ve broken things out section by section:

An Analysis of the Web’s Churn Rate

Not too long ago, at SMX East, I heard Joachim Kupke (senior software engineer on Google’s indexing team) say that "a majority of the web is duplicate content". I made great use of that point at a Jane and Robot meet up shortly after.  Now, I’d like to add my own corollary to that statement: "most of the web is short-lived".

Churn on the Web

 

After just a single month, a full 25% of the URLs are what we call "unverifiable".  By that I mean that the content was either duplicate, included session parameters, or for some reason could not be retrieved (verified) again (404s, 500s, etc.).  Six months later, 75% of the tens of billions of URLs we’ve seen are "unverifiable" and a year later, only 20% qualifies for "verified" status. As Rand noted earlier this week, Google’s doing a lot of verifying themselves.

To visualize this dramatic churn, imagine the web six months ago…

the web six months ago

Using Joachim’s point, plus what we’ve observed, that six-month old content today looks something like this:

what remains of the the six month old web

What this means for you as a marketer is that some of the links you build and content you share across the web is not permanent. If you engage heavily with high-churn portions of the web, the statistics you monitor over time can vary pretty wildly. It’s important to understand the difference between getting links (and republishing content) in places that will make a splash now, but fade away, versus engaging in lasting ways.  Of course, both are important (as high-churn areas may drive traffic that turns into more permanent value), but the distinction shouldn’t be overlooked. 

Canonicalization, De-Duping & Choosing Which Pages to Keep

Regarding Linkscape’s indices, we capture both of these cases:

  • We’ve got an up-to-date crawl including fresh content that’s making waves right now. Blogscape helps power this, monitoring 10 million+ feeds and sending those back to Linkscape for inclusion in our crawl.
  • We include the lasting content which will continue to support your SEO efforts by analyzing which sites and pages are "unverifiable" and removing these from each new index. This is why our index growth isn’t cumulative — we re-crawl the web each cycle to make sure that the links + data you’re seeing are fresh and verifiable.

To put it another way, consider the quality of most of the pages on the web, as measured, for instance, by mozRank:

Most Pages are Junk (via mozRank)

I think the graph speaks for itself. The vast majority of pages have very little "importance" as defined by a measure of link juice. So it doesn’t surprise me (now at least) that most of these junk pages are disappearing after not too long.  Of course, there are still plenty of really important pages that do stick around.

But what does this say about the pages we’re keeping?  First of let’s take out any discussion of the pages that we saw over a year ago (as we’ve seen above, there’s likely less than 1/5th of them remaining on the web).  In just the past 12 months, we’ve seen between 500 billion and well over 1 trillion pages depending on how you count it (via Danny at Search Engine Land).

Linkscape URLs in the last year

So in just a year we’ve provided 500 billion unique urls through Linkscape and the Linkscape powered tools (Competitive Link Finder, Visualization, Backlink Analysis, etc.). And what’s more, this represents less than half of the URLs we’ve seen in total, as the "scrubbing" we do for each index cuts approx. 50% of the "junk" (including canonicalization, de-duping, and straight tossing for spam and other reasons). There’s likely many trillions of URLs out there, but the engines (and Linkscape) certainly don’t want anything close to all of these in an index.

Linkscape’s December Index Update:

From this latest index (compiled over approx. the last 30 days) we’ve included:

  • 47,652,586,788 unique URLs (47.6 billion)
  • 223,007,523 subdomains (223 million)
  • 58,587,013 root domains (59.5 billion)
  • 547,465,598,586 links (547 billion)

We’ve checked that all of these URLs and links existed within the last month or so.  And I call out this notion of "verified" because we believe that’s what matters for a lot of reasons:

I hope you’ll agree. Or, at least, share your thoughts :)

New Updates to the Free & Paid Versions of our API

I also want to call a shout out to Sarah who’s been hard at work on repackaging our site intelligence API suite.  She’s got all kinds of great stuff planned for early the coming year, including tons of data in our free APIs.  Plus she’s dropped the prices on our paid suite by nearly 90%.

Both of these items are great news to some of our many partners, including:

Thanks to these partners we’ve doubled the traffic to our APIs to over 4 million hits per day, more than half of which are from external partners!  We’re really excited to be working with so many of you.

Do you like this post? Yes No

Google’s Indexation Cap

Posted by randfish

Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation – getting more of their pages included in Google’s index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I’m going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.

First, a quick introduction to a truth that I’m not sure Google’s shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven’t seen) – that is – the concept that there’s an "indexation cap" on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn’t feel like the principle was "confirmed," but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google’s been more open about recently.

The "indexation cap" makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little "unique" content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google’s "last 24 hours" function:

SEOmoz blog post search on Google in the past 24 hours
Seriously, go have a look; the quantity of "junk" you wouldn’t want in your search engine’s index is remarkable

Since Tom published the post on Xenu’s Link Sleuth last night, Google’s already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that’s conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who’s tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it’s been a dead metric for a long time).

So – long story short – Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

The interesting part is that, in the past 3 months, the number of big websites (I’ll use that to refer to sites with an excess of 1 million unique pages) we’ve talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we’re not alone. The pattern is usually the same:

  • One morning, you wake up, and 40% of your search traffic is gone with no signal as to what’s happened
  • Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
  • Enter statistics data, showing that rankings for big terms aren’t down (or, maybe down a little), but that the long tail has gotten a lot shorter
  • Re-consideration request goes to Google
  • Somewhere between 10 to 40 days later, a message arrives saying:

We’ve processed your reconsideration request for http://xyz.com.

We received a request from a site owner to reconsider how we index the following site: http://xyz.com

We’ve now reviewed your site. When we review a site, we check to see if it’s in violation of our Webmaster Guidelines. If we don’t find any problems, we’ll reconsider our indexing of your site. If your site still doesn’t appear in our search results, check our Help Center for steps you can take.

  • This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
  • Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
    Exec: "We spent $10 million @#$%ing dollars with you last month and you can’t help?"
    AdWords Rep: "I’m sorry. We wish we could help. We just don’t have any influence on that side of the business. We don’t know anyone there or talk to anyone there."
    Exec: "Get me your boss on the phone. Now."
    Repeat ad nauseum until you reach level of management commensurate with spend of the exec’s company (or their connections)
    Exec: "Can you get me some answers?"
    AdWords Boss: "They won’t tell me much, but apparently they’re not keeping as many pages in the index from your site as they were before."
    Exec: "Yeah, we kind figured that part out. Are they going to put us back in."
    AdWords Boss: "My understanding is no."
    Exec: "So what am I supposed to do? We’re not going to have money to buy those $10 million in ads next month, you know."
    AdWords Boss: "You might try talking to someone who does SEO."
  • At this point, consultants receive desperate email or phone messages

To help site owners facing these problems, let’s examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don’t have statistical or quantitative data to back them up at this time):

  1. Importance on the Web’s Link Graph
    We’ve talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It’s likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index.
  2. Backlink Profile of the Domain
    The profile of a site’s links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value).
  3. Trustworthiness of the Domain
    Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up.
  4. Rate of Growth in Pages vs. Backlinks
    If your site’s content is growing dramatically, but you’re not earning many new links, this can be a signal to the engine that your content isn’t "worthy" of ongoing attention and inclusion.
  5. Depth & Frequency of Linking to Pages on the Domain
    If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they’re not particularly keen on the deep content – which is why the index may toss it out.
  6. Content Uniqueness
    Uniqueness is a constantly moving target and hard to nail down, but basically, if you don’t have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you’re at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they’re also, in my experience, much tougher on pages and sites that don’t earn high quantities of external links to their deep content with this analysis.
  7. Visitor, CTR and Usage Data Metrics
    If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny.
  8. Search Quality Rater Analysis + Manual Spam Reports
    If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).

Now let’s talk about some leading indicators that can help to show if you’re at risk:

  • Deep pages rarely receive external links – if you’re producing hundreds or thousands of pages of new content and fewer than "dozens" earn any external link at all, you’re in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
  • Deep pages don’t appear in Google Alerts – if Google Alerts is consistently passing you by (not reporting, this can be (but isn’t universally) an indication that they’re not perceiving your pages as being unique or worthy enough of the main index in the long run.
  • Rate of crawling is slow – if you’re updating content, links and launching new pages multiple times per day, and Google’s coming by every week, you’re likely in trouble. XML Sitemaps might help, but it’s likely you’re going to need to improve some of those factors described above to get in good graces for the long term.

There’s no doubt that indexation can be a vexing problem, and one that’s tremendously challenging to conquer. When the answer to the "how do we get those pages back?" is "make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs," there’s going to be resistance and a search for easier answers. But, like most things in life, what’s worth having is hard to get.

As always, I’m looking forward to your thoughts (and your shared experiences) on this tough issue. I’m also hopeful that, at some point in the future, we’ll be able to run some correlations on sites that aren’t fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.

Do you like this post? Yes No

Posted by randfish

Over the past 2 years, SEOmoz has worked with quite a number of websites whose primary goal (or primary problem) in SEO has been indexation – getting more of their pages included in Google’s index so they have the opportunity to rank well. These are, obviously, long tail focused sites that earn the vast majority of their visits from queries that bring in 5 or fewer searches each day. In this post, I’m going to tackle the question of how Google determines the quantity of pages to index on a site and how sites can go about improving these metric.

First, a quick introduction to a truth that I’m not sure Google’s shared very publicly (though they may have discussed it on panels or formally on the web somewhere I haven’t seen) – that is – the concept that there’s an "indexation cap" on the number of URLs from a website that Google will maintain in their main index. I was skeptical about this until I heard it firsthand from a Googler being described to a webmaster. Even then, I didn’t feel like the principle was "confirmed," but after talking to a lot of SEOs working at very large companies, some of whom have more direct interactions with the search quality team, this is, apparently, a common point of discussion and something Google’s been more open about recently.

The "indexation cap" makes sense, particularly as the web is growing exponentially in size every few years, often due to the production of spam and more legitimate, but no less index-worthy content on sites of all sizes and shapes. I believe that many site owners started noticing that the more pages they produced, even with very little "unique" content, the more traffic Google would send and thus, abuse was born. As an example, try searching using Google’s "last 24 hours" function:

SEOmoz blog post search on Google in the past 24 hours
Seriously, go have a look; the quantity of "junk" you wouldn’t want in your search engine’s index is remarkable

Since Tom published the post on Xenu’s Link Sleuth last night, Google’s already discovered more than 250 pages around the web that include that content or mentions of it. If, according to Technorati, the blogosphere is still producing 1.5 million+ posts each week, that’s conservatively growing the web by ~20 billion pages each year. It should come as no surprise that Google, along with every other search engine, has absolutely no desire to keep more than, possibly, 10-20% of this type of content (and anyone who’s tried re-publishing in this fashion for SEO has likely felt that effect). Claiming to have the biggest index size may actually be a strike against relevancy in this world (according to Danny Sullivan, it’s been a dead metric for a long time).

So – long story short – Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.

The interesting part is that, in the past 3 months, the number of big websites (I’ll use that to refer to sites with an excess of 1 million unique pages) we’ve talked to, helped through Q+A or consulted with that have lost wide swaths of indexation has skyrocketed, and we’re not alone. The pattern is usually the same:

  • One morning, you wake up, and 40% of your search traffic is gone with no signal as to what’s happened
  • Queue panicking executives, investors and employees (oh, and usually the poor SEO team, too)
  • Enter statistics data, showing that rankings for big terms aren’t down (or, maybe down a little), but that the long tail has gotten a lot shorter
  • Re-consideration request goes to Google
  • Somewhere between 10 to 40 days later, a message arrives saying:

We’ve processed your reconsideration request for http://xyz.com.

We received a request from a site owner to reconsider how we index the following site: http://xyz.com

We’ve now reviewed your site. When we review a site, we check to see if it’s in violation of our Webmaster Guidelines. If we don’t find any problems, we’ll reconsider our indexing of your site. If your site still doesn’t appear in our search results, check our Help Center for steps you can take.

  • This email, soon to be recognized by the Academy of Nonsense for its pre-eminent place among the least helpful collection of words ever assembled, spurs bouts of cursing and sometimes, tragically, termination of SEO or marketing managers. Hence, we at SEOmoz take it pretty personally (as this group includes many close friends & colleagues).
  • Calls go out to the Google AdWords reps, typically consisting of a conversation that goes something like:
    Exec: "We spent $10 million @#$%ing dollars with you last month and you can’t help?"
    AdWords Rep: "I’m sorry. We wish we could help. We just don’t have any influence on that side of the business. We don’t know anyone there or talk to anyone there."
    Exec: "Get me your boss on the phone. Now."
    Repeat ad nauseum until you reach level of management commensurate with spend of the exec’s company (or their connections)
    Exec: "Can you get me some answers?"
    AdWords Boss: "They won’t tell me much, but apparently they’re not keeping as many pages in the index from your site as they were before."
    Exec: "Yeah, we kind figured that part out. Are they going to put us back in."
    AdWords Boss: "My understanding is no."
    Exec: "So what am I supposed to do? We’re not going to have money to buy those $10 million in ads next month, you know."
    AdWords Boss: "You might try talking to someone who does SEO."
  • At this point, consultants receive desperate email or phone messages

To help site owners facing these problems, let’s examine some of the potential metrics Google looks at to determine indexation (note that these are my opinions, and I don’t have statistical or quantitative data to back them up at this time):

  1. Importance on the Web’s Link Graph
    We’ve talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It’s likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index.
  2. Backlink Profile of the Domain
    The profile of a site’s links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value).
  3. Trustworthiness of the Domain
    Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up.
  4. Rate of Growth in Pages vs. Backlinks
    If your site’s content is growing dramatically, but you’re not earning many new links, this can be a signal to the engine that your content isn’t "worthy" of ongoing attention and inclusion.
  5. Depth & Frequency of Linking to Pages on the Domain
    If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they’re not particularly keen on the deep content – which is why the index may toss it out.
  6. Content Uniqueness
    Uniqueness is a constantly moving target and hard to nail down, but basically, if you don’t have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you’re at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they’re also, in my experience, much tougher on pages and sites that don’t earn high quantities of external links to their deep content with this analysis.
  7. Visitor, CTR and Usage Data Metrics
    If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny.
  8. Search Quality Rater Analysis + Manual Spam Reports
    If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).

Now let’s talk about some leading indicators that can help to show if you’re at risk:

  • Deep pages rarely receive external links – if you’re producing hundreds or thousands of pages of new content and fewer than "dozens" earn any external link at all, you’re in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
  • Deep pages don’t appear in Google Alerts – if Google Alerts is consistently passing you by (not reporting, this can be (but isn’t universally) an indication that they’re not perceiving your pages as being unique or worthy enough of the main index in the long run.
  • Rate of crawling is slow – if you’re updating content, links and launching new pages multiple times per day, and Google’s coming by every week, you’re likely in trouble. XML Sitemaps might help, but it’s likely you’re going to need to improve some of those factors described above to get in good graces for the long term.

There’s no doubt that indexation can be a vexing problem, and one that’s tremendously challenging to conquer. When the answer to the "how do we get those pages back?" is "make the content better, more unique, stickier and get a good number of diverse domains to link regularly to each of those millions of URLs," there’s going to be resistance and a search for easier answers. But, like most things in life, what’s worth having is hard to get.

As always, I’m looking forward to your thoughts (and your shared experiences) on this tough issue. I’m also hopeful that, at some point in the future, we’ll be able to run some correlations on sites that aren’t fully indexed to show how metrics like link counts or domain importance may relate to indexation numbers.

Do you like this post? Yes No

More Screen Shots of Google’s New Beta AdSense Interface

A couple of week ago, we reported about Google’s new AdSense interface and posted on screen shot, provided by Google. I just gained access to the new interface myself and I took many screen shots. Before I provide the screen shots, I wanted to share with you both the URL they gave me to access the beta interface.

The link at the top right is in red and says “Try new AdSense.” When you click it, it takes you to https://www.google.com/adsense/enablebeta and then redirects me to https://www.google.com/adsense/v3/app. I believe you need to be added to the beta to gain access, but those are the URLs.

Here are screen shots, with sensitive info blocked out:

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

Forum discussion continued at Google AdSense Help, DigitalPoint Forums, and WebmasterWorld.


A couple of week ago, we reported about Google’s new AdSense interface and posted on screen shot, provided by Google. I just gained access to the new interface myself and I took many screen shots. Before I provide the screen shots, I wanted to share with you both the URL they gave me to access the beta interface.

The link at the top right is in red and says “Try new AdSense.” When you click it, it takes you to https://www.google.com/adsense/enablebeta and then redirects me to https://www.google.com/adsense/v3/app. I believe you need to be added to the beta to gain access, but those are the URLs.

Here are screen shots, with sensitive info blocked out:

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

New Google AdSense Interface

Forum discussion continued at Google AdSense Help, DigitalPoint Forums, and WebmasterWorld.



Page 1 of 212

Seth Godin: Sliced Bread

Malcolm Gladwell: Outliers

Anthony Parinello: Your Price is Too High