Duplicate Content Between HTML & PDF Pages? Google Should Figure It Out

A Google Webmaster Help thread has discussion about a potential duplicate content issues between HTML and PDF documents. In this case, the content found on the HTML is the same as on the PDFs. Be it an automated “print as PDF” feature or manual download of the content in PDF format.

How does Google handle the duplicate nature of such content available on the web?

JohnMu at Google chimed in saying that in most cases, they will use the HTML file. He does recommend that in these cases, you block the PDFs from being crawled and indexed. But ultimately, he said, that is your call. Google will likely just want to keep the HTML version in their index.

John said:

If you have the same content in PDF as in HTML pages, in most cases we’ll probably show the HTML versions above (or in place of) the PDF versions. If this is a problem for your specific situation, I’d consider using the robots.txt or x-robots-tag to prevent the PDF files from getting indexed. I imagine for most sites this is not really a problem, so I wouldn’t suggest blocking indexing of PDF files without confirming that it’s really necessary.

The only situation where I would consider doing something in advance is when the CMS automatically creates PDF-copies of normal HTML pages. Generally speaking, this shouldn’t cause any problems, but those PDF versions are likely not compelling enough to merit getting indexed separately (and crawling them will possibly put a load on your server that you could avoid). Ultimately, it’s up to you to determine which content you wish to have crawled and indexed :-) — if you feel that PDF-copies of your content are compelling enough for users who search for your content, feel free to make them available.

Forum discussion at Google Webmaster Help.


A Google Webmaster Help thread has discussion about a potential duplicate content issues between HTML and PDF documents. In this case, the content found on the HTML is the same as on the PDFs. Be it an automated “print as PDF” feature or manual download of the content in PDF format.

How does Google handle the duplicate nature of such content available on the web?

JohnMu at Google chimed in saying that in most cases, they will use the HTML file. He does recommend that in these cases, you block the PDFs from being crawled and indexed. But ultimately, he said, that is your call. Google will likely just want to keep the HTML version in their index.

John said:

If you have the same content in PDF as in HTML pages, in most cases we’ll probably show the HTML versions above (or in place of) the PDF versions. If this is a problem for your specific situation, I’d consider using the robots.txt or x-robots-tag to prevent the PDF files from getting indexed. I imagine for most sites this is not really a problem, so I wouldn’t suggest blocking indexing of PDF files without confirming that it’s really necessary.

The only situation where I would consider doing something in advance is when the CMS automatically creates PDF-copies of normal HTML pages. Generally speaking, this shouldn’t cause any problems, but those PDF versions are likely not compelling enough to merit getting indexed separately (and crawling them will possibly put a load on your server that you could avoid). Ultimately, it’s up to you to determine which content you wish to have crawled and indexed :-) — if you feel that PDF-copies of your content are compelling enough for users who search for your content, feel free to make them available.

Forum discussion at Google Webmaster Help.



Bing Doesn’t Support the Canonical Tag At All Right Now

There is this old and upsetting thread in the Bing Forums about how Bing handles the canonical tag. The thread is filled with misinformation. Matt McGee’s post at Search Engine Land a week ago says it clearly.

Bing says it’s still working on supporting the canonical tag on a single domain, and suggests webmasters should rely on other means to manage duplicate content.

You got that right, 11 months ago, Google, Yahoo and Bing announced support for the Canonical tag. As far as I know, only Google really uses it and they even added cross domain canonical support this month. Where is Bing at this? Well, in the next several months they hope to support a single domain use of the canonical tag and hopefully soon after the cross domain support. So it would have taken Bing over a year since they announced support of this tag to actually support it?

I am not too upset about that, to be honest. What I am more upset about is that official Bing support representatives are pretty much lying in the Bing Forums. Brett Yount, the Product Manager of Bing Webmaster Center said:

accourding to our blog post, http://www.bing.com/community/blogs/webmaster/archive/2009/02/12/partnering-to-help-solve-duplicate-content-issues.aspx, the canonical tag is used as a hint only.

No, it is not used as a hint or anything. It is not used period, not yet. Maybe in four months, but not yet.

Forum discussion at Bing Forums.


There is this old and upsetting thread in the Bing Forums about how Bing handles the canonical tag. The thread is filled with misinformation. Matt McGee’s post at Search Engine Land a week ago says it clearly.

Bing says it’s still working on supporting the canonical tag on a single domain, and suggests webmasters should rely on other means to manage duplicate content.

You got that right, 11 months ago, Google, Yahoo and Bing announced support for the Canonical tag. As far as I know, only Google really uses it and they even added cross domain canonical support this month. Where is Bing at this? Well, in the next several months they hope to support a single domain use of the canonical tag and hopefully soon after the cross domain support. So it would have taken Bing over a year since they announced support of this tag to actually support it?

I am not too upset about that, to be honest. What I am more upset about is that official Bing support representatives are pretty much lying in the Bing Forums. Brett Yount, the Product Manager of Bing Webmaster Center said:

accourding to our blog post, http://www.bing.com/community/blogs/webmaster/archive/2009/02/12/partnering-to-help-solve-duplicate-content-issues.aspx, the canonical tag is used as a hint only.

No, it is not used as a hint or anything. It is not used period, not yet. Maybe in four months, but not yet.

Forum discussion at Bing Forums.



Whiteboard Friday – Content & Technology Licensing

Posted by great scott!

Looking for a super-powerful tactic to build lots of high-quality links? Well we’ve got a winner for you! Licensing your content and/or data can be an immensely powerful, highly scalable strategy for building powerful links and brand awareness alike.  It’s incredibly effective for folks who have quality content or data and want to leverage that material into a great link building solution. 

Be warned though: there are important rules to consider in order to avoid potential duplicate content issues as well as cannibalization.  You want your content licensing working for you, not against you, so watch this week’s WBF to learn how you can manage licensing arrangements to best reap the benefits…

SEOmoz Whiteboard Friday – Content & Technology Licensing from Scott Willoughby on Vimeo.

Do you like this post? Yes No

Posted by great scott!

Looking for a super-powerful tactic to build lots of high-quality links? Well we’ve got a winner for you! Licensing your content and/or data can be an immensely powerful, highly scalable strategy for building powerful links and brand awareness alike.  It’s incredibly effective for folks who have quality content or data and want to leverage that material into a great link building solution. 

Be warned though: there are important rules to consider in order to avoid potential duplicate content issues as well as cannibalization.  You want your content licensing working for you, not against you, so watch this week’s WBF to learn how you can manage licensing arrangements to best reap the benefits…

SEOmoz Whiteboard Friday – Content & Technology Licensing from Scott Willoughby on Vimeo.

Do you like this post? Yes No

SMX East Lands in NYC

Search Marketing Expo East kicks off in New York City on Monday, with many Yahoos on various panels to talk about important issues in the SEO and SEM community. We hope you’ll stop by SMX East and check out what we’re up to!
Monday, October 5, 2009
Time: 10:45 a.m. -12 p.m.
Panel: Duplicate Content Issues: The [...]

Search Marketing Expo East kicks off in New York City on Monday, with many Yahoos on various panels to talk about important issues in the SEO and SEM community. We hope you’ll stop by SMX East and check out what we’re up to!

Monday, October 5, 2009

Time: 10:45 a.m. -12 p.m.
Panel: Duplicate Content Issues: The Search Engine Edition
Speaker: Cris Pierry, Senior Director, Search, Yahoo! Search

Time: 3:45 p.m. – 5 p.m.
Panel: Maps, Maps, Maps!
Speaker: Atif Rafiq, Director, Product Marketing, Yahoo! Local, Yahoo!

Time: 3:45 p.m. – 5 p.m.
Panel: Trademarks & Paid Search: How Have Things Changed
Speaker: Laura Covington, Associate General Counsel, Global Brand and Trademarks, Yahoo!

Tuesday, October 6, 2009

Time: 12 p.m. -1:30 p.m.
Panel: Ask The Search Engines: Best Practices Edition
Speaker: Cris Pierry, Senior Director, Search, Yahoo

Time: 4:45 p.m. – 6 p.m.
Panel: Universal & Blended Search Opportunities
Speaker: Larry Cornett, Vice President of Consumer Products, Yahoo! Search

Wednesday, October 7, 2009

Time: 9 a.m. -10:15 a.m.
Panel: Search Meet Display; Display Meet Search
Speaker: Antony Taylor, VP, Display Platforms, Yahoo!

Time: 11:45 a.m. -12:45 p.m.
Panel: Managing Search Across Business Units
Speaker: David Roth, Director of Search Marketing, Yahoo!

Time: 2 p.m. – 3 p.m.
Panel: Ask The Paid Search Reps
Speaker: David Miller, Director, Sponsored Search Product Management, Yahoo! Inc.

See the complete list of panelists and location.

Seth Godin: Sliced Bread

Malcolm Gladwell: Outliers

Anthony Parinello: Your Price is Too High