Duplicate Content Between HTML & PDF Pages? Google Should Figure It Out

A Google Webmaster Help thread has discussion about a potential duplicate content issues between HTML and PDF documents. In this case, the content found on the HTML is the same as on the PDFs. Be it an automated “print as PDF” feature or manual download of the content in PDF format.

How does Google handle the duplicate nature of such content available on the web?

JohnMu at Google chimed in saying that in most cases, they will use the HTML file. He does recommend that in these cases, you block the PDFs from being crawled and indexed. But ultimately, he said, that is your call. Google will likely just want to keep the HTML version in their index.

John said:

If you have the same content in PDF as in HTML pages, in most cases we’ll probably show the HTML versions above (or in place of) the PDF versions. If this is a problem for your specific situation, I’d consider using the robots.txt or x-robots-tag to prevent the PDF files from getting indexed. I imagine for most sites this is not really a problem, so I wouldn’t suggest blocking indexing of PDF files without confirming that it’s really necessary.

The only situation where I would consider doing something in advance is when the CMS automatically creates PDF-copies of normal HTML pages. Generally speaking, this shouldn’t cause any problems, but those PDF versions are likely not compelling enough to merit getting indexed separately (and crawling them will possibly put a load on your server that you could avoid). Ultimately, it’s up to you to determine which content you wish to have crawled and indexed :-) — if you feel that PDF-copies of your content are compelling enough for users who search for your content, feel free to make them available.

Forum discussion at Google Webmaster Help.


A Google Webmaster Help thread has discussion about a potential duplicate content issues between HTML and PDF documents. In this case, the content found on the HTML is the same as on the PDFs. Be it an automated “print as PDF” feature or manual download of the content in PDF format.

How does Google handle the duplicate nature of such content available on the web?

JohnMu at Google chimed in saying that in most cases, they will use the HTML file. He does recommend that in these cases, you block the PDFs from being crawled and indexed. But ultimately, he said, that is your call. Google will likely just want to keep the HTML version in their index.

John said:

If you have the same content in PDF as in HTML pages, in most cases we’ll probably show the HTML versions above (or in place of) the PDF versions. If this is a problem for your specific situation, I’d consider using the robots.txt or x-robots-tag to prevent the PDF files from getting indexed. I imagine for most sites this is not really a problem, so I wouldn’t suggest blocking indexing of PDF files without confirming that it’s really necessary.

The only situation where I would consider doing something in advance is when the CMS automatically creates PDF-copies of normal HTML pages. Generally speaking, this shouldn’t cause any problems, but those PDF versions are likely not compelling enough to merit getting indexed separately (and crawling them will possibly put a load on your server that you could avoid). Ultimately, it’s up to you to determine which content you wish to have crawled and indexed :-) — if you feel that PDF-copies of your content are compelling enough for users who search for your content, feel free to make them available.

Forum discussion at Google Webmaster Help.



Google to Add PDF Support to Fetch As Googlebot?

A couple months ago, Google released an incredibly useful feature in the Webmaster Tools labs named fetch as GoogleBot. It basically allowed you to see what GoogleBot sees, enabling you to see crawl issues, hacks, injected links and other webmaster related issues as a GoogleBot. But when it came to PDFs, I don’t think the tool worked properly (yes, it is in labs).

A thread in the Google Webmaster Help forums has one webmaster asking why the feature doesn’t work with his PDFs. He asked:

For example, in the URL in question, http://www.knowitall.com/literature/spec/95731_Pharmaceutical_Excipients_Spec_Sheet.pdf, the text “Pharmaceutical Excipients Database” is in the pdf, but in the “Fetch as GoogleBot” results window, none of those terms are found–the results are basically in binary format. The document is found by the Google Search engine so it is apparently extracting the human readable text.

I couldn’t run a test on that document, but I used a PDF on my server to compare. I ran five different tests on two different domains, with a bunch of different types of PDF documents and they all came out with gibberish binary format results in the fetch as GoogleBot. Here is one sample screenshot:

Fetch as Googlebot PDFs

Not, Susan Moskwa from Google said in that thread:

FYI we’re looking into this issue, so sit tight. If it looks okay in search results the problem is probably not with your site (we’ve been able to reproduce it for other sites as well). Thanks for letting us know.

So it seems like they may get the fetch as GoogleBot feature working for PDF documents?

Forum discussion at Google Webmaster Help.


A couple months ago, Google released an incredibly useful feature in the Webmaster Tools labs named fetch as GoogleBot. It basically allowed you to see what GoogleBot sees, enabling you to see crawl issues, hacks, injected links and other webmaster related issues as a GoogleBot. But when it came to PDFs, I don’t think the tool worked properly (yes, it is in labs).

A thread in the Google Webmaster Help forums has one webmaster asking why the feature doesn’t work with his PDFs. He asked:

For example, in the URL in question, http://www.knowitall.com/literature/spec/95731_Pharmaceutical_Excipients_Spec_Sheet.pdf, the text “Pharmaceutical Excipients Database” is in the pdf, but in the “Fetch as GoogleBot” results window, none of those terms are found–the results are basically in binary format. The document is found by the Google Search engine so it is apparently extracting the human readable text.

I couldn’t run a test on that document, but I used a PDF on my server to compare. I ran five different tests on two different domains, with a bunch of different types of PDF documents and they all came out with gibberish binary format results in the fetch as GoogleBot. Here is one sample screenshot:

Fetch as Googlebot PDFs

Not, Susan Moskwa from Google said in that thread:

FYI we’re looking into this issue, so sit tight. If it looks okay in search results the problem is probably not with your site (we’ve been able to reproduce it for other sites as well). Thanks for letting us know.

So it seems like they may get the fetch as GoogleBot feature working for PDF documents?

Forum discussion at Google Webmaster Help.



Seth Godin: Sliced Bread

Malcolm Gladwell: Outliers

Anthony Parinello: Your Price is Too High