skip navigation

PDF Document Management Software, Services & Support

Server Desktop Services Support Why Us? About Us

The Latest

SecurSign 5 Now Available! Includes Signature Validation to Detect Tampering.
Lansdowne, PA (July 13, 2011)
Encrypt, digitally sign and verify digital signatures on PDF documents.

Redax 5: Advanced Redaction for PDF Documents
Tuesday, March 22, 2011
The latest Redax adds new patterns, regular expressions and more!

Redax Enterprise Server 3 Ships!
Thursday, January 6, 2011
New Redaction Engine, Powerful New Markup Options and More!

Survey: Server Based PDF Applications
Tuesday, December 7, 2010
The 2010 Survey asked about PDF server application development.

5 PDF Readers Compared
Tuesday, November 30, 2010
Expanding on our previous review, we've included Nitro's Reader and Adobe's new Reader X.

PDF Form Aids Sales Team Collaboration
Friday, November 26, 2010
Take a document, add a dash of JavaScript, a sprinkling of PDF know-how, and serve.

Section 508 Center for PDF now online!
Wednesday, November 17, 2010
A key resource for document authors, content managers and Section 508 coordinators concerned with PDF accessibility.

Make Your PDFs Work Well with Google (and other search engines)

TalkPDF225x100_noDJ.png

by Duff Johnson

During any given business day, I use Google hourly, if not more often. I also search local and network hard drives looking for proposals, client files and so on. Whether I think about it or not, full-text search is a big part of how I do my job.

On many of my searches, naturally enough, lots of PDF files come up in the search results. This makes sense - Google does index PDF files, and PDFs represent a large volume of the pages actually accessed online. So far, so good.Google search results

Now for the problem. In Google's search results, and in the results of most other search engines, the listings of most PDF files appear at best unprofessional, and at worst, downright embarrassing.

How bad is the problem

I performed a simple experiment. Your mileage may vary, but I doubt the results will be significantly different.

I conducted 10 more-or-less random searches. Google's search results included an average of 4.3 PDF files on the first search results page of each search. Of those PDFs, an average of 60 percent were displayed with totally meaningless Titles.

Let's look at why that happens, and how you can fix this problem with PDFs you make available for indexing and searching online.

 

The anatomy of Google's search results

The blue underlined text in Google's search results comes from one of two places in a PDF. First, Google looks in the "Title" document information field. While it is simple for document creators to add this information to their PDFs, real-world search results demonstrate that most PDF Title fields are either empty, bogus or otherwise malformed. To make things worse, many authoring applications place nonsensical information in, or even discard data from, the document information fields, creating a search-results "look and feel" that can range from confusing to totally meaningless.

(To check a PDF's Title information in Acrobat, use the Control-D keyboard shortcut or go to File > Document Properties, then click the Description tab. You can add or correct the document's title, author, and other fields as desired. But Title is essential!)

Be sure your PDF's document information fields


Click image to enlarge

If Google finds nothing in a PDF file's Title field, the second place it looks is more or less the first chunk of text it encounters in the document. This might be the title (if that's the first text on the page), but it's just as likely to be unhelpful code or simply miscellaneous text from somewhere on the first page of the document. Google uses this text to as a "stand in" for the Title for use in search results - an approach that fails far more often than it succeeds.

When you fail to ensure a valid Title in a PDF, search results won't show the vital information that can assist users in choosing the correct document to open. The result is slower, less-reliable searches for every user, every time they search.

Other considerations for optimizing PDFs for search-engine use

PDF Specification: As of this writing (January 2006), it appears that Google doesn't index Specification 1.6 PDF files, the latest version fully compatible with Acrobat 7.0. To solve this problem, use PDF Optimizer in Acrobat 7.x Professional (Advanced > PDF Optimizer...) to set your PDF version to 1.5 or 1.4 and make your file's contents available to Google's indexing engine.

To ensure that all search engines can index your content,


Click image to enlarge

File-size limits: Google does not index every word in every PDF. There's a size limit - variously reported as being between 100 and 500kb - to the text that Google will attempt to extract and index from any given file. If you are posting large PDFs and it's critical that Google indexes all of the content, consider posting documents by chapter or use another natural breaking point. This way, Google is less likely to stop indexing at, say, page 57 of a 112-page document.

Content Reading Order: If controlling and optimizing the way search engines index your PDFs matter to you, you'll eventually want to get familiar with the content reading order - the order in which search engines extract text from the document for indexing. Content ordering is not a casual process, but it can result in dramatically improved search results, especially for search engines that display search terms in context.

To begin defining content order in Acrobat Professional, first find out whether your file is Tagged. (Control-D keyboard shortcut, then check the "Description" tab)....

This little tell-tale is prima-facie evidence of inaccessible content.


Click image to enlarge

If your PDF isn't tagged, you can quickly tag it using the Advanced > Accessibility > Add Tags to Document command. To view how the content is currently ordered, open the Touch Up Reading Order Tool, in the same Accessibility menu item. This image is pretty optional, I think.

Get the reading-order right, and so will Google.


Click image to enlarge

Conclusion

Most organizations posting documents to their intranet or Internet file servers want those documents to be found by other people. Corporate intranets rely on search engines to index and retrieve all manner of internal documents for use everyday. To the extent that PDF files comprise a meaningful volume of your searchable content (and you wouldn't have read this far unless they do), you owe it to yourself to make sure your PDFs will look their best under the relentless gaze of the search engines.

Key Take Aways:

  • Check each PDF file's "Description" (in Document Properties) before posting.
  • Break large PDFs into chapters before posting to ensure Google indexes all the content.
  • Add structure to PDF files so Google indexes the content you want displayed in search results.

Originally posted on Duff Johnson's PDF Perspective blog for acrobatusers.com.


Server Desktop Services Support Why Us? About Us
AppendPDF
AppendPDF Pro
FDFMerge
FDFMerge Lite
pdfHarmony
Redax Enterprise Server
SecurSign
StampPDF Batch
APCrypt
APJavaScript
APSplit
APGetInfo
pdfAPilot Server 2
Redax
StampPDF plugin
StampPDF DE
AppendPDF DE
APSplit DE
PDF Forms
Designer/XFA Forms
PDF JavaScript
PDF Accessibility
Section 508
Publication Scanning
CD/DVD-ROMs
Custom Development
Software Support Policy
Technical Support
Product Documentation
FAQs
Sample Scripts
PDF Glossary
Contact Support

Talking PDF
Appligent Labs
Customers
Testimonials
Case Studies
Cost Effectiveness
Innovation
PDF Standards
Experience
Mission
History
People
Partners
Contact Us
News & Events
Site Accessibility
Site Index
 
Site Accessibility | Email the WebAdmin
Valid HTML 4.01! Section 508 Compliance logo