In Part 2 of this post we will continue covering some aspects related to PDF conversion and discussing the state-of-the-art in accomplishing the translation task.
What’s the big deal with PDF conversion?
Why not simply cut and paste, it’s easy to do and doesn’t cost anything.
Extracting textual information from PDFs – though time-consuming – can seem relatively easy at first glance. You can copy and paste, take screenshots and even manually retype any needed information. However, it becomes nearly impossible when copying from the PDF isn’t allowed or when a pasted section produces results that cannot be used. Also, it may seem easy to overcome an iceberg when you consider only the visible part, but there are more things to consider under the surface. The visual part of a PDF document – the look and feel – is only the tip of the iceberg.
Are all PDFs created equal?
Every PDF has its own shape and features – no two are the same. There are several different flavors of PDF, but you can reduce all flavors into basically 2 types: Distilled PDF and Scanned PDF. You get Distilled PDF when you produce a PDF document from a text publishing tool (via Acrobat Distiller or other PDF writers). Adobe Acrobat allows other flavors of PDF to contain raster images of each of the pages of the document (with or without some text in the background to allow text searching). These PDF documents are referred to as Scanned PDF. You get these when you scan paper documents (via Acrobat Exchange or some other method).
Does the type of PDF created matter?
Yes, it does. When it comes to converting PDFs into an editable format, the nature of the PDF does matter. Extracting text from a Scanned PDF is not that simple and it requires at least some tailoring to the problem at hand and good OCR software. The complications arise when, for instance, the image is noisy or text pixels cannot be well distinguished from the background. In this case, the OCR process does not work as smoothly because it depends on the quality of the provided PDF. Usually it will require a lot of clean up once they are converted.
What types of documents will convert easily?
It is important to note that process optimization is a utopia when it comes to translating PDFs, but as a general rule, the simpler the layout of the source documents, the better the converted documents will be. For instance, if you are converting novels, since there is typically not much layout in the source documents, you can expect a lot of success (and hence very little cleanup) in converting these to editable format. If, on the other hand, you’ve got complex pages such as scanned scientific journal pages, which are likely to contain multiple columns, lots of complex tables, math, footnotes and bibliographies, you should expect have to do a fair amount of cleanup on the converted documents.
Is there anything happening to make PDF conversion easier in the future?
Several tools have been designed and developed to interact with PDF documents. Beside the common Adobe products and solutions, third party developers propose many different softwares and API, either under license or as freeware. Consequently, a wide range of PDF tools are proposed in the market. Most of them allow for the extraction of textual content but their practical use is limited in the sense that the text’s reading order is not necessary preserved, especially when handling multi-column documents, or in the presence of complex layouts.
Adobe Acrobat X Pro [ https://acrobat.adobe.com/us/en/acrobat/acrobat-pro.html ] does a startlingly good job of exporting PDF files into Word or Excel editable documents. It isn’t perfect, and didn’t select the correct fonts when exporting my test documents, but it did a far better job of preserving the original format than anything I’ve seen in third-party software. This export function worked best when I used Distilled PDFs—not from a scanned image. In contrast, Scanned PDFs contain only a picture of the original text, and Acrobat can only extract the text by using its built-in Optical Character Reading (OCR) software. Acrobat X has more accurate OCR than previous versions did, but it still lags far behind the best third-party OCR software like ABBYY Finereader 10 Professional Edition [ https://www.abbyy.com/finereader/ ].
Our experience is that you need to experiment with various options to see which ones best fit into your needs and work best with your PDF documents. Our approach is constantly re-evaluating the various tools, methods and techniques available and incorporating the best of what’s out there into what we do.
The fact is all PDF files used as a source for translation need reworking before they’re translated into several languages. By making the native source documents available to your translation partner, you will avoid any rework or any unnecessary preparation of the documents before translation can start. It will allow us to perform a full analysis and it will let you stay in control of your budget and schedule without any surprises down the road. PDFs serve a purpose, but when it comes to translation, there is nothing better than the real thing: native source documents (such as FrameMaker, InDesign, Quark XPress, etc).