— I have to translate PDF documents in several languages.
Can this be done easily?
Perhaps you’ve heard a question like this before and you are still looking for an answer. Before going any further, let’s look a little more into the format known as PDF.
These days a huge number of documents are either created in or converted to PDF format. It has become the universal exchange format for electronic documents. The main advantage (and attraction) of PDF is its ability to preserve the look and feel of the original document by describing the low-level structural objects. Most PDF documents, however, are untagged and do not contain the basic high-level logical structure information. One consequence, among others, is that this makes extracting information particularly difficult: a true disadvantage in an age of open and flexible data structures.
Now, back to the question… how can I easily translate PDF documents?
The answer is an unqualified MAYBE. Unfortunately, people are often led to believe that the PDF format is a publishing format and should therefore be easy to convert to MS Word. In fact, PDF is more a description of what the printed page looks like than a description of the document’s structure.
So, from a translation process perspective, the challenge consists in developing techniques that allow for changing the content, without losing the whole work of accomplished formatting. At this point, we have to say though that process optimization is an utopia when it comes to PDF document translation. Preparing PDF files that are suitable for translation will continue to be a major issue and this will get worse as more PDF creation methods become available.
— So, you’re saying that this can’t be done?
No, exactly. What we are saying is that for typical PDF documents, the file preparation can be partially automated, but you should expect that other parts will need to be done manually. Of course, the supported features depend greatly on the quality, the complexity and the level of markup that is required in the target documents.
The page elements that are usually easy to recognize, such as headers and footers, text flows, page grid, margins, line art, raster image, tables, headings, and callouts, can be challenging for an automated tool to recognize. The file preparation should be done carefully considering all particular structures that exist in the source PDF documents. If the documents do not share a consistent appearance, it will become even more difficult and time-consuming task.
To give you an idea of what we are talking about, let’s take a look at one simple lay-out feature that can cause a lot of problems: columns. Many publications are printed in a multiple column layout. What this means, of course, is that the PDF, also contains the multiple columns,. But since the PDF is basically a page layout format, it contains information about the letters on the page and where they are to be printed. However, there is nothing in the PDF that specifies that some copy is in column one and other copy is in column two (or even that there are in fact two columns). The conversion tool must therefore analyze the geometry of the page and attempt to recognize a column layout. When the margin is tight, and two columns are quite close
together, conversion tools can often get confused and miss a multiple column layout, thereby horribly mangling together the text from two totally unrelated paragraphs. In this case, It will be virtually impossible to extract any text from such PDF, so the linguist will have to retype the entire source document before translation can start.
Our next posts will continue covering some aspects and issues related to PDF conversion.
Stay tuned! The game is not over yet!