This is a guest post by Reed James. Enjoy, and please share!
I am surprised by the number of discussions in the ProZ.com forums about how to convert PDFs to an editable format such as Microsoft Word. In my opinion, the short and best answer is: it cannot be done. I say this because to my mind, conversion entails an almost perfect transfer of page layout, fonts and formatting such as bold, underline and italics. This is if the formatting is simple. Tables and images deserve an explanation in and of themselves.
There are two kinds of PDFs, each with its own difficulties. A live PDF means that it is (sort of) extractable and therefore editable right off the bat. What this means is you can copy the text from the PDF and paste it into a Word document. This does not necessarily mean that you will be successful in completely converting the live PDF into a Word document (again, I can't stress enough that those tables and images wreak havoc during the conversion process).
Then you have the dead PDF. Think of this variety as digital paper. Someone had a piece of paper (hopefully several pieces of paper) they wanted to put on their computer. So they scanned it and it became a PDF. But to you the translator, it is still a piece of paper — it just happens to live on your computer screen. You will find all of the characteristic flaws of a piece of paper: crooked, smudges, illegible handwriting and some more fun stuff thrown in for your pleasure. In fact, these PDFs can be worse than paper because many times they are at least once removed from that original piece of paper. My guess is that they are photocopies or faxes. This can make Arabic numerals especially hard to discern. Was that a number 6 or an 8? (Did I mention that when you are squinting at these numbers and trying to make a sound decision, it is 1:00 AM and you are dying to get some shuteye?)
Let's start with those live PDFs. I'm afraid you can't kill them, but I do recommend that you anesthetize them and make quick work with your scalpel in the operating room. First of all, you have to decide how good/bad the formatting is. If it is a simple, straightforward text, you can probably convert it. There is some pretty good software such as Solid PDF or ABBYY FineReader. If the output is a clean-looking Word document with no text boxes or paragraph marks at the end of each line, you're home free and ready to start translating in your CAT tool of choice. If not, a simple but somewhat time-consuming solution is to copy all of the text or save it to a text file and then either paste or open it in Word. Then you will have to format the document yourself. Depending on what you've agreed upon with the client, you may not have to reproduce the formatting exactly. Fortunately, your text is clean and the formatting job is just a matter of time. Maybe you want to put on your favorite music to lessen the drudgery.
If your PDF is dead, you will have to extract it — no chance of copying and pasting here. Whatever its condition (and I've seen mangled and mutilated dead PDFs that need an autopsy), make sure you run it through an OCR (optical character recognition) program. I can only recommend ABBYY FineReader, because it's the only software I've systematically used over the years. Now take a look at the output. Is the target document riddled with text boxes, funny characters or chunks of images of text? If you see any of the above, run for your life! — No, wait! You already agreed to translate this document, so here is what you can do:
Of course, if you have a clean target document, the only thing you need to do is go over it and check that the spelling, and the words themselves, are in order. After a while, you get a feel for what mistakes ABBYY FineReader will throw your way: l instead of 1, garbage text and nn for m, to give you a few examples. If you like, you can use the built-in spell checker and correct the text highlighted in blue. The key here is to never lose track of the source text and check it thoroughly against the document that underwent OCR.
Before you bury that scarred and wounded dead PDF littered with rubbish that makes it unfeasible to format by hand, save it and keep it on ice for future use. Think of it as your Frankenstein project. You can easily use bits and pieces of it — or limbs and organs in keeping with the leitmotif — in the target document. Look for proper nouns and numbers, which will no doubt be similar or the same in the target language. In other words, let nothing go to waste.
Which brings me to the fix for that dead PDF slaughtered beyond recognition, you know, one that would make roadkill look good! I'm afraid folks, you'll just have to limber up your fingers and type away your translation. There's a quick way to set up your Windows so that the source is on the left and the target is on the right (unless you want to do it backwards). Just press Windows key + left arrow so that the PDF is positioned on the left half of the screen and Windows key + right arrow so the word document is positioned on the right side of the screen. Of course, if you don't like to type, you can always dictate your translation with Dragon NaturallySpeaking (Editor's note: Learn more about dictation methods in this Q&A.). Typing Assistant is pretty good if you just want to stick to keyboarding and opt out of vocal cording.
If there are a lot of text boxes, stamps, seals, tables and graphs or any mixture thereof, consider breaking each page into smaller chunks in the form of a screen snippet. When I am looking at a tax form, for instance, it is hard on my eyes and more daunting if I see the complete page all at once. However, if I isolate each chunk or paragraph and focus on that alone, it is much more pleasant and hence I am more productive.
There are many ways to snip PDFs: Evernote, SnagIt and Microsoft Windows' own Snipping Tool. Yes, I know, it's an extra step and somewhat tedious if you have a long document, but I find it worthwhile. There's nothing worse than a translation with missing portions (Well, actually there is, a poor translation).
If your budget and schedule allow for it, you can always outsource the dirty work involved in formatting dead or live PDF documents. Seek the services of a DTP expert on Fiverr or Upwork. For a reasonable price and a fairly quick turnaround time, you can get someone else to prep your source document so that you can wake up the next morning or the morning after that and find it in your mailbox. Even so, don't forget to give it the once over in case your helper left something out/unresolved.
Besides those translators on the various forums who say that they refuse to accept PDFs (which is not my case), there are others who suggest asking the client if he/she has a Word copy of the document. I myself have asked the same question and usually the answer is negative. But it's worth a try as this would be the ideal and ultimate solution. If this is not possible, you should also ask the client how accurately the source document has to be formatted. Sometimes they will just say that minimal formatting is acceptable. Other times, they will demand more from the translator. Just make sure you are charging enough to make it worth your while. I myself add a certain percentage of my regular fee to all PDF translations. If you prefer, you could also add a given fixed hourly rate for dealing with these monsters.
Whatever you choose to do, and however you choose to go about it, it is extremely important to decide what solution or solutions you are going to apply before you start translating. At the beginning of my career, I relied too heavily on OCR programs, and in my ignorance, I would start translating on a less than perfect Word conversion with funny lines and squiggles. Then the client would mention this fact and I would have to scramble and fix it the best way I knew how. Conversely, sometimes I would type out the translation when I probably could have benefited from conversion software, saving me several hours. PDFs can be challenging, and a pain at times, but without them, our options would be limited.