Как извлечь текст из PDF со встроенными шрифтами

Question

Как извлечь текст из PDF со встроенными шрифтами

3038

Nishanth Lawrence 2013-10-08 в 09:20

Pdftotext из xpdf работает нормально для обычного файла встроенных шрифтов, но не работает там, где есть шрифты встроенных подмножеств. Есть ли обходной путь для этой проблемы?

1

2 ответа на вопрос

0

Damon 2013-10-08 в 09:45

In this situation, I have printed the PDFs using the Adobe PDF printer via a high resolution (1200 dpi+), high quality image(up any settings you can). Then, I OCR the image PDF leaving me with a searchable and workable PDF.

When I have many PDFs to do over thousands of pages, I have opened multiple PDF windows at once to do this simultaneously using multiple cores for multiple PDFs. It is a PITA, but it works.

Hopefully your files are small! I've done this to upwards of 10,000 pages once (building code books). Not fun.

Спасибо за ответ . Но почему средство просмотра PDF может правильно его интерпретировать? Nishanth Lawrence 11 лет назад 0

Вероятно, потому что кодировка встроена в PDF, а не в программу. Damon 11 лет назад 0

Accepted Answer · 2013-10-08 09:23:26

The issue is probably that the characters which are rendered using the subset font have a custom encoding - the numeric representation of the characters does not correspond to ASCII, Latin-1 or any other common encoding.

See

This means there isn't an easy workaround.

Как извлечь текст из PDF со встроенными шрифтами

2 ответа на вопрос

Похожие вопросы