Blog: Example

Extract some pages from a pdf (qpdf)

install qpdf (FOSS)

qpdf originalDoc.pdf --pages . 1-10 -- outputDoc.pdf

For multiple sets of pages

qpdf --pages . 1-8 . 53-70 -- input.pdf output.pdf


TTTThis

Convert images to pdf (img2pdf)

(There is a way described for Imagemagick but it didn't work for me https://linuxhint.com/convert-image-to-pdf-command-line/ )

sudo apt-get install img2pdf

Open a Terminal in the folder with the images and do

sudo img2pdf *.png -o output_imgs.pdf

(assuming the images are pngs.)


TTTThis

Convert image to text (tesseract-ocr)

sudo apt install tesseract-ocr

or (although I don't think this is necessary)

sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng

Do an example. Name your file existingimage.png and open a Terminal in that folder and do

tesseract -l eng existingimage.png output_from_ocr cat documenttocreate.txt

(where -l specifies a language. To see all the languages, do man tesseract)


OCR means Optical Character Recognition

Convert image to pdf (not to txt)

tesseract -l eng input_for_ocr.png output_from_ocr pdf


Errors because 'Tesseract couldn't load any languages!': https://github.com/tesseract-ocr/tesseract/issues/1309

Spanish: download from here https://github.com/tesseract-ocr/tessdata/blob/c2b2e0df86272ce11be323f23f96cf656565ed41/spa.traineddata

put it here /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata (You will have to open that folder as root)

Or maybe you can just use: sudo apt-get install tesseract-ocr-spa (although it might not save it to the location you want)


NOTE: After you install a language (or even if you don't) you might over-save the same file, and see an error message, but it's working anyway.

TTTThis

Convert PDF to TXT with pdftotext (poppler-utils)

  1. sudo apt install poppler-utils
  2. open Terminal in folder
  3. pdftotext -layout pdfname.pdf documenttocreate.txt (where -layout tries to preserve the formatting of the pdf, it is an optional command)
  4. pdftotext -layout -f 1 -l 20 pdfname.pdf documenttocreate.txt (where f and l designate the first and last pages, we create a txt file out of pages 1-20 of the pdf)

Pdf To Text doesn't support batch conversion. You have to do it using “Bash for loop” to convert a whole folder full of pdfs.

for file in *.pdf; do pdftotext -layout "$file"; done

To convert all pdfs in that folder to files (I haven't tested this)

TTTThis

Convert PDF all pages (or some) into images (pdftoppm)

https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844

Check you have it (often pre-installed) with

pdftoppm -v

Example you can try without overwriting anything. Open a Terminal in the folder and do (you can, if you want, change your pdf name to imgtesttt.pdf and just copy-paste these 2 commands).

mkdir -p images && pdftoppm -jpeg -jpegopt quality=100 -r 300 imgtesttt.pdf images/pg

(This is the highest quality jpeg available, although you can set from 0 to 100. Jpegs will be around .2 to 2mb with 8.5x11" pages.

or for png

mkdir -p images && pdftoppm -png -r 300 imgtesttt.pdf images/newimagename

(Note that this does two things. First it creates a folder called 'images', so to create a folder called 'book' you need to change that as well as the latter part of the command)

To do several pdfs that are in the same folder

mkdir -p XXXXX && pdftoppm -png -r 300 XXXXX.pdf XXXXX/XXXXX

mkdir -p YYYYY && pdftoppm -png -r 300 YYYYY.pdf YYYYY/YYYYY

A more simple command

mkdir -p images && pdftoppm imgtesttt.pdf images/pg

(where it creates a folder in that folder called ‘images’ and makes a .ppm image file of every page. Where 300 is 300dpi (default is 150dpi if you don't specify)

Note: you can make:

  • PPM (default)
  • PNG (with -png)
  • JPEG (with -jpeg)
  • TIFF (with -tiff)

A tiff example:

mkdir -p images && pdftoppm -tiff -r 300 mypdf.pdf images/pg (300dpi, where each image takes 15-45 seconds. Single core process, so not any faster on faster machines)

A simpler jpeg example:

mkdir -p images && pdftoppm -jpeg -r 300 mypdf.pdf images/pg

I just did png this way:

mkdir -p images && pdftoppm -png -r 300 mypdf.pdf images/pg

Image

image shows jpeg 300dpi and png 300dpi at 100% and 200%


TTTThis