Lighten, darken, increase contrast on (text) images, for readability (ImageMagick)

Lighten those pages

  • convert output.pdf -function polynomial 1,0,0,0 darkened.pdf
  • mogrify output.pdf -contrast-stretch 2%x20% music1C.pdf
  • convert -density 600 output.pdf output-%02d.jpg

(using rm to add security permissions then remove them after https://stackoverflow.com/questions/52861946/imagemagick-not-authorized-to-convert-pdf-to-an-image)

sudo mv /etc/ImageMagick-6/policy.xml /etc/ImageMagick-6/policy.xml.off

-----When done, you can restore the original with

sudo mv /etc/ImageMagick-6/policy.xml.off /etc/ImageMagick-6/policy.xml

3 step process:

  • convert your_pdf_filename.pdf output-%02d.jpg
  • convert output*.jpg -level 25% final-%02d.jpg
  • convert final*.jpg very_readable.pdf

(change the level value)

With the arg -threshold you get a "black and white" (only) image. But I want to keep the gray scale, which is possible with the arg -level: you keep the gray, letting the image with a darker or lighter gray scale. (referring to something like <<< convert output*.jpg -normalize -threshold 80% final-%02d.jpg >>>

TTTThis

Extract some pages from a pdf (qpdf)

install qpdf (FOSS)

qpdf originalDoc.pdf --pages . 1-10 -- outputDoc.pdf

For multiple sets of pages

qpdf --pages . 1-8 . 53-70 -- input.pdf output.pdf


TTTThis

Convert images to pdf (img2pdf)

(There is a way described for Imagemagick but it didn't work for me https://linuxhint.com/convert-image-to-pdf-command-line/ )

sudo apt-get install img2pdf

Open a Terminal in the folder with the images and do

sudo img2pdf *.png -o output_imgs.pdf

(assuming the images are pngs.)


TTTThis

Convert image to text (tesseract-ocr)

sudo apt install tesseract-ocr

or (although I don't think this is necessary)

sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng

Do an example. Name your file existingimage.png and open a Terminal in that folder and do

tesseract -l eng existingimage.png output_from_ocr cat documenttocreate.txt

(where -l specifies a language. To see all the languages, do man tesseract)


OCR means Optical Character Recognition

Convert image to pdf (not to txt)

tesseract -l eng input_for_ocr.png output_from_ocr pdf


Errors because 'Tesseract couldn't load any languages!': https://github.com/tesseract-ocr/tesseract/issues/1309

Spanish: download from here https://github.com/tesseract-ocr/tessdata/blob/c2b2e0df86272ce11be323f23f96cf656565ed41/spa.traineddata

put it here /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata (You will have to open that folder as root)

Or maybe you can just use: sudo apt-get install tesseract-ocr-spa (although it might not save it to the location you want)


NOTE: After you install a language (or even if you don't) you might over-save the same file, and see an error message, but it's working anyway.

TTTThis

Convert PDF to TXT with pdftotext (poppler-utils)

  1. sudo apt install poppler-utils
  2. open Terminal in folder
  3. pdftotext -layout pdfname.pdf documenttocreate.txt (where -layout tries to preserve the formatting of the pdf, it is an optional command)
  4. pdftotext -layout -f 1 -l 20 pdfname.pdf documenttocreate.txt (where f and l designate the first and last pages, we create a txt file out of pages 1-20 of the pdf)

Pdf To Text doesn't support batch conversion. You have to do it using “Bash for loop” to convert a whole folder full of pdfs.

for file in *.pdf; do pdftotext -layout "$file"; done

To convert all pdfs in that folder to files (I haven't tested this)

TTTThis