Converting File Formats

Converting files from one format to another happens often, but converting images and text from other sources in preparation for PDF production can be a challenge. Files arrive from many sources but those sources may not be appropriate to your apps or to your requirements. Linux, Mac, and Windows systems have many options to convert formats.

Don’t pooh-pooh Linux shell tools as only for wonky power users, nuts-and-bolts programmers, and deep-diving system administrators. Bash (GNU’s Bourne Again SHell) tools continue to offer speed and convenience for getting things done that GUI apps may not do quite as well. Consider the Windows PowerShell command line used in my Backup article that gave both a GUI progress bar and the elapsed time in its text output. Cygwin, offering bash for Windows has been available for a long time. Microsoft has even made a bash available for Windows 10.

But first, this special issue needs addressing.

File Naming Problems

Many years ago, when Windows magnified what they borrowed from Unix for MS-DOS, Windows users discovered they could use very long filenames. They started naming their files with descriptive phrases. Unfortunately, they chose to use spaces in their filenames. While apps can use spaces in filenames — Unix, Linux, Windows, and even as many years back as MS-DOS, the operating system’s directories could store spaces in filenames — those spaces cause trouble when working with certain very convenient command line (shell) tools.

Inside a program, a file’s name is whatever the OS’s directory says it is. Filenames can be very long and can use just about any character, including some characters it shouldn’t use. What shouldn’t it use? It is typically a problem for filenames to have question marks and asterisks in them because those are shell wildcards. Sometimes it is a problem for filenames to have control characters in them because those are often hard to type. Spaces separate command line arguments, so spaces in filenames become a problem for the shell interpreting your space-separated commands. The common workaround uses quotation marks around such filenames. Another workaround uses a backslash (\) character before each space, a technique known as escaping the space. Using the backslash trick also makes typing a bit tougher and is prone to the typist forgetting one of the them.

Face it. Filenames with spaces are a pain.

The easiest thing to do is rename the file. Several Windows methods exist for renaming and an interesting PowerShell method (or direct) looks very similar to the Linux method. Here’s a Linux example:

$ ls this*
this is a test file with spaces in the name.txt

This filename was created with quotation marks surrounding the name, which preserved the spaces, and the directory is happy to store it. Notice that the wildcard (*) was able to expand the spaces on the command line for ls(1) to use. Linux’s mv(1) is different from the Windows move or rename commands. On Linux, mv will move a file from one place to another, will rename a file, or will move and rename the file simultaneously. On Windows, rename only changes names. It does not move the file. The Windows version of move is more like Linux’s mv that puts a file elsewhere and renames it. In fact, the Windows move command was borrowed from Unix’s mv.

Here’s what the Linux bash command to rename it might look like:

$ mv this\ is\ a\ test\ file\ with\ spaces\ in\ the\ name.txt thisisatestfilewithspacesinthename.txt

Aside from the pain of typing this — you just had to pick a long name, didn’t you, but command line auto expansion (hit the TAB key after typing part of the current name given after the mv command) helps — you could manually delete the spaces. Along the way, if you don’t like how the words run together in a space-free name, you could capitalize each word. Windows directories store capital letters and Windows GUI programs can use them, but some Windows shell commands refuse to differentiate them. Even so, imagine going through this pain when you need to convert many such files?

On Linux, the tr(1) utility helps. The tr command translates each character in a set of characters into the equivalent characters from a second set.

Note: Why does Linux use such heavily abbreviated program names? Linux derives from Unix. Unix was created back in the day of printing terminals, before monitors became cheap and widespread, at a time of slow network transmission speeds. There was quite a delay when the system took a character typed at the terminal’s keyboard, sending it down the slow network line to the computer, possibly miles away, then that computer acknowledged it by echoing the character back over the slow line, which the printing terminal would receive and strike the character on the paper, So, the fewer characters to type, the faster everything went, hence, very short command names.

For example, say you want to change every instance of any of the characters in the set abcde with the equivalent characters in the set 12345:

$ echo "the quick brown fox jumps over the lazy dog" | tr "abcde" "12345"
th5 qui3k 2rown fox jumps ov5r th5 l1zy 4og

Notice that the sentence has three letters e and every one got replaced with a 5. Of course, you can replace only one character. You could replace every space with an underscore:

$ echo "this is a test" | tr ' ' '_'
 this_is_a_test

If you just want to get rid of them:

$ echo "this is a test" | tr -d ' '
thisisatest

The -d option deletes every appearance of the character in the set. No second set is needed. Either of these last two methods will get rid of the spaces, but let’s stick with using the underscore in place of the space.

Note: You can create test files using the touch(1) command:

$ touch "this is a test" "this is another test" "this is the last test"

Touch creates a file for each name you give it, but such files have no data. They only have a directory entry, so ls would list a zero length.

If you have many files with spaces that you want to replace like this, use mv with tr in a loop. Given the following filenames:

$ ls -1 this*
this is another test
this is a test
this is the last test

Here’s the bash loop to do it:

for f in *\ *
do
  mv "$f" "$(echo "$f" | tr ' ' '_')"
done

Just type that in after the $ prompt. Each new line typed will prompt with a > symbol, indicating that bash knows you’re continuing the for command. Pressing ENTER after the done will run the loop. Here’s the result of the loop, using ls -1 (a minus-one argument):

$ ls -1 this*
this_is_another_test
this_is_a_test
this_is_the_last_test

The script works like this:

  1. The for loop fills variable f, one filename at a time, with each filename having at least one space in it. That space is escaped with a backslash between two asterisk (*) wildcards, so any filename containing at least one space will match. Remember that bash wildcards can be used as often as needed. Filenames without spaces will not be given to the f variable.
  2. Within the do...done delimiters, the mv command gets the current contents of variable f in its first argument — the filename to move from. Quotation marks hide the spaces from the shell, but allow expansions using the $ introducer. The second argument — the filename to move to — is generated in a subshell, using the $(...) syntax.
  3. Inside the $(...) subshell, the echo command sends the expanded variable f to the tr command through a pipe. That replaces every space with an underscore. There’s a little trick. The quotation marks surrounding $f act different than you might think. From the perspective of the shell, the one in front is a close quote and the one after is another open quote. But with a subshell, the $f expands and the expansion is preserved by the subshell because of the quote mark and delivered intact when the next quotation mark appears. To see how this works, touch a file named, “this has multiple spaces in it”, but which has one space, then two spaces, then three, and so on.
  4. The outer quotation marks deliver the full name including all spaces to the mv command. The loop repeats this action for every file having spaces in its name.

Here’s what it looks like, including the special file name mentioned in the third step:

$ ls -1 this*
this has  multiple   spaces    in     it
this is another test
this is a test
this is the last test

$ for f in *\ *; do mv "$f" "$(echo "$f" | tr ' ' '_')"; done

$ ls -1 this*
this_has__multiple___spaces____in_____it
this_is_another_test
this_is_a_test
this_is_the_last_test

$ for f in *\_*; do mv "$f" "$(echo $f | tr '_' ' ')"; done

$ ls -1 this*
this has  multiple   spaces    in     it
this is another test
this is a test
this is the last test

The first filename, “this has…”, contains an increasing number of spaces between its words. The others use one each. (BTW, semicolons allow separate parts of the loop to appear on a single command line.) The loop replaces the spaces with underscores, as the list following the loop shows. Every space was preserved by the subshell and passed through the pipe to tr. Proof is in the second file list. Reversing the process comes next. Just put an underscore where the spaces were in the for statement and swap the quoted space and the quoted underscore in the tr command. But, this time the quotation marks surrounding $f were removed in the echo command because space preservation is not needed when the underscores are in the filenames. The quotes could have been kept. Every underscore got a space replacing it.

To prove that the extra quotation marks are needed to preserve the spaces in the subshell, here’s what would happen if you used the first loop to replace the spaces with underscores, but did not use quotation marks in the echo command:

$ for f in *\ *; do mv "$f" "$(echo $f | tr ' ' '_')"; done

$ ls -1 this*
this_has_multiple_spaces_in_it
this_is_another_test
this_is_a_test
this_is_the_last_test

Oops! Turned all the multi-spaces into single spaces. How did that happen?

Without the quotation marks to protect them inside the subshell, the $f variable expands them as part of the echo command. Before running echo, the subshell reduces each instance of unprotected multiple spaces into a single space. That’s what echo passes to tr, so tr only sees one space to translate for each instance. To preserve the spaces generated within the subshell, a separate set of quotation marks are needed.

Note: You can remove the test files using:

$ rm this*

No quotation marks are needed for this because the shell expands the wildcard and passes the results to the command, rm, without further evaluation. To prove this, before running rm, run the command:

$ echo this*

You’ll see that the multispace file name expands with all its spaces. Bash runs its evaluating operations in a certain order.

Convert Documentation to PDF

Most modern document generating software can export to PDF. But, if you have a bunch of files needing conversion, delivering them on the command line eliminates the time needed to load them into the word processor, export them, and then do it again for the next one.

LibreOffice has versions for Linux, Windows, and Mac. Its command line interface is soffice(1). This command can start any of the LibreOffice suite of programs, but when used on the command line with the --headless option it does the work without starting the GUI.

Note: On Windows, use -headless instead. In general, where Linux uses a double-hyphen for its options, the Windows version of soffice uses a single hyphen.

To convert docs or presentations to PDF, run the command:

$ soffice --headless --convert-to pdf FILENAMES

This converts any document, spreadsheet, or presentation file into PDF. The filename will remain the same except it uses the suffix .pdf instead. Other file formats are available, too. See the list of *.xcu files and look inside the format you want to try. Whatever name is used as the node: oor:name in that file is the name you can use with the suffix type, such as

$ soffice --headless --convert-to xls:"MS Excel 4.0" *.xlsx

Convert PDF to Image Files

Sometimes, you want simple graphic images from your PDF. Maybe you want to import part of it into another image. Or, you wrote the PDF and others can read it, but you don’t want them changing it so making a static image could do the trick. Also consider preventing invasive software embedded in the PDF from compromising your system. Turn the other guy’s PDFs into graphic images before opening them. They’re still readable but cannot compromise. Qubes does this when it converts to trusted PDFs.

The easiest way to convert PDFs into images is with the ImageMagick tools, specifically using convert(1). ImageMagick typically comes with Linux distributions, but it is available for Windows, Mac OS X, and iOS. Windows users will need to install Ghostscript. Consider ImageMagick a lifetime learning experience.

Note: The Poppler utilities provide extra conversion apps (pdfdetach(1), pdffonts(1), pdfimages(1), pdfinfo(1), pdfseparate(1), pdftocairo(1), pdftohtml(1), pdftoppm(1), pdftops(1), pdftotext(1), pdfunite(1)). However, there is no version of Poppler made for 64-bit Windows users. A 32-bit version is available and 64-bit Windows 10 can run 32-bit programs because of WoW64 (Windows on Windows64). One could download and build the Poppler code with MinGW (Minimalist GNU for Windows). MinGW has 32-bit and 64-bit versions, opening up many Linux utilities to Windows users, including bash, the shell, gcc, the C compiler, gawk, the text processing language, grep, the text search program, and many others.

To turn a PDF to a set of PNG files, use the convert utility:

$ convert INFILE.pdf OUTFILE.png

or to turn into a set of JPG files, just change the output suffix:

$ convert INFILE.pdf OUTFILE.jpg

One file for each page will output using whatever name you gave as the outfile name. Because the output suffix will be different from the input suffix, the outfile name can be the same as the infile name. Each output file will have a page number:

$ ls -1sh TestFile1*
44K TestFile1-0.png
40K TestFile1-1.png
96K TestFile1.pdf

Converting TestFile1.pdf produced two image files, one for each page. But, converting a larger file reveals a small problem:

$ ls -1sh TestFile2*
 48K TestFile2-0.png
 40K TestFile2-10.png
 28K TestFile2-11.png
 24K TestFile2-1.png
 32K TestFile2-2.png
 44K TestFile2-3.png
 24K TestFile2-4.png
 68K TestFile2-5.png
 24K TestFile2-6.png
 68K TestFile2-7.png
 64K TestFile2-8.png
 40K TestFile2-9.png
104K TestFile2.pdf

TestFile2.pdf has 12 pages, which convert doesn’t know about. It doesn’t prescan to see how many pages there are. It just reacts to discovering the page break by generating a new file, updating its page number, and moving on. That results in the non-numeric sorting problem.

When sorting numbers as characters instead of as numbers, the first digit encountered in our filename numbering scheme, going from left to right, is 1, which sorts after the digit 0 in the same position in another filename and before the equivalent digit 2 in another filename. But, if multiple files have a 1 in that position, the next character resolves the sequence. For 10 and 11, the 0 and 1 subsort correctly, but the filename with sequence number 1 is followed by a period. Character encoding, such as ASCII and Unicode, puts all characters in a particular order. This order affects all sorting sequences due to the order assigned in that encoding.

To see what encoding your terminal is using:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

These are my locale(1) codes currently active. My system (LANG setting) uses Unicode, English US UTF-8. There are others. On my system they’re defined in:

$ ls /usr/share/X11/locale/
am_ET.UTF-8       iso8859-1   iso8859-9     locale.dir        vi_VN.tcvn
armscii-8         iso8859-10  iso8859-9e    microsoft-cp1251  vi_VN.viscii
C                 iso8859-11  ja            microsoft-cp1255  zh_CN
compose.dir       iso8859-13  ja.JIS        microsoft-cp1256  zh_CN.gb18030
cs_CZ.UTF-8       iso8859-14  ja_JP.UTF-8   mulelao-1         zh_CN.gbk
el_GR.UTF-8       iso8859-15  ja.SJIS       nokhchi-1         zh_CN.UTF-8
en_US.UTF-8       iso8859-2   km_KH.UTF-8   pt_BR.UTF-8       zh_HK.big5
fi_FI.UTF-8       iso8859-3   ko            ru_RU.UTF-8       zh_HK.big5hkscs
georgian-academy  iso8859-4   koi8-c        sr_CS.UTF-8       zh_HK.UTF-8
georgian-ps       iso8859-5   koi8-r        tatar-cyr         zh_TW
ibm-cp1133        iso8859-6   koi8-u        th_TH             zh_TW.big5
iscii-dev         iso8859-7   ko_KR.UTF-8   th_TH.UTF-8       zh_TW.UTF-8
isiri-3342        iso8859-8   locale.alias  tscii-0

Notice that the locale output shows LC_ALL is empty. That one overrides every other one. UTF-8 defines each character using a multibyte sequence. Sorting looks at the underlying encoding numbers and compares them to determine which character comes first. In UTF-8, the character 1 followed by a digit comes before the character 1 followed by a period. That may violate the law of least astonishment for those accustomed to ASCII, which sequences the period before any digits. For comparison, run the same command, but prefix it with a new setting for LC_ALL:

$ LC_ALL=C ls -1sh TestFile2*
 48K TestFile2-0.png
 24K TestFile2-1.png
 40K TestFile2-10.png
 28K TestFile2-11.png
 32K TestFile2-2.png
 44K TestFile2-3.png
 24K TestFile2-4.png
 68K TestFile2-5.png
 24K TestFile2-6.png
 68K TestFile2-7.png
 64K TestFile2-8.png
 40K TestFile2-9.png
104K TestFile2.pdf

Setting the LC_ALL variable, or any variable, in front of a command sets it only for that command. After the command finishes, the variable loses its setting. Notice with the C setting, an ASCII standard setting used for the C programming language, the output sequence changes so the 1. (a 1 followed by a period) in the filename comes before the 10 in the filename.

Still, this isn’t the sorting sequence we want. Sorting by ASCII instead of UTF-8, it’s not sorting numerically. To get a numeric sort, irrespective of the locale definition in use, there must be a uniform quantity of digits in every filename.

Want to know how many digits? Got to know how many pages. If less than 10, use only one digit. If less than 100, need two, and so on. The identify(1) command in ImageMagick shows the total pages:

$ identify TestFile2.pdf | wc -l
12

If you don’t pipe identify‘s output through wc(1) to count the lines, you’d see a list of the graphic formats on each page. That could be interesting info, but just counting the lines gives the page count.

Note: One of the Poppler utilities is pdfinfo(1). To extract the page count from the rest of the facts:

$ pdfinfo TestFile2.pdf | grep Pages
Pages: 12

Armed with the page count, introduce the correct number of digits for sorting with the following command:

$ convert INFILE.pdf OUTFILE-%02d.png

The hyphenated addition to the outfile tells convert to add the page number using the format described. This format follows the printf(1) technique. Introduced with a % symbol, it defines a format using leading zeroes (0) and a two-digit (2) decimal (d) number. Here’s the result using TestFile2:

$ ls -sh TestFile2*
 48K TestFile2-00.png
 24K TestFile2-01.png
 32K TestFile2-02.png
 44K TestFile2-03.png
 24K TestFile2-04.png
 68K TestFile2-05.png
 24K TestFile2-06.png
 68K TestFile2-07.png
 64K TestFile2-08.png
 40K TestFile2-09.png
 40K TestFile2-10.png
 28K TestFile2-11.png
104K TestFile2.pdf

Each PNG file has the same dimensions that identify showed for the PDF’s pages. Of course, identify will tell you the dimensions of the graphic file.

Note: Converting to PNG files creates a transparent background. If you prefer a white background, add -background and -alpha options as follows:

$ time convert INFILE.pdf -background white -alpha remove OUTFILE.png

Leaving out the -background option and its color argument will default to white. The background can be any color, defined by a name, a hex reference, or a decimal RGB reference. The name comes from a database that showrgb(1) lists. Hex references using the format #RRGGBB where a pound sign (#) introduces the three red (RR), green (GG), and blue (BB) hexadecimal, two-digit byte values correspond with their decimal RGB numbers. Decimal RGB uses the format rgb(RRR,GGG,BBB) where each is a decimal number with up to three digits (0 to 255). For example, MediumSlateBlue would be #7B68EE in hex form and rgb(123,104,238) in decimal form. Names in the showrgb list do not exist for every possible number combination, so numeric formats can use numbers outside that list, although not everybody may be able to see the subtle difference.

While convert produces a new file from an existing file, if you already have PNG files that need conversion to a colored background, replace the convert command with the mogrify(1) command. Mogrify adjusts an existing file.

Shrink or Grow an Image

Any images you have may be perfect for your needs as they are, but sometimes you need to increase or decrease their sizes. If it’s just not the right size, make it so.

Copy any image example file to a temporary location so you can experiment with it without danger of destroying your original. Play with a test image by using logo: as the file name. A large version of this logo: image is online.

Want the image 25% bigger in each dimension?

$ identify logo:
logo:=>LOGO GIF 640x480 640x480+0+0 8-bit sRGB 256c 28.6KB 0.000u 0:00.000

$ convert logo: -resize 125% t.png

$ identify t.png
t.png PNG 800x600 800x600+0+0 8-bit sRGB 143KB 0.000u 0:00.000

How about half the size in each dimension?

$ convert logo: -resize 50% t.jpg

$ identify t.jpg
t.jpg JPEG 320x240 320x240+0+0 8-bit sRGB 20.7KB 0.000u 0:00.000

These examples specify a scaling percentage for the -resize argument. Using a single percentage maintains the aspect ratio: the relation of the width to the height. Remember that shrinking each dimension by 50% makes the image one quarter its original size.

The argument could scale the Width (x-axis) and Height (y-axis) separately using the format:

-resize WxH%

Keep in mind that when scaling the two separately, signified by the % at the end, the W and H numbers are adjusted distinctly, possibly fouling the aspect ratio, but you get what you asked for.

Specify pixel width and height by using WxH without a % symbol. Give only a W pixel number to set a new width without changing the height. Use a xH pixel number with a leading x to change only the height, not the width. However, if you specify both using WxH, convert will try to keep the aspect ratio. It will ignore one of the dimensions to make it work. Take the 640×480 (AR=4:3) logo: image. You could change it to 400×300 because that preserves the aspect ratio. But, say you choose to reverse it by using 300×400 instead:

$ convert logo: -resize 300x400 t.jpg

$ identify t.jpg
t.jpg JPEG 300x225 300x225+0+0 8-bit sRGB 18.6KB 0.000u 0:00.000

It tried to keep your 300 width and it forced the height to 225 making a 4:3 ratio. If you purposely insist on violating the aspect ratio, force it by adding an exclamation mark (!):

$ convert logo: -resize 300x400! t.jpg

$ identify t.jpg
t.jpg JPEG 300x400 300x400+0+0 8-bit sRGB 28KB 0.000u 0:00.000

That image might look a bit weird, but you got what you asked for.

Extract Images From PDF

Poppler utilities include pdfimages(1) to extract images from a PDF and pdftotext(1) to extract the text from a PDF.

Use pdfimages to extract in several formats, such as PNG, JPG, Tiff, and others. Multiple formats can be specified, if needed, but it’s often most useful to get everything in one format. To see what kinds of images exist in a PDF, use the -list option:

$ pdfimages -list TestFile2
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   1     1 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
   2     2 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   2     3 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
   3     4 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   3     5 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
   4     6 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   4     7 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
   5     8 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   5     9 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
   6    10 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   6    11 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
   7    12 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   7    13 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
   8    14 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   8    15 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
   9    16 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
   9    17 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
  10    18 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
  10    19 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
  11    20 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
  11    21 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%
  12    22 image     421    53  rgb     3   8  jpeg   no         4  0    97    98 6402B 9.6%
  12    23 smask     421    53  gray    1   8  image  no         4  0    97    98   47B 0.2%

This shows the page number, size, color type, and encoding type along with other info for each image found.

To extract each format according to its encoding type, use the -all option:

$ pdfimages -all TestFile2.pdf TestFile2

$ ls -sh TestFile2-*
8.0K TestFile2-000.jpg  8.0K TestFile2-008.jpg  8.0K TestFile2-016.jpg
4.0K TestFile2-001.png  4.0K TestFile2-009.png  4.0K TestFile2-017.png
8.0K TestFile2-002.jpg  8.0K TestFile2-010.jpg  8.0K TestFile2-018.jpg
4.0K TestFile2-003.png  4.0K TestFile2-011.png  4.0K TestFile2-019.png
8.0K TestFile2-004.jpg  8.0K TestFile2-012.jpg  8.0K TestFile2-020.jpg
4.0K TestFile2-005.png  4.0K TestFile2-013.png  4.0K TestFile2-021.png
8.0K TestFile2-006.jpg  8.0K TestFile2-014.jpg  8.0K TestFile2-022.jpg
4.0K TestFile2-007.png  4.0K TestFile2-015.png  4.0K TestFile2-023.png

The last argument represents the basename for it to use as a prefix when creating the extracted files. The file format PREFIX-###.SUFFIX then uses your prefix text, a hyphen, a three-digit image number starting with 000, and then the dot and your suffix text according to the image encoding type, such as .jpg, .png, .tif, and so on.

In the -all output, images listed as rgb color and jpeg type became JPG files while the soft masks listed as gray color (grayscale) and image (opaque) type became PNG files.

If you want everything to become a specific encoding type, such as PNG, use a specific option for that type, such as -png for PNG files, or -j for JPG files.

$ pdfimages -j TestFile2.pdf TestFile2

$ ls -sh TestFile2-*
8.0K TestFile2-000.jpg  8.0K TestFile2-008.jpg  8.0K TestFile2-016.jpg
 68K TestFile2-001.ppm   68K TestFile2-009.ppm   68K TestFile2-017.ppm
8.0K TestFile2-002.jpg  8.0K TestFile2-010.jpg  8.0K TestFile2-018.jpg
 68K TestFile2-003.ppm   68K TestFile2-011.ppm   68K TestFile2-019.ppm
8.0K TestFile2-004.jpg  8.0K TestFile2-012.jpg  8.0K TestFile2-020.jpg
 68K TestFile2-005.ppm   68K TestFile2-013.ppm   68K TestFile2-021.ppm
8.0K TestFile2-006.jpg  8.0K TestFile2-014.jpg  8.0K TestFile2-022.jpg
 68K TestFile2-007.ppm   68K TestFile2-015.ppm   68K TestFile2-023.ppm

Using the -j option delivered the internal JPG images as the requested JPG type, but soft masks became PPM (Portable PixMap) files.

$ pdfimages -png TestFile2.pdf TestFile2

$ ls -sh TestFile2-*
 20K TestFile2-000.png   20K TestFile2-008.png   20K TestFile2-016.png
4.0K TestFile2-001.png  4.0K TestFile2-009.png  4.0K TestFile2-017.png
 20K TestFile2-002.png   20K TestFile2-010.png   20K TestFile2-018.png
4.0K TestFile2-003.png  4.0K TestFile2-011.png  4.0K TestFile2-019.png
 20K TestFile2-004.png   20K TestFile2-012.png   20K TestFile2-020.png
4.0K TestFile2-005.png  4.0K TestFile2-013.png  4.0K TestFile2-021.png
 20K TestFile2-006.png   20K TestFile2-014.png   20K TestFile2-022.png
4.0K TestFile2-007.png  4.0K TestFile2-015.png  4.0K TestFile2-023.png

Using the -png option forced everything to transform into a PNG file, images and soft masks alike.

Extracting Text From PDF

Another Poppler utility, pdftotext(1) extracts only the text, not images. There are many options for selecting the text, but focus here on three simple options:

$ pdftotext TestFile2.pdf t-nooption.txt

$ pdftotext -layout TestFile2.pdf t-layout.txt

$ pdftotext -layout -htmlmeta TestFile2.pdf t-layout.html

$ ls -1shrt t-*
16K t-nooption.txt
16K t-layout.txt
20K t-layout.html

After naming the PDF to extract from, if no output file is given, it uses the input file’s basename — the name without the suffix — as the output file and appends a .txt suffix. A -listenc option identifies the text encoding types found in the PDF.

Without giving any option, the file named “-nooption” shows the text with little formatting. If it finds any unusual characters, such as bullets, it will split the text around them into separate lines, possibly with double-spacing between the lines. Paragraph text will appear as clearly as possible according to its internal formatting.

Using the -layout option repairs much of the formatting, putting text on the same line as its bullet and maintaining indentation levels.

Adding the -htmlmeta option to the -layout option produces HTML encoding with the text, making the output file a bit bigger. Open that file in your favorite web browser to see a more readable layout.

From one of these formats, the variety of text analysis tools available makes these results easier to manipulate. For example, I wrote a script that took a set of PDF files, used pdftotext to extract the portion of text in the form of tables, then cleaned up the text and reformatted the data with several Linux tools (egrep, tr, sed, and gawk) to ease the import into Excel for further processing. Once a month I run that script with a wildcard to deliver all the PDF files at once to the script and, in a fraction of a second, it converts all their embedded text tables into easily imported files. Better time spent analyzing the quickly imported data than tediously copying part by part from the displayed PDF into the spreadsheet.

Convert and Combine Images to PDF

As already shown, convert and mogrify can turn PDFs into images. Both can go in the opposite direction. Convert writes its result to a separate file:

$ convert IMAGENAME.png DOCNAME.pdf

Mogrify overwrites existing files. For example, to convert multiple PNG files into separate PDFs, one for each PNG:

$ mogrify -format pdf -- *.png

Now you have a bunch of separate PDF pages. With sequence numbering in the file names, wildcards put them in the right order. Now to combine them into a multipage PDF.

Convert uses Ghostscript to do some of its PDF work. As mentioned before, versions for Mac and Windows exist. The Linux version of Ghostscript is named gs(1). It is a PDF viewer and processor with lots of options to handle a variety of operations and output device formats. For combining multiple PDFs into one, use the command:

$ gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=OUTFILE INFILES

The -dBATCH option is not documented in the man page; an unfortunate oversight. Instead, use gs -h or gs --help to see it. When using gs interactively, it produces many notices and prompts. Tell it to just get to work with this -dBATCH option. Using the -dNOPAUSE option tells it to continue with the next page instead of pausing when each page is done. A -q tells it to be quiet: do not output a variety of messages as it does the work. These three options remove gs‘s attempts to interact with the user.

The -sDEVICE=pdfwrite selects which of many device definitions to use. Again, use gs -h to see the list of devices. It is impressive, including many printer hardware models. For this conversion use the pdfwrite device. Using the -sOutputFile=OUTFILE option selects the output filename. Replace outfile with the actual filename, including the .pdf suffix. Finally, replace infiles with the space-separated list of PDF files you want it to combine. For example:

$ gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=t-gs.pdf TestFile2-*.pdf

Given the set of PDF files previously converted from PNG files, specified in their two-digit numeric order by the wildcard, gs combines them into one PDF file. If you submit it to pdftotext it would find no text because every page came from an image.

Now you can take apart or rebuild PDFs from other formats.

Leave a Comment