Show Biggest Files

Consider it a corollary of Parkinson’s Law. With large, cheap storage you shouldn’t run low on file storage any more, but as space increased, files got larger.

ISO disk images. Filesystem backups. Video.

Even terabyte capacity drives — thousands of gigabytes each — can run low on space.

If an alert comes from the system that storage is getting low, what’s your first step? Try finding the biggest files and see if you can get rid of them. Delete them if you don’t need them. Move them off the main drive if you must keep them.

Finding them. That’s the problem.

Manual Searching

A server system administrator with multiple users will probably have to start searching at the /home directory to see the biggest users’ files. In that case search with root privileges using the sudo prefix, then notify the user which files might need offloading.

If it’s your own system, you’re the only user on it. You typically store everything in your home directory. Start there.

Here’s the general method, using du(1):

1. du -sh DIR/* | sort -h | tail -5

This shows size (-s) in human-readable format (-h), such as suffixes K, M, G (Kilo, Mega, Giga), of each file and subdirectory in the chosen DIR (name the directory). Then sort in ascending numeric order by the human-readable (-h) size abbreviations. Show only the biggest 5 at the end. Change the tail argument number to see more files. Remove the tail pipe to see all. For example:

$ du -sh ~/* | sort -h | tail -5
12G     /home/YOU/Documents
13G     /home/YOU/Pictures
25G     /home/YOU/Downloads
27G     /home/YOU/Music
38G     /home/YOU/work

All these are subdirectories. It makes no sense to move these directories off the system. Moving away some files in them might benefit system space. Run the command again on the largest directory:

2. du -sh DIR1/DIR2/* | sort -h | tail -5

With DIR1 being the one you used previously, such as your home dir (~), look inside DIR2, the biggest one to focus on now. and see which are its biggest 5. For example:

$ du -sh ~/work/* | sort -h | tail -5 588K    /home/YOU/work/a
527M    /home/YOU/work/b
970M    /home/YOU/work/c
5.7G    /home/YOU/work/d
31G     /home/YOU/work/e

Obviously, I don’t use these single-letter names for subdirectories in my work directory. Their names were changed to protect the guilty.

Subdirectory “e” is the biggest. Maybe gain something after paring “e” by checking subdirectory “d“. First, check the biggest:

3. du -sh DIR1/DIR2/DIR3/* | sort -h | tail -5

Add DIR3, the largest found, and see what’s biggest there:

$ du -sh ~/work/e/* | sort -h | tail -5 4.9M    /home/YOU/work/e/aa
81M     /home/YOU/work/e/ab
386M    /home/YOU/work/e/ac
1.3G    /home/YOU/work/e/ad
29G     /home/YOU/work/e/ae

Again, names were changed. Directory “ae” is the biggest.

Keep going like this, but another approach is to stop using the summary (-s) option at this stage and use the all (-a) option. This typically gives a longer list by identifying each file’s disk usage.

4. du -ah DIR1/DIR2/DIR3/DIR4 | sort -h | tail -15

This time, no wildcard needed. The all option gives lots more files because it recurses through the subdirectories.

As usual, the biggest subdirectories appear at the bottom of the sorted list, but as you look above it you’ll see subdirectories in the biggest traversed through the list into the biggest portions of usage. From there you can find the subdirectory or files in it that take up the most space and decide whether you can afford to move them off the system to reclaim space.

Use the -ah technique to narrow down your search into two steps:

5. du -ah DIR1 | sort -h | tail -15

This finds the biggest throughout DIR1, but you’ll have to eyeball upward on the 15 biggest to see which sub-sub-subdir might need your attention. Then, replace DIR1 with DIR1/DIR2/DIR3/etc to track to the one needing most attention.

After cleaning up the worst offender, run this again for the next major directory needing attention.

Automated Searching

Having to repeat this several times to get the largest files, wherever they are, becomes:

Tedious.

Instead, write a program that does all this work. You could do something else. Call it bigspace because it shows you the biggest space usage files.

Here are its characteristics:

  1. Search either all mounted filesystems or only directories specified.
    1. Give no names on the command line to see all mounted filesystems.
    2. Name directories on the command line to see those directories.
  2. Default to see the 5 largest files. Use option -n to change it.
  3. Add a help option -h.

Knowing these requirements, define at the top of the program the most important global variables:

#
# Globals
# 

progname="$(basename $0)"
numfiles=5                  # Qty of big files to show

Set the progname variable for error messages. The bash command line name ($0) contains the full pathname. Reduce to its basename — remove the directory component. Give the numfiles variable a default setting for how many files to display.

Clarify Your Thoughts

When writing a new program, I explain my original plans in the comments at the top and write a usage() function to identify the options and summarize the purpose.

usage () {
    echo "Usage: $0 [-n numfiles] [dirs]"                               1>&2
    echo                                                                1>&2
    echo "Lists local file systems in order of most space usage."       1>&2
    echo "Within each file system, show the ${numfiles} biggest files." 1>&2
    echo "The -n option changes the number of files. Naming one or "    1>&2
    echo "more dirs on the command line will list only those dirs."     1>&2
    echo "Otherwise search the local file system mounts."               1>&2
}

Notice each echo line ends with 1>&2. That writes the usage message to the standard error output.

NOTE: Recall that file descriptor 0=stdin (keyboard), 1=stdout (screen), and 2=stderr (screen), using the common C language abbreviations.

These are redirectable. Regular data input (0) typically comes from stdin, typically the keyboard, and is often redirected from a program or a file. Regular data output (1) goes to stdout, typically the terminal screen, and is often redirected to another program or to a file. Error data output (2), also typically goes to the screen. It is separately identified as stderr to avoid error or warning messages from getting mixed up with data output.

Always send your own errors, warnings, and other advisories to stderr. The “<” redirects stdin and the “>” redirects stdout. Modifications using additional symbols change the way redirection works. Using “>&” allows a file descriptor to redirect to another file descriptor. Notation “1>&2” redirects echo‘s default output from stdout to stderr.

Getting Command Line Options

Using command line options to change the program’s behavior or to run special variations is so common that bash provides a way to parse and deliver them. It’s called getopts. Trying man getopts shows the bash_builtins(1) man page taken directly from the bash(1) man page. You can read more about it either way.

Bigspace doesn’t need all the getopts capabilities. A simple explanation of features used will suffice.

while getopts ":hn:" option do
    case $option in
        n)
            numfiles="${OPTARG}"             
            ;;

        h)
            usage
            exit 1
            ;;

        ?)
            echo "${progname}: Bad Option -${OPTARG}!"
           usage
            exit 1
            ;;
    esac
done
shift $(( ${OPTIND} - 1 ))

With more than one option, put getopts into a loop lasting as long as there are more arguments. Before bash starts a script, it parses the command line and sets some variables.

Each getopts call takes two arguments:

  1. A list of the one-character options to recognize
  2. A variable name to hold the next available option

Each call to getopts puts the next available option in the variable. When out of options, getopts quits with an exit code. The while test takes that as a false quitting the loop.

A case statement is a multitest. If the ${option} character values match the following constants — in this case each constant is one character with a closed parenthesis signifying the end — skip to the code after the parenthesis and execute everything up to the double-semicolon (;;). If you leave out the double-semicolon, execution continues into the next code even though the next code’s constant doesn’t match. That can be useful.

Notice the quoted characters after the getopts command. These characters identify what characters to look for on the command line.

Skip over the first colon (:) for a moment. More about that colon soon. The rest of the quoted option list can come in any order.

Put a colon (:) after a one-character option, such as “n” (number of files to show), tells getopts that character must have additional data. Additional data can come in the same argument, such as “-n12“, or in the next argument, such as “-n 12“. Getopts puts that data into the ${OPTARG} variable.

Either way, the case code for “n” gets it from ${OPTARG} and puts it into the numfiles variable. This case code does not quit the program so the while loop continues with the next argument if there is one.

Operations needing no extra data, such as “h” (help), need no colon after it. When it shows up, the case code calls the usage() function and quits the script with an exit code of 1, which shell software can detect as an error or unsuccessful run of bigspace.

What about that colon in the front of the getopts character list? Finding an error, getopts has its own error messages. Put a colon in front of the first quoted option character and getopts will not output its own error message when a bad option is found. Instead, getopts puts a question mark in the variable ${option}.

Have a case test for this. When the case detects that “?” report the error. This error message format identifies which program had the error, what the error was ("Bad Option") and show what bad character was used with its leading hyphen. Give a usage() reminder and quit with an error code signalling failure — any nonzero number.

Why show the program name in the error message? Run multiple programs in a pipeline or inside a script. Which gave the error? It’s not always obvious. Make it obvious by identifying the program’s name.

Starting with the first call to getopts in the while loop, the total number of command line arguments is known. An option index ${OPTIND} has the index of the next argument on the command line. After the last option and its argument runs, getopts quits. ${OPTIND} shows how many arguments were consumed already. Some arguments may remain on the command line.

The typical command line format is:

programname [hyphenated-options] [arguments]

NOTE: Brackets in command line examples mean that command line component is optional. Don’t type the brackets.

You don’t have to have hyphenated options in all programs, but there could be many. Hyphenated options may themselves have arguments, which should come after the option itself. After the last hyphenated option is the set of remaining arguments. There could be none, one, or many.

When the getopts while loop is done, ${OPTIND} has the next argument’s index number. The program name is in $0, but that’s not part of the total command line argument count. Bash does not consider it an argument — it’s the command.

For example, try running bigspace with this 4-argument command line:

bigspace -n 10 Downloads Videos

When the loop finishes, ${OPTIND} would be 3.

This is because -n is argument number 1 ($1), "10" is argument number 2 ($2), and the name in the next argument would be 3 ($3). When the while getopts loop collects all the option arguments, remaining arguments are the directory names bigspace must look in.

How many arguments in that command line would have to be removed for $1 to become "Downloads" and $2 to become "Videos"?

Two, which is ${OPTIND} minus 1. When the loop is finished, shift — another bash builtin — the command line as many places to the left as the bash says in the following arithmetic expression:

$(( ${OPTIND} - 1 ))

That destroys the options already consumed and puts the remaining arguments, the directory or filenames, into $1, $2, etc, ready for the next operation. With no remaining arguments, when shifted, the argument count ends up zero.

Collecting Filesystems

Assume those remaining command line names are directories or filenames. Search directories for the largest files.

With no names given, bigspace must discover the mounted filesystems, then search them for their largest files. Before starting the filesystem search, start the filesystem discovery.

Linux systems may have lots of mounts. Use the mount(8) command without arguments to see them all. A sample on a Qubes VM running Fedora 26 showed 36 mount points. Not all are needed for this. Removing temporary filesystems (tmpfs) and maybe some others is easy. Why use a big list when a smaller list is available?

Instead, df(1) not only gives the most significant of these, but shows each one’s usage size. That helps narrow the search objective:

$ df Filesystem    1K-blocks      Used Available Use% Mounted on
/dev/mapper/dmroot   20511356 10337224   9193068  53% /
/dev/xvdd              487652   228463    233589  50% /usr/lib/modules/4.9.35-20.pvops.qubes.x86_64
devtmpfs               149536        0    149536   0% /dev
tmpfs                 1048576        0   1048576   0% /dev/shm
tmpfs                  156464      608    155856   1% /run
tmpfs                  156464        0    156464   0% /sys/fs/cgroup
tmpfs                 1048576       12   1048564   1% /tmp
tmpfs                   31292       12     31280   1% /run/user/1000

A standalone Fedora 26 system similarly narrowed to this:

$ df
Filesystem              1K-blocks      Used Available Use% Mounted on
devtmpfs                  4062408         0   4062408   0% /dev
tmpfs                     4074968         0   4074968   0% /dev/shm
tmpfs                     4074968      1972   4072996   1% /run
tmpfs                     4074968         0   4074968   0% /sys/fs/cgroup
/dev/mapper/fedora-root  51475068  37511280  11325964  77% /
tmpfs                     4074968       308   4074660   1% /tmp
/dev/sda1                  487652    160455    297501  36% /boot
/dev/mapper/fedora-home 420332168 141840564 257116836  36% /home
tmpfs                      814992        16    814976   1% /run/user/42
tmpfs                      814992        68    814924   1% /run/user/1000

If your mountpoints include remote systems, add the -l option to restrict the list to local mountpoints.

Not all these columns help bigspace requirements. Only care about real physical mountpoints, not temporary filesystems. Notice the real ones reside on a /dev filesystem, as shown in the first column. The last two columns list space usage as a percent and the mountpoint path. No need for the columns in between.

Rather than using awk(1) to isolate, df has the --output option to select the columns. Its man or info(1) pages show the field names used for --output, which aren’t the same as the output column headings. You can also get the field names from df --help. Fields needed: “source“, “target“, and “pcent“.

Here is a sample from a Fedora 26 system’s filesystems:

$ df --output=source,target,pcent
Filesystem              Mounted on     Use%
devtmpfs                /dev             0% 
tmpfs                   /dev/shm         0%
tmpfs                   /run             1%
tmpfs                   /sys/fs/cgroup   0%
/dev/mapper/fedora-root /               77%
tmpfs                   /tmp             1%
/dev/sda1               /boot           36%
/dev/mapper/fedora-home /home           36%
tmpfs                   /run/user/42     1%
tmpfs                   /run/user/1000   1%

Use the slash ‘/‘ at the beginning of each line to isolate the real mountpoint entries:

$ df --output=source,target,pcent | egrep '^/'
/dev/mapper/fedora-root /               77%
/dev/sda1               /boot           36%
/dev/mapper/fedora-home /home           36%

Reduce the extra spacing for readable column alignment into one space each for easy parsing with tr‘s squeeze option (-s):

$ df --output=source,target,pcent | egrep '^/' | tr -s ' '
/dev/mapper/fedora-root / 77%
/dev/sda1 /boot 36%
/dev/mapper/fedora-home /home 36%

Isolate the two columns needed, just the 2nd and 3rd fields:

$ df --output=source,target,pcent | egrep '^/' | tr -s ' ' | cut -d' ' -f2-
/ 77%
/boot 36%
/home 36%

Sort by their fullness percentage, putting the biggest usage first. That requires reverse numeric sorting the second field:

$ df --output=source,target,pcent |
  egrep '^/' |
  tr -s ' ' |
  cut -d' ' -f2- |
  sort -rn -k2,2
/ 77%
/home 36%
/boot 36%

Remember: bash automatically continues command lines broken at the pipe symbol.

While this example started in the order produced by sort, not all systems will. Bigspace should put the largest first.

Finally, keep mountpoint pathnames in the first field. Size no longer matters:

$ df --output=source,target,pcent |
  egrep '^/' |
  tr -s ' ' |
  cut -d' ' -f2- |
  sort -rn -k2,2 |
  cut -d' ' -f1
/
/home
/boot

Array Filling

Assign this filesystem name output to an array:

  • Run the pipeline in a subshell.
  • Put the subshell output in parentheses to make an indexed array.
  • Assign the indexed array to the big_fs variable.

That same variable takes the subdirectories list from the command line:

if (( $# > 0 ))
then
    big_fs=( "$@" )
else
    big_fs=(
        $(df --output=source,target,pcent |
            egrep '^/' |
            tr -s ' ' |
            cut -d' ' -f2- |
            sort -rn -k2,2 |
            cut -d' ' -f1
        )
    )
fi

Check whether any arguments are on the command line using the bash argument count notation. The special bash variable, $#, holds the number of command line arguments, not including the program’s name.

If there are any command line arguments, the list is expanded using either $* or $@. The difference between them comes if any arguments were quoted to preserve spaces. Use a quoted "$@" to maintain those spaces in their arguments. Otherwise, the $* and $@ are the same. Bash expands each argument into space-separated values.

Without command line arguments, assign the big_fs[] indexed array from the df subshell output.

Big File Hunt

With the array settled, search the filesystem list in it. The search is a separate function, written near the top of the script where bash learns the functions before they’re used.

for fs in "${big_fs[@]}"
do
    if [[ "${fs}" == "/" ]]
    then
        excl=$(echo "${big_fs[@]}" |
          sed -e 's,/ \| / \|/$\|^/$, ,g' -e 's,^/\| /, --exclude=/,g')
    fi
    echo "${fs}:"
    bigdir "${fs}" | sort -rh | head "-${numfiles}"
    echo

    unset excl
done

This loop:

Sets the fs variable one entry at a time from the "${big_fs[@]}" list.

  1. Passes the entry to the search code.
  2. Sorts search results in reverse order by human-readable size.
  3. Displays the results showing only the quantity requested.
  4. Cycles until finished.

But, there’s some special handling for the root (/) mountpoint. Before understanding what that sed(1) mess is, skip to see how the bigdir() function works first. This explanation comes later. Wait for it!

Don’t Curse When You Can Recurse

Here’s the bigdir() function, so named because it searches for big files through whole dir trees.

bigdir () {
    local dir prevIFS

    for dir
    do
        if [[ -f "${dir}" ]]
        then
            if [[ -s "${dir}" ]]
            then
                du -h "${dir}" 2>/dev/null
            fi
            continue 
        fi

        if [[ "${dir}" == "/" ]]
        then
            dir=""                          # Prevent // notation
        elif [[ "${excl}" =~ "${dir}" ]]
        then
            continue                        # Skip dirs to be done later
        fi

        # Isolate largest usage
        prevIFS="${IFS}"
        export IFS=$'\n'                    # Preserve spaces in filenames
        biggest=(
            $(du -s ${excl} ${dir}/* 2>/dev/null |
                sort -rn |
                head "-${numfiles}" |
                cut -f2
            )
        )
        export IFS="${prevIFS}"
        bigdir "${biggest[@]}" # Recurse
    done
}

Bigdir() does all the work to find large files, but that work requires traversing directories from parent to child to grandchild and so on. To handle that, it detects the difference between a file and a directory:

  • Give it a file, it shows the file’s space usage and moves on.
  • Give it a directory, it collects and examines the file’s contents.

Directories can have subdirectories and files. Bigdir() must collect the files and descend into the subdirectories to find more files. This repetition suggests using a recursive function — a function that calls itself.

Consider the pathname, /a/b/c/d. The initial / is the root directory, which can contain files and directories. The a, b, and c directories may each have files or more directories in them, but only those directories count for now. In the c directory is a file or directory named d. Thus, root (/) is the (ultimate) parent. Its child is a, which has a child named b, which has a child named c, which has a child named d. Whatever it takes to look through /, the same technique applies to a, b, c, and d.

Recursive functions abbreviate code, but that trades off against extra memory to hold the data the parent needs while the child runs separately. If the child has a grandchild, more memory must hold the child’s data while the grandchild runs. This keeps going until there are no more subdirectories. Recursion can be difficult to understand. Sometimes unwinding the recursion into a loop — trade off more code to use less data memory — can be just as confusing.

Going through this recursive code carefully should reduce confusion.

The bigdir() function starts by defining two local variables. All bash variables are default global — any part of the program can use them, or clobber them — unless the variable is explicitly defined as local.

Global variables create when the program starts and any part of the program can change them. If one function stores a value in a global variable another function can read that variable’s value or write a new one to it. If the second function writes to it, the first function loses its value. That might be ok, but it could be a problem.

Local variables create when a function starts and destroy when that function quits. No other function can change that local variable, even if two functions use the same name. The two functions have separate local variables. If a function creates a local variable with the same name as a global variable, the local inside the function hides access to the global.

Recursive functions may require some parent variables to wait while a child version of that same variable operates. A parent’s local variables do not go to the child. When a child starts, any local variables it has are fresh and separate from the parent’s local variables with the same name. When the child is done it returns to the parent, its local variables are destroyed, and the parent’s variables are visible to the parent again.

In bigdir(), only two such local variables need distinction between parent and child: dir and prevIFS.

  • dir the current directory it’s examining — but could be a file. The child’s dir variable must not clobber the parent’s dir variable because when the child quits the parent must know where it left offpre
  • prevIFS holds a bash variable named IFS. A child will change IFS and so it needs its own prevIFS. The child run must not clobber the way the parent uses them.

All other variables, if any, are presumed global and ok to clobber.

Remember that if nothing else uses a variable that a function uses, you’re not clobbering anything. Your script may use more memory than you’d like by having lots of globals hanging around. Globals disappear when the script stops running. No other script has access to them.

The whole bigdir() function is a for...in loop. It names the loop variable, dir, but has no in component. The in normally precedes a list of values to assign one at a time to the loop variable. Without in, bash takes values from the function’s argument list: the function’s private copy of $0, $1, $2, and so on.

When bigspace calls bigdir() it passes one directory — the current mountpoint or directory under examination — to it in the bigdir() $1. That mountpoint or directory starts the dir loop.

But, the dir variable changes when the function recurses. Wait for it!

It’s a File

Inside the loop, the first if...then test starts by using bash‘s test. Bash‘s test named with double-brackets [[ ... ]] has extra features the usual test(1) named with single-brackets [ ... ] doesn’t have. Prefer the double-bracket test in bash scripts.

Double-bracket test expressions, like single-bracket tests, must have a space between open and close brackets and the expression.

A -f test checks whether the argument is a regular file. If it’s not, move on to directory handling

When it’s a regular file, check whether data is in it. The -s tests its size, where zero length is false and nonzero is true. Without data don’t waste time on it. Continue the loop with the next ${dir} entry.

A file with data in it runs the du command. Any du complaints normally go to stderr (file descriptor 2). Redirect complaints to the bit bucket (/dev/null) instead. This du output collects and becomes the bigdir() function’s stdout.

Whether the regular file has data or not, skip the rest of the ${dir} loop and continue with the next iteration.

Call bigdir() with only one pathname and that’s the end of it. But, what if bigdir() is called with multiple names?

That happens in the recursion. Wait for it!

It’s a Directory

The next ${dir} test expects directories. If the user forces it to look at something else, such as a device, bigspace will show it uses no space and quit. No harm is done to let it go.

This next test checks whether the directory is root (/). While pathnames may have many slashes, only the root directory is exactly one slash with nothing else in the name.

When ${dir} is the root directory, empty the variable by assigning it a big nothing in quotes. Earlier examples shown in this article — especially the first example — used the notation du DIR/*.

Notice the du -s command in this function’s “Isolate” section. It uses ${dir}/* for the search. What if ${dir} is the root directory? Would you want to write, //* because ${dir} is a /?

Well, you can.

Bash allows double-slash and knows to treat it as a single-slash. So do many applications, such as du. It will know you meant the root directory, but it may propagate the double-slash in the output. That double-slash will look a bit strange, if not outright ugly. Putting literally nothing "" into the dir variable, eliminates the ugly.

However, if "${dir}" is not the root directory, see if its name is contained in the exclusion list variable, "${excl}".

The exclusion list ${excl} is built outside the bigdir() function. All recursions of bigdir() need it. If the "${dir}" value is not the root dir, but it is in the exclusion list, skip the du search. Run straight to the loop’s next dir setting. The reason for it will come clear when examining the next du command. Wait for it.

To understand the prevIFS and IFS variable settings, consider the upcoming du command. Wildcards like * expand on the shell’s command line before the command runs. When a space appears in a filename, du sees two filenames separated by that space instead of one filename containing a space character. You can’t quote the wildcard because quoting hides wildcards from the shell.

NOTE: Spaces in filenames present a problem for software and command line interfaces (CLIs or shells) that use spaces to separate their arguments. This space naming problem became widespread with increasing popularity of Windows.

MS-DOS’s CLI, back in the day, disallowed spaces in filenames, even though it internally permitted them as did many other operating system shells. When Windows came later and allowed them, MS-DOS programs still in use ran into more frequent problems.

Solutions to the problem in bash and other software include quoting and escape character prefixing. Alternative naming schemes include using underscores “_” instead of spaces or capitalizing the first letter of each word and running all the capitalized words together. Another helpful solution is automatic filename searching and completion by the shell or by a file app.

Software working with existing filenames must handle embedded spaces. Be careful to expand user-supplied variables within quotation marks to preserve spaces. Bash quotation marks hide wildcards but expand variables. Using apostrophes for quoting hides the variables, too.

Some programs do their own wildcard expansion. du isn’t one:

$ echo "content for testing" >testfile

$ cat testfile 
content for testing

$ du testfile
4       testfile

$ du test*
4       testfile

$ du "test*"
du: cannot access 'test*': No such file or directory

$ du "*"
du: cannot access '*': No such file or directory

When parsing a command line, called word splitting, bash uses the Internal Field Separator (IFS) shell variable to subdivide arguments. Because it contains invisible characters, the following hex dump shows its default value:

$ echo -n "${IFS}" | od -tx1
0000000 20 09 0a
0000003

Quoting the variable prevents echo from eliminating any characters in favor of its own format. The -n option suppresses echo‘s own newline.

The octal dump program, od(1), shows the addressing in hex (-Ax) and the dump format type in one-byte hex (-tx1). The output shows the first address (000000) has a hex 20 (space), then a hex 09 (tab), and finally a hex 0a (linefeed AKA newline), for a total of three characters (000003).

Bash uses any of those three characters as parsing separators.

To eliminate parsing by spaces, replace the IFS variable’s contents. That fixes the current shell, but a subshell must get that, too. To be sure all child shells get that effect, export the changed variable. Bash‘s export command both assigns the value to the variable and exports the variable to subshells all in one step.

When this shell is finished the original IFS value needs restoring. Save its original value in the prevIFS variable. Change IFS to use only a newline. Export that. To assign the newline, represented by the escaped-n (\n), into IFS use bash‘s dollar-single-quote notation $'...'. That translates the backslash-escaped symbol into its binary character. Thus, bash turns $'\n' into a linefeed character (hex 0a) and stores it into the IFS variable.

With bash word splitting limited to whole lines, create the indexed array variable biggest[] to store the lines from the subshell’s du pipeline.

Recall that du -s gives the file’s or directory’s allocated disk size. If its argument is a directory, it descends throughout that directory tree accumulating allocation sizes, and reports the total. Without specifying a block size, it uses the default block size for the device, typically 4K units for hard drives. The testfile examples showed 4 meaning 4K.

Here’s an example from the root of a Qubes disposable VM (DispVM):

$ du -s /* 2>/dev/null | sort -rn | head -5
8016420 /usr
1614096 /var
347800 /opt
174804 /boot
157460 /mnt

Before applying the cut -f2, this shows the disk usage of root’s contents, sorted in reverse numeric order, showing only the top 5. Directory /usr has the largest allocation, followed by /var and the others.

Why redirect du‘s stderr to the bit bucket?

If the command requests to read or descend into directories with restricted permissions, error messages would abound. Redirect them.

Adding the cut command removes the numbers:

$ du -s /* 2>/dev/null | sort -rn | head -5 | cut -f2
/usr
/var
/opt
/boot
/mnt

Each name is on its own line in order of largest to smallest usage, limited in count by head, which count comes into bigspace from ${numfiles}.

With IFS set to newline only, each output line becomes an array element when the subshell finishes. Any names generated with embedded spaces — or tabs — remain part of the name. After the names fill the array, restore IFS to its default setting.

Then, Recurse!

Now, bigdir() calls bigdir() — calls itself — passing as an argument all the elements in the biggest[] array. Notice the @ is used with the quoted array. Recall the difference between using @ and using * comes when the array is quoted and expanded with @: each element is separately quoted. That preserves embedded spaces in the names.

What does the recursion do?

Each element in biggest[] becomes, one at a time, the value of the for loop’s dir variable in the child run of bigdir(). As the loop iterates, each directory is searched for files that du -h can output. The function’s caller gets all those files. If the current name is a directory, descend into it looking for more files.

When the last subdirectory and file finish examination, outputting their human-readable sizes along the way, the recursed call to bigdir() returns to its calling parent, which may have another to do. Eventually each child reaches the end and returns to the parent until returning to the ultimate parent. When the ultimate bigdir() parent finishes, it returns to the main execution.

Excluding the Mess

Here’s that main execution loop again:

for fs in "${big_fs[@]}"
do
    if [[ "${fs}" == "/" ]]
    then
        excl=$(echo "${big_fs[@]}" |
        sed -e 's,/ \| / \|/$\|^/$, ,g' -e 's,^/\| /, --exclude=/,g')
    fi
    echo "${fs}:"
    bigdir "${fs}" | sort -rh | head "-${numfiles}"
    echo

    unset excl
done

When bigdir() returns it delivers all those files and their sizes collected by bigdir() to the pipeline. While each subdirectory delivered its biggest files, another directory may have bigger files. Sort puts the whole bunch in reverse, human-readable numeric order, from highest to lowest. Head takes the biggest, limiting the count by ${numfiles}. For each filesystem or subdirectory, output its biggest files and cycle the loop.

So, what’s that ${excl} exclusion list mess for?

Remember that filesystems mount somewhere in or under root (/). So do subdirectories. Searching for a subdirectory under root and searching root itself will rediscover that same path. Search it twice?

Requesting a specific directory under root is important, even though root was also requested. Why?

The user asked.

Here’s an example from a Qubes DispVM running Fedora 26:

$ bigspace / /var
/:
132M    /usr/share/brave/resources/app.asar
115M    /opt/google/chrome/chrome
108M    /usr/lib/locale/locale-archive
95M     /usr/share/brave/brave
88M     /usr/lib64/firefox/libxul.so

/var:
165M    /var/lib/rpm/Packages
112M    /var/cache/yum/x86_64/23/fedora/gen/primary_db.sqlite
105M    /var/lib/clamav/main.cvd
55M     /var/cache/yum/x86_64/23/updates/gen/primary_db.sqlite
53M     /var/log/dnf.librepo.log

The root directory’s 5 files are all smaller than the Packages file in the /var directory tree. If /var was not excluded from the / search, that file would have distracted from the root list, appearing once in the / list and once in the /var list. Yet, that duplication would have pushed off the bottom item in the / list. In fact, the 112M file in /var also pushes off another file from the / list.

$ bigspace /
/:
165M    /var/lib/rpm/Packages
132M    /usr/share/brave/resources/app.asar
115M    /opt/google/chrome/chrome
112M    /var/cache/yum/x86_64/23/fedora/gen/primary_db.sqlite
108M    /usr/lib/locale/locale-archive

The du command has an --exclude option that allows identifying a directory to skip. Only one entry per exclusion is allowed, but multiple --exclude options are allowed. Generate one for each directory specified except the root directory. Build the exclusion list only when the filesystem selected is the root directory.

Take all the entries in "${big_fs[@]}" and send them to sed. Inserting the --exclude= text into the list is relatively easy. Isolating the root directory without affecting the other paths’ slashes is harder. This is where regular expressions (REs) — the mess — come in handy.

The first sed expression (-e) eliminates the root directory (/). It must run before inserting the --exclude= option for du‘s use. This first expression is a substitution.

Sed substitution uses the following format:

s/find/replace/flags

The s simply identifies a substitution. Any character can come after the s, although most people use a slash (/).

When a slash is part of either the find portion or the replace portion, as in this case, use another character as a separator. In this case, a comma (,) is the separator.

The find part breaks down into four parts, any one of which triggers the replace part. Each part is separated by a logical OR, represented by an escaped-pipe “\|“. Logical OR means either the left side or the right side may match for the whole thing to match.

To separate four tests, any of which may match, there are three ORs. Here they are, taken separately:

  1. / ” — A slash then a space.
  2. / ” — A space then a slash then a space.
  3. /$” — A slash followed by the end-line ($).
  4. ^/$” — Begin-line (^) followed by a slash followed by end-line.

The first expression matches the root directory coming first in the list or it could match a pathname with a trailing slash, such as “/var/“. The trailing slash often comes when a user types a partial pathname and then hits the tab key on the bash command line. Pressing tab after typing a partial path triggers automatic pathname completion. When a directory name is completed, using the tab signals the name is complete by putting a trailing slash, where the user could type another directory or filename.

NOTE: Can’t use the word-oriented REs because they must precede or follow word-like characters. Slashes are not considered word-like characters.

The second expression matches a root directory reference in between two other pathnames.

The third expression matches the root directory reference coming last in the list.

The fourth expression matches the root directory being the only pathname in the list. A carat (^) is an RE character that matches the beginning of the line. There is another special definition of the carat in REs, but that’s not used here. A dollar ($) is an RE character that matches the end of the line. Therefore, this fourth expression matches when one slash is the only thing on the line.

Any of these may match because of the OR “\|” RE. After the last OR’d expression comes another comma separator. That separator signals starting the replace part of the substitution. It contains only one space. So, if any of the four styles of root directory appear, the match is replaced with a single space. This could produce a bunch of extra spaces, but those won’t hurt the operations — especially if echo trims them.

After the replace part comes the flags part. Substitution flags are optional and there are only a few. The most commonly used flag is g for global replace. If the g flag is not used, only the first matching appearance of the find part gets replaced. Any other possible matches on the same line remain untouched. Use the g flag to replace all possible matches on the line. So, every appearance of the root directory is removed along with trailing slashes in other directory references.

The second sed expression is another substitution, also using a comma separator. There are two expressions in the find part:

  1. ^/” — A caret (begin-line) then a slash.
  2. /” — A space then a slash.

Tied together in the RE with an OR “\|“, either one of them may match. The first part describes a slash at the beginning of the line. The second part describes a space followed by a slash. With root directory references gone, all that’s left are pathnames. Either of these parts would match the slash at the beginning of a pathname, but only at the beginning.

The replace part inserts a space in front of the matched slash. Then comes du‘s exclusion option --exclude=, then it puts the slash back in place. As a side-effect, there was no space in front of the slash at the beginning of the line. There will be now.

Because of the global replace g flag these --exclude= insertions happen throughout the pathname list.

No Messy Repetition

Now, look back at the use of the ${excl} in bigdir(). Not its use in the du command. That’s just a command line argument using the exclude options. Look at its use on the left side of the contains (=~) test.

The elif queries whether the exclusion list contains the active searching directory. If the directory that bigdir() is searching for is one of the directories in the exclusion list, skip it. It gets done separately. Continue with the next directory in the loop.

For example, if both / and /var, which is contained in /, are searched, when recursion of the root directory’s subdirectories causes a search for /var, it will be excluded from the root search. Why?

Because the user asked to search /var separately, the exclusion list contains /var. Thus, /var will get its own list so searching / won’t need to show it.

If you search just / you might get a list that includes /var, /usr, and /opt. You’d see the 5 biggest files, but they’d just mix in to the report for /. There are more. Ask bigspace to search for all four and you’d get the 5 biggest files in / not counting those three directories, but those three would separately show their biggest 5 files.

Try it!

Leave a Comment