Why Egrep?

The grep(1) family of search utilities are very fast. The name stands for Global Regular Expression Print because it searches globally through the files using regular expressions, AKA REs, see regex(7) for more detail, to describe what to find, and then prints its matches.

Metacharacters are characters used to describe characters. Filename wildcards are simple examples of metacharacters. Fixed strings are just the characters, no other special meanings.Originally:

  • grep searched using basic RE metacharacters
  • fgrep searched only for fixed strings — no metacharacters
  • egrep used extended REs

The whole grep family was and is fast, but fgrep was generally considered fastest because it has the least work to do — it doesn’t have to handle REs. Egrep does complex work and sometimes was faster than fgrep. That’s not always the case.

While the names grep, fgrep, and egrep remain for compatibility the several programs have combined into one grep with command line options distinguishing which to use:

  • Use -G (optional) to get basic RE grep.
  • Use -F to get fixed string grep.
  • Use -E to get extended RE grep.

However, this change to use one grep with option arguments, though done years ago, could break existing programs using the older names. So, the older names remain.

There’s only one grep and its three options handle all three variations. Linux and other variations on Unix don’t need to duplicate programs just to have duplicate names. Point short scripts or aliases to the kind of grep needed.

Find Out Which One

The which(1) command shows an executable program’s full pathname. If it’s an alias, which shows what the alias is. Run which with multiple program names and it tells about each:

$ which grep alias grep='grep --color=auto'
          /usr/bin/grep

This shows that the command name, grep, is both an alias and a program file. For an alias, it gives you the definition. Typing grep without a pathname — typing grep instead of /usr/bin/grep — uses the alias,which automatically adds the --color option to help highlight the match. Otherwise, the program is stored in the /usr/bin directory.

$ file /usr/bin/grep
/usr/bin/grep: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=ca5d34701a9b00ad6b11ab0f501292be2df34f1f, stripped

What kind of file is /usr/bin/grep? The file(1) command tells you: It’s a binary (compiled) executable program. To see all three, you could type,

which egrep fgrep grep

but here’s how to do it with the shell’s brace expansion to simplify your typing:

$ which {e,f,}grep
alias egrep='egrep --color=auto'
        /usr/bin/egrep
alias fgrep='fgrep --color=auto'
        /usr/bin/fgrep
alias grep='grep --color=auto'
        /usr/bin/grep

Note: Braces specify a comma-separated list of characters that the shell expands into sets of arguments. While any number of characters can belong to one group separated from the next group by a comma — no spaces unless you want the space to be one of the group members — the shell’s brace expansion takes each member of the group and expands it with whatever else is on the command line:

$ echo {abc,def,ghi}
abc def ghi

$ echo x{abc,def,ghi}z
xabcz xdefz xghiz

The first example shows the three groups. The second shows each group is replicated within the x and z characters. How about some permutations:

$ echo {a,b,c}{d,e,f}{g,h,i}
adg adh adi aeg aeh aei afg afh afi bdg bdh bdi beg beh bei bfg bfh bfi cdg cdh cdi ceg ceh cei cfg cfh cfi

Aliases are defined for all of the grep family. Use the command without specifying its full path and this alias automatically adds the --color option. Otherwise, all these programs are in /usr/bin.

If the primary program is grep, what are these others?

$ ls -oh /usr/bin/{e,f,}grep
-rwxr-xr-x 1 root   28 Apr 22  2016 /usr/bin/egrep
-rwxr-xr-x 1 root   28 Apr 22  2016 /usr/bin/fgrep
-rwxr-xr-x 1 root 159K Apr 22  2016 /usr/bin/grep

Notice egrep and fgrep files are significantly shorter than grep. The file(1) command tells more:

$ file /usr/bin/{e,f,}grep
/usr/bin/egrep: POSIX shell script, ASCII text executable
/usr/bin/fgrep: POSIX shell script, ASCII text executable
/usr/bin/grep:  ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=ca5d34701a9b00ad6b11ab0f501292be2df34f1f, stripped

Both egrep and fgrep are shell scripts. What do they say?

$ head /usr/bin/{e,f}grep
==> /usr/bin/egrep <==
#!/bin/sh exec grep -E "$@"

==> /usr/bin/fgrep <==
#!/bin/sh exec grep -F "$@"

Head(1) shows the first 10 lines of each filename, and gives a heading identifying which file it’s showing. These one-liner programs launch bash via sh — a bash variant — to launch grep -E or grep -F. Using exec replaces the child sh with grep itself. The quoted $@ passes the rest of the script’s command line to the launched grep.

Thus, fgrep and egrep aliases launch a script to exec — wholly replace the running program — with a binary executable of the primary program using the correct option. Still have your fingers programmed to use fgrep or egrep? Those names still work. They’re not likely to go away any time soon.

While fgrep should be fastest with fixed strings, it can’t search for REs. Efficient implementation of complex REs and of high-speed searching algorithms allows egrep to work quickly despite all its extra facility. GNU versions have blurred basic REs with extended REs to where there is hardly a difference.

Speed and Flexibility

grepspeed compares the three grep variants. When searching through a bunch of data — the haystack — for a string of characters — the needle — that string could be anywhere. It may appear more than once. Bury that needle in a bunch of hay, then mix several such buryings in a stack of them. How fast will each grep variation find it?

First, define facts about the needle and the haystack with some constants:

wid=80
len=1000000
pos=49
needle="abc"

Each line has the needle variable in it. Make lines 80 characters wide. The haystack file will have a million such lines. Put needle off center. How about up against the 49th character position in the line? Make needle a simple string, but more than one character, so all three greps can search for it.

Build a Line Containing Needle

To bury needle in a line of spaces, printf(1) can left or right justify data within a field of spaces. Use ${pos} to create the field width and put needle in it. Right justification happens with positive field widths so spaces will precede needle. Add appropriate spaces after it.

# Generate 80-characters with needle off to the right
$ printf "%*s%*s" ${pos} ${needle} $((wid - pos)) ""
                                               abc

# Prove that cat -te show "$" only when a newline is present
$ printf "%s" "" | cat -te

$ printf "%s\n" "" | cat -te
$

The first command generates a line.

  1. The first argument is the format definition. Using %s causes printf to require a string to display. That’s ${needle}.
    1. A number for the field width can go between the % and the s. Using a * instead of a number tells printf to look in the next argument for the width.
    2. This makes ${pos} the field width and ${needle} the string to right justify in that field.
  2. Right after is another %*s telling printf to take two more arguments for a total of four.
    1. The third argument after the format string, $((wid - pos)), sets a shell arithmetic operation. Inside the double parentheses, variable names don’t need dollar signs to extract the value, but you can use them if you want. The remaining field width on the line is the total line width less the position just used for the first field’s width.
    2. For that second field width, put nothing ("") in it. That fills the field with spaces.

The second command shows that there is no actual newline in the data.

  1. No newline is provided by a plain %s field.
  2. Because cat -te (-t shows tabs, -e shows line ends, AKA newlines) would show a $ where the newline appeared, yet none shows, it proves no newline was generated.

The third command shows how to force the newline — by using the \n notation.

  1. Notice this printf field definition prints the string followed by a newline.
  2. The string is empty ("") and then the newline follows. cat -te shows a dollar sign on the otherwise empty line.

Note: See printf(1) for special interpreted sequences known as escape notation. It’s called escape notation because the backslash — the slash that leans to the left — changes the character after it into something else, escaping the normal shell interpretation of that character.

To see what the notation has in it, try the following command:

# Prove it has exactly 80 characters of text not counting newline
$ printf "%*s%*s" ${pos} ${needle} $(( wid - pos )) "" | wc
       0       1      80

Piping that printf output into wc(1), the 0 shows there are no lines — that is, no newlines appear in the data. A newline is a linefeed character, 0x0A hexadecimal, 10 decimal, or \n in escape-character notation. The 1 is the word count, meaning the line has only one word in it, the contents of needle. The 80 is the number of characters in the whole printf output.

To prove the character positioning, use a numeric gauge:

# Prove the needle position; NOTE: Add newline to prove trailing spaces
$ printf "%*s%*s\n" "${pos}" "${needle}" $((wid-pos)) "" |
 cat -te ; echo 12345678901234567890123456789012345678901234567890123456789012345678901234567890
                                              abc                               $
12345678901234567890123456789012345678901234567890123456789012345678901234567890

The gauge is 80 characters long. Needle is at positions 47, 48,and 49, so it’s right justified in a field 49 characters wide. The rest makes the line 80 characters long. The dollar sign, representing the position of the newline, is at the 81st character.

At this stage, the line has only spaces before and after needle. Prefer some different character instead of spaces? Pipe it through tr:

tr ' ' 'x'

This tr turns every space into an x, but you can use any character if you don’t want x. No one character is any more difficult than any other character unless it is the same as the first character of needle. It’s easy to search for needle among a bunch of characters that aren’t needle‘s characters.

Instead, make the search more difficult. Put needle‘s first characters into the nonmatching portions of the line where all the spaces are.

Consider how the search can go:

  • It looks for needle‘s first character in the line. If it’s not there, needle isn’t in the line. Done.
  • If it is there, is the second character right after?
    • If not, search the rest of the line for the first character as before.
    • If it is the second character, check for the third, and so on.

Thus, false matches — it’s the first, but not the second; it’s the first and second, but not the third — make the search harder. False matches slow it down. How much it slows depends on the algorithm. So, put all the characters in needle except the last one throughout the line so that the search finds lots of false matches.

# How many characters in needle?
$ echo ${#needle}
3

# Prove isolating all characters in needle less one
$ echo ${needle%?}
ab

# Prove generating spaces the length of needle less one
$ printf '%*s' $(( ${#needle} - 1 )) '' | wc
      0       0       2

Here’s how these command lines work:

  • The first command shows ${needle} has three characters. Putting a # between the open brace and the variable’s name tells bash to count the characters.
  • The second command shows using a % followed by a wildcard — in this case the ? — that matches any single character. The % removes a matching suffix from a variable’s value. That is, because ? matches any one character, whatever is the last character will be removed.
  • The third command proves the proper field width for the removal technique command. Using bash‘s arithmetic expression — double parentheses — subtract 1 from the length of needle. wc shows it is 2 characters wide.

Put all that together into a sed command, along with a gauge to be sure the line length isn’t violated:

# Prove all spaces of length less one of needle were replaced; NOTE: One remains
$ printf "%*s%*s\n" "${pos}" "${needle}" $((wid-pos)) "" |
sed "s/$(printf '%*s' $((${#needle}-1)) '')/${needle%?}/g" |
cat -te ; echo 12345678901234567890123456789012345678901234567890123456789012345678901234567890
ababababababababababababababababababababababababcababababababababababababababab$
12345678901234567890123456789012345678901234567890123456789012345678901234567890
  • Printf generates the line, as before, with needle in it. Pipe it through to sed(1), the stream editor.
  • Sed applies editing commands to a stream of data passing through it. Sed‘s s command substitutes a matching string with another string.
    • Following the s comes a slash (/), introducing the search string. Create this search string by using the subshell command to generate the number of spaces represented by the needle less one character.
    • Another slash ends the search string and defines the replace string. This has all the characters in needle except the last one.
    • One more slash terminates the replace string and starts the optional flag section. The g flag tells sed to repeat this search and replace globally — every instance it can find in line, not just the first.
  • Pipe this through cat -te to see where the newline appears in the generated line. A newline was used in the printf to make the gauge go on the next line.

Notice above the gauge that the $ representing the newline is at the 81st position. One space remains unchanged. That’s because an odd number of spaces come after needle. Here’s what it looks like without the gauge:

# Replace all spaces of needle length less one with beginning of needle
$ line=$( printf "%*s%*s" "${pos}" "${needle}" $(( wid - pos )) "" |
sed "s/$(printf '%*s' $(( ${#needle} - 1 )) '')/${needle%?}/g" )

# Prove line holds false starts of needle including the true needle
$ echo "${line}" | cat -te
ababababababababababababababababababababababababcababababababababababababababab $

 # Prove line length has not changed; Note: echo adds a newline
$ echo "${line}" | wc
      1       1      81

Now the printf generating the line doesn’t include the newline for the gauge. While wc shows a line, that’s because echo generates one, hence the 81st character. Kill echo‘s newline using its -n option and the true generated length of 80 will appear. Forget to use the quotation marks and the length 79 would mean that echo removed the trailing space. Because the trailing space is part of the 80-character line data, you made echo lie to you.

Generation and assignment on a Qubes v3.2 Disposable VM (DVM) running Fedora 24 with one effective CPU core:

$ head /proc/cpuinfo | egrep 'model name'
model name      : Intel(R) Core(TM) i7-3537U CPU @ 2.00GHz

$ time line=$(printf "%*s%*s" "${pos}" "${needle}" $(( wid - pos )) "" |
sed "s/$(printf '%*s' $(( ${#needle} - 1 )) '')/${needle%?}/g")

real    0m0.019s
user    0m0.000s
sys     0m0.017s

On an older system running Fedora 25 with 8 effective CPU cores:

$ head /proc/cpuinfo | egrep 'model name'
model name      : Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz

$ time line=$(printf "%*s%*s" "${pos}" "${needle}" $(( wid - pos )) "" |
sed "s/$(printf '%*s' $(( ${#needle} - 1 )) '')/${needle%?}/g")

real    0m0.007s
user    0m0.002s
sys     0m0.006s

Generate a Million Line Haystack

There are several ways to generate the haystack composed of a million lines. Here’s the most obvious:

On a Qubes v3.2 DVM running Fedora 24 with 1 effective CPU:

# Each echo writes to haystack, then loop
$ time for i in {1..1000000}; do echo "${line}" >>haystack; done

real 0m49.255s
user 0m19.763s
sys 0m21.761s

$ wc haystack
 1000000  1000000 81000000 haystack

Remember: there’s an 80 character line followed by a newline, therefore one million lines is 81 million characters. This took a long time — most of a minute — to generate the haystack. Appending redirection (>>) to the haystack file was required because each echo separately wrote to the file. Regular output redirection (>) would make each new generated line overwrite the previous. That’s not good.

But this took a long time.

Instead, let the loop accumulate the set of all lines and then write out the entire result as the system sees fit. To do that, move the redirection after done and use regular output redirection because it only opens the file once. That should remove the impact of repeated file handling.

# Loop generates, then writes accumulation to haystack
$ rm haystack

$ time for i in {1..1000000}; do echo "${line}"; done >haystack

real    0m13.339s
user    0m10.963s
sys     0m2.285s

$ wc haystack
 1000000  1000000 81000000 haystack

Much better.

Another way to generate lines uses awk(1) for the loop. Just pass the required variables into awk using its -v option

# Awk generates, then writes accumulation to haystack
$ rm haystack

$ time awk -vline="${line}" -vlen="${len}" '
BEGIN { for (i=1; i<=len; i++) { print line; } }
' >haystack

real    0m0.354s
user    0m0.168s
sys     0m0.153s

$ wc haystack
 1000000  1000000 81000000 haystack

Significant improvement!

Awk typically reads and processes an existing file or data from its standard input. One way to make it generate data of its own when there is no input: Put code in its BEGIN segment.

No other processing or END segments.

No input data from a file or by redirection.

When the BEGIN segment finishes, awk, not having a file name, looks for data from its standard input, finds none, and quits.

End Of File.

EOF.

Whatever awk generated in the BEGIN segment is its output, which redirects to haystack. Can’t get a lot easier.

Inside braces, the BEGIN segment starts a for() loop using C language-style notation. The same style appears in bash: Three parts to the loop introduction and a fourth part for the code to run.

  • The first part sets an initial value to the loop counter, i for “iterator”.
  • The second part tests whether a terminating condition happened yet. This condition checks whether i is currently below or the same as the ${len} value passed in from the shell variable. Because i starts at 1 and ${len} was set to one million, it has a way to go.
  • The third part increments the loop counter, i, by one. So, the 1 will become a 2, then later, a 3, and so on.
  • The fourth part is the code that runs once each loop cycle. If the second part — the test — is true, that is, 1 is less than or equal to ${len} (1,000,000) the first time around, the code inside the braces runs once. It prints the ${line} value passed in from the shell variable. That’s the same line containing needle generated before. After that code runs, the third part — the increment — changes the counter. Then the second part — the test — runs again to see if the fourth part — the code section — deserves another run.

Where the shell generated the haystack with its own loop, taking multiple seconds each method, the awk method generated the same haystack in slightly more than a third of a second. Awk is an impressive string handling program.

Here’s what running the same tests on an older system, but with many more effective cores, does.

On Fedora 25 with 8 effective CPU cores:

# Each echo writes to haystack, then loop; Note: Append redirection
$ time for i in {1..1000000}; do echo "${line}" >>haystack; done

real    0m20.841s
user    0m13.675s
sys     0m7.019s

$ wc haystack
 1000000  1000000 81000000 haystack


# Loop generates, then writes accumulation to haystack; Note: redirection
$ rm haystack

$ time for i in {1..1000000}; do echo "${line}"; done >haystack

real    0m10.834s
user    0m8.736s
sys     0m2.052s

$ wc haystack
 1000000  1000000 81000000 haystack


# Awk generates, then writes accumulation to haystack
$ rm haystack

$ time awk -vline="${line}" -vlen="${len}" '
BEGIN { for (i = 1; i <= len; i++) { print line; } }
' >haystack

real    0m0.226s
user    0m0.156s
sys     0m0.069s

$ wc haystack
 1000000  1000000 81000000 haystack

No contest!

A bash loop generates faster when it doesn’t have to take time to write each line, but instead writes the end result all at once. But, awk generates the loop 30 to 50 times faster.

Testing the Family

On to testing grep. One reservation about grep typical output: Normally, grep searching produces the matching line as output, if any. Not caring about what the match looks like for this test, using grep‘s-q (quiet) option is the wrong way to go. When grep finds its first match, the -q option makes it quit searching. That’s good when you only want to know, “Is it there?”

For this test, find every match. Don’t stop early.

Producing matches sends output to the terminal. That slows it down.

Throw it out. Redirect to /dev/null, the bit bucket.

But, one million 81-character lines is a lot of data to produce just to toss. Is there anything else to do the work quietly?

What about the -c (count) option? That just produces the match count, but grep must find every one to get that count. No need to worry about the count. It should be a million. Don’t throw the count away and you’ll see it.

So, which “toss it” method is faster? Here are Q&D tests of the two throw-away methods:

$ time grep -G "{needle}" haystack >/dev/null

real    0m0.148s
user    0m0.010s
sys     0m0.125s

$ time grep -G -c "{needle}" haystack >/dev/null

real    0m0.097s
user    0m0.007s
sys     0m0.077s

Throwing away the generated count is faster.

Here’s the code to produce test timings:

#!/bin/bash
#
# grepspeed
#
# Q&D test search speed of the three grep variants.
 
if [[ "$1" == "" ]]
then
  linetype="easy"
else
  linetype="hard"               # Any command line arg becomes hard type
fi

TIMEFORMAT=$'real=%R,user=%U,sys=%S,cpu=%P%%'

time {
  echo -n "Build ${linetype} haystack: "

  wid=80
  len=1000000
  pos=49
  needle="abc"

  # Generate line
  line="$(
   if [[ "${linetype}" == "easy" ]]
   then
     printf "%*s%*s" "${pos}" "${needle}" $(( wid - pos )) ""
   else
     printf "%*s%*s" "${pos}" "${needle}" $(( wid - pos )) "" |
     sed "s/$(printf '%*s' $(( ${#needle} - 1 )) '')/${needle%?}/g"
   fi
 )"

 # Generate haystack as set of lines
 awk -vline="${line}" -vlen="${len}" '
   BEGIN {
     for (i = 1; i <= len; i++) {
       print line;
     }
   }
 ' >haystack
   echo -n "$(wc -l <haystack) lines: "
}

echo -n "Grep -G [basic]: "
time for n in {1..1000}
do
  grep -G -c "${needle}" haystack >/dev/null
done

echo -n "Grep -F [fixed]: "
time for n in {1..1000}
do
  grep -F -c "${needle}" haystack >/dev/null
done

echo -n "Grep -E [exten]: "
time for n in {1..1000}
do
  grep -E -c "${needle}" haystack >/dev/null
done

rm haystack

Start the program by allowing the user to select an easy test or a hard test by giving an argument on the command line.

  • Easy haystack line: Nothing but spaces before and after  needle seems like an easy test because  needle doesn’t start with a space. So, each comparison of needle‘s first character would fail and move on to the next character in the haystack line. This will be the default test.
  • Hard haystack line: A partial match happens, which advances to the next character in needle, which is also a partial match, advancing again, then failing on the third character. That must rewind needle‘s first character match against the next character in the haystack line, which was already tested against the wrong character.

If the user types nothing but the program’s name, default to the easy test. Any other argument after the name uses the hard test.

It’s All About Timing

Set the TIMEFORMAT environment variable. Bash‘s default time prefix produces real, user, and sys times. These pretty formats, converting into minutes and seconds with decimal fractions don’t make for easy comparisons when looping the tests. Cycling tests 1,000 times averages times in seconds and fractions into milliseconds and fractions. Thus, 123.456 would be 123 milliseconds and a fraction. You could take it as 123,456 microseconds. But the unchanged TIMEFORMAT gives hours, minutes, seconds, and fractions that wouldn’t look right.

Changing bash‘s TIMEFORMAT environment variable offers a way to eliminate the hours-minutes-seconds-fractions problem. Using a printf(1) style — also see printf(3) for more detail — allows flexibility in showing execution times. Real, user, and system times express in seconds and fractions but without the “s” suffix. No conversion to minutes and seconds.

Also, it offers CPU usage percentage. When the sum of user and system times does not match the real time, the system is typically busy doing something else. Be careful about the real time when the CPU percentage is significantly less than 100%.

Systems with multiple CPU threads distribute their operations. CPUs can have multiple cores and each core can have multiple threads, frequently two threads per core. If a CPU has four cores and each has two threads, it looks like eight effective CPUs operating in parallel.

When real time reports 100% with one effective CPU it was fully occupied with the task. But, with two effective CPUs, the real report combines them. Two CPUs showing effective 100% worked at an average of 50% each. That wouldn’t max out until 200%. CPUs with four cores, each with two threads, looks like an effective eight CPU system, so 800% would be fully busy. Its 100% is really only 12.5% busy.

Note: If you ever want to return the TIMEFORMAT to its default inside a program, don’t set it to an empty string, like:

TIMEFORMAT=""

Instead use:

unset TIMEFORMAT

Setting it to an empty string removes the time prefix’s output, shutting up the time prefix. Unsetting it removes it from bash‘s environment, which restores the default output format.

Also, bash‘s own time prefix has different output and settings from the /usr/bin/time program.

The TIMEFORMAT setting here is designed for easy parsing. Commas separate timing fields. Equals signs separate field titles from field values.

With the TIMEFORMAT settled, build the haystack.

Group Effort

To isolate its timing from the grep family tests, separate all the haystack setup code from the rest of the test code. Two ways to do this: put it in a subshell or make a group command.

Surrounding subshell command lists with parentheses starts a shell child process, running a distinct execution.

A child inherits from the parent. The parent does not inherit from the child.

That means variables assigned in the parent go to the child, but variables assigned in the child do not come back to the parent:

$ echo $a

$ a=abc
$ echo $a
abc
$ (echo $a)
abc

This example first shows that the variable $a value is empty. Assign it “abc” and echo to prove that it’s got that value. Then, in a subshell, echo the value. The subshell child has the value.

$ echo $b; (b=xyz; echo $b); echo $b

xyz

$

This shows first that variable $b has no value, represented by the blank line. In a subshell give it a value and show that value to prove it. When the child shell finishes, echo the same variable: It has no value in the parent shell, again shown by the blank line.

Instead, surround group commands with braces. Group commands run in the current shell — the parent. There is no child created. The shell considers grouped commands a single operation with one final status. Results happening in the group are available to the running shell:

$ { echo $a; echo $b; b=xyz; echo $b; }; echo $a; echo $b
abc

xyz
abc
xyz
$

In this example, a group is started with an open brace. There must be a space or a newline between the open brace and the group’s first command.

Show the $a setting from before. Then, show the $b value, which wasn’t set in the parent by the child shell, so it’s empty. Then, set the b variable and show it inside the group.

Close the brace to end the group.

Notice that the last command in the group must either have a semicolon to terminate it or a newline must come between the last command and the close brace. Also, if a newline does not follow the close brace, a semicolon must do so. This is because to the shell braces are commands!

After the group, show that $a and $b both still have their assigned values.

By using a group, the time prefix attaches to the whole group. Yet the variables assigned inside the group are available to the rest of the program because a group runs in the parent.

Introduce the Introducer

Inside the group, announce which type of haystack will build. Set variables for the

  • Width of each line (wid=80)
  • Length of the haystack in lines (len=1000000)
  • Position within the line of the needle (pos=49)
  • Needle characters to find in each haystack line (needle="abc")

Generate the line according to whether the search should be easy or hard. Start a command group to run the code that generates the line as shown before.

Easy lines only have spaces before and after the needle. To use some other character than spaces, pipe the printf output to tr, such as:

tr ' ' 'x'

where the 'x' can be that or any other character. It doesn’t matter what the character is so long as it’s not the first character of the needle.

Hard search lines use all the characters of ${needle} except the last. Use sed‘s substitution command. Here’s how it works:

Following the general format: s/findstring/replacestring/flags

  • The findstring should contain as many spaces as the needle‘s length less one. The notation ${#var} gives the number of characters in var.
  • Put a subtract one into an arithmetic expression using $((expression)) notation.
  • Hand that arithmetic expression to printf as the field width for a null string (''). The field width will generate as many spaces needed to right justify the null string, which will be the number of spaces as the length of needle less one. A three-character needle must replace every two consecutive spaces with the first two characters of the needle.
  • In the ${needle%?} notation, the “%” deletes the suffix — the right side — characters in ${needle} matching the characters after the “%“. Characters after the “%” may include wildcards. Recall the “?” wildcard matches any single character, so it outputs needle, deleting whatever is the last character in needle.

If ${needle} is "abc" then the result is "ab". Thus, sed replaces two consecutive spaces with "ab", the first two characters in ${needle}. If needle was a longer string, sed would replace more spaces with more characters, but always one less than the length of the needle.

Whichever method ran, the easy or the hard, assign the text it generated to the line variable.

Next, generate the haystack from the line, as demonstrated and timed before. When haystack finishes, count the lines to show that the correct ${len} was used. Suppress echo‘s newline with -n so the time to run the group command will appear on that same line. That ends the group.

Run each grep test, first basic (grep), then fixed (fgrep), finally extended (egrep). As before, their times appear on the same line as the test notice. Run each test 1,000 times to average out differences due to system activity. Throw away the output so it won’t slow down testing. Each test’s whole loop generates one time. The seconds reported should be interpreted as average milliseconds per each grep type.

When done, remove the haystack file.

Running Tests

Running the easy haystack on three different systems shows the following results:

On a Qubes v3.2 DVM running Fedora 24 with 1 effective CPU:

$ time ./grepspeed
Build easy haystack: 1000000 lines: real=0.370,user=0.189,sys=0.131,cpu=86.70%
Grep -G [basic]: real=104.788,user=91.525,sys=13.028,cpu=99.78%
Grep -F [fixed]: real=97.131,user=84.012,sys=12.948,cpu=99.82%
Grep -E [exten]: real=105.566,user=91.530,sys=13.136,cpu=99.15%

real    5m7.876s
user    4m27.260s
sys     0m39.260s

On a Qubes v3.2 VM running Fedora 24 with 2 effective CPUs:

$ time ./grepspeed
Build easy haystack: 1000000 lines: real=0.430,user=0.217,sys=0.133,cpu=81.44%
Grep -G [basic]: real=118.274,user=102.046,sys=16.444,cpu=100.18%
Grep -F [fixed]: real=108.251,user=92.178,sys=16.276,cpu=100.18%
Grep -E [exten]: real=117.278,user=101.157,sys=16.370,cpu=100.21%
 
real    5m44.319s
user    4m55.601s
sys     0m49.254s

On Fedora 25 with 8 effective CPU cores:

$ time grepspeed
Build easy haystack: 1000000 lines: real=0.239,user=0.175,sys=0.063,cpu=99.63%
Grep -G [basic]: real=2.183,user=0.651,sys=1.547,cpu=100.68%
Grep -F [fixed]: real=2.172,user=0.669,sys=1.522,cpu=100.87%
Grep -E [exten]: real=2.254,user=0.718,sys=1.576,cpu=101.75%

real    0m6.961s
user    0m2.265s
sys     0m4.769s

Although one more CPU allocated to the VM meant that it was only about 50% busy while the DVM was about 100% busy, their times were similar. Of course, the F-25 system running without VMs but with 8 effective cores was only a blip at about 12.5% busy.

  • DVM: Fixed grep is clearly a winner when searching for fixed strings, running about 7.3% faster than basic grep and 8.0% faster than extended grep. Basic and extended were equivalent in time.
  • VM: Fixed grep wins over the other two in sheer speed, 8.5% faster than basic grep and 7.7% faster than extended grep. Again, basic and extended are time equivalent.
  • F-25: Differences between them are trivial at a little over 2 ms each, but fixed grep did come out slightly faster.

Here are times for the hard haystack:

On a Qubes v3.2 DVM running Fedora 24 with 1 effective CPU:

$ time ./grepspeed h
Build hard haystack: 1000000 lines: real=0.370,user=0.187,sys=0.148,cpu=90.36%
Grep -G [basic]: real=105.752,user=91.991,sys=12.741,cpu=99.03%
Grep -F [fixed]: real=96.482,user=83.714,sys=12.604,cpu=99.82%
Grep -E [exten]: real=103.416,user=90.631,sys=12.619,cpu=99.84%
 
real    5m6.044s
user    4m26.527s
sys     0m38.128s

On a Qubes v3.2 VM running Fedora 24 with 2 effective CPUs:

$ time ./grepspeed h
Build hard haystack: 1000000 lines: real=0.328,user=0.193,sys=0.136,cpu=100.28%
Grep -G [basic]: real=120.759,user=102.778,sys=18.141,cpu=100.13%
Grep -F [fixed]: real=110.523,user=93.032,sys=17.742,cpu=100.22%
Grep -E [exten]: real=119.576,user=101.949,sys=17.896,cpu=100.22%

real    5m51.217s
user    4m57.956s
sys     0m53.941s

On Fedora 25 with 8 effective CPU cores:

$ time grepspeed h
Build hard haystack: 1000000 lines: real=0.240,user=0.158,sys=0.082,cpu=99.82%
Grep -G [basic]: real=2.231,user=0.709,sys=1.532,cpu=100.42%
Grep -F [fixed]: real=1.972,user=0.623,sys=1.378,cpu=101.50%
Grep -E [exten]: real=2.183,user=0.714,sys=1.504,cpu=101.60%

real    0m6.758s
user    0m2.264s
sys     0m4.569s
  • DVM: As before, fixed grep won going 8.8% faster than basic and 6.7% faster than extended, but extended was slightly faster than basic in this hard haystack.
  • VM: Fixed grep won the speed test beating basic by 8.5% and extended by 7.6%, but extended was again a little faster than basic.
  • F-25: Following the same pattern as the VMs, fixed beat basic by 11.6% and extended by 9.7%. Extended continued to beat basic by 2.2% this time.

While it is clear that for fixed strings — no REs allowed — grep -F (fgrep) is faster than the other two. It doesn’t have to carry as much analysis overhead for strings without regular expressions. If speed is essential, favor it, but only if you’re searching for a specific set of characters. No expressions allowed. But grep -E (egrep) and grep -G (grep) are about the same. GNU grep may not differ much between grep and egrep, but non-GNU versions might. Extended grep supports more sophisticated REs and searches quickly. Need to use complex searching keys? When in doubt, I use egrep.

Leave a Comment