CS 3210 - Lab 11 - String Matching

This lab is all about how to match strings, either with globbing or with regular expressions.

The chapter correponding to this lab is Chapter 21.

Aren't we skipping stuff?

The alert student will note that we are skipping chapters 19 and 20. Chapter 19 deals with Virtual Consoles, which are only usable at the console (not from an SSH / telnet session). Most people that need more than one console simply open up a new SSH session or use the X Window system. Chapter 20 talks about the Linux Console and explains some escape sequences that you can print to the screen which make text boldface, change colors, etc. We will cover all of that stuff next time when we talk about the Slang library.

The Simple Way: Globbing

As we have already learned in Chapter 11 (on pgs. 193-199) globbing can be used to perform quick wildcard matches. A list of valid globbing metacharacters is found on pgs. 407-408 in your book.

Example Program: fnmatch.c - This program takes one glob expression as an agrument and applies it to the lines read from stdin. To test this program type something like:

$ ls | ./fnmatch "*.c"

Make sure you enclose the glob expression in quotes (single or double) otherwise the shell will try to expand it.

The Fun Way: Regular Expressions

Regular Expressions were invented by Steven Kleene as a notation for describing "the algebra of regular sets". They have since become the standard way of matching patterns of text. Regulare Expressions (or "regexes" for short) contain much more powerful pattern matching capabilities than simple globbing. Numerous Unix programs use regexes, including grep, sed, awk, perl, and numerous others.

If you are unfamiliar with regular expressions, I highly recommend you read Using Regular Expressions by Stephen Ramsay. Numerous other regex resources are listed below in the Further Study section.

Using Regexes in the Shell

Use the incredibly useful grep program, passing it a regular expression and a file to search in like so:

$ grep "^q.*z$" /usr/share/dict/words 
quartz
quiz

What that means is "find all the words that begin with 'q', end with 'z', and have anything else in the middle".

The following table gives a brief synopsis of regex metacharacters.

Wildcard Matching matches characters
. matches any one character
[adf] matches any one of the letters 'a', 'd', or 'f'
[^adf] matches any letter EXCEPT 'a', 'd', or 'f'
Quantifiers how many times to match a character
? match previous character zero or one times
* match previous character zero to N times
+ match previous character one to N times
\{5\} match previous character 5 times
\{2, 6\} match previous character at least 2 times but no more than 6 times
Anchors where it matches
^ match expression at start of line
$ match expression at end of line
\< match expression at beginning of word
\> match expression at end of word
\b match expression at word boundary (beginning or end)
\B match expression in the middle of a word

Wildcard Matching	matches characters
.	matches any one character
[adf]	matches any one of the letters 'a', 'd', or 'f'
[^adf]	matches any letter EXCEPT 'a', 'd', or 'f'
Quantifiers	how many times to match a character
?	match previous character zero or one times
*	match previous character zero to N times
+	match previous character one to N times
\{5\}	match previous character 5 times
\{2, 6\}	match previous character at least 2 times but no more than 6 times
Anchors	where it matches
^	match expression at start of line
$	match expression at end of line
\<	match expression at beginning of word
\>	match expression at end of word
\b	match expression at word boundary (beginning or end)
\B	match expression in the middle of a word

These can be combined to form any expression you need. Examples.

.* match anything (any character, any number of times)
v match any string that has a 'v'
[^v] match any string that does not have a 'v'
^v match any string that begins with a 'v'
^v.*e$ match any string that begins with a 'v' and ends with an 'e'
v{2,3} match any string with two or three consecutive 'v's

Feel free to experiment by grepping through /usr/share/dict/words.

Using Regexes in a C Program

Four functions are involved in using regular expressions in C. They are listed in your book on p. 409. It basically goes like this:

First call regcomp to compile the regular expression. You cannot just test a string right away like you can with fnmatch, you have to compile it first.
If regcomp returns non-zero, call regerror to get the error message. There are a few hoops you have to jump through to get regerror to actually return the error message. A sample wrapper function is presented on p. 413.
Call regexec as many times as you like with the compiled regular expression to test for a match.
When you are done, call regfree to free the memory used by the compiled regular expression.

The following example shows an application of those four steps.

Example Program: regmatch.c - A program that tests applying a regular expression to a sequence of strings. Experiment with this one by changing the regex, adding new strings, etc.

Confusing Example Program: match.c - From your book around pgs. 410-411. This is a very obtuse example of using regular expressions. Simply put, it tests if a line begins with a # (meaning it's a comment), but NOT a /#. It's not a terrific example.

Further Study

Good On-Line Regular Expression Tutorials

If you are unfamiliar with regular expressions, I recommend you look at the following pages:

Using Regular Expressions by Stephen Ramsay from the Electronic Text Center. A questions-and-answers approach to teaching regular expressions.

So What's A $#!%% Regular Expression, Anyway?! on DevShed.com. This is a fairly gentle introduction.

Regular Expressions on Linux-Learn.org. A straightfoward explanation of how to use regular expressions with application to grep and sed.

Regular Expressions from Zytrax.com. This is a "big picture" overview with a Web-centric approach.

Streams and sed -- The Stream Editor from rute.sourceforge.net. A sed tutorial that briefly explains regular expressions. Also contains some good information on piping and redirection in the shell.

Books

Mastering Regular Expressions, 2nd Edition by O'Reilly and Associates.

sed & awk, 2nd Edition by O'Reilly and Associates.

Your book, Chapter 21

Example Programs

These programs are all on Gautama.

grep - Matches regular expressions in files or the output of other programs.

locate - See which files on the hard drive match a regular expression.

vi - In non-insert mode, you can type "/" followed by a string to search for a regular expression. You can also type ":s/regex/replace/" to search and replace text.

sed - The stream editor. Contains an odd little programming language.

awk - A strange little programming language.

perl - An even stranger big programming language with some nice regex extensions.

Lab 11

This will be a lab on regular expression matching.

Description / Requirements

For this lab, I want you to write a miniature version of grep. Your program will take one argument, a regular expression, and tries to match that regex with lines read from stdin.

Some steps to help you:

First, verify that the user passed one command-line argument to the program.
Compile the string passed as a regular expression using regcomp. Be sure to test if there is an error compiling the regex.
Repeatedly read lines from stdin, testing for a match on each line by calling regexec
Last, cleanup by calling regfree.

Save this in a file called minigrep.c. The whole file should be about 80 or so lines long, but you will be able to re-use a lot of the code from the previous examples. It shouldn't take you too long. Be sure to have a Makefile as well. Put both files in a lab11/ directory.

Sample Output

Calling the program like this from the command line:

$ ./minigrep "^f.*z.*e$" < /usr/share/dict/words

should produce the following output:

familiarize
fantasize
faze
fertilize
finalize
fizzle
formalize
frazzle
freeze
frieze
froze

Another example: calling the program like this:

$ ./minigrep "^pu[zl].*e$" < /usr/share/dict/words

should produce the following output:

pulsate
pulse
puzzle

You can test if your program is working right by comparing the output of your program with the output of "grep".

Files

Here is a list of the files created in this lab.

Makefile
minigrep.c

Please turn in just the minigrep.c file with your name, lab # and class # at the top.

.*	match anything (any character, any number of times)
v	match any string that has a 'v'
[^v]	match any string that does not have a 'v'
^v	match any string that begins with a 'v'
^v.*e$	match any string that begins with a 'v' and ends with an 'e'
v{2,3}	match any string with two or three consecutive 'v's