File I/O, Handy Programs, Reg Exprs

File I/O

The following is a summary of the basic file operations. Some examples of these operations in action are given below.

Important Note: CGI programs (and all the resources used by those programs) are executed by user nobody. This means that all files that are used by a CGI program need to be world-readable or (if you will be writing to a file) world-writeable. If you are having difficulty accessing text files with your CGI program, do a chmod 555 textfile to cure it.

open

This is done with the open function and it is one we have seen before. The most common way to open a file in Perl is as follows:

open( FILEHANDLE, "filename.txt" ) or die "filename.txt: $!";

Note that after opening the file, you will use the FILEHANDLE and not the filename.txt for future operations.

Because the die function will cause the program to immediately exit, you might want to write a little subroutine that prints an error to the browser and then dies, so you (and the user) are not left wondering what happened.

If you pass open just a single argument (the filename) it will look for a scalar variable by the same name (case-sensitive) and open that filename, thusly:

$FILE = "/home/mark/myfile.txt";
open FILE or die "$!";

There are also a number of characters you can place before the filename that indicate you are opening the file for a special purpose. Most of these characters are borrowed from the shell and should look familiar to you.

"<filename.txt" - Open read-only.
">filename.txt" - Truncate (erase) the contents before writing.
">>filename.txt" - Open the file for writing and append to the end of the file.
Putting a + before any of the above tells Perl that you want both read and write access.
If you put a | before or after the filename, it opens it as a pipe for either reading or writing. (More on this later.)

read

As illustrated in the previous week's notes, Perl has a read function which you can use to read from a filehandle, thusly:

read( FILEHANDLE, $input, $num_bytes );

But that is neither the most common nor the most useful way to read from a file. You're more likely to see:

 $input_line = <FILEHANDLE>;

Or even:

 while (<FILEHANDLE>) { ...

Inside the body of the above while loop, each line read in would be stored in the scratch variable.

If you want to just read a line from stdin, do this:

$line = <>

Or in a loop:

while (<>) { ...

This is known as the "diamond operator".

write

Perl also has a write function that is, unfortunately, NOT analogous to the read function above. It is used for writing formatted text via Perl's built-in formatting capabilities (which we are not going to cover in class).

If you want to print something to a file, use the print function that you have already learned about, thusly:

print FILEHANDLE list...

Print can handle being given either a scalar or a list so you can print as many lines as you want to a file, or just one at a time.

Note that there is no comma after the FILEHANDLE. This is a common "gotcha" that people new to Perl encounter.

close

The good news is that you don't have to explicitly close files when you are done with them. Perl will close all open files automatically when the program exits.

If, however, you wish to be real neat n' tidy, you can close files explicitly with the close function, thusly:

close FILEHANDLE;

Other File Ops

There are a number of other file operations, most of them work just like C or just like the shell.

tell FILEHANDLE - This reports the current file position in bytes.
seek FILEHANDLE, OFFSET, WHENCE - This moves the current file position. The value of WHENCE is either 0, 1, or 2 to indicate whether the offset is relative to the beginning, current position, or end of file respectively. If you want to rewind to the beginning of a file, do: seek FILEHANDLE, 0, 0;
select FILEHANDLE - This works just like select in C. Any filehandle that you select in this way will be the destination for all future calls to print. This function will return the previous handle so you can restore it later.
unlink "filename" - Deletes a file. Note that this uses a filename and not a filehandle.
stat FILEHANDLE or "filename" - Returns a 13-element list of information describing the file. Stuff like creation time, last modification time, etc.

Example: Reading tabular text

Let's say you were doing a "voting booth" CGI program that allows people to select their favorite ice cream. You wish to record their selection and add it to a running total. To make that information persistent, you are using a file that is formatted as follows:

Starlight Mint: 4
Tin Roof: 7
Rocky Road: 3
Raspberry Swirl: 8

Since this is a small file, it can easily be read into memory, modified as needed, and then written back out to file. An appropriate data structure to use would be a hash for holding the information as key-value pairs.

Let's say that the front-end form contains a set of radio buttons named "flavor" that contains the flavor selected. We will assume for this example that the names of the form input fields have been read into a %FORM hash as per the previous week's notes. The following is the code needed to read the file, increment the choice, and write the results back to disk.

open (STATS, "flavors.txt") or die "$!";
while (<STATS>) {
	chomp;	# remove trailing newline
	($flavor, $votes) = split /:\s*/; # split line around : + whitespace
	$stats{$flavor} = $votes;
}
close STATS;
$stats{$FORM{flavor}}++;
open (STATS, ">flavors.txt") or die "$!";
foreach (keys %stats) {
	print STATS "$_: $stats{$_}\n";
}

If the data file you are using has more than two entries per line, a hash is typically not the best approach. For such situations, it is best to read each line into an array and use split to get the individual fields when needed.

Pipes as Files, and Some Handy Programs

In C there is a popen call that allows you to open a series of pipes. Perl allows you to do that using just the open call.

Two situations where you might want to use this in your homework assignments are presented.

Sending Email

There is a handly little command-line program called mail that takes an address and an optional subject as arguments, then reads a message from stdin and sends the email when it hits an end-of-file. Example:

mail -s "Subject" someone@somewhere.com <message.txt

Note the use of redirection to read from a file rather than stdin.

If you want to make an email gateway, the best way to send someone mail is like this:

# Use the 'mail' command-line program to send the email
open (MAIL, "| mail -s \"$subject\" $emailaddr") or die "$!";
print MAIL $message;
close MAIL; # sends it

You can put more than one command in a pipeline. Building on the previous example, if I wanted to make sure that the text sent in the email message was formatted as 75 characters per column, paragraphs indented, and uniform spacing, I could include the fmt command in the pipeline, thusly:

 open (MAIL, "| fmt -tu | mail -s \"$subject\" $emailaddr") or die "$!";

Grabbing Web Pages

I have mentioned before the little text-based browser called 'lynx'. It has some command-line switches that can be used to simply grab an HTML page from a URL and dump the contents to stdout.

To grab an HTML page with all the tags intact, do this:

lynx -source http://somewhere.com/page.html

To grab an HTML page WITHOUT the tags (i.e. just the text), do this:

lynx -dump -nolist http://somewhere.com/page.html

The -nolist switch tells lynx not to list the URLs enountered on the page at the bottom (which is the default).

If you're writing a 'bot that's going to grab Web pages from another site and read them for info, you could use something like this:

open (PAGE, "lynx -dump -nolist $url |") or die "$!";
while (<PAGE>) {
	if ($_ =~ /$info/) {
		# Do something with the info you found...
	}
}

There is also a wget command that grabs a Web page and saves it to a file. Use it if you find it helpful.

Regular Expressions

Fairly Important Paragraph: We will probably go over regular expressions in the lab next Tuesday, rather than tonight in class. I am not going to write up a lot of notes on this, so I urge you in the strongest possible terms to read the perlre man page. I promise that you will see questions on the test that come from this man page.

Here is the most basic info about regular epressions in Perl. Perl borrowed nearly all of these constructs from the sed program (Stream EDitor).

Matching Operator (=~) - Allows you to apply a regular expression to a scalar variable, thusly:
```
 if ($var =~ /match/) { ... 
```
If the matching variable is not used, all regular expressions are applied to the scratch variable ($_).
Matching (/match/) - Simply attempts to match a pattern to a variable. Can also be written as m/match/. (The little 'm' at the beginning stands for 'match'.)
Translation (tr/from/to/) - This translates from one character (or character class) to another character (or character class). Can also be written as y/from/to/.
Example: If you wanted to translate from upper-case to lower-case characters, you would type:
```
 tr /[A-Z]/[a-z]/; 
```
You typically don't see translation as much as you see matching or substitution.
Substitution (s/search/replace/) - Attempts to match the pattern between the first and second '/' and replace that pattern with the text between the second and third '/'.
There are wild and wooly things you can do with substitution. The following example will strip HTML tags from a file:
```
 cat file.html | perl -pe 's/<.*?>//g' > file.txt 
```
Delimiters - If you don't want to use slashes as delimeters, Perl allows you to use whatever other characters you would like.
The following example removes unnecessary </P> and </LI> tags from an html file:
```
 cat old.html | perl -pe 's#</(p|li)>##ig' > new.html 
```

Changelog:

4/14/99 - Initial Revision
4/20/99 - Fixed bugs in reading file contents into a hash