Unix Interview Preparation

AWK
It is an excellent filter and report writer. Many UNIX utilities generates rows and columns of information. AWK is an excellent tool for processing these rows and columns, and is easier to use AWK than most conventional programming languages. It can be considered to be a pseudo-C interpretor, as it understands the same arithmatic operators as C. AWK also has string manipulation functions, so it can search for particular strings and modify the output. AWK also has associative arrays, which are incredible useful, and is a feature most computing languages lack. Associative arrays can make a complex problem a trivial exercise.
Awk
Last updated - Tue Dec 28 20:29:22 EST 2010 Part of the Unix tutorials And then there's My blog
Table of Contents

Why learn AWK? Basic Structure Executing an AWK script Which shell to use with AWK? Dynamic Variables The Essential Syntax of AWK Arithmetic Expressions Unary arithmetic operators The Autoincrement and Autodecrement Operators Assignment Operators Conditional expressions Regular Expressions And/Or/Not Commands AWK Built-in Variables FS - The Input Field Separator Variable OFS - The Output Field Separator Variable NF - The Number of Fields Variable NR - The Number of Records Variable
RS - The Record Separator Variable ORS - The Output Record Separator Variable FILENAME - The Current Filename Variable Associative Arrays Multi-dimensional Arrays Example of using AWK's Associative Arrays Output of the script Picture Perfect PRINTF Output PRINTF - formatting output Escape Sequences Format Specifiers Width - specifying minimum field size Left Justification The Field Precision Value Explicit File output AWK Numerical Functions Trigonometric Functions Exponents, logs and square roots Truncating Integers "Random Numbers The Lotto script String Functions The Length function The Index Function The Substr function GAWK's Tolower and Toupper function The Split function NAWK's string functions The Match function The System function The Getline function The systime function The Strftime function User Defined Functions AWK patterns Formatting AWK programs Environment Variables ARGC - Number or arguments (NAWK/GAWK) ARGV - Array of arguments (NAWK/GAWK) ARGIND - Argument Index (GAWK only) FNR (NAWK/GAWK) OFMT (NAWK/GAWK) RSTART, RLENGTH and match (NAWK/GAWK) SUBSEP - Multi-dimensional array separator (NAWK/GAWK) ENVIRON - environment variables (GAWK only) IGNORECASE (GAWK only)
CONVFMT - conversion format (GAWK only) ERRNO - system errors (GAWK only) FIELDWIDTHS - fixed width fields (GAWK only) AWK, NAWK, GAWK, or PERL
Copyright 2001,2004 Bruce Barnett and General Electric Company All rights reserved You are allowed to print copies of this tutorial for your personal use, and link to this page, but you are not allowed to make electronic copies, or redistribute this tutorial in any form without permission. Original version written in 1994 and published in the Sun Observer Updated: Tue Mar 9 11:07:08 EST 2004 Updated: Fri Feb 16 05:32:38 EST 2007 Updated: Wed Apr 16 20:55:07 EDT 2008 Awk is an extremely versatile programming language for working on files. We'll teach you just enough to understand the examples in this page, plus a smidgen. The examples given below have the extensions of the executing script as part of the filename. Once you download it, and make it executable, you can rename it anything you want.
Why learn AWK?

In the past I have covered grep and sed. This section discusses AWK, another cornerstone of UNIX shell programming. There are three variations of AWK: AWK - the original from AT&T NAWK - A newer, improved version from AT&T GAWK - The Free Software foundation's version Originally, I didn't plan to discuss NAWK, but several UNIX vendors have replaced AWK with NAWK, and there are several incompatibilities between the two. It would be cruel of me to not warn you about the differences. So I will highlight those when I come to them. It is important to know than all of AWK's features are in NAWK and GAWK. Most, if not all, of NAWK's features are in GAWK. NAWK ships as part of Solaris. GAWK does not. However, many sites on the Internet have the sources freely available. If you user Linux, you have GAWK. But in general, assume that I am talking about the classic AWK unless otherwise noted. Why is AWK so important? It is an excellent filter and report writer. Many UNIX utilities generates rows and columns of information. AWK is an excellent tool for processing these rows and columns, and is easier to use AWK than most conventional programming languages. It can be considered to be a pseudo-C interpretor, as it understands the same arithmatic operators as C.
AWK also has string manipulation functions, so it can search for particular strings and modify the output. AWK also has associative arrays, which are incredible useful, and is a feature most computing languages lack. Associative arrays can make a complex problem a trivial exercise. I won't exhaustively cover AWK. That is, I will cover the essential parts, and avoid the many variants of AWK. It might be too confusing to discuss three different versions of AWK. I won't cover the GNU version of AWK called "gawk." Similarly, I will not discuss the new AT&T AWK called "nawk." The new AWK comes on the Sun system, and you may find it superior to the old AWK in many ways. In particular, it has better diagnostics, and won't print out the infamous "bailing out near line ..." message the original AWK is prone to do. Instead, "nawk" prints out the line it didn't understand, and highlights the bad parts with arrows. GAWK does this as well, and this really helps a lot. If you find yourself needing a feature that is very difficult or impossible to do in AWK, I suggest you either use NAWK, or GAWK, or convert your AWK script into PERL using the "a2p" conversion program which comes with PERL. PERL is a marvelous language, and I use it all the time, but I do not plan to cover PERL in these tutorials. Having made my intention clear, I can continue with a clear conscience. Many UNIX utilities have strange names. AWK is one of those utilities. It is not an abbreviation for awkward. In fact, it is an elegant and simple language. The work "AWK" is derived from the initials of the language's three developers: A. Aho, B. W. Kernighan and P. Weinberger.
Basic Structure
The essential organization of an AWK program follows the form: pattern { action } The pattern specifies when the action is performed. Like most UNIX utilities, AWK is line oriented. That is, the pattern specifies a test that is performed with each line read as input. If the condition is true, then the action is taken. The default pattern is something that matches every line. This is the blank or null pattern. Two other important patterns are specified by the keywords "BEGIN" and "END." As you might expect, these two words specify actions to be taken before any lines are read, and after the last line is read. The AWK program below:
BEGIN { print "START" } { print } END { print "STOP" }
adds one line before and one line after the input file. This isn't very useful, but with a simple change, we can make this into a typical AWK program: BEGIN { print "File\tOwner"," } { print $8, "\t", $3} END { print " - DONE -" }
I'll improve the script in the next sections, but we'll call it "FileOwner." But let's not put it into a script or file yet. I will cover that part in a bit. Hang on and follow with me so you get the flavor of AWK. The characters "\t" Indicates a tab character so the output lines up on even boundries. The "$8" and "$3" have a meaning similar to a shell script. Instead of the eighth and third argument, they mean the eighth and third field of the input line. You can think of a field as a column, and the action you specify operates on each line or row read in. There are two differences between AWK and a shell processing the characters within double quotes. AWK understands special characters follow the "\" character like "t". The Bourne and C UNIX shells do not. Also, unlike the shell (and PERL) AWK does not evaluate variables within strings. To explain, the second line could not be written like this: {print "$8\t$3" } That example would print "$8 $3." Inside the quotes, the dollar sign is not a special character. Outside, it corresponds to a field. What do I mean by the third and eight field? Consider the Solaris "/usr/bin/ls -l" command, which has eight columns of information. The System V version (Similar to the Linux version), "/usr/5bin/ls -l," has 9 columns. The third column is the owner, and the eighth (or nineth) column in the name of the file. This AWK program can be used to process the output of the "ls -l" command, printing out the filename, then the owner, for each file. I'll show you how. Update: On a linux system, change "$8" to "$9". One more point about the use of a dollar sign. In scripting languages like Perl and the various shells, a dollar sign means the word following is the name of the variable. Awk is different. The dollar sign means that we are refering to a field or column in the current line. When switching between Perl and AWK you must remener that "$" has a different meaning. So the following piece of code prints two "fields" to standard out. The first field printed is the number "5", the second is the fifth field (or column) on the input line. BEGIN { x=5 } { print x, $x}
Executing an AWK script

So let's start writing our first AWK script. There are a couple of ways to do this. Assuming the first script is called "FileOwner," the invocation would be ls -l | FileOwner This might generate the following if there were only two files in the current directory:
File Owner
a.file barnett another.file barnett - DONE There are two problems with this script. Both problems are easy to fix, but I'll hold off on this until I cover the basics. The script itself can be written in many ways. The C shell version would look like this:
#!/bin/csh -f # Linux users have to change $8 to $9 awk ' BEGIN { print "File\tOwner" } \ { print $8, "\t", $3} \ END { print " - DONE -" } \ '
Click here to get file: awk_example1.csh As you can see in the above script, each line of the AWK script must have a backslash if it is not the last line of the script. This is necessary as the C shell doesn't, by default, allow strings I have a long list of complaints about using the C shell. See Top Ten reasons not to use the C shell The Bourne shell (as does most shells) allows quoted strings to span several lines: #!/bin/sh # Linux users have to change $8 to $9 awk ' BEGIN { print "File\tOwner" } { print $8, "\t", $3} END { print " - DONE -" } ' Click here to get file: awk_example1.sh The third form is to store the commands in a file, and execute awk -f filename Since AWK is also an interpretor, you can save yourself a step and make the file executable by add one line in the beginning of the file: #!/bin/awk -f BEGIN { print "File\tOwner" } { print $8, "\t", $3} END { print " - DONE -" }
Click here to get file: awk_example1.awk Change the permission with the chmod command, (i.e. "chmod +x awk_example1.awk"), and the script becomes a new command. Notice the "-f" option following '#!/bin/awk "above, which is also used in the third format where you use AWK to execute the file directly, i.e. "awk -f filename". The "-f" option specifies the AWK file containing the instructions. As you can see, AWK considers lines that start with a "#" to be a comment, just like the shell. To be precise, anything from the "#" to the end of the line is a comment (unless its inside an AWK string. However, I always comment my AWK scripts with the "#" at the start of the line, for reasons I'll discuss later. Which format should you use? I prefer the last format when possible. It's shorter and simpler. It's also easier to debug problems. If you need to use a shell, and want to avoid using too many files, you can combine them as we did in the first and second example.
Which shell to use with AWK?

The format of AWK is not free-form. You cannot put new line breaks just anywhere. They must go in particular locations. To be precise, in the original AWK you can insert a new line character after the curly braces, and at the end of a command, but not elsewhere. If you wanted to break a long line into two lines at any other place, you had to use a backslash: #!/bin/awk -f BEGIN { print "File\tOwner" } { print $8, "\t", \ $3} END { print " - DONE -" } Click here to get file: awk_example2.awk The Bourne shell version would be #!/bin/sh awk ' BEGIN { print "File\tOwner" } { print $8, "\t", \ $3} END { print "done"} ' Click here to get file: awk_example2.sh
while the C shell would be #!/bin/csh -f awk ' BEGIN { print "File\tOwner" }\ { print $8, "\t", \\ $3}\ END { print "done"}\ ' Click here to get file: awk_example2.csh As you can see, this demonstrates how awkward the C shell is when enclosing an AWK script. Not only are back slashes needed for every line, some lines need two. (Note - this is true when using old awk (e.g. on Solaris) because the print statement had to be on one line. Newer AWK's are more flexible where newlines can be added.) Many people will warn you about the C shell. Some of the problems are subtle, and you may never see them. Try to include an AWK or sed script within a C shell script, and the back slashes will drive you crazy. This is what convinced me to learn the Bourne shell years ago, when I was starting out. I strongly recommend you use the Bourne shell for any AWK or sed script. If you don't use the Bourne shell, then you should learn it. As a minimum, learn how to set variables, which by some strange coincidence is the subject of the next section.
Dynamic Variables
Since you can make a script an AWK executable by mentioning "#!/bin/awk -f" on the first line, including an AWK script inside a shell script isn't needed unless you want to either eliminate the need for an extra file, or if you want to pass a variable to the insides of an AWK script. Since this is a common problem, now is as good a time to explain the technique. I'll do this by showing a simple AWK program that will only print one column. NOTE: there will be a bug in the first version. The number of the column will be specified by the first argument. The first version of the program, which we will call "Column," looks like this: #!/bin/sh #NOTE - this script does not work! column=$1 awk '{print $column}' Click here to get file (but be aware that it doesn't work): Column1.sh A suggested use is: ls -l | Column 3
This would print the third column from the ls command, which would be the owner of the file. You can change this into a utility that counts how many files are owned by each user by adding ls -l | Column 3 | uniq -c | sort -nr Only one problem: the script doesn't work. The value of the "column" variable is not seen by AWK. Change "awk" to "echo" to check. You need to turn off the quoting when the variable is seen. This can be done by ending the quoting, and restarting it after the variable: #!/bin/sh column=$1 awk '{print $'$column'}' Click here to get file: Column2.sh This is a very important concept, and throws experienced programmers a curve ball. In many computer languages, a string has a start quote, and end quote, and the contents in between. If you want to include a special character inside the quote, you must prevent the character from having the typical meaning. In the C language, this is down by putting a backslash before the character. In other languages, there is a special combination of characters to to this. In the C and Bourne shell, the quote is just a switch. It turns the interpretation mode on or off. There is really no such concept as "start of string" and "end of string." The quotes toggle a switch inside the interpretor. The quote character is not passed on to the application. This is why there are two pairs of quotes above. Notice there are two dollar signs. The first one is quoted, and is seen by AWK. The second one is not quoted, so the shell evaluates the variable, and replaces "$column" by the value. If you don't understand, either change "awk" to "echo," or change the first line to read "#!/bin/sh -x." Some improvements are needed, however. The Bourne shell has a mechanism to provide a value for a variable if the value isn't set, or is set and the value is an empty string. This is done by using the format: ${variable:-defaultvalue} This is shown below, where the default column will be one: #!/bin/sh column=${1:-1} awk '{print $'$column'}' Click here to get file: Column3.sh We can save a line by combining these two steps: #!/bin/sh awk '{print $'${1:-1}'}'
Click here to get file: Column4.sh It is hard to read, but it is compact. There is one other method that can be used. If you execute an AWK command and include on the command line variable=value this variable will be set when the AWK script starts. An example of this use would be: #!/bin/sh awk '{print $c}' c=${1:-1} Click here to get file: Column5.sh This last variation does not have the problems with quoting the previous example had. You should master the earlier example, however, because you can use it with any script or command. The second method is special to AWK. Modern AWK's have other options as well. See the comp.unix.shell FAQ.
The Essential Syntax of AWK

Earlier I discussed ways to start an AWK script. This section will discuss the various grammatical elements of AWK.
Arithmetic Expressions
There are several arithmetic operators, similar to C. These are the binary operators, which operate on two variables:
+--------------------------------------------+ | AWK Table 1 | | Binary Operators | |Operator Type Meaning | +--------------------------------------------+ |+ Arithmetic Addition | |Arithmetic Subtraction | |* Arithmetic Multiplication | |/ Arithmetic Division | |% Arithmetic Modulo | |<space> String Concatenation +--------------------------------------------+
Using variables with the value of "7" and "3," AWK returns the following results for each operator when using the print command:
+---------------------+
|Expression Result | +---------------------+ |7+3 10 | |7-3 4 | |7*3 21 | |7/3 2.33333 | |7%3 1 | |7 3 73 | +---------------------+
There are a few points to make. The modulus operator finds the remainder after an integer divide. The print command output a floating point number on the divide, but an integer for the rest. The string concatenate operator is confusing, since it isn't even visible. Place a space between two variables and the strings are concatenated together. This also shows that numbers are converted automatically into strings when needed. Unlike C, AWK doesn't have "types" of variables. There is one type only, and it can be a string or number. The conversion rules are simple. A number can easily be converted into a string. When a string is converted into a number, AWK will do so. The string "123" will be converted into the number 123. However, the string "123X" will be converted into the number 0. (NAWK will behave differently, and converts the string into integer 123, which is found in the beginning of the string).
Unary arithmetic operators

The "+" and "-" operators can be used before variables and numbers. If X equals 4, then the statement: print -x; will print "-4."
The Autoincrement and Autodecrement Operators

AWK also supports the "++" and "--" operators of C. Both increment or decrement the variables by one. The operator can only be used with a single variable, and can be before or after the variable. The prefix form modifies the value, and then uses the result, while the postfix form gets the results of the variable, and afterwards modifies the variable. As an example, if X has the value of 3, then the AWK statement print x++, " ", ++x; would print the numbers 3 and 5. These operators are also assignment operators, and can be used by themselves on a line:
x++; --y;
Assignment Operators
Variables can be assigned new values with the assignment operators. You know about "++" and "--." The other assignment statement is simply: variable = arithmetic_expression Certain operators have precedence over others; parenthesis can be used to control grouping. The statement x=1+2*3 4; is the same as x = (1 + (2 * 3)) "4"; Both print out "74." Notice spaces can be added for readability. AWK, like C, has special assignment operators, which combine a calculation with an assignment. Instead of saying x=x+2; you can more concisely say: x+=2; The complete list follows:
+-----------------------------------------+ | AWK Table 2 | | Assignment Operators | |Operator Meaning | |+= Add result to variable | |-= Subtract result from variable | |*= Multiply variable by result | |/= Divide variable by result | |%= Apply modulo to variable | +-----------------------------------------+
Conditional expressions
The second type of expression in AWK is the conditional expression. This is used for certain tests, like the if or while. Boolean conditions evaluate to true or false. In AWK, there is a definite difference between a boolean condition, and an arithmetic expression. You cannot convert a boolean condition to an integer or string. You can, however, use an arithmetic expression as a conditional expression. A value of 0 is false, while anything else is true. Undefined variables has the value of 0. Unlike AWK, NAWK lets you use booleans as integers. Arithmetic values can also be converted into boolean conditions by using relational operators:
+---------------------------------------+ | AWK Table 3 | | Relational Operators | |Operator Meaning | +---------------------------------------+ |== Is equal | |!= Is not equal to | |> Is greater than | |>= Is greater than or equal to | |< Is less than | |<= Is less than or equal to | +---------------------------------------+
These operators are the same as the C operators. They can be used to compare numbers or strings. With respect to strings, lower case letters are greater than upper case letters.
Regular Expressions
Two operators are used to compare strings to regular expressions:
+-----------------------------+ | AWK Table 4 | |Regular Expression Operators | |Operator Meaning | +-----------------------------+ |~ Matches | |!~ Doesn't match | +-----------------------------+
The order in this case is particular. The regular expression must be enclosed by slashes, and comes after the operator. AWK supports extended regular expressions, so the following are examples of valid tests: word !~ /START/ lawrence_welk ~ /(one|two|three)/
And/Or/Not
There are two boolean operators that can be used with conditional expressions. That is, you can combine two conditional expressions with the "or" or "and" operators: "&&" and "||." There is also the unary not operator: "!."
Commands
There are only a few commands in AWK. The list and syntax follows: if ( conditional ) statement [ else statement ] while ( conditional ) statement for ( expression ; conditional ; expression ) statement for ( variable in array ) statement break continue { [ statement ] ...} variable=expression print [ expression-list ] [ > expression ] printf format [ , expression-list ] [ > expression ] next exit At this point, you can use AWK as a language for simple calculations; If you wanted to calculate something, and not read any lines for input, you could use the BEGIN keyword discussed earlier, combined with a exit command:
#!/bin/awk -f BEGIN { # Print the squares from 1 to 10 the first way i=1; while (i <= 10) { printf "The square of ", i, " is ", i*i; i = i+1; } # do it again, using more concise code for (i=1; i <= 10; i++) { printf "The square of ", i, " is ", i*i; } # now end exit; }
Click here to get file: awk_print_squares.awk The following asks for a number, and then squares it:
#!/bin/awk -f BEGIN { print "type a number"; } { print "The square of ", $1, " is ", $1*$1; print "type another number"; } END { print "Done" }
Click here to get file: awk_ask_for_square.awk The above isn't a good filter, because it asks for input each time. If you pipe the output of another program into it, you would generate a lot of meaningless prompts. Here is a filter that you should find useful. It counts lines, totals up the numbers in the first column, and calculates the average. Pipe "wc -c *" into it, and it will count files, and tell you the average number of words per file, as well as the total words and the number of files.
#!/bin/awk -f BEGIN { # How many lines lines=0; total=0; } { # this code is executed once for each line # increase the number of files lines++; # increase the total size, which is field #1 total+=$1; } END { # end, now output the total print lines " lines read"; print "total is ", total; if (lines > 0 ) { print "average is ", total/lines; } else { print "average is 0"; } }
Click here to get file: average.awk
You can pipe the output of "ls -s" into this filter to count the number of files, the total size, and the average size. There is a slight problem with this script, as it includes the output of "ls" that reports the total. This causes the number of files to be off by one. Changing lines++; to if ($1 != "total" ) lines++; will fix this problem. Note the code which prevents a divide by zero. This is common in wellwritten scripts. I also initialize the variables to zero. This is not necessary, but it is a good habit.
AWK Built-in Variables

I have mentioned two kinds of variables: positional and user defined. A user defined variable is one you create. A positional variable is not a special variable, but a function triggered by the dollar sign. Therefore print $1; and X=1; print $X; do the same thing: print the first field on the line. There are two more points about positional variables that are very useful. The variable "$0" refers to the entire line that AWK reads in. That is, if you had eight fields in a line, print $0; is similar to print $1, $2, $3, $4, $5, $6, $7, $8 This will change the spacing between the fields; otherwise, they behave the same. You can modify positional variables. The following commands $2=""; print; deletes the second field. If you had four fields, and wanted to print out the second and fourth field, there are two ways. This is the first:
#!/bin/awk -f { $1=""; $3=""; print; }
and the second

#!/bin/awk -f { print $2, $4; }
These perform similarly, but not identically. The number of spaces between the values vary. There are two reasons for this. The actual number of fields does not change. Setting a positional variable to an empty string does not delete the variable. It's still there, but the contents has been deleted. The other reason is the way AWK outputs the entire line. There is a field separator that specifies what character to put between the fields on output. The first example outputs four fields, while the second outputs two. In-between each field is a space. This is easier to explain if the characters between fields could be modified to be made more visible. Well, it can. AWK provides special variables for just that purpose.
FS - The Input Field Separator Variable

AWK can be used to parse many system administration files. However, many of these files do not have whitespace as a separator. as an example, the password file uses colons. You can easily change the field separator character to be a colon using the "-F" command line option. The following command will print out accounts that don't have passwords: awk -F: '{if ($2 == "") print $1 ": no password!"}' </etc/passwd There is a way to do this without the command line option. The variable "FS" can be set like any variable, and has the same function as the "-F" command line option. The following is a script that has the same function as the one above.
#!/bin/awk -f BEGIN { FS=":"; } { if ( $2 == "" ) { print $1 ": no password!"; } }
Click here to get file: awk_nopasswd.awk The second form can be used to create a UNIX utility, which I will name "chkpasswd," and executed like this: chkpasswd </etc/passwd The command "chkpasswd -F:" cannot be used, because AWK will never see this argument. All interpreter scripts accept one and only one argument, which is immediately after the "#!/bin/awk" string. In this case, the single argument is "-f." Another difference between the command line option and the internal variable is the ability to set the input field separator to be more than one character. If you specify FS=": "; then AWK will split a line into fields wherever it sees those two characters, in that exact order. You cannot do this on the command line. There is a third advantage the internal variable has over the command line option: you can change the field separator character as many times as you want while reading a file. Well, at most once for each line. You can even change it depending on the line you read. Suppose you had the following file which contains the numbers 1 through 7 in three different formats. Lines 4 through 6 have colon separated fields, while the others separated by spaces. ONE 1 I TWO 2 II #START THREE:3:III FOUR:4:IV FIVE:5:V #STOP SIX 6 VI SEVEN 7 VII The AWK program can easily switch between these formats:
#!/bin/awk -f { if ($1 == "#START") { FS=":"; } else if ($1 == "#STOP") { FS=" "; } else { #print the Roman number in column 3 print $3 } }
Click here to get file: awk_example3.awk Note the field separator variable retains its value until it is explicitly changed. You don't have to reset it for each line. Sounds simple, right? However, I have a trick question for you. What happens if you change the field separator while reading a line? That is, suppose you had the following line One Two:Three:4 Five and you executed the following script:
#!/bin/awk -f { print $2 FS=":" print $2 }
What would be printed? "Three" or "Two:Three:4?" Well, the script would print out "Two:Three:4" twice. However, if you deleted the first print statement, it would print out "Three" once! I thought this was very strange at first, but after pulling out some hair, kicking the deck, and yelling at muself and everyone who had anything to do with the development of UNIX, it is intuitively obvious. You just have to be thinking like a professional programmer to realize it is intuitive. I shall explain, and prevent you from causing yourself physical harm. If you change the field separator before you read the line, the change affects what you read. If you change it after you read the line, it will not redefine the variables. You wouldn't want a variable to change on you as a side-effect of another action. A programming language with hidden side effects is broken, and should not be trusted. AWK allows you to redefine the field separator either before or after you read the line, and does the right thing each time. Once you read the variable, the variable will not change unless you change it. Bravo! To illustrate this further, here is another version of the previous code that changes the field separator dynamically. In this case, AWK does it by examining field "$0," which is the entire line. When the line contains a colon, the field separator is a colon, otherwise, it is a space:
#!/bin/awk -f { if ( $0 ~ /:/ ) { FS=":"; } else { FS=" "; } #print the third field, whatever format print $3 }
Click here to get file: awk_example4.awk This example eliminates the need to have the special "#START" and "#STOP" lines in the input.
OFS - The Output Field Separator Variable

There is an important difference between print $2 $3 and print $2, $3 The first example prints out one field, and the second prints out two fields. In the first case, the two positional parameters are concatenated together and output without a space. In the second case, AWK prints two fields, and places the output field separator between them. Normally this is a space, but you can change this by modifying the variable "OFS." If you wanted to copy the password file, but delete the encrypted password, you could use AWK:
#!/bin/awk -f BEGIN { FS=":"; OFS=":"; } { $2=""; print }
Click here to get file: delete_passwd.awk Give this script the password file, and it will delete the password, but leave everything else the same. You can make the output field separator any number of characters. You are not limited to a single character.
NF - The Number of Fields Variable

It is useful to know how many fields are on a line. You may want to have your script change its operation based on the number of fields. As an example, the command "ls -l" may generate eight or nine fields, depending on which version you are executing. The System V version, "/usr/bin/ls -l" generates nine fields, which is equivalent to the Berkeley "/usr/ucb/ls -lg" command. If you
wanted to print the owner and filename then the following AWK script would work with either version of "ls:"
#!/bin/awk -f # parse the output of "ls -l" # print owner and filename # remember - Berkeley ls -l has 8 fields, System V has 9 { if (NF == 8) { print $3, $8; } else if (NF == 9) { print $3, $9; } }
Click here to get file: owner_group.awk Don't forget the variable can be prepended with a "$." This allows you to print the last field of any column #!/bin/awk -f { print $NF; } Click here to get file: print_last_field.awk One warning about AWK. There is a limit of 99 fields in a single line. PERL does not have any such limitations.
NR - The Number of Records Variable

Another useful variable is "NR." This tells you the number of records, or the line number. You can use AWK to only examine certain lines. This example prints lines after the first 100 lines, and puts a line number before each line after 100:
#!/bin/awk -f { if (NR >= 100) { print NR, $0; }
Click here to get file: awk_example5.awk
RS - The Record Separator Variable

Normally, AWK reads one line at a time, and breaks up the line into fields. You can set the "RS" variable to change AWK's definition of a "line." If you set it to an empty string, then AWK will
read the entire file into memory. You can combine this with changing the "FS" variable. This example treats each line as a field, and prints out the second and third line:
#!/bin/awk -f BEGIN { # change the record separator from newline to nothing RS="" # change the field separator from whitespace to newline FS="\n" } { # print the second and third line of the file print $2, $3; }
Click here to get file: awk_example6.awk The two lines are printed with a space between. Also this will only work if the input file is less than 100 lines, therefore this technique is limited. You can use it to break words up, one word per line, using this:
#!/bin/awk -f BEGIN { RS=" "; } { print ; }
Click here to get file: oneword_per_line.awk but this only works if all of the words are separated by a space. If there is a tab or punctuation inside, it would not.
ORS - The Output Record Separator Variable

The default output record separator is a newline, like the input. This can be set to be a newline and carriage return, if you need to generate a text file for a non-UNIX system.
#!/bin/awk -f # this filter adds a carriage return to all lines # before the newline character BEGIN { ORS="\r\n" } { print }
Click here to get file: add_cr.awk
FILENAME - The Current Filename Variable

The last variable known to regular AWK is "FILENAME," which tells you the name of the file being read.
#!/bin/awk -f # reports which file is being read BEGIN { f=""; } { if (f != FILENAME) { print "reading", FILENAME; f=FILENAME; } print; }
Click here to get file: awk_example6a.awk This can be used if several files need to be parsed by AWK. Normally you use standard input to provide AWK with information. You can also specify the filenames on the command line. If the above script was called "testfilter," and if you executed it with testfilter file1 file2 file3 It would print out the filename before each change. An alternate way to specify this on the command line is testfilter file1 - file3 <file2 In this case, the second file will be called "-," which is the conventional name for standard input. I have used this when I want to put some information before and after a filter operation. The prefix and postfix files special data before and after the real data. By checking the filename, you can parse the information differently. This is also useful to report syntax errors in particular files:
#!/bin/awk -f { if (NF == 6) { # do the right thing } else { if (FILENAME == "-" ) { print "SYNTAX ERROR, Wrong number of fields,", "in STDIN, line #:", NR, "line: ", $0;
} else { print "SYNTAX ERROR, Wrong number of fields,", "Filename: ", FILENAME, "line # ", NR,"line: ", $0; } } }
Click here to get file: awk_example7.awk
Associative Arrays
I have used dozens of different programming languages over the last 20 years, and AWK is the first language I found that has associative arrays. This term may be meaningless to you, but believe me, these arrays are invaluable, and simplify programming enormously. Let me describe a problem, and show you how associative arrays can be used for reduce coding time, giving you more time to explore another stupid problem you don't want to deal with in the first place. Let's suppose you have a directory overflowing with files, and you want to find out how many files are owned by each user, and perhaps how much disk space each user owns. You really want someone to blame; it's hard to tell who owns what file. A filter that processes the output of ls would work: ls -l | filter But this doesn't tell you how much space each user is using. It also doesn't work for a large directory tree. This requires find and xargs: find . -type f -print | xargs ls -l | filter The third column of "ls" is the username. The filter has to count how many times it sees each user. The typical program would have an array of usernames and another array that counts how many times each username has been seen. The index to both arrays are the same; you use one array to find the index, and the second to keep track of the count. I'll show you one way to do it in AWK--the wrong way:
#!/bin/awk -f # bad example of AWK programming # this counts how many files each user owns. BEGIN { number_of_users=0; } { # must make sure you only examine lines with 8 or more fields if (NF>7) { user=0; # look for the user in our list of users for (i=1; i<=number_of_users; i++) {
is the user known? if (username[i] == $3) { # found it - remember where the user is user=i; } } if (user == 0) { # found a new user username[++number_of_users]=$3; user=number_of_users; } # increase number of counts count[user]++; } } END { for (i=1; i<=number_of_users; i++) { print count[i], username[i] } }
Click here to get file: awk_example8.awk I don't want you to read this script. I told you it's the wrong way to do it. If you were a C programmer, and didn't know AWK, you would probably use a technique like the one above. Here is the same program, except this example that uses AWK's associative arrays. The important point is to notice the difference in size between these two versions:
#!/bin/awk -f { username[$3]++; } END { for (i in username) { print username[i], i; } }
Click here to get file: count_users0.awk This is shorter, simpler, and much easier to understand--Once you understand exactly what an associative array is. The concept is simple. Instead of using a number to find an entry in an array, use anything you want. An associative array in an array whose index is a string. All arrays in AWK are associative. In this case, the index into the array is the third field of the "ls" command, which is the username. If the user is "bin," the main loop increments the count per user by effectively executing username["bin"]++;
UNIX guru's may gleefully report that the 8 line AWK script can be replaced by: awk '{print $3}' | sort | uniq -c | sort -nr True, However, this can't count the total disk space for each user. We need to add some more intelligence to the AWK script, and need the right foundation to proceed. There is also a slight bug in the AWK program. If you wanted a "quick and dirty" solution, the above would be fine. If you wanted to make it more robust, you have to handle unusual conditions. If you gave this program an empty file for input, you would get the error: awk: username is not an array Also, if you piped the output of "ls -l" to it, the line that specified the total would increment a non-existing user. There are two techniques used to eliminate this error. The first one only counts valid input:
#!/bin/awk -f { if (NF>7) { username[$3]++; } } END { for (i in username) { print username[i], i; } }
Click here to get file: count_users1.awk This fixes the problem of counting the line with the total. However, it still generates an error when an empty file is read as input. To fix this problem, a common technique is to make sure the array always exists, and has a special marker value which specifies that the entry is invalid. Then when reporting the results, ignore the invalid entry.
#!/bin/awk -f BEGIN { username[""]=0; } { username[$3]++; } END { for (i in username) { if (i != "") { print username[i], i; } } }
Click here to get file: count_users2.awk This happens to fix the other problem. Apply this technique and you will make your AWK programs more robust and easier for others to use.
Multi-dimensional Arrays
Some people ask if AWK can handle multi-dimensional arrays. It can. However, you don't use conventional two-dimensional arrays. Instead you use associative arrays. (Did I even mention how useful associative arrays are?) Remember, you can put anything in the index of an associative array. It requires a different way to think about problems, but once you understand, you won't be able to live without it. All you have to do is to create an index that combines two other indices. Suppose you wanted to effectively execute a[1,2] = y; This is invalid in AWK. However, the following is perfectly fine: a[1 "," 2] = y; Remember: the AWK string concatenation operator is the space. It combines the three strings into the single string "1,2." Then it uses it as an index into the array. That's all there is to it. There is one minor problem with associative arrays, especially if you use the for command to output each element: you have no control over the order of output. You can create an algorithm to generate the indices to an associative array, and control the order this way. However, this is difficult to do. Since UNIX provides an excellent sort utility, more programmers separate the information processing from the sorting. I'll show you what I mean.
Example of using AWK's Associative Arrays

I often find myself using certain techniques repeatedly in AWK. This example will demonstrate these techniques, and illustrate the power and elegance of AWK. The program is simple and common. The disk is full. Who's gonna be blamed? I just hope you use this power wisely. Remember, you may be the one who filled up the disk. Having resolved my moral dilemma, by placing the burden squarely on your shoulders, I will describe the program in detail. I will also discuss several tips you will find useful in large AWK programs. First, initialize all arrays used in a for loop. There will be four arrays for this purpose. Initialization is easy: u_count[""]=0; g_count[""]=0;
ug_count[""]=0; all_count[""]=0; The second tip is to pick a convention for arrays. Selecting the names of the arrays, and the indices for each array is very important. In a complex program, it can become confusing to remember which array contains what. I suggest you clearly identify the indices and contents of each array. To demonstrate, I will use a "_count" to indicate the number of files, and "_sum" to indicate the sum of the file sizes. In addition, the part before the "_" specifies the index used for the array, which will be either "u" for user, "g" for group, "ug" for the user and group combination, and "all" for the total for all files. In other programs, I have used names like username_to_directory[username]=directory; Follow a convention like this, and it will be hard for you to forget the purpose of each associative array. Even when a quick hack comes back to haunt you three years later. I've been there. The third suggestion is to make sure your input is in the correct form. It's generally a good idea to be pessimistic, but I will add a simple but sufficient test in this example.
if (NF != 10) { # ignore } else {
etc. I placed the test and error clause up front, so the rest of the code won't be cluttered. AWK doesn't have user defined functions. NAWK, GAWK and PERL do. The next piece of advice for complex AWK scripts is to define a name for each field used. In this case, we want the user, group and size in disk blocks. We could use the file size in bytes, but the block size corresponds to the blocks on the disk, a more accurate measurement of space. Disk blocks can be found by using "ls -s." This adds a column, so the username becomes the fourth column, etc. Therefore the script will contain: size=$1; user=$4; group=$5; This will allow us to easily adapt to changes in input. We could use "$1" throughout the script, but if we changed the number of fields, which the "-s" option does, we'd have to change each field reference. You don't want to go through an AWK script, and change all the "$1" to "$2," and also change the "$2" to "$3" because those are really the "$1" that you just changed to "$2." Of course this is confusing. That's why it's a good idea to assign names to the fields. I've been there too. Next the AWK script will count how many times each combination of users and groups occur. That is, I am going to construct a two-part index that contains the username and groupname. This
will let me count up the number of times each user/group combination occurs, and how much disk space is used. Consider this: how would you calculate the total for just a user, or for just a group? You could rewrite the script. Or you could take the user/group totals, and total them with a second script. You could do it, but it's not the AWK way to do it. If you had to examine a bazillion files, and it takes a long time to run that script, it would be a waste to repeat this task. It's also inefficient to require two scripts when one can do everything. The proper way to solve this problem is to extract as much information as possible in one pass through the files. Therefore this script will find the number and size for each category: Each user Each group Each user/group combination All users and groups This is why I have 4 arrays to count up the number of files. I don't really need 4 arrays, as I can use the format of the index to determine which array is which. But this does maake the program easier to understand for now. The next tip is subtle, but you will see how useful it is. I mentioned the indices into the array can be anything. If possible, select a format that allows you to merge information from several arrays. I realize this makes no sense right now, but hang in there. All will become clear soon. I will do this by constructing a universal index of the form <user> <group> This index will be used for all arrays. There is a space between the two values. This covers the total for the user/group combination. What about the other three arrays? I will use a "*" to indicate the total for all users or groups. Therefore the index for all files would be "* *" while the index for all of the file owned by user daemon would be "daemon *." The heart of the script totals up the number and size of each file, putting the information into the right category. I will use 8 arrays; 4 for file sizes, and 4 for counts: u_count[user " *"]++; g_count["* " group]++; ug_count[user " " group]++; all_count["* *"]++;
u_size[user " *"]+=size; g_size["* " group]+=size; ug_size[user " " group]+=size; all_size["* *"]+=size; This particular universal index will make sorting easier, as you will see. Also important is to sort the information in an order that is useful. You can try to force a particular output order in AWK,
but why work at this, when it's a one line command for sort? The difficult part is finding the right way to sort the information. This script will sort information using the size of the category as the first sort field. The largest total will be the one for all files, so this will be one of the first lines output. However, there may be several ties for the largest number, and care must be used. The second field will be the number of files. This will help break a tie. Still, I want the totals and sub-totals to be listed before the individual user/group combinations. The third and fourth fields will be generated by the index of the array. This is the tricky part I warned you about. The script will output one string, but the sort utility will not know this. Instead, it will treat it as two fields. This will unify the results, and information from all 4 arrays will look like one array. The sort of the third and fourth fields will be dictionary order, and not numeric, unlike the first two fields. The "*" was used so these sub-total fields will be listed before the individual user/group combination. The arrays will be printed using the following format:
for (i in u_count) { if (i != "") { print u_size[i], u_count[i], i; } } O
I only showed you one array, but all four are printed the same way. That's the essence of the script. The results is sorted, and I converted the space into a tab for cosmetic reasons.
Output of the script

I changed my directory to /usr/ucb, used the script in that directory. The following is the output:
size 3173 3173 2973 2973 88 88 64 64 48 48 count 81 81 75 75 3 3 2 2 1 1 user * root * root * root * root * root group * * staff staff daemon daemon kmem kmem tty tty
This says there are 81 files in this directory, which takes up 3173 disk blocks. All of the files are owned by root. 2973 disk blocks belong to group staff. There are 3 files with group daemon, which takes up 88 disk blocks. As you can see, the first line of information is the total for all users and groups. The second line is the sub-total for the user "root." The third line is the sub-total for the group "staff." Therefore the order of the sort is useful, with the sub-totals before the individual entries. You could write a
simple AWK or grep script to obtain information from just one user or one group, and the information will be easy to sort. There is only one problem. The /usr/ucb directory on my system only uses 1849 blocks; at least that's what du reports. Where's the discrepancy? The script does not understand hard links. This may not be a problem on most disks, because many users do not use hard links. Still, it does generate inaccurate results. In this case, the program vi is also e, ex, edit, view, and 2 other names. The program only exists once, but has 7 names. You can tell because the link count (field 2) reports 7. This causes the file to be counted 7 times, which causes an inaccurate total. The fix is to only count multiple links once. Examining the link count will determine if a file has multiple links. However, how can you prevent counting a link twice? There is an easy solution: all of these files have the same inode number. You can find this number with the -i option to ls. To save memory, we only have to remember the inodes of files that have multiple links. This means we have to add another column to the input, and have to renumber all of the field references. It's a good thing there are only three. Adding a new field will be easy, because I followed my own advice. The final script should be easy to follow. I have used variations of this hundreds of times and find it demonstrates the power of AWK as well as provide insight to a powerful programming paradigm. AWK solves these types of problems easier than most languages. But you have to use AWK the right way. Note - this version was written for a Solaris box. You have to verify if ls is generating the right number of arguments. The -g argument may need to be deleted, and the check for the number of files may have to be modified. UpdatedI added a Linux version below - to be downloaded. This is a fully working version of the program, that accurately counts disk space, appears below:
#!/bin/sh find . -type f -print | xargs /usr/bin/ls -islg | awk ' BEGIN { # initialize all arrays used in for loop u_count[""]=0; g_count[""]=0; ug_count[""]=0; all_count[""]=0; } { # validate your input if (NF != 11) { # ignore } else { # assign field names inode=$1; size=$2; linkcount=$4; user=$5; group=$6; # should I count this file?
doit=0; if (linkcount == 1) { # only one copy - count it doit++; } else { # a hard link - only count first one seen[inode]++; if (seen[inode] == 1) { doit++; } } # if doit is true, then count the file if (doit ) { # total up counts in one pass # use description array names # use array index that unifies the arrays # first the counts for the number of files u_count[user " *"]++; g_count["* " group]++; ug_count[user " " group]++; all_count["* *"]++; # then the total disk space used u_size[user " *"]+=size; g_size["* " group]+=size; ug_size[user " " group]+=size; all_size["* *"]+=size; } } } END { # output in a form that can be sorted for (i in u_count) { if (i != "") { print u_size[i], u_count[i], i; } } for (i in g_count) { if (i != "") { print g_size[i], g_count[i], i; } } for (i in ug_count) { if (i != "") { print ug_size[i], ug_count[i], i; } } for (i in all_count) { if (i != "") { print all_size[i], all_count[i], i; } }
} ' | # numeric sort - biggest numbers first # sort fields 0 and 1 first (sort starts with 0) # followed by dictionary sort on fields 2 + 3 sort +0nr -2 +2d | # add header (echo "size count user group";cat -) | # convert space to tab - makes it nice output # the second set of quotes contains a single tab character tr ' ' ' ' # done - I hope you like it
Click here to get file: count_users3.awk Remember when I said I didn't need to use 4 different arrays? I can use just one. This is more confusing, but more concise
#!/bin/sh find . -type f -print | xargs /usr/bin/ls -islg | awk ' BEGIN { # initialize all arrays used in for loop count[""]=0; } { # validate your input if (NF != 11) { # ignore } else { # assign field names inode=$1; size=$2; linkcount=$4; user=$5; group=$6; # should I count this file? doit=0; if (linkcount == 1) { # only one copy - count it doit++; } else { # a hard link - only count first one seen[inode]++; if (seen[inode] == 1) { doit++; } } # if doit is true, then count the file if (doit ) { # total up counts in one pass
# use description array names # use array index that unifies the arrays # first the counts for the number of files count[user " *"]++; count["* " group]++; count[user " " group]++; count["* *"]++; # then the total disk space used size[user " *"]+=size; size["* " group]+=size; size[user " " group]+=size; size["* *"]+=size; } } } END { # output in a form that can be sorted for (i in count) { if (i != "") { print size[i], count[i], i; } } } ' | # numeric sort - biggest numbers first # sort fields 0 and 1 first (sort starts with 0) # followed by dictionary sort on fields 2 + 3 sort +0nr -2 +2d | # add header (echo "size count user group";cat -) | # convert space to tab - makes it nice output # the second set of quotes contains a single tab character tr ' ' ' ' # done - I hope you like it
Click here to get file: count_users.awk Here is a version that works with modern Linux systems, but assumes you have well-behaved filenames (without spaces, etc,): count_users_new.awk
Picture Perfect PRINTF Output

So far, I described several simple scripts that provide useful information, in a somewhat ugly output format. Columns might not line up properly, and it is often hard to find patterns or trends without this unity. As you use AWK more, you will be desirous of crisp, clean formatting. To achieve this, you must master the printf function.
PRINTF - formatting output

The printf is very similar to the C function with the same name. C programmers should have no problem using printf function. Printf has one of these syntactical forms: printf ( format); printf ( format, arguments...); printf ( format) >expression; printf ( format, arguments...) > expression; The parenthesis and semicolon are optional. I only use the first format to be consistent with other nearby printf statements. A print statement would do the same thing. Printf reveals it's real power when formatting commands are used. The first argument to the printf function is the format. This is a string, or variable whose value is a string. This string, like all strings, can contain special escape sequences to print control characters.
Escape Sequences
The character "\" is used to "escape" or mark special characters. The list of these characters is in table below:
+-------------------------------------------------------+ | AWK Table 5 | | Escape Sequences | |Sequence Description | +-------------------------------------------------------+ |\a ASCII bell (NAWK only) | |\b Backspace | |\f Formfeed | |\n Newline | |\r Carriage Return | |\t Horizontal tab | |\v Vertical tab (NAWK only) | |\ddd Character (1 to 3 octal digits) (NAWK only) | |\xdd Character (hexadecimal) (NAWK only) | | Any character c | +-------------------------------------------------------+
It's difficult to explain the differences without being wordy. Hopefully I'll provide enough examples to demonstrate the differences. With NAWK, you can print three tab characters using these three different representations:
printf("\t\11\x9\n"); A tab character is decimal 9, octal 11, or hexadecimal 09. See the man page ascii(7) for more information. Similarly, you can print three double-quote characters (decimal 34, hexadecimal 22, or octal 42 ) using printf("\"\x22\42\n"); You should notice a difference between the printf function and the print function. Print terminates the line with the ORS character, and divides each field with the OFS separator. Printf does nothing unless you specify the action. Therefore you will frequently end each line with the newline character "\n," and you must specify the separating characters explicitly.
Format Specifiers
The power of the printf statement lies in the format specifiers, which always start with the character "%." The format specifiers are described in table 6:
+----------------------------------------+ | AWK Table 6 | | Format Specifiers | |Specifier Meaning | +----------------------------------------+ |c ASCII Character | |d Decimal integer | |e Floating Point number | | (engineering format) | |f Floating Point number | | (fixed point format) | |g The shorter of e or f, | | with trailing zeros removed | |o Octal | |s String | |x Hexadecimal | |% Literal % | +----------------------------------------+
Again, I'll cover the differences quickly. Table 3 illustrates the differences. The first line states "printf(%c\n",100.0)"" prints a "d."
+--------------------------------+ | AWK Table 7 | | Example of format conversions | |Format Value Results | +--------------------------------+ |%c 100.0 d | |%c "100.0" 1 (NAWK?) | |%c 42 " | |%d 100.0 100 | |%e 100.0 1.000000e+02 | |%f 100.0 100.000000 |
|%g 100.0 100 | |%o 100.0 144 | |%s 100.0 100.0 | |%s "13f" 13f | |%d "13f" 0 (AWK) | |%d "13f" 13 (NAWK) | |%x 100.0 64 | +--------------------------------+
This table reveals some differences between AWK and NAWK. When a string with numbers and letters are coverted into an integer, AWK will return a zero, while NAWK will convert as much as possible. The second example, marked with "NAWK?" will return "d" on some earlier versions of NAWK, while later versions will return "1." Using format specifiers, there is another way to print a double quote with NAWK. This demonstrates Octal, Decimal and Hexadecimal conversion. As you can see, it isn't symmetrical. Decimal conversions are done differently. printf("%s%s%s%c\n", "\"", "\x22", "\42", 34); Between the "%" and the format character can be four optional pieces of information. It helps to visualize these fields as: %<sign><zero><width>.<precision>format I'll discuss each one separately.
Width - specifying minimum field size

If there is a number after the "%," this specifies the minimum number of characters to print. This is the width field. Spaces are added so the number of printed characters equal this number. Note that this is the minimum field size. If the field becomes to large, it will grow, so information will not be lost. Spaces are added to the left. This format allows you to line up columns perfectly. Consider the following format: printf("%st%d\n", s, d); If the string "s" is longer than 8 characters, the columns won't line up. Instead, use printf("%20s%d\n", s, d); As long as the string is less than 20 characters, the number will start on the 21st column. If the string is too long, then the two fields will run together, making it hard to read. You may want to consider placing a single space between the fields, to make sure you will always have one space between the fields. This is very important if you want to pipe the output to another program.
Adding informational headers makes the output more readable. Be aware that changing the format of the data may make it difficult to get the columns aligned perfectly. Consider the following script:
#!/usr/bin/awk -f BEGIN { printf("String Number\n"); } { printf("%10s %6d\n", $1, $2); }
Click here to get file: awk_example9.awk It would be awkward (forgive the choice of words) to add a new column and retain the same alignment. More complicated formats would require a lot of trial and error. You have to adjust the first printf to agree with the second printf statement. I suggest
#!/usr/bin/awk -f BEGIN { printf("%10s %6sn", "String", "Number"); } { printf("%10s %6d\n", $1, $2); }
Click here to get file: awk_example10.awk or even better

#!/usr/bin/awk -f BEGIN { format1 ="%10s %6sn"; format2 ="%10s %6dn"; printf(format1, "String", "Number"); } { printf(format2, $1, $2); }
Click here to get file: awk_example11.awk The last example, by using string variables for formatting, allows you to keep all of the formats together. This may not seem like it's very useful, but when you have multiple formats and multiple columns, it's very useful to have a set of templates like the above. If you have to add an extra space to make things line up, it's much easier to find and correct the problem with a set of
format strings that are together, and the exact same width. CHainging the first columne from 10 characters to 11 is easy.
Left Justification
The last example places spaces before each field to make sure the minimum field width is met. What do you do if you want the spaces on the right? Add a negative sign before the width: printf("%-10s %-6d\n", $1, $2); This will move the printing characters to the left, with spaces added to the right.
The Field Precision Value

The precision field, which is the number between the decimal and the format character, is more complex. Most people use it with the floating point format (%f), but surprisingly, it can be used with any format character. With the octal, decimal or hexadecimal format, it specifies the minimum number of characters. Zeros are added to met this requirement. With the %e and %f formats, it specifies the number of digits after the decimal point. The %e "e+00" is not included in the precision. The %g format combines the characteristics of the %d and %f formats. The precision specifies the number of digits displayed, before and after the decimal point. The precision field has no effect on the %c field. The %s format has an unusual, but useful effect: it specifies the maximum number of significant characters to print. If the first number after the "%," or after the "%-," is a zero, then the system adds zeros when padding. This includes all format types, including strings and the %c character format. This means "%010d" and "%.10d" both adds leading zeros, giving a minimum of 10 digits. The format "%10.10d" is therefore redundant. Table 8 gives some examples:
+--------------------------------------------+ | AWK Table 8 | | Examples of complex formatting | |Format Variable Results | +--------------------------------------------+ |%c 100 "d" | |%10c 100 " d" | |%010c 100 "000000000d" | +--------------------------------------------+ |%d 10 "10" | |%10d 10 " 10" | |%10.4d 10.123456789 " 0010" | |%10.8d 10.123456789 " 00000010" | |%.8d 10.123456789 "00000010" | |%010d 10.123456789 "0000000010" | +--------------------------------------------+ |%e 987.1234567890 "9.871235e+02" | |%10.4e 987.1234567890 "9.8712e+02" |%10.8e 987.1234567890 "9.87123457e+02" |
+--------------------------------------------+ |%f 987.1234567890 "987.123457" |%10.4f 987.1234567890 " 987.1235" |%010.4f 987.1234567890 "00987.1235" |%10.8f 987.1234567890 "987.12345679" | +--------------------------------------------+ |%g 987.1234567890 "987.123" | |%10g 987.1234567890 " 987.123" |%10.4g 987.1234567890 " 987.1" | |%010.4g 987.1234567890 "00000987.1" |%.8g 987.1234567890 "987.12346" | +--------------------------------------------+ |%o 987.1234567890 "1733" | |%10o 987.1234567890 " 1733" | |%010o 987.1234567890 "0000001733" |%.8o 987.1234567890 "00001733" | +--------------------------------------------+ |%s 987.123 "987.123" | |%10s 987.123 " 987.123" | |%10.4s 987.123 " 987." | |%010.8s 987.123 "000987.123" | +--------------------------------------------+ |%x 987.1234567890 "3db" | |%10x 987.1234567890 " 3db" | |%010x 987.1234567890 "00000003db" |%.8x 987.1234567890 "000003db" | +--------------------------------------------+
| | |
| |
There is one more topic needed to complete this lesson on printf.
Explicit File output

Instead of sending output to standard output, you can send output to a named file. The format is printf("string\n") > "/tmp/file"; You can append to an existing file, by using ">>:" printf("string\n") >> "/tmp/file"; Like the shell, the double angle brackets indicates output is appended to the file, instead of written to an empty file. Appending to the file does not delete the old contents. However, there is a subtle difference between AWK and the shell. Consider the shell program:
#!/bin/sh while x=`line` do echo got $x >>/tmp/a echo got $x >/tmp/b
done
This will read standard input, and copy the standard input to files "/tmp/a" and "/tmp/b." File "/tmp/a" will grow larger, as information is always appended to the file. File "/tmp/b," however, will only contain one line. This happens because each time the shell see the ">" or ">>" characters, it opens the file for writing, choosing the truncate/create or appending option at that time. Now consider the equivalent AWK program:
#!/usr/bin/awk -f { print $0 >>"/tmp/a" print $0 >"/tmp/b" }
This behaves differently. AWK chooses the create/append option the first time a file is opened for writing. Afterwards, the use of ">" or ">>" is ignored. Unlike the shell, AWK copies all of standard input to file "/tmp/b." Instead of a string, some versions of AWK allow you to specify an expression: # [note to self] check this one - it might not work printf("string\n") > FILENAME ".out"; The following uses a string concatenation expression to illustrate this:
#!/usr/bin/awk -f END { for (i=0;i<30;i++) { printf("i=%d\n", i) > "/tmp/a" i; } }
Click here to get file: awk_example12.awk This script never finishes, because AWK can have 10 additional files open, and NAWK can have 20. If you find this to be a problem, look into PERL. I hope this gives you the skill to make your AWK output picture perfect.
AWK Numerical Functions

In previous tutorials, I have shown how useful AWK is in manipulating information, and generating reports. When you add a few functions, AWK becomes even more, mmm, functional.
There are three types of functions: numeric, string and whatever's left. Table9 lists all of the numeric functions:
+----------------------------------+ | AWK Table 9 | +----------------------------------+ | Numeric Functions | |Name Function Variant | +----------------------------------+ |cos cosine AWK | |exp Exponent AWK | |int Integer AWK | |log Logarithm AWK | |sin Sine AWK | |sqrt Square Root AWK | |atan2 Arctangent NAWK |rand Random NAWK | |srand Seed Random NAWK +----------------------------------+
| |
Trigonometric Functions
Oh joy. I bet millions, if not dozens, of my readers have been waiting for me to discuss trigonometry. Personally, I don't use trigonometry much at work, except when I go off on a tangent. Sorry about that. I don't know what came over me. I don't usually resort to puns. I'll write a note to myself, and after I sine the note, I'll have my boss cosine it. Now stop that! I hate arguing with myself. I always lose. Thinking about math I learned in the year 2 B.C. (Before Computers) seems to cause flashbacks of high school, pimples, and (shudder) times best left forgotten. The stress of remembering those days must have made me forget the standards I normally set for myself. Besides, no-one appreciates obtuse humor anyway, even if I find acute way to say it. I better change the subject fast. Combining humor and computers is a very serious matter. Here is a NAWK script that calculates the trigonometric functions for all degrees between 0 and 360. It also shows why there is no tangent, secant or cosecant function. (They aren't necessary). If you read the script, you will learn of some subtle differences between AWK and NAWK. All this in a thin veneer of demonstrating why we learned trigonometry in the first place. What more can you ask for? Oh, in case you are wondering, I wrote this in the month of December.
#!/usr/bin/nawk -f # # A smattering of trigonometry... # # This AWK script plots the values from 0 to 360 # for the basic trigonometry functions
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
but first - a review: (Note to the editor - the following diagram assumes a fixed width font, like Courier. otherwise, the diagram looks very stupid, instead of slightly stupid) Assume the following right triangle Angle Y | | | a | c | | +------b
Angle X
since the triangle is a right angle, then X+Y=90 Basic Trigonometric Functions. If you know the length of 2 sides, and the angles, you can find the length of the third side. Also - if you know the length of the sides, you can calculate the angles. The formulas are sine(X) = a/c cosine(X) = b/c tangent(X) = a/b reciprocal functions cotangent(X) = b/a secant(X) = c/b cosecant(X) = c/a Example 1) if an angle is 30, and the hypotenuse (c) is 10, then a = sine(30) * 10 = 5 b = cosine(30) * 10 = 8.66 The second example will be more realistic: Suppose you are looking for a Christmas tree, and while talking to your family, you smack into a tree because your head was turned, and your kids were arguing over who was going to put the first ornament on the tree. As you come to, you realize your feet are touching the trunk of the tree, and your eyes are 6 feet from the bottom of your frostbitten toes. While counting the stars that spin around your head, you also realize the top of the tree is located at a 65 degree angle, relative to your eyes.
# You suddenly realize the tree is 12.84 feet high! After all, # tangent(65 degrees) * 6 feet = 12.84 feet
# All right, it isn't realistic. Not many people memorize the # tangent table, or can estimate angles that accurately. # I was telling the truth about the stars spinning around the head, however. # BEGIN { # assign a value for pi. PI=3.14159; # select an "Ed Sullivan" number - really really big BIG=999999; # pick two formats # Keep them close together, so when one column is made larger # the other column can be adjusted to be the same width fmt1="%7s %8s %8s %8s %10s %10s %10s %10sn"; # print out the title of each column fmt2="%7d %8.2f %8.2f %8.2f %10.2f %10.2f %10.2f %10.2fn"; # old AWK wants a backslash at the end of the next line # to continue the print statement # new AWK allows you to break the line into two, after a comma printf(fmt1,"Degrees","Radians","Cosine","Sine", "Tangent","Cotangent","Secant", "Cosecant"); for (i=0;i<=360;i++) { # convert degrees to radians r = i * (PI / 180 ); # in new AWK, the backslashes are optional # in OLD AWK, they are required printf(fmt2, i, r, # cosine of r cos(r), # sine of r sin(r), # # I ran into a problem when dividing by zero. # So I had to test for this case. # # old AWK finds the next line too complicated # I don't mind adding a backslash, but rewriting the # next three lines seems pointless for a simple lesson. # This script will only work with new AWK, now - sigh... # On the plus side, # I don't need to add those back slashes anymore # # tangent of r (cos(r) == 0) ? BIG : sin(r)/cos(r), # cotangent of r (sin(r) == 0) ? BIG : cos(r)/sin(r), # secant of r (cos(r) == 0) ? BIG : 1/cos(r), # cosecant of r (sin(r) == 0) ? BIG : 1/sin(r)); } # put an exit here, so that standard input isn't needed. exit; }
Click here to get file: trigonometry.awk NAWK also has the arctangent function. This is useful for some graphics work, as arc tangent(a/b) = angle (in radians) Therefore if you have the X and Y locations, the arctangent of the ratio will tell you the angle. The atan2() function returns a value from negative pi to positive pi.
Exponents, logs and square roots

The following script uses three other arithmetic functions: log, exp, and sqrt. I wanted to show how these can be used together, so I divided the log of a number by two, which is another way to find a square root. I then compared the value of the exponent of that new log to the built-in square root function. I then calculated the difference between the two, and converted the difference into a positive number.
#!/bin/awk -f # demonstrate use of exp(), log() and sqrt in AWK # e.g. what is the difference between using logarithms and regular arithmetic # note - exp and log are natural log functions - not base 10 # BEGIN { # what is the about of error that will be reported? ERROR=0.000000000001; # loop a long while for (i=1;i<=2147483647;i++) { # find log of i logi=log(i); # what is square root of i? # divide the log by 2 logsquareroot=logi/2; # convert log of i back squareroot=exp(logsquareroot); # find the difference between the logarithmic calculation # and the built in calculation diff=sqrt(i)-squareroot; # make difference positive if (diff < 0) { diff*=-1; } if (diff > ERROR) { printf("%10d, squareroot: %16.8f, error: %16.14f\n", i, squareroot, diff); } } exit; }
Click here to get file: awk_example13.awk Yawn. This example isn't too exciting, except to those who enjoy nitpicking. Expect the program to reach 3 million before you see any errors. I'll give you a more exciting sample soon.
Truncating Integers
All version of AWK contain the int function. This truncates a number, making it an integer. It can be used to round numbers by adding 0.5: printf("rounding %8.4f gives %8dn", x, int(x+0.5));
"Random Numbers
NAWK has functions that can generate random numbers. The function rand returns a random number between 0 and 1. Here is an example that calculates a million random numbers between 0 and 100, and counts how often each number was used:
#!/usr/bin/nawk -f # old AWK doesn't have rand() and srand() # only new AWK has them # how random is the random function? BEGIN { # srand(); i=0; while (i++<1000000) { x=int(rand()*100 + 0.5); y[x]++; } for (i=0;i<=100;i++) { printf("%dt%d\n",y[i],i); } exit; }
Click here to get file: random.awk If you execute this script several times, you will get the exact same results. Experienced programmers know random number generators aren't really random, unless they use special hardware. These numbers are pseudo-random, and calculated using some algorithm. Since the algorithm is fixed, the numbers are repeatable unless the numbers are seeded with a unique value. This is done using the srand function above, which is commented out. Typically the random number generator is not given a special seed until the bugs have been worked out of the program. There's nothing more frustrating than a bug that occurs randomly. The srand function
may be given an argument. If not, it uses the current time and day to generate a seed for the random number generator.
The Lotto script

I promised a more useful script. This may be what you are waiting for. It reads two numbers, and generates a list of random numbers. I call the script "lotto.awk."
#!/usr/bin/nawk -f BEGIN { # Assume we want 6 random numbers between 1 and 36 # We could get this information by reading standard input, # but this example will use a fixed set of parameters. # # First, initialize the seed srand(); # How many numbers are needed? NUM=6; # what is the minimum number MIN=1; # and the maximum? MAX=36; # How many numbers will we find? start with 0 Number=0; while (Number < NUM) { r=int(((rand() *(1+MAX-MIN))+MIN)); # have I seen this number before? if (array[r] == 0) { # no, I have not Number++; array[r]++; } } # now output all numbers, in order for (i=MIN;i<=MAX;i++) { # is it marked in the array? if (array[i]) { # yes printf("%d ",i); } } printf("\n"); exit; }
Click here to get file: lotto.awk If you do win a lottery, send me a postcard.
String Functions
Besides numeric functions, there are two other types of function: strings and the whatchamacallits. First, a list of the string functions:
+-------------------------------------------------+ | AWK Table 10 | | String Functions | |Name Variant | +-------------------------------------------------+ |index(string,search) AWK, NAWK, GAWK | |length(string) AWK, NAWK, GAWK | |split(string,array,separator) AWK, NAWK, GAWK | |substr(string,position) AWK, NAWK, GAWK | |substr(string,position,max) AWK, NAWK, GAWK | |sub(regex,replacement) NAWK, GAWK | |sub(regex,replacement,string) NAWK, GAWK | |gsub(regex,replacement) NAWK, GAWK | |gsub(regex,replacement,string) NAWK, GAWK | |match(string,regex) NAWK, GAWK | |tolower(string) GAWK | |toupper(string) GAWK | +-------------------------------------------------+
Most people first use AWK to perform simple calculations. Associative arrays and trigonometric functions are somewhat esoteric features, that new users embrace with the eagerness of a chain smoker in a fireworks factory. I suspect most users add some simple string functions to their repertoire once they want to add a little more sophistication to their AWK scripts. I hope this column gives you enough information to inspire your next effort. There are four string functions in the original AWK: index(), length(), split(), and substr(). These functions are quite versatile.
The Length function

What can I say? The length() function calculates the length of a string. I often use it to make sure my input is correct. If you wanted to ignore empty lines, check the length of the each line before processing it with if (length($0) > 1) { ... } You can easily use it to print all lines longer than a certain length, etc. The following command centers all lines shorter than 80 characters:
#!/bin/awk -f {
if (length($0) < 80) { prefix = ""; for (i = 1;i<(80-length($0))/2;i++) prefix = prefix " "; print prefix $0; } else { print; } }
Click here to get file: center.awk
The Index Function

If you want to search for a special character, the index() function will search for specific characters inside a string. To find a comma, the code might look like this: sentence="This is a short, meaningless sentence."; if (index(sentence, ",") > 0) { printf("Found a comma in position \%d\n", index(sentence,",")); } The function returns a positive value when the substring is found. The number specified the location of the substring. If the substring consists of 2 or more characters, all of these characters must be found, in the same order, for a non-zero return value. Like the length() function, this is useful for checking for proper input conditions.
The Substr function

The substr() function can extract a portion of a string. One common use is to split a string into two parts based on a special character. If you wanted to process some mail addresses, the following code fragment might do the job:
#!/bin/awk -f { # field 1 is the e-mail address - perhaps if ((x=index($1,"@")) > 0) { username = substr($1,1,x-1); hostname = substr($1,x+1,length($1)); # the above is the same as # hostname = substr($1,x+1); printf("username = %s, hostname = %s\n", username, hostname); } }
Click here to get file: email.awk The substr() function takes two or three arguments. The first is the string, the second is the position. The optional third argument is the length of the string to extract. If the third argument is missing, the rest of the string is used. The substr function can be used in many non-obvious ways. As an example, it can be used to convert upper case letters to lower case.
#!/usr/bin/awk -f # convert upper case letters to lower case BEGIN { LC="abcdefghijklmnopqrstuvwxyz"; UC="ABCDEFGHIJKLMNOPQRSTUVWXYZ"; } { out=""; # look at each character for(i=1;i<=length($0);i++) { # get the character to be checked char=substr($0,i,1); # is it an upper case letter? j=index(UC,char); if (j > 0 ) { # found it out = out substr(LC,j,1); } else { out = out char; } } printf("%s\n", out); }
Click here to get file: upper_to_lower.awk
GAWK's Tolower and Toupper function

GAWK has the toupper() and tolower() functions, for convenient conversions of case. These functions take strings, so you can reduce the above script to a single line:
#!/usr/local/bin/gawk -f { print tolower($0); }
Click here to get file: upper_to_lower.gawk
The Split function

Another way to split up a string is to use the split() function. It takes three arguments: the string, an array, and the separator. The function returns the number of pieces found. Here is an example:
#!/usr/bin/awk -f BEGIN { # this script breaks up the sentence into words, using # a space as the character separating the words string="This is a string, is it not?"; search=" "; n=split(string,array,search); for (i=1;i<=n;i++) { printf("Word[%d]=%s\n",i,array[i]); } exit; }
Click here to get file: awk_example14.sh The third argument is typically a single character. If a longer string is used, only the first letter is used as a separator.
NAWK's string functions

NAWK (and GAWK) have additional string functions, which add a primitive SED-like functionality: sub(), match(), and gsub(). Sub() performs a string substitution, like sed. To replace "old" with "new" in a string, use sub(/old/, "new", string) If the third argument is missing, $0 is assumed to be string searched. The function returns 1 if a substitution occurs, and 0 if not. If no slashes are given in the first argument, the first argument is assumed to be a variable containing a regular expression. The sub() only changes the first occurrence. The gsub() function is similar to the g option in sed: all occurrence are converted, and not just the first. That is, if the patter occurs more than once per line (or string), the substitution will be performed once for each found pattern. The following script:
#!/usr/bin/nawk -f BEGIN { string = "Another sample of an example sentence"; pattern="[Aa]n"; if (gsub(pattern,"AN",string)) { printf("Substitution occurred: %s\n", string); }
exit; }
Click here to get file: awk_example15.awk print the following when executed: Substitution occurred: ANother sample of AN example sentence As you can see, the pattern can be a regular expression.
The Match function

As the above demonstrates, the sub() and gsub() returns a positive value if a match is found. However, it has a side-effect of changing the string tested. If you don't wish this, you can copy the string to another variable, and test the spare variable. NAWK also provides the match() function. If match() finds the regular expression, it sets two special variables that indicate where the regular expression begins and ends. Here is an example that does this:
#!/usr/bin/nawk -f # demonstrate the match function BEGIN { regex="[a-zA-Z0-9]+"; } { if (match($0,regex)) { # RSTART is where the pattern starts # RLENGTH is the length of the pattern before = substr($0,1,RSTART-1); pattern = substr($0,RSTART,RLENGTH); after = substr($0,RSTART+RLENGTH); printf("%s<%s>%s\n", before, pattern, after); } }
Click here to get file: awk_example16.awk Lastly, there are the whatchamacallit functions. I could use the word "miscellaneous," but it's too hard to spell. Darn it, I had to look it up anyway.
+-----------------------------------------------+ | AWK Table 11 | | Miscellaneous Functions | |Name Variant | +-----------------------------------------------+ |getline AWK, NAWK, GAWK | |getline <file NAWK, GAWK |
|getline variable NAWK, GAWK |getline variable <file NAWK, GAWK | |"command" | getline NAWK, GAWK | |"command" | getline variable NAWK, GAWK | |system(command) NAWK, GAWK | |close(command) NAWK, GAWK | |systime() GAWK | |strftime(string) GAWK |strftime(string, timestamp) GAWK | +-----------------------------------------------+
The System function

NAWK has a function system() that can execute any program. It returns the exit status of the program. if (system("/bin/rm junk") != 0) print "command didn't work"; The command can be a string, so you can dynamically create commands based on input. Note that the output isn't sent to the NAWK program. You could send it to a file, and open that file for reading. There is another solution, however.
The Getline function

AWK has a command that allows you to force a new line. It doesn't take any arguments. It returns a 1, if successful, a 0 if end-of-file is reached, and a -1 if an error occurs. As a side effect, the line containing the input changes. This next script filters the input, and if a backslash occurs at the end of the line, it reads the next line in, eliminating the backslash as well as the need for it.
#!/usr/bin/awk -f # look for a as the last character. # if found, read the next line and append { line = $0; while (substr(line,length(line),1) == "\\") { # chop off the last character line = substr(line,1,length(line)-1); i=getline; if (i > 0) { line = line $0; } else { printf("missing continuation on line %d\n", NR); } } print line; }
Click here to get file: awk_example17.awk Instead of reading into the standard variables, you can specify the variable to set: getline a_line print a_line; NAWK and GAWK allow the getline function to be given an optional filename or string containing a filename. An example of a primitive file preprocessor, that looks for lines of the format #include filename and substitutes that line for the contents of the file:
#!/usr/bin/nawk -f { # a primitive include preprocessor if (($1 == "#include") && (NF == 2)) { # found the name of the file filename = $2; while (i = getline < filename ) { print; } } else { print; } }
Click here to get file: include.nawk NAWK's getline can also read from a pipe. If you have a program that generates single line, you can use "command" | getline; print $0; or "command" | getline abc; print abc; If you have more than one line, you can loop through the results:
while ("command" | getline) { cmd[i++] = $0; }
for (i in cmd) { printf("%s=%s\n", i, cmd[i]); }
Only one pipe can be open at a time. If you want to open another pipe, you must execute close("command"); This is necessary even if the end of file is reached.
The systime function

The systime() function returns the current time of day as the number of seconds since Midnight, January 1, 1970. It is useful for measuring how long portions of your GAWK code takes to execute.
#!/usr/local/bin/gawk -f # how long does it take to do a few loops? BEGIN { LOOPS=100; # do the test twice start=systime(); for (i=0;i<LOOPS;i++) { } end = systime(); # calculate how long it takes to do a dummy test do_nothing = end-start; # now do the test again with the *IMPORTANT* code inside start=systime(); for (i=0;i<LOOPS;i++) { # How long does this take? while ("date" | getline) { date = $0; } close("date"); } end = systime(); newtime = (end - start) - do_nothing; if (newtime <= 0) { printf("%d loops were not enough to test, increase it\n", LOOPS); exit; } else { printf("%d loops took %6.4f seconds to execute\n", LOOPS, newtime); printf("That's %10.8f seconds per loop\n", (newtime)/LOOPS); # since the clock has an accuracy of +/- one second, what is the error printf("accuracy of this measurement = %6.2f%%\n", (1/(newtime))*100); } exit;
Click here to get file: awk_example17.gawk
The Strftime function

GAWK has a special function for creating strings based on the current time. It's based on the strftime(3c) function. If you are familiar with the "+" formats of the date(1) command, you have a good head-start on understanding what the strftime command is used for. The systime() function returns the current date in seconds. Not very useful if you want to create a string based on the time. While you could convert the seconds into days, months, years, etc., it would be easier to execute "date" and pipe the results into a string. (See the previous script for an example). GAWK has another solution that eliminates the need for an external program. The function takes one or two arguments. The first argument is a string that specified the format. This string contains regular characters and special characters. Special characters start with a backslash or the percent character. The backslash characters with the backslash prefix are the same I covered earlier. In addition, the strftime() function defines dozens of combinations, all of which start with "%." The following table lists these special sequences:
+---------------------------------------------------------------+ | AWK Table 12 | | GAWK's strftime formats | +---------------------------------------------------------------+ |%a The locale's abbreviated weekday name | |%A The locale's full weekday name | |%b The locale's abbreviated month name | |%B The locale's full month name | |%c The locale's "appropriate" date and time representation | |%d The day of the month as a decimal number (01--31) |%H The hour (24-hour clock) as a decimal number (00--23) | |%I The hour (12-hour clock) as a decimal number (01--12) | |%j The day of the year as a decimal number (001--366) | |%m The month as a decimal number (01--12) | |%M The minute as a decimal number (00--59) | |%p The locale's equivalent of the AM/PM | |%S The second as a decimal number (00--61). |%U The week number of the year (Sunday is first day of week) | |%w The weekday as a decimal number (0--6). Sunday is day 0 |%W The week number of the year (Monday is first day of week) | |%x The locale's "appropriate" date representation | |%X The locale's "appropriate" time representation | |%y The year without century as a decimal number (00--99) | |%Y The year with century as a decimal number |%Z The time zone name or abbreviation | |%% A literal %. | +---------------------------------------------------------------+
| |
Depending on your operating system, and installation, you may also have the following formats:
+-----------------------------------------------------------------------+ | AWK Table 13 | | Optional GAWK strftime formats | +-----------------------------------------------------------------------+ |%D Equivalent to specifying %m/%d/%y |%e The day of the month, padded with a blank if it is only one digit | |%h Equivalent to %b, above | |%n A newline character (ASCII LF) | |%r Equivalent to specifying %I:%M:%S %p | |%R Equivalent to specifying %H:%M | |%T Equivalent to specifying %H:%M:%S |%t A TAB character | |%k The hour as a decimal number (0-23) | |%l The hour (12-hour clock) as a decimal number (1-12) | |%C The century, as a number between 00 and 99 | |%u is replaced by the weekday as a decimal number [Monday == 1] | |%V is replaced by the week number of the year (using ISO 8601) | |%v The date in VMS format (e.g. 20-JUN-1991) +-----------------------------------------------------------------------+
One useful format is strftime("%y_%m_%d_%H_%M_%S") This constructs a string that contains the year, month, day, hour, minute and second in a format that allows convenient sorting. If you ran this at noon on Christmas, 1994, it would generate the string 94_12_25_12_00_00 Here is the GAWK equivalent of the date command:
#! /usr/local/bin/gawk -f # BEGIN { format = "%a %b %e %H:%M:%S %Z %Y"; print strftime(format); }
Click here to get file: date.gawk You will note that there is no exit command in the begin statement. If I was using AWK, an exit statement is necessary. Otherwise, it would never terminate. If there is no action defined for each line read, NAWK and GAWK do not need an exit statement. If you provide a second argument to the strftime() function, it uses that argument as the timestamp, instead of the current system's time. This is useful for calculating future times. The following script calculates the time one week after the current time:
#!/usr/local/bin/gawk -f BEGIN { # get current time ts = systime(); # the time is in seconds, so one_day = 24 * 60 * 60; next_week = ts + (7 * one_day); format = "%a %b %e %H:%M:%S %Z %Y"; print strftime(format, next_week); exit; }
Click here to get file: one_week_later.gawk
User Defined Functions

Finally, NAWK and GAWK support user defined functions. This function demonstrates a way to print error messages, including the filename and line number, if appropriate:
#!/usr/bin/nawk -f { if (NF != 4) { error("Expected 4 fields"); } else { print; } } function error ( message ) { if (FILENAME != "-") { printf("%s: ", FILENAME) > "/dev/tty"; } printf("line # %d, %s, line: %s\n", NR, message, $0) >> "/dev/tty"; }
Click here to get file: awk_example18.nawk
AWK patterns
In my first tutorial on AWK, I described the AWK statement as having the form pattern {commands} I have only used two patterns so far: the special words BEGIN and END. Other patterns are possible, yet I haven't used any. There are several reasons for this. The first is that these patterns aren't necessary. You can duplicate them using an if statement. Therefore this is an "advanced feature." Patterns, or perhaps the better word is conditions, tend to make an AWK program
obscure to a beginner. You can think of them as an advanced topic, one that should be attempted after becoming familiar with the basics. A pattern or condition is simply an abbreviated test. If the condition is true, the action is performed. All relational tests can be used as a pattern. The "head -10" command, which prints the first 10 lines and stops, can be duplicated with {if (NR <= 10 ) {print}} Changing the if statement to a condition shortens the code: NR <= 10 {print} Besides relational tests, you can also use containment tests, i. e. do strings contain regular expressions? Printing all lines that contain the word "special" can be written as {if ($0 ~ /special/) {print}} or more briefly $0 ~ /special/ {print} This type of test is so common, the authors of AWK allow a third, shorter format: /special/ {print} These tests can be combined with the AND (&&) and OR (||) commands, as well as the NOT (!) operator. Parenthesis can also be added if you are in doubt, or to make your intention clear. The following condition prints the line if it contains the word "whole" or columns 1 and 2 contain "part1" and "part2" respectively. ($0 ~ /whole/) || (($1 ~ /part1/) && ($2 ~ /part2/)) {print} This can be shortened to /whole/ || $1 ~ /part1/ && $2 ~ /part2/ {print} There is one case where adding parenthesis hurts. The condition /whole/ {print} works, but (/whole/) {print}
does not. If parenthesis are used, it is necessary to explicitly specify the test: ($0 ~ /whole) {print} A murky situation arises when a simple variable is used as a condition. Since the variable NF specifies the number of fields on a line, one might think the statement NF {print} would print all lines with one of more fields. This is an illegal command for AWK, because AWK does not accept variables as conditions. To prevent a syntax error, I had to change it to NF != 0 {print} I expected NAWK to work, but on some SunOS systems it refused to print any lines at all. On newer Solaris systems it did behave properly. Again, changing it to the longer form worked for all variations. GAWK, like the newer version of NAWK, worked properly. After this experience, I decided to leave other, exotic variations alone. Clearly this is unexplored territory. I could write a script that prints the first 20 lines, except if there were exactly three fields, unless it was line 10, by using NF == 3 ? NR == 10 : NR < 20 { print } But I won't. Obscurity, like puns, is often unappreciated. There is one more common and useful pattern I have not yet described. It is the comma separated pattern. A common example has the form: /start/,/stop/ {print} This form defines, in one line, the condition to turn the action on, and the condition to turn the action off. That is, when a line containing "start" is seen, it is printed. Every line afterwards is also printed, until a line containing "stop" is seen. This one is also printed, but the line after, and all following lines, are not printed. This triggering on and off can be repeated many times. The equivalent code, using the if command, is:
{ if ($0 ~ /start/) { triggered=1; } if (triggered) { print; if ($0 ~ /stop/) { triggered=0; } } }
The conditions do not have to be regular expressions. Relational tests can also be used. The following prints all lines between 20 and 40: (NR==20),(NR==40) {print} You can mix relational and containment tests. The following prints every line until a "stop" is seen: (NR==1),/stop/ {print} There is one more area of confusion about patterns: each one is independent of the others. You can have several patterns in a script; none influence the other patterns. If the following script is executed: NR==10 {print} (NR==5),(NR==15) {print} /xxx/ {print} (NR==1),/NeVerMatchThiS/ {print} and the input file's line 10 contains "xxx," it would be printed 4 times, as each condition is true. You can think of each condition as cumulative. The exception is the special BEGIN and END conditions. In the original AWK, you can only have one of each. In NAWK and GAWK, you can have several BEGIN or END actions.
Formatting AWK programs

Many readers have questions my style of AWK programming. In particular, they ask me why I include code like this: # Print column 3 print $3; when I could use print $3 # print column 3 After all, they reason, the semicolon is unnecessary, and comments do not have to start on the first column. This is true. Still I avoid this. Years ago, when I started writing AWK programs, I would find myself confused when the nesting of conditions were too deep. If I moved a complex if statement inside another if statement, my alignment of braces became incorrect. It can be very difficult to repair this condition, especially with large scripts. Nowadays I use emacs to do the formatting for me, but 10 years ago I didn't have this option. My solution was to use the program cb, which is a "C beautifier." By including optional semicolons, and starting comments on the first column of the line, I could send my AWK script through this filter, and properly align all of the code.
Environment Variables
I've described the 7 special variables in AWK, and briefly mentioned some others NAWK and GAWK. The complete list follows:
+--------------------------------+ | AWK Table 14 | |Variable AWK NAWK GAWK | +--------------------------------+ |FS Yes Yes Yes | |NF Yes Yes Yes | |RS Yes Yes Yes | |NR Yes Yes Yes | |FILENAME Yes Yes Yes | |OFS Yes Yes Yes | |ORS Yes Yes Yes | +--------------------------------+ |ARGC Yes Yes | |ARGV Yes Yes | |ARGIND Yes | |FNR Yes Yes | |OFMT Yes Yes | |RSTART Yes Yes | |RLENGTH Yes Yes | |SUBSEP Yes Yes | |ENVIRON Yes | |IGNORECASE Yes | |CONVFMT Yes | |ERRNO Yes | |FIELDWIDTHS Yes | +--------------------------------+
Since I've already discussed many of these, I'll only cover those that I missed earlier.
ARGC - Number or arguments (NAWK/GAWK)

The variable ARGC specifies the number of arguments on the command line. It always has the value of one or more, as it counts its own program name as the first argument. If you specify an AWK filename using the "-f" option, it is not counted as an argument. If you include a variable on the command line using the form below, NAWK does not count as an argument. nawk -f file.awk x=17 GAWK does, but that is because GAWK requires the "-v" option before each assignment: gawk -f file.awk -v x=17
GAWK variables initialized this way do not affect the ARGC variable.
ARGV - Array of arguments (NAWK/GAWK)

The ARGV array is the list of arguments (or files) passed as command line arguments.
ARGIND - Argument Index (GAWK only)

The ARGIND specifies the current index into the ARGV array, and therefore ARGV[ARGIND] is always the current filename. Therefore FILENAME == ARGV[ARGIND] is always true. It can be modified, allowing you to skip over files, etc.
FNR (NAWK/GAWK)
The FNR variable contains the number of lines read, but is reset for each file read. The NR variable accumulates for all files read. Therefore if you execute an awk script with two files as arguments, with each containing 10 lines: nawk '{print NR}' file file2 nawk '{print FNR}' file file2 the first program would print the numbers 1 through 20, while the second would print the numbers 1 through 10 twice, once for each file.
OFMT (NAWK/GAWK)
The OFMT variable Specifies the default format for numbers. The default value is "%.6g."
RSTART, RLENGTH and match (NAWK/GAWK)

I've already mentioned the RSTART and RLENGTH variables. After the match() function is called, these variables contain the location in the string of the search pattern. RLENGTH contains the length of this match.
SUBSEP - Multi-dimensional array separator (NAWK/GAWK)

Earlier I described how you can construct multi-dimensional arrays in AWK. These are constructed by concatenating two indexes together with a special character between them. If I use an ampersand as the special character, I can access the value at location X, Y by the reference array[ X "&" Y ] NAWK (and GAWK) has this feature built in. That is, you can specify the array element array[X,Y] It automatically constructs the string, placing a special character between the indexes. This character is the non-printing character "034." You can control the value of this character, to make sure your strings do not contain the same character.
ENVIRON - environment variables (GAWK only)

The ENVIRON array contains the environment variables the current process. You can print your current search path using print ENVIRON["PATH"]
IGNORECASE (GAWK only)

The IGNORECASE variable is normally zero. If you set it to non-zero, then all pattern matches ignore case. Therefore the following is equivalent to "grep -i match:" BEGIN {IGNORECASE=1;} /match/ {print}
CONVFMT - conversion format (GAWK only)
The CONVFMT variable is used to specify the format when converting a number to a string. The default value is "%.6g." One way to truncate integers is to convert an integer to a string, and convert the string to an integer - modifying the actions with CONVFMT: a = 12; b = a ""; CONVFMT = "%2.2f"; c = a ""; Variables b and c are both strings, but the first one will have the value "12.00" while the second will have the value "12."
ERRNO - system errors (GAWK only)

The ERRNO variable describes the error, as a string, after a call to the getline command fails.
FIELDWIDTHS - fixed width fields (GAWK only)

The FIELDWIDTHS variable is used when processing fixed width input. If you wanted to read a file that had 3 columns of data; the first one is 5 characters wide, the second 4, and the third 7, you could use substr to split the line apart. The technique, using FIELDWIDTHS, would be: BEGIN {FIELDWIDTHS="5 4 7";} { printf("The three fields are %s %s %s\n", $1, $2, $3);}
AWK, NAWK, GAWK, or PERL

This concludes my series of tutorials on AWK and the NAWK and GAWK variants. I originally intended to avoid extensive coverage of NAWK and GAWK. I am a PERL user myself, and still believe PERL is worth learning, yet I don't intend to cover this very complex language. Understanding AWK is very useful when you start to learn PERL, and a part of me felt that teaching you extensive NAWK and GAWK features might encourage you to bypass PERL. I found myself going into more depth than I planned, and I hope you found this useful. I found out a lot myself, especially when I discussed topics not covered in other books. Which reminds me of some closing advice: if you don't understand how something works, experiment and see what you can discover. Good luck, and happy AWKing... Other of my Unix shell tutorials can be found here. Other shell tutorials can be found at Heiner's SHELLdorado and Chris F. A. Johnson's Unix Shell Page
Sed - An Introduction and Tutorial by Bruce Barnett

Table of Contents

The Awful Truth about sed The essential command: s for substitution The slash as a delimiter Using & as the matched string Using \1 to keep part of the pattern Substitute Flags /g - Global replacement Is sed recursive? /1, /2, etc. Specifying which occurrence /p - print Write to a file with /w filename Combining substitution flags Arguments and invocation of sed Multiple commands with -e command Filenames on the command line sed -n: no printing sed -f scriptname sed in shell script Quoting multiple sed lines in the C shell Quoting multiple sed lines in the Bourne shell A sed interpreter script Sed Comments Passing arguments into a sed script Using sed in a shell here-is document Multiple commands and order of execution Addresses and Ranges of Text Restricting to a line number Patterns Ranges by line number Ranges by patterns Delete with d Printing with p Reversing the restriction with ! Relationships between d, p, and ! The q or quit command Grouping with { and } Writing a file with the 'w' command Reading in a file with the 'r' command SunOS and the # Comment Command
Adding, Changing, Inserting new lines Append a line with 'a' Insert a line with 'i' Change a line with 'c' Leading tabs and spaces in a sed script Adding more than one line Adding lines and the pattern space Address ranges and the above commands Multi-Line Patterns Print line number with = Transform with y Displaying control characters with a l Working with Multiple Lines Using newlines in sed scripts The Hold Buffer Exchange with x Example of Context Grep Hold with h or H Keeping more than one line in the hold buffer Get with g or G Flow Control Testing with t An alternate way of adding comments The poorly documented ; Passing regular expressions as arguments Command Summary In Conclusion More References
Copyright 2001,2005,2007,2011 Bruce Barnett and General Electric Company All rights reserved You are allowed to print copies of this tutorial for your personal use, and link to this page, but you are not allowed to make electronic copies, or redistribute this tutorial in any form without permission.
Introduction to Sed
How to use sed, a special editor for modifying files automatically. If you want to write a program to make changes in a file, sed is the tool to use. There are a few programs that are the real workhorse in the Unix toolbox. These programs are simple to use for simple applications, yet have a rich set of commands for performing complex actions. Don't let the complex potential of a program keep you from making use of the simpler
aspects. This chapter, like all of the rest, start with the simple concepts and introduces the advanced topics later on. A note on comments. When I first wrote this, most versions of sed did not allow you to place comments inside the script. Lines starting with the '#' characters are comments. Newer versions of sed may support comments at the end of the line as well.
The Awful Truth about sed

Sed is the ultimate stream editor. If that sounds strange, picture a stream flowing through a pipe. Okay, you can't see a stream if it's inside a pipe. That's what I get for attempting a flowing analogy. You want literature, read James Joyce. Anyhow, sed is a marvelous utility. Unfortunately, most people never learn its real power. The language is very simple, but the documentation is terrible. The Solaris on-line manual pages for sed are five pages long, and two of those pages describe the 34 different errors you can get. A program that spends as much space documenting the errors than it does documenting the language has a serious learning curve. Do not fret! It is not your fault you don't understand sed. I will cover sed completely. But I will describe the features in the order that I learned them. I didn't learn everything at once. You don't need to either.
The essential command: s for substitution

Sed has several commands, but most people only learn the substitute command: s. The substitute command changes all occurrences of the regular expression into a new value. A simple example is changing "day" in the "old" file to "night" in the "new" file:
sed s/day/night/ <old >new
Or another way (for Unix beginners),

sed s/day/night/ old >new
and for those who want to test this:

echo day | sed s/day/night/
This will output "night". I didn't put quotes around the argument because this example didn't need them. If you read my earlier tutorial on quotes, you would understand why it doesn't need quotes. However, I recommend you do use quotes. If you have meta-characters in the command, quotes are necessary. And if you aren't sure, it's a good habit, and I will henceforth quote future examples to emphasize the "best practice." Using the strong (single quote) character, that would be:
sed 's/day/night/' <old >new
I must emphasize that the sed editor changes exactly what you tell it to. So if you executed
echo Sunday | sed 's/day/night/'
This would output the word "Sunnight" bacause sed found the string "day" in the input. There are four parts to this substitute command:
s /../../ day night Substitute command Delimiter Regular Expression Pattern Search Pattern Replacement string
The search pattern is on the left hand side and the replacement string is on the right hand side. We've covered quoting and regular expressions.. That's 90% of the effort needed to learn the substitute command. To put it another way, you already know how to handle 90% of the most frequent uses of sed. There are a ... few fine points that an future sed expert should know about. (You just finished section 1. There's only 63 more sections to cover. :-) Oh. And you may want to bookmark this page, .... just in case you don't finish.
The slash as a delimiter

The character after the s is the delimiter. It is conventionally a slash, because this is what ed, more, and vi use. It can be anything you want, however. If you want to change a pathname that contains a slash - say /usr/local/bin to /common/bin - you could use the backslash to quote the slash:
sed 's/\/usr\/local\/bin/\/common\/bin/' <old >new
Gulp. Some call this a 'Picket Fence' and it's ugly. It is easier to read if you use an underline instead of a slash as a delimiter:
sed 's_/usr/local/bin_/common/bin_' <old >new
Some people use colons:

sed 's:/usr/local/bin:/common/bin:' <old >new
Others use the "|" character.

sed 's|/usr/local/bin|/common/bin|' <old >new
Pick one you like. As long as it's not in the string you are looking for, anything goes. And remember that you need three delimiters. If you get a "Unterminated `s' command" it's because you are missing one of them.
Using & as the matched string

Sometimes you want to search for a pattern and add some characters, like parenthesis, around or near the pattern you found. It is easy to do this if you are looking for a particular string:
sed 's/abc/(abc)/' <old >new
This won't work if you don't know exactly what you will find. How can you put the string you found in the replacement string if you don't know what it is? The solution requires the special character "&." It corresponds to the pattern found.
sed 's/[a-z]*/(&)/' <old >new
You can have any number of "&" in the replacement string. You could also double a pattern, e.g. the first number of a line:
% echo "123 abc" | sed 's/[0-9]*/& &/' 123 123 abc
Let me slightly amend this example. Sed will match the first string, and make it as greedy as possible. The first match for '[0-9]*' is the first character on the line, as this matches zero of more numbers. So if the input was "abc 123" the output would be unchanged (well, except for a space before the letters). A better way to duplicate the number is to make sure it matches a number:
% echo "123 abc" | sed 's/[0-9][0-9]*/& &/' 123 123 abc
The string "abc" is unchanged, because it was not matched by the regular expression. If you wanted to eliminate "abc" from the output, you must expand the the regular expression to match the rest of the line and explicitly exclude part of the expression using "(", ")" and "\1", which is the next topic. A quick comment. The original sed did not support the "+" metacharacter. GNU sed does. It means "one or more matches". So the above could also be written using
% echo "123 abc" | sed 's/[0-9]+/& &/' 123 123 abc
Using \1 to keep part of the pattern
I have already described the use of "(" ")" and "1" in my tutorial on regular expressions. To review, the escaped parentheses (that is, parentheses with backslashes before them) remember portions of the regular expression. You can use this to exclude part of the regular expression. The "\1" is the first remembered pattern, and the "\2" is the second remembered pattern. Sed has up to nine remembered patterns. If you wanted to keep the first word of a line, and delete the rest of the line, mark the important part with the parenthesis:
sed 's/$[a-z]*$.*/\1/'
I should elaborate on this. Regular exprssions are greedy, and try to match as much as possible. "[a-z]*" matches zero or more lower case letters, and tries to be as big as possible. The ".*" matches zero or more characters after the first match. Since the first one grabs all of the lower case letters, the second matches anything else. Therefore if you type
echo abcd123 | sed 's/$[a-z]*$.*/\1/'
This will output "abcd" and delete the numbers. If you want to switch two words around, you can remember two patterns and change the order around:
sed 's/$[a-z]*$ $[a-z]*$/\2 \1/'
Note the space between the two remembered patterns. This is used to make sure two words are found. However, this will do nothing if a single word is found, or any lines with no letters. You may want to insist that words have at least one letter by using
sed 's/$[a-z][a-z]*$ $[a-z][a-z]*$/\2 \1/'
The "\1" doesn't have to be in the replacement string (in the right hand side). It can be in the pattern you are searching for (in the left hand side). If you want to eliminate duplicated words, you can try:
sed 's/$[a-z]*$ \1/\1/'
If you want to detect duplicated words, you can use

sed -n '/$[a-z][a-z]*$ \1/p'
This, when used as a filter, will print lines with duplicated words. The numeric value can have up to nine values: "\1" thru "\9." If you wanted to reverse the first three characters on a line, you can use
sed 's/^$.$$.$$.$/\3\2\1/'
Substitute Flags
You can add additional flags after the last delimiter. These flags can specify what happens when there is more than one occurrence of a pattern on a single line, and what to do if a substitution is found. Let me describe them.
/g - Global replacement
Most Unix utilties work on files, reading a line at a time. Sed, by default, is the same way. If you tell it to change a word, it will only change the first occurrence of the word on a line. You may want to make the change on every word on the line instead of the first. For an example, let's place parentheses around words on a line. Instead of using a pattern like "[A-Za-z]*" which won't match words like "won't," we will use a pattern, "[^ ]*," that matches everything except a space. Well, this will also match anything because "*" means zero or more. The current version of sed can get unhappy with patterns like this, and generate errors like "Output line too long" or even run forever. I consider this a bug, and have reported this to Sun. As a work-around, you must avoid matching the null string when using the "g" flag to sed. A work-around example is: "[^ ][^ ]*." The following will put parenthesis around the first word:
sed 's/[^ ]*/(&)/' <old >new
If you want it to make changes for every word, add a "g" after the last delimiter and use the work-around:
sed 's/[^ ][^ ]*/(&)/g' <old >new
Is sed recursive?
Sed only operates on patterns found in the in-coming data. That is, the input line is read, and when a pattern is matched, the modified output is generated, and the rest of the input line is scanned. The "s" command will not scan the newly created output. That is, you don't have to worry about expressions like:
sed 's/loop/loop the loop/g' <old >new
This will not cause an infinite loop. If a second "s" command is executed, it could modify the results of a previous command. I will show you how to execute multiple commands later.
/1, /2, etc. Specifying which occurrence

With no flags, the first pattern is changed. With the "g" option, all patterns are changed. If you want to modify a particular pattern that is not the first one on the line, you could use "$" and "$"
to mark each pattern, and use "\1" to put the first pattern back unchanged. This next example keeps the first word on the line but deletes the second:
sed 's/$[a-zA-Z]*$ $[a-zA-Z]*$ /\1 /' <old >new
Yuck. There is an easier way to do this. You can add a number after the substitution command to indicate you only want to match that particular pattern. Example:
sed 's/[a-zA-Z]* //2' <old >new
You can combine a number with the g (global) flag. For instance, if you want to leave the first world alone alone, but change the second, third, etc. to DELETED, use /2g:
sed 's/[a-zA-Z]* /DELETED /2g' <old >new
Don't get /2 and \2 confused. The /2 is used at the end. \2 is used in inside the replacement field. Note the space after the "*" character. Without the space, sed will run a long, long time. (Note: this bug is probably fixed by now.) This is because the number flag and the "g" flag have the same bug. You should also be able to use the pattern
sed 's/[^ ]*//2' <old >new
but this also eats CPU. If this works on your computer, and it does on some Unix systems, you could remove the encrypted password from the password file:
sed 's/[^:]*//2' </etc/passwd >/etc/password.new
But this didn't work for me the time I wrote thise. Using "[^:][^:]*" as a work-around doesn't help because it won't match an non-existent password, and instead delete the third field, which is the user ID! Instead you have to use the ugly parenthesis:
sed 's/^$[^:]*$:[^:]:/\1::/' </etc/passwd >/etc/password.new
You could also add a character to the first pattern so that it no longer matches the null pattern:
sed 's/[^:]*:/:/2' </etc/passwd >/etc/password.new
The number flag is not restricted to a single digit. It can be any number from 1 to 512. If you wanted to add a colon after the 80th character in each line, you could type:
sed 's/./&:/80' <file >new
You can also do it the hard way by using 80 dots:

sed 's/^......................................................................... ......./&:/' <file >new
/p - print
By default, sed prints every line. If it makes a substitution, the new text is printed instead of the old one. If you use an optional argument to sed, "sed -n," it will not, by default, print any new lines. I'll cover this and other options later. When the "-n" option is used, the "p" flag will cause the modified line to be printed. Here is one way to duplicate the function of grep with sed:
sed -n 's/pattern/&/p' <file
Write to a file with /w filename

There is one more flag that can follow the third delimiter. With it, you can specify a file that will receive the modified data. An example is the following, which will write all lines that start with an even number, followed by a space, to the file even:
sed -n 's/^[0-9]*[02468] /&/w even' <file
In this example, the output file isn't needed, as the input was not modified. You must have exactly one space between the w and the filename. You can also have ten files open with one instance of sed. This allows you to split up a stream of data into separate files. Using the previous example combined with multiple substitution commands described later, you could split a file into ten pieces depending on the last digit of the first number. You could also use this method to log error or debugging information to a special file.
Combining substitution flags

You can combine flags when it makes sense. Also "w" has to be the last flag. For example the following command works:
sed -n 's/a/A/2pw /tmp/file' <old >new
Next I will discuss the options to sed, and different ways to invoke sed.
Arguments and invocation of sed

previously, I have only used one substitute command. If you need to make two changes, and you didn't want to read the manual, you could pipe together multiple sed commands:
sed 's/BEGIN/begin/' <old | sed 's/END/end/' >new
This used two processes instead of one. A sed guru never uses two processes when one can do.
Multiple commands with -e command

One method of combining multiple commands is to use a -e before each command:
sed -e 's/a/A/' -e 's/b/B/' <old >new
A "-e" isn't needed in the earlier examples because sed knows that there must always be one command. If you give sed one argument, it must be a command, and sed will edit the data read from standard input. Also see Quoting multiple sed lines in the Bourne shell
Filenames on the command line

You can specify files on the command line if you wish. If there is more than one argument to sed that does not start with an option, it must be a filename. This next example will count the number of lines in three files that don't begin with a "#:"
sed 's/^#.*//' f1 f2 f3 | grep -v '^$' | wc -l
The sed substitute command changes every line that starts with a "#" into a blank line. Grep was used to filter out empty lines. Wc counts the number of lines left. Sed has more commands that make grep unnecessary. But I will cover that later. Of course you could write the last example using the "-e" option:
sed -e 's/^#.*//' f1 f2 f3 | grep -v '^$' | wc -l
There are two other options to sed.
sed -n: no printing

The "-n" option will not print anything unless an explicit request to print is found. I mentioned the "/p" flag to the substitute command as one way to turn printing back on. Let me clarify this. The command
sed 's/PATTERN/&/p' file
acts like the cat program if PATTERN is not in the file: e.g. nothing is changed. If PATTERN is in the file, then each line that has this is printed twice. Add the "-n" option and the example acts like grep:
sed -n 's/PATTERN/&/p' file
Nothing is printed, except those lines with PATTERN included.
sed -f scriptname
If you have a large number of sed commands, you can put them into a file and use
sed -f sedscript <old >new
where sedscript could look like this:

# sed comment - This script changes lower case vowels to upper case s/a/A/g s/e/E/g s/i/I/g s/o/O/g s/u/U/g
When there are several commands in one file, each command must be on a separate line. Also see here
sed in shell script

If you have many commands and they won't fit neatly on one line, you can break up the line using a backslash:
sed -e -e -e -e -e 's/a/A/g' 's/e/E/g' 's/i/I/g' 's/o/O/g' 's/u/U/g' \ \ \ \ <old >new
Quoting multiple sed lines in the C shell

You can have a large, multi-line sed script in the C shell, but you must tell the C shell that the quote is continued across several lines. This is done by placing a backslash at the end of each line:
#!/bin/csh -f sed 's/a/A/g \ s/e/E/g \ s/i/I/g \ s/o/O/g \ s/u/U/g' <old >new
Quoting multiple sed lines in the Bourne shell

The Bourne shell makes this easier as a quote can cover several lines:
#!/bin/sh sed ' s/a/A/g s/e/E/g s/i/I/g s/o/O/g s/u/U/g' <old >new
A sed interpreter script

Another way of executing sed is to use an interpreter script. Create a file that contains: #!/bin/sed -f s/a/A/g s/e/E/g s/i/I/g s/o/O/g s/u/U/g
Click here to get file: CapVowel.sed If this script was stored in a file with the name "CapVowel" and was executable, you could use it with the simple command:
CapVowel <old >new
Comments
Sed comments are lines where the first non-white character is a "#." On many systems, sed can have only one comment, and it must be the first line of the script. On the Sun (1988 when I wrote this), you can have several comment lines anywhere in the script. Modern versions of Sed support this. If the first line contains exactly "#n" then this does the same thing as the "-n" option: turning off printing by default. This could not done with a sed interpreter script, because the first line must start with "#!/bin/sed -f" as I think "#!/bin/sed -nf" generated an error. It works as I write this update in 2008. Note that "#!/bin/sed -fn" does not work because sed thinks the filename of the script is "n". However,
"#!/bin/sed -nf"
does work.
Passing arguments into a sed script

Passing a word into a shell script that calls sed is easy if you remembered my tutorial on the Unix quoting mechanism. To review, you use the single quotes to turn quoting on and off. A simple shell script that uses sed to emulate grep is: #!/bin/sh sed -n 's/'$1'/&/p' However - there is a problem with this script. If you have a space as an argument, the script would cause a syntax error A better version would protect from this happening: #!/bin/sh sed -n 's/'"$1"'/&/p'
Click here to get file: sedgrep.sed If this was stored in a file called sedgrep, you could type
sedgrep '[A-Z][A-Z]' <file
This would allow sed to act as the grep command.
Using sed in a shell here-is document

You can use sed to prompt the user for some parameters and then create a file with those parameters filled in. You could create a file with dummy values placed inside it, and use sed to change those dummy values. A simpler way is to use the "here is" document, which uses part of the shell script as if it were standard input:
#!/bin/sh echo -n 'what is the value? ' read value sed 's/XXX/'$value'/' <<EOF The value is XXX EOF
Click here to get file: sed_hereis.sed When executed, the script says:
what is the value?
If you type in "123," the next line will be:

The value is 123
I admit this is a contrived example. "Here is" documents can have values evaluated without the use of sed. This example does the same thing:
#!/bin/sh echo -n 'what is the value? ' read value cat <<EOF The value is $value EOF
However, combining "here is" documents with sed can be useful for some complex cases. Note that sed 's/XXX/'$value'/' <<EOF will give a syntax error if the user types a space. Better form would be to use sed 's/XXX/'"$value"'/' <<EOF
Multiple commands and order of execution

As we explore more of the commands of sed, the commands will become complex, and the actual sequence can be confusing. It's really quite simple. Each line is read in. Each command, in order specified by the user, has a chance to operate on the input line. After the substitutions are made, the next command has a chance to operate on the same line, which may have been modified by earlier commands. If you ever have a question, the best way to learn what will happen is to create a small example. If a complex command doesn't work, make it simpler. If you are having problems getting a complex script working, break it up into two smaller scripts and pipe the two scripts together.
Addresses and Ranges of Text

You have only learned one command, and you can see how powerful sed is. However, all it is doing is a grep and substitute. That is, the substitute command is treating each line by itself, without caring about nearby lines. What would be useful is the ability to restrict the operation to certain lines. Some useful restrictions might be:
Specifying a line by its number. Specifying a range of lines by number. All lines containing a pattern. All lines from the beginning of a file to a regular expression All lines from a regular expression to the end of the file. All lines between two regular expressions.
Sed can do all that and more. Every command in sed can be proceeded by an address, range or restriction like the above examples. The restriction or address immediately precedes the command: restriction command
Restricting to a line number

The simplest restriction is a line number. If you wanted to delete the first number on line 3, just add a "3" before the command:
sed '3 s/[0-9][0-9]*//' <file >new
Patterns
Many Unix utilities like vi and more use a slash to search for a regular expression. Sed uses the same convention, provided you terminate the expression with a slash. To delete the first number on all lines that start with a "#," use:
sed '/^#/ s/[0-9][0-9]*//'
I placed a space after the "/expression/" so it is easier to read. It isn't necessary, but without it the command is harder to fathom. Sed does provide a few extra options when specifying regular expressions. But I'll discuss those later. If the expression starts with a backslash, the next character is the delimiter. To use a comma instead of a slash, use:
sed '\,^#, s/[0-9][0-9]*//'
The main advantage of this feature is searching for slashes. Suppose you wanted to search for the string "/usr/local/bin" and you wanted to change it for "/common/all/bin." You could use the backslash to escape the slash:
sed '/\/usr\/local\/bin/ s/\/usr\/local/\/common\/all/'
It would be easier to follow if you used an underline instead of a slash as a search. This example uses the underline in both the search command and the substitute command:
sed '\_/usr/local/bin_ s_/usr/local_/common/all_'
This illustrates why sed scripts get the reputation for obscurity. I could be perverse and show you the example that will search for all lines that start with a "g," and change each "g" on that line to an "s:"
sed '/^g/s/g/s/g'
Adding a space and using an underscore after the substitute command makes this much easier to read:
sed '/^g/ s_g_s_g'
Er, I take that back. It's hopeless. There is a lesson here: Use comments liberally in a sed script under SunOS. You may have to remove the comments to run the script under a different operating system, but you now know how to write a sed script to do that very easily! Comments are a Good Thing. You may have understood the script perfectly when you wrote it. But six months from now it could look like modem noise.
Ranges by line number

You can specify a range on line numbers by inserting a comma between the numbers. To restrict a substitution to the first 100 lines, you can use:
sed '1,100 s/A/a/'
If you know exactly how many lines are in a file, you can explicitly state that number to perform the substitution on the rest of the file. In this case, assume you used wc to find out there are 532 lines in the file:
sed '101,532 s/A/a/'
An easier way is to use the special character "$," which means the last line in the file.
sed '101,$ s/A/a/'
The "$" is one of those conventions that mean "last" in utilities like cat -e, vi, and ed. "cat -e" Line numbers are cumulative if several files are edited. That is,
sed '200,300 s/A/a/' f1 f2 f3 >new
is the same as
cat f1 f2 f3 | sed '200,300 s/A/a/' >new
Ranges by patterns
You can specify two regular expressions as the range. Assuming a "#" starts a comment, you can search for a keyword, remove all comments until you see the second keyword. In this case the two keywords are "start" and "stop:"
sed '/start/,/stop/ s/#.*//'
The first pattern turns on a flag that tells sed to perform the substitute command on every line. The second pattern turns off the flag. If the "start" and "stop" pattern occurs twice, the substitution is done both times. If the "stop" pattern is missing, the flag is never turned off, and the substitution will be performed on every line until the end of the file. You should know that if the "start" pattern is found, the substitution occurs on the same line that contains "start." This turns on a switch, which is line oriented. That is, the next line is read and the substitute command is checked. If it contains "stop" the switch is turned off. Switches are line oriented, and not word oriented. You can combine line numbers and regular expressions. This example will remove comments from the beginning of the file until it finds the keyword "start:"
sed -e '1,/start/ s/#.*//'
This example will remove comments everywhere except the lines between the two keywords:
sed -e '1,/start/ s/#.*//' -e '/stop/,$ s/#.*//'
The last example has a range that overlaps the "/start/,/stop/" range, as both ranges operate on the lines that contain the keywords. I will show you later how to restrict a command up to, but not including the line containing the specified pattern. Before I start discussing the various commands, I should explain that some commands cannot operate on a range of lines. I will let you know when I mention the commands. In this next section I will describe three commands, one of which cannot operate on a range.
Delete with d
Using ranges can be confusing, so you should expect to do some experimentation when you are trying out a new script. A useful command deletes every line that matches the restriction: "d." If you want to look at the first 10 lines of a file, you can use:
sed '11,$ d' <file
which is similar in function to the head command. If you want to chop off the header of a mail message, which is everything up to the first blank line, use:
sed '1,/^$/ d' <file
You can duplicate the function of the tail command, assuming you know the length of a file. Wc can count the lines, and expr can subtract 10 from the number of lines. A Bourne shell script to look at the last 10 lines of a file might look like this: #!/bin/sh #print last 10 lines of file
# First argument is the filename lines=`wc -l $1 | awk '{print $1}' ` start=èxpr $lines - 10` sed "1,$start d" $1
Click here to get file: sed_tail.sh The range for deletions can be regular expressions pairs to mark the begin and end of the operation. Or it can be a single regular expression. Deleting all lines that start with a "#" is easy:
sed '/^#/ d'
Removing comments and blank lines takes two commands. The first removes every character from the "#" to the end of the line, and the second deletes all blank lines:
sed -e 's/#.*//' -e '/^$/ d'
A third one should be added to remove all blanks and tabs immediately before the end of line:
sed -e 's/#.*//' -e 's/[ Î]*$//' -e '/^$/ d'
The character "Î" is a CTRL-I or tab character. You would have to explicitly type in the tab. Note the order of operations above, which is in that order for a very good reason. Comments might start in the middle of a line, with white space characters before them. Therefore comments are first removed from a line, potentially leaving white space characters that were before the comment. The second command removes all trailing blanks, so that lines that are now blank are converted to empty lines. The last command deletes empty lines. Together, the three commands remove all lines containing only comments, tabs or spaces. This demonstrates the pattern space sed uses to operate on a line. The actual operation sed uses is:
Copy the input line into the pattern space. Apply the first sed command on the pattern space, if the address restriction is true. Repeat with the next sed expression, again operating on the pattern space. When the last operation is performed, write out the pattern space and read in the next line from the input file.
Printing with p
Another useful command is the print command: "p." If sed wasn't started with an "-n" option, the "p" command will duplicate the input. The command
sed 'p'
will duplicate every line. If you wanted to double every empty line, use:
sed '/^$/ p'
Adding the "-n" option turns off printing unless you request it. Another way of duplicating head's functionality is to print only the lines you want. This example prints the first 10 lines:
sed -n '1,10 p' <file
Sed can act like grep by combining the print operator to function on all lines that match a regular expression:
sed -n '/match/ p'
which is the same as:

grep match
Reversing the restriction with !

Sometimes you need to perform an action on every line except those that match a regular expression, or those outside of a range of addresses. The "!" character, which often means not in Unix utilities, inverts the address restriction. You remember that
sed -n '/match/ p'
acts like the grep command. The "-v" option to grep prints all lines that don't contain the pattern. Sed can do this with
sed -n '/match/ !p' </tmp/b
Relationships between d, p, and !

As you may have noticed, there are often several ways to solve the same problem with sed. This is because print and delete are opposite functions, and it appears that "!p" is similar to "d," while "!d" is similar to "p." I wanted to test this, so I created a 20 line file, and tried every different combination. The following table, which shows the results, demonstrates the difference:
Relations between d, p, and ! Sed Range Command Results -------------------------------------------------------sed -n 1,10 p Print first 10 lines sed -n 11,$ !p Print first 10 lines sed 1,10 !d Print first 10 lines sed 11,$ d Print first 10 lines
-------------------------------------------------------sed -n 1,10 !p Print last 10 lines sed -n 11,$ p Print last 10 lines sed 1,10 d Print last 10 lines sed 11,$ !d Print last 10 lines -------------------------------------------------------sed -n 1,10 d Nothing printed sed -n 1,10 !d Nothing printed sed -n 11,$ d Nothing printed sed -n 11,$ !d Nothing printed -------------------------------------------------------sed 1,10 p Print first 10 lines twice, Then next 10 lines once sed 11,$ !p Print first 10 lines twice, Then last 10 lines once -------------------------------------------------------sed 1,10 !p Print first 10 lines once, Then last 10 lines twice sed 11,$ p Print first 10 lines once, then last 10 lines twice
This table shows that the following commands are identical:

sed sed sed sed -n '1,10 p' -n '11,$ !p' '1,10 !d' '11,$ d'
It also shows that the "!" command "inverts" the address range, operating on the other lines.
The q or quit command

There is one more simple command that can restrict the changes to a set of lines. It is the "q" command: quit. the third way to duplicate the head command is:
sed '11 q'
which quits when the eleventh line is reached. This command is most useful when you wish to abort the editing after some condition is reached. The "q" command is the one command that does not take a range of addresses. Obviously the command
sed '1,10 q'
cannot quit 10 times. Instead

sed '1 q'
or
sed '10 q'
is correct.
Grouping with { and }

The curly braces, "{" and "}," are used to group the commands. Hardly worth the build up. All that prose and the solution is just matching squigqles. Well, there is one complication. Since each sed command must start on its own line, the curly braces and the nested sed commands must be on separate lines. Previously, I showed you how to remove comments starting with a "#." If you wanted to restrict the removal to lines between special "begin" and "end" key words, you could use:
#!/bin/sh # This is a Bourne shell script that removes #-type comments # between 'begin' and 'end' words. sed -n ' /begin/,/end/ { s/#.*// s/[ Î]*$// /^$/ d p } '
Click here to get file: sed_begin_end.sh These braces can be nested, which allow you to combine address ranges. You could perform the same action as before, but limit the change to the first 100 lines:
#!/bin/sh # This is a Bourne shell script that removes #-type comments # between 'begin' and 'end' words. sed -n ' 1,100 { /begin/,/end/ { s/#.*// s/[ Î]*$// /^$/ d p } } '
Click here to get file: sed_begin_end1.sh You can place a "!" before a set of curly braces. This inverts the address, which removes comments from all lines except those between the two reserved words:
#!/bin/sh sed ' /begin/,/end/ !{ s/#.*// s/[ Î]*$// /^$/ d p } '
Click here to get file: sed_begin_end2.sh
Writing a file with the 'w' command

You may remember that the substitute command can write to a file. Here again is the example that will only write lines that start with an even number (and followed by a space):
sed -n 's/^[0-9]*[02468] /&/w even' <file
I used the "&" in the replacement part of the substitution command so that the line would not be changed. A simpler example is to use the "w" command, which has the same syntax as the "w" flag in the substitute command:
sed -n '/^[0-9]*[02468]/ w even' <file
Remember - only one space must follow the command. Anything else will be considered part of the file name. The "w" command also has the same limitation as the "w" flag: only 10 files can be opened in sed.
Reading in a file with the 'r' command

There is also a command for reading files. The command
sed '$r end' <in>out
will append the file "end" at the end of the file (address "$)." The following will insert a file after the line with the word "INCLUDE:"
sed '/INCLUDE/ r file' <in >out
You can use the curly braces to delete the line having the "INCLUDE" command on it:
#!/bin/sh sed '/INCLUDE/ { r file d }'
Click here to get file: sed_include.sh The order of the delete command "d" and the read file command "r" is important. Change the order and it will not work. There are two subtle actions that prevent this from working. The first is the "r" command writes the file to the output stream. The file is not inserted into the pattern space, and therefore cannot be modified by any command. Therefore the delete command does not affect the data read from the file. The other subtlety is the "d" command deletes the current data in the pattern space. Once all of the data is deleted, it does make sense that no other action will be attempted. Therefore a "d" command executed in a curly brace also aborts all further actions. As an example, the substitute command below is never executed:
#!/bin/sh # this example is WRONG sed -e '1 { d s/.*// }'
Click here to get file: sed_bad_example.sh The earlier example is a crude version of the C preprocessor program. The file that is included has a predetermined name. It would be nice if sed allowed a variable (e.g "\1)" instead of a fixed file name. Alas, sed doesn't have this ability. You could work around this limitation by creating sed commands on the fly, or by using shell quotes to pass variables into the sed script. Suppose you wanted to create a command that would include a file like cpp, but the filename is an argument to the script. An example of this script is:
% include 'sys/param.h' <file.c >file.c.new
A shell script to do this would be:

#!/bin/sh # watch out for a '/' in the parameter # use alternate search delimiter sed -e '\_#INCLUDE <'"$1"'>_{ r '"$1"' d }'
Let me elaborate. If you had a file that contains

Test first file #INCLUDE <file1> Test second file #INCLUDE <file2>
you could use the command

sed_include1.sh file1<input|sed_include1.sh file2
to include the specified files.
Click here to get file: sed_include1.sh
SunOS and the # Comment Command

As we dig deeper into sed, comments will make the commands easier to follow. Most versions of sed only allow one line as a comment, and it must be the first line. SunOS (and GNU's sed) allows more than one comment, and these comments don't have to be first. The last example could be:
#!/bin/sh # watch out for a '/' in the parameter # use alternate search delimiter sed -e '\_#INCLUDE <'"$1"'>_{ # read the file r '"$1"' # delete any characters in the pattern space # and read the next line in d }'
Click here to get file: sed_include2.sh
Adding, Changing, Inserting new lines

Sed has three commands used to add new lines to the output stream. Because an entire line is added, the new line is on a line by itself to emphasize this. There is no option, an entire line is used, and it must be on its own line. If you are familiar with many unix utilities, you would expect sed to use a similar convention: lines are continued by ending the previous line with a "\". The syntax to these commands is finicky, like the "r" and "w" commands.
Append a line with 'a'

The "a" command appends a line after the range or pattern. This example will add a line after every line with "WORD:"
#!/bin/sh
sed ' /WORD/ a\ Add this line after every line with WORD '
Click here to get file: sed_add_line_after_word.sh You could eliminate two lines in the shell script if you wish:
#!/bin/sh sed '/WORD/ a\ Add this line after every line with WORD'
Click here to get file: sed_add_line_after_word1.sh I prefer the first form because it's easier to add a new command by adding a new line and because the intent is clearer. There must not be a space after the "\".
Insert a line with 'i'

You can insert a new line before the pattern with the "i" command:
#!/bin/sh sed ' /WORD/ i\ Add this line before every line with WORD '
Click here to get file: sed_add_line_before_word.sh
Change a line with 'c'

You can change the current line with a new line.
#!/bin/sh sed ' /WORD/ c\ Replace the current line with the line '
Click here to get file: sed_change_line.sh
A "d" command followed by a "a" command won't work, as I discussed earlier. The "d" command would terminate the current actions. You can combine all three actions using curly braces:
#!/bin/sh sed ' /WORD/ { i\ Add this line before a\ Add this line after c\ Change the line to this one }'
Click here to get file: sed_insert_append_change.sh
Leading tabs and spaces in a sed script

Sed ignores leading tabs and spaces in all commands. However these white space characters may or may not be ignored if they start the text following a "a," "c" or "i" command. In SunOS, both "features" are available. The Berkeley (and Linux) style sed is in /usr/bin, and the AT&T version (System V) is in /usr/5bin/. To elaborate, the /usr/bin/sed command retains white space, while the /usr/5bin/sed strips off leading spaces. If you want to keep leading spaces, and not care about which version of sed you are using, put a "\" as the first character of the line:
#!/bin/sh sed ' a\ \ This line starts with a tab '
Adding more than one line

All three commands will allow you to add more than one line. Just end each line with a "\:"
#!/bin/sh sed ' /WORD/ a\ Add this line\ This line\ And this line '
Adding lines and the pattern space

I have mentioned the pattern space before. Most commands operate on the pattern space, and subsequent commands may act on the results of the last modification. The three previous commands, like the read file command, add the new lines to the output stream, bypassing the pattern space.
Address ranges and the above commands

You may remember in my last tutorial I warned you that some commands can take a range of lines, and others cannot. To be precise, the commands "a," "i," "r," and "q" will not take a range like "1,100" or "/begin/,/end/." The documentation states that the read command can take a range, but I get an error when I try this. The "c" or change command allows this, and it will let you change several lines into one:
#!/bin/sh sed ' /begin/,/end/ c\ ***DELETED*** '
If you need to do this, you can use the curly braces, as that will let you perform the operation on every line:
#!/bin/sh # add a blank line after every line sed '1,$ { a\ }'
Multi-Line Patterns
Most UNIX utilities are line oriented. Regular expressions are line oriented. Searching for patterns that covers more than one line is not an easy task. (Hint: It will be very shortly.) Sed reads in a line of text, performs commands which may modify the line, and outputs modification if desired. The main loop of a sed script looks like this: 1. The next line is read from the input file and places in the pattern space. If the end of file is found, and if there are additional files to read, the current file is closed, the next file is opened, and the first line of the new file is placed into the pattern space. 2. The line count is incremented by one. Opening a new file does not reset this number.
3. Each sed command is examined. If there is a restriction placed on the command, and the current line in the pattern space meets that restriction, the command is executed. Some commands, like "n" or "d" cause sed to go to the top of the loop. The "q" command causes sed to stop. Otherwise the next command is examined. 4. After all of the commands are examined, the pattern space is output unless sed has the optional "-n" argument. The restriction before the command determines if the command is executed. If the restriction is a pattern, and the operation is the delete command, then the following will delete all lines that have the pattern:
/PATTERN/ d
If the restriction is a pair of numbers, then the deletion will happen if the line number is equal to the first number or greater than the first number and less than or equal to the last number:
10,20 d
If the restriction is a pair of patterns, there is a variable that is kept for each of these pairs. If the variable is false and the first pattern is found, the variable is made true. If the variable is true, the command is executed. If the variable is true, and the last pattern is on the line, after the command is executed the variable is turned off:
/begin/,/end/ d
Whew! That was a mouthful. If you have read carefully up to here, you should have breezed through this. You may want to refer back, because I covered several subtle points. My choice of words was deliberate. It covers some unusual cases, like:
# what happens if the second number # is less than the first number? sed -n '20,1 p' file
and
# generate a 10 line file with line numbers # and see what happens when two patterns overlap yes | head -10 | cat -n | \ sed -n -e '/1/,/7/ p' -e '/5/,/9/ p'
Enough mental punishment. Here is another review, this time in a table format. Assume the input file contains the following lines:
AB CD EF GH IJ
When sed starts up, the first line is placed in the pattern space. The next line is "CD." The operations of the "n," "d," and "p" commands can be summarized as:
+----------------+---------+------------------------------------------+ |Pattern Next | Command | Output New Pattern New Next | |Space Input | | Space Input | +----------------+---------+------------------------------------------+ |AB CD | n | <default> CD EF | |AB CD | d | CD EF | |AB CD | p | AB CD EF | +----------------+---------+------------------------------------------+
The "n" command may or may not generate output depending upon the existence of the "-n" flag. That review is a little easier to follow, isn't it? Before I jump into multi-line patterns, I wanted to cover three more commands:
Print line number with =

The "=" command prints the current line number to standard output. One way to find out the line numbers that contain a pattern is to use:
# add line numbers first, # then use grep, # then just print the number cat -n file | grep 'PATTERN' | awk '{print $1}'
The sed solution is:

sed -n '/PATTERN/ =' file
Earlier I used the following to find the number of lines in a file

#!/bin/sh lines=`wc -l file | awk '{print $1}' `
Using the "=" command can simplify this:

#!/bin/sh lines=`sed -n '$=' file `
The "=" command only accepts one address, so if you want to print the number for a range of lines, you must use the curly braces:
#!/bin/sh # Just print the line numbers sed -n '/begin/,/end/ { = d
}' file
Since the "=" command only prints to standard output, you cannot print the line number on the same line as the pattern. You need to edit multi-line patterns to do this.
Transform with y
If you wanted to change a word from lower case to upper case, you could write 26 character substitutions, converting "a" to "A," etc. Sed has a command that operates like the tr program. It is called the "y" command. For instance, to change the letters "a" through "f" into their upper case form, use:
sed 'y/abcdef/ABCDEF/' file
I could have used an example that converted all 26 letters into upper case, and while this column covers a broad range of topics, the "column" prefers a narrower format. If you wanted to convert a line that contained a hexadecimal number (e.g. 0x1aff) to upper case (0x1AFF), you could use:
sed '/0x[0-9a-zA-Z]*/ y/abcdef/ABCDEF' file
This works fine if there are only numbers in the file. If you wanted to change the second word in a line to upper case, you are out of luck - unless you use multi-line editing. (Hey - I think there is some sort of theme here!)
Displaying control characters with a l

The "l" command prints the current pattern space. It is therefore useful in debugging sed scripts. It also converts unprintable characters into printing characters by outputting the value in octal preceded by a "\" character. I found it useful to print out the current pattern space, while probing the subtleties of sed.
Working with Multiple Lines

There are three new commands used in multiple-line patterns: "N," "D," and "P." I will explain their relation to the matching "n," "d," and "p" single-line commands. The "n" command will print out the current pattern space (unless the "-n" flag is used), empty the current pattern space, and read in the next line of input. The "N" command does not print out the current pattern space and does not empty the pattern space. It reads in the next line, but appends a new line character along with the input line itself to the pattern space.
The "d" command deleted the current pattern space, reads in the next line, puts the new line into the pattern space, and aborts the current command, and starts execution at the first sed command. This is called starting a new "cycle." The "D" command deletes the first portion of the pattern space, up to the new line character, leaving the rest of the pattern alone. Like "d," it stops the current command and starts the command cycle over again. However, it will not print the current pattern space. You must print it yourself, a step earlier. If the "D" command is executed with a group of other commands in a curly brace, commands after the "D" command are ignored. The next group of sed commands is executed, unless the pattern space is emptied. If this happens, the cycle is started from the top and a new line is read. The "p" command prints the entire pattern space. The "P" command only prints the first part of the pattern space, up to the NEWLINE character. Some examples might demonstrate "N" by itself isn't very useful. the filter
sed -e 'N'
doesn't modify the input stream. Instead, it combines the first and second line, then prints them, combines the third and fourth line, and prints them, etc. It does allow you to use a new "anchor" character: "\n." This matches the new line character that separates multiple lines in the pattern space. If you wanted to search for a line that ended with the character "#," and append the next line to it, you could use
#!/bin/sh sed ' # look for a "#" at the end of the line /#$/ { # Found one - now read in the next line N # delete the "#" and the new line character, s/#\n// }' file
You could search for two lines containing "ONE" and "TWO" and only print out the two consecutive lines:
#!/bin/sh sed -n ' /ONE/ { # found "ONE" - read in next line N # look for "TWO" on the second line # and print if there. /\n.*TWO/ p }' file
The next example would delete everything between "ONE" and "TWO:"
#!/bin/sh sed '
/ONE/ { # append a line N # search for TWO on the second line /\n.*TWO/ { # found it - now edit making one line s/ONE.*\n.*TWO/ONE TWO/ } }' file
You can either search for a particular pattern on two consecutive lines, or you can search for two consecutive words that may be split on a line boundary. The next example will look for two words which are either on the same line or one is on the end of a line and the second is on the beginning of the next line. If found, the first word is deleted:
#!/bin/sh sed ' /ONE/ { # append a line N # "ONE TWO" on same line s/ONE TWO/TWO/ # "ONE # TWO" on two consecutive lines s/ONE\nTWO/TWO/ }' file
Let's use the "D" command, and if we find a line containing "TWO" immediately after a line containing "ONE," then delete the first line:
#!/bin/sh sed ' /ONE/ { # append a line N # if TWO found, delete the first line /\n.*TWO/ D }' file
Click here to get file: sed_delete_line_after_word.sh If we wanted to print the first line instead of deleting it, and not print every other line, change the "D" to a "P" and add a "-n" as an argument to sed:
#!/bin/sh sed -n ' # by default - do not print anything /ONE/ { # append a line
N # if TWO found, print the first line /\n.*TWO/ P }' file
Click here to get file: sed_print_line_after_word.sh It is very common to combine all three multi-line commands. The typical order is "N," "P" and lastly "D." This one will delete everything between "ONE" and "TWO" if they are on one or two consecutive lines:
#!/bin/sh sed ' /ONE/ { # append the next line N # look for "ONE" followed by "TWO" /ONE.*TWO/ { # delete everything between s/ONE.*TWO/ONE TWO/ # print P # then delete the first line D } }' file
Click here to get file: sed_delete_between_two_words.sh Earlier I talked about the "=" command, and using it to add line numbers to a file. You can use two invocations of sed to do this (although it is possible to do it with one, but that must wait until next section). The first sed command will output a line number on one line, and then print the line on the next line. The second invocation of sed will merge the two lines together:
#!/bin/sh sed '=' file | \ sed '{ N s/\n/ / }'
Click here to get file: sed_merge_two_lines.sh If you find it necessary, you can break one line into two lines, edit them, and merge them together again. As an example, if you had a file that had a hexadecimal number followed by a word, and you wanted to convert the first word to all upper case, you can use the "y" command, but you must first split the line into two lines, change one of the two, and merge them together. That is, a line containing
0x1fff table2
will be changed into two lines:

0x1fff table2
and the first line will be converted into upper case. I will use tr to convert the space into a new line:
#!/bin/sh tr ' ' '\012' file| sed ' { y/abcdef/ABCDEF/ N s/\n/ / }'
Click here to get file: sed_split.sh It isn't obvious, but sed could be used instead of tr. You can embed a new line in a substitute command, but you must escape it with a backslash. It is unfortunate that you must use "\n" in the left side of a substitute command, and an embedded new line in the right hand side. Heavy sigh. Here is the example:
#!/bin/sh sed ' s/ /\ /' | \ sed ' { y/abcdef/ABCDEF/ N s/\n/ / }'
Click here to get file: sed_split_merge.sh Sometimes I add a special character as a marker, and look for that character in the input stream. When found, it indicates the place a blank used to be. A backslash is a good character, except it must be escaped with a backslash, and makes the sed script obscure. Save it for that guy who keeps asking dumb questions. The sed script to change a blank into a "\" following by a new line would be: #!/bin/sh sed 's/ /\\\ /' file
Click here to get file: sed_addslash_before_blank.sh Yeah. That's the ticket. Or use the C shell and really confuse him! #!/bin/csh -f sed '\ s/ /\\\\ /' file
Click here to get file: sed_addslash_before_blank.csh A few more examples of that, and he'll never ask you a question again! I think I'm getting carried away. I'll summarize with a chart that covers the features we've talked about:
+----------------+---------+------------------------------------------+ |Pattern Next | Command | Output New Pattern New Next | |Space Input | | Space Input | +----------------+---------+------------------------------------------+ |AB CD | n | <default> CD EF | |AB CD | N | AB\nCD EF | |AB CD | d | CD EF | |AB CD | D | CD EF | |AB CD | p | AB CD EF | |AB CD | P | AB CD EF | +----------------+---------+------------------------------------------+ |AB\nCD EF | n | <default> EF GH | |AB\nCD EF | N | AB\nCD\nEF GH |AB\nCD EF | d | EF GH | |AB\nCD EF | D | CD EF | |AB\nCD EF | p | AB\nCD AB\nCD EF | |AB\nCD EF | P | AB AB\nCD EF | +----------------+---------+------------------------------------------+
Using newlines in sed scripts

Occasionally one wishes to use a new line character in a sed script. Well, this has some subtle issues here. If one wants to search for a new line, one has to use "\n." Here is an example where you search for a phrase, and delete the new line character after that phrase - joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X y
The Hold Buffer

So far we have talked about three concepts of sed: (1) The input stream or data before it is modified, (2) the output stream or data after it has been modified, and (3) the pattern space, or buffer containing characters that can be modified and send to the output stream. There is one more "location" to be covered: the hold buffer or hold space. Think of it as a spare pattern buffer. It can be used to "copy" or "remember" the data in the pattern space for later. There are five commands that use the hold buffer.
Exchange with x
The "x" command eXchanges the pattern space with the hold buffer. By itself, the command isn't useful. Executing the sed command
sed 'x'
as a filter adds a blank line in the front, and deletes the last line. It looks like it didn't change the input stream significantly, but the sed command is modifying every line. The hold buffer starts out containing a blank line. When the "x" command modifies the first line, line 1 is saved in the hold buffer, and the blank line takes the place of the first line. The second "x" command exchanges the second line with the hold buffer, which contains the first line. Each subsequent line is exchanged with the preceding line. The last line is placed in the hold buffer, and is not exchanged a second time, so it remains in the hold buffer when the program terminates, and never gets printed. This illustrates that care must be taken when storing data in the hold buffer, because it won't be output unless you explicitly request it.
Example of Context Grep
One use of the hold buffer is to remember previous lines. An example of this is a utility that acts like grep as it shows you the lines that match a pattern. In addition, it shows you the line before and after the pattern. That is, if line 8 contains the pattern, this utility would print lines 7, 8 and 9. One way to do this is to see if the line has the pattern. If it does not have the pattern, put the current line in the hold buffer. If it does, print the line in the hold buffer, then the current line, and then the next line. After each set, three dashes are printed. The script checks for the existence of an argument, and if missing, prints an error. Passing the argument into the sed script is done by turning off the single quote mechanism, inserting the "$1" into the script, and starting up the single quote again:
#!/bin/sh # grep3 - prints out three lines around pattern # if there is only one argument, exit case $# in 1);; *) echo "Usage: $0 pattern";exit;; esac; # I hope the argument doesn't contain a / # if it does, sed will complain # use sed -n to disable printing # unless we ask for it sed -n ' '/$1/' !{ #no match - put the current line in the hold buffer x # delete the old one, which is # now in the pattern buffer d } '/$1/' { # a match - get last line x # print it p # get the original line back x # print it p # get the next line n # print it p # now add three dashes as a marker a\ --# now put this line into the hold buffer x }'
Click here to get file: grep3.sh You could use this to show the three lines around a keyword, i.e.:
grep3 vt100 </etc/termcap
Hold with h or H
The "x" command exchanges the hold buffer and the pattern buffer. Both are changed. The "h" command copies the pattern buffer into the hold buffer. The pattern buffer is unchanged. An identical script to the above uses the hold commands:
#!/bin/sh # grep3 version b - another version using the hold commands # if there is only one argument, exit case $# in 1);; *) echo "Usage: $0 pattern";exit;; esac; # again - I hope the argument doesn't contain a / # use sed -n to disable printing sed -n ' '/$1/' !{ # put the non-matching line in the hold buffer h } '/$1/' { # found a line that matches # append it to the hold buffer H # the hold buffer contains 2 lines # get the next line n # and add it to the hold buffer H # now print it back to the pattern space x # and print it. p # add the three hyphens as a marker a\ --}'
Click here to get file: grep3a.sh
Keeping more than one line in the hold buffer

The "H" command allows you to combine several lines in the hold buffer. It acts like the "N" command as lines are appended to the buffer, with a "\n" between the lines. You can save several lines in the hold buffer, and print them only if a particular pattern is found later. As an example, take a file that uses spaces as the first character of a line as a continuation character. The files /etc/termcap, /etc/printcap, makefile and mail messages use spaces or tabs to indicate a continuing of an entry. If you wanted to print the entry before a word, you could use this script. I use a "Î" to indicate an actual tab character:
#!/bin/sh # print previous entry sed -n ' /^[ Î]/!{ # line does not start with a space or tab, # does it have the pattern we are interested in? '/$1/' { # yes it does. print three dashes i\ --# get hold buffer, save current line x # now print what was in the hold buffer p # get the original line back x } # store it in the hold buffer h } # what about lines that start # with a space or tab? /^[ Î]/ { # append it to the hold buffer H }'
Click here to get file: grep_previous.sh You can also use the "H" to extend the context grep. In this example, the program prints out the two lines before the pattern, instead of a single line. The method to limit this to two lines is to use the "s" command to keep one new line, and deleting extra lines. I call it grep4:
#!/bin/sh # grep4: prints out 4 lines around pattern # if there is only one argument, exit case $# in 1);; *) echo "Usage: $0 pattern";exit;; esac; sed -n ' '/$1/' !{ # does not match - add this line to the hold space H # bring it back into the pattern space x # Two lines would look like .*\n.* # Three lines look like .*\n.*\n.* # Delete extra lines - keep two s/^.*\n$.*\n.*$$/\1/ # now put the two lines (at most) into # the hold buffer again x } '/$1/' { # matches - append the current line H # get the next line n # append that one also H # bring it back, but keep the current line in # the hold buffer. This is the line after the pattern, # and we want to place it in hold in case the next line # has the desired pattern x # print the 4 lines p # add the mark a\ --}'
Click here to get file: grep4.sh You can modify this to print any number of lines around a pattern. As you can see, you must remember what is in the hold space, and what is in the pattern space. There are other ways to write the same routine.
Get with g or G
Instead of exchanging the hold space with the pattern space, you can copy the hold space to the pattern space with the "g" command. This deletes the pattern space. If you want to append to the pattern space, use the "G" command. This adds a new line to the pattern space, and copies the hold space after the new line. Here is another version of the "grep3" command. It works just like the previous one, but is implemented differently. This illustrates that sed has more than one way to solve many problems. What is important is you understand your problem, and document your solution:
#!/bin/sh # grep3 version c: use 'G'
instead of H
# if there is only one argument, exit case $# in 1);; *) echo "Usage: $0 pattern";exit;; esac; # again - I hope the argument doesn't contain a / sed -n ' '/$1/' !{ # put the non-matching line in the hold buffer h } '/$1/' { # found a line that matches # add the next line to the pattern space N # exchange the previous line with the # 2 in pattern space x # now add the two lines back G # and print it. p # add the three hyphens as a marker a\ --# remove first 2 lines s/.*\n.*\n$.*$$/\1/ # and place in the hold buffer for next time h }'
Click here to get file: grep3c.sh
The "G" command makes it easy to have two copies of a line. Suppose you wanted to the convert the first hexadecimal number to uppercase, and don't want to use the script I described in an earlier column
#!/bin/sh # change the first hex number to upper case format # uses sed twice # used as a filter # convert2uc <in >out sed ' s/ /\ /' | \ sed ' { y/abcdef/ABCDEF/ N s/\n/ / }'
Click here to get file: convert2uc.sh Here is a solution that does not require two invocations of sed:
#!/bin/sh # convert2uc version b # change the first hex number to upper case format # uses sed once # used as a filter # convert2uc <in >out sed ' { # remember the line h #change the current line to upper case y/abcdef/ABCDEF/ # add the old line back G # Keep the first word of the first line, # and second word of the second line # with one humongeous regular expression s/^$[^ ]*$ .*\n[^ ]* $.*$/\1 \2/ }'
Click here to get file: convert2uc1.sh Carl Henrik Lunde suggested a way to make this simpler. I was working too hard.
#!/bin/sh # convert2uc version b # change the first hex number to upper case format # uses sed once # used as a filter # convert2uc <in >out sed '
{ # remember the line h #change the current line to upper case y/abcdef/ABCDEF/ # add the old line back G # Keep the first word of the first line, # and second word of the second line # with one humongeous regular expression s/ .* / / # delete all but the first and last word }'
Click here to get file: convert2uc2.sh This example only converts the letters "a" through "f" to upper case. This was chosen to make the script easier to print in these narrow columns. You can easily modify the script to convert all letters to uppercase, or to change the first letter, second word, etc.
Flow Control
As you learn about sed you realize that it has its own programming language. It is true that it's a very specialized and simple language. What language would be complete without a method of changing the flow control? There are three commands sed uses for this. You can specify a label with an text string preceded by a colon. The "b" command branches to the label. The label follows the command. If no label is there, branch to the end of the script. The "t" command is used to test conditions. Before I discuss the "t" command, I will show you an example using the "b" command. This example remembers paragraphs, and if it contains the pattern (specified by an argument), the script prints out the entire paragraph.
#!/bin/sh sed -n ' # if an empty line, check the paragraph /^$/ b para # else add it to the hold buffer H # at end of file, check paragraph $ b para # now branch to end of script b # this is where a paragraph is checked for the pattern :para # return the entire paragraph # into the pattern space x # look for the pattern, if there - print /'$1'/ p '
Click here to get file: grep_paragraph.sh
Testing with t
You can execute a branch if a pattern is found. You may want to execute a branch only if a substitution is made. The command "t label" will branch to the label if the last substitute command modified the pattern space. One use for this is recursive patterns. Suppose you wanted to remove white space inside parenthesis. These parentheses might be nested. That is, you would want to delete a string that looked like "( ( ( ())) )." The sed expressions
sed 's/([ Î]*)/g'
would only remove the innermost set. You would have to pipe the data through the script four times to remove each set or parenthesis. You could use the regular expression
sed 's/([ Î()]*)/g'
but that would delete non-matching sets of parenthesis. The "t" command would solve this:
#!/bin/sh sed ' :again s/([ Î]*)//g t again '
Click here to get file: delete_nested_parens.sh
An alternate way of adding comments

There is one way to add comments in a sed script if you don't have a version that supports it. Use the "a" command with the line number of zero:
#!/bin/sh sed ' /begin/ { 0i\ This is a comment\ It can cover several lines\ It will work with any version of sed }'
Click here to get file: sed_add_comments.sh
The poorly documented ;

There is one more sed command that isn't well documented. It is the ";" command. This can be used to combined several sed commands on one line. Here is the grep4 script I described earlier, but without the comments or error checking and with semicolons between commands: #!/bin/sh sed -n ' '/$1/' !{;H;x;s/^.*\n$.*\n.*$$/\1/;x;} '/$1/' {;H;n;H;x;p;a\ --}'
Click here to get file: grep4a.sh Yessireebob! Definitely character building. I think I have made my point. As far as I am concerned, the only time the semicolon is useful is when you want to type the sed script on the command line. If you are going to place it in a script, format it so it is readable. I have mentioned earlier that many versions of sed do not support comments except on the first line. You may want to write your scripts with comments in them, and install them in "binary" form without comments. This should not be difficult. After all, you have become a sed guru by now. I won't even tell you how to write a script to strip out comments. That would be insulting your intelligence. Also - some operating systems do NOT let you use semicolons. So if you see a script with semicolons, and it does not work on a non-Linux system, replace the semicolon with a new line character. (As long as you are not using csh/tcsh, but that's another topic.
Passing regular expressions as arguments

In the earlier scripts, I mentioned that you would have problems if you passed an argument to the script that had a slash in it. In fact, regular expression might cause you problems. A script like the following is asking to be broken some day:
#!/bin/sh sed 's/'"$1"'//g'
If the argument contains any of these characters in it, you may get a broken script: "/\.*[]^$" For instance, if someone types a "/" then the substiture command will see four delimiters instead of three. You will also get syntax errors if you provide a "]" without a "]". One solution is to have the user put a backslash before any of these characters when they pass it as an argument. However, the user has to know which characters are special. Another solution is to add a backslash before each of those characters in the script
#!/bin/sh arg=ècho "$1" | sed 's:[]\[\^\$\.\*\/]:\\\\&:g'` sed 's/'"$arg"'//g'
Click here to get file: sed_with_regular_expressions1.sh If you were searching for the pattern "^../," the script would convert this into "\^\.\.\/" before passing it to sed.
Command Summary
As I promised earlier, here is a table that summarizes the different commands. The second column specifies if the command can have a range or pair of addresses (with a 2) or a single address or pattern (with a 1). The next four columns specifies which of the four buffers or streams are modified by the command. Some commands only affect the output stream, others only affect the hold buffer. If you remember that the pattern space is output (unless a "-n" was given to sed), this table should help you keep track of the various commands.
+--------------------------------------------------------+ |Command Address Modifications to | | or Range Input Output Pattern Hold | | Stream Stream Space Buffer | +--------------------------------------------------------+ |= Y | |a 1 Y | |b 2 | |c 2 Y | |d 2 Y Y | |D 2 Y Y | |g 2 Y | |G 2 Y | |h 2 Y | |H 2 Y | |i 1 Y | |l 1 Y | |n 2 Y * | |N 2 Y Y | |p 2 Y | |P 2 Y | |q 1 | |r 1 Y | |s 2 Y | |t 2 | |w 2 Y | |x 2 Y Y | |y 2 Y | +--------------------------------------------------------+
The "n" command may or may not generate output, depending on the "-n" option. The "r" command can only have one address, despite the documentation.
Check out my new Sed Reference Chart
In Conclusion
This concludes my tutorial on sed. It is possible to find shorter forms of some of my scripts. However, I chose these examples to illustrate some basic constructs. I wanted clarity, not obscurity. I hope you enjoyed it.
More References
This concludes my tutorial on sed. Other of my Unix shell tutorials can be found here. Other shell tutorials can be found at Heiner's SHELLdorado and Chris F. A. Johnson's Unix Shell Page
FILE SPACING: # double space a file sed G # triple space a file sed 'G;G' # undo double-spacing (assumes even-numbered lines are always blank) sed 'n;d' NUMBERING: # number each line of a file (simple left alignment). Using a tab (see # note on '\t' at end of file) instead of space will preserve margins. sed = filename | sed 'N;s/\n/\t/' # number each line of a file (number on left, right-aligned) sed = filename | sed 'N; s/^/ /; s/ *$.\{6,\}$\n/\1 /' # number each line of file, but only print numbers if line is not blank sed '/./=' filename | sed '/./N; s/\n/ /' # count lines (emulates "wc -l") sed -n '$=' TEXT CONVERSION AND SUBSTITUTION:
# IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format sed 's/.$//' # IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format sed 's/$//' # method 1 sed -n p # method 2 # delete leading whitespace (spaces, tabs) from front of each line # aligns all text flush left sed 's/^[ \t]*//' # see note on '\t' at end of file # delete trailing whitespace (spaces, tabs) from end of each line sed 's/[ \t]*$//' # see note on '\t' at end of file # delete BOTH leading and trailing whitespace from each line sed 's/^[ \t]*//;s/[ \t]*$//' # insert 5 blank spaces at beginning of each line (make page offset) sed 's/^/ /' # align all text flush right on a 79-column width sed -e :a -e 's/^.\{1,78\}$/ &/;ta' # set at 78 plus 1 space # center all text in the middle of 79-column width. In method 1, # spaces at the beginning of the line are significant, and trailing # spaces are appended at the end of the line. In method 2, spaces at # the beginning of the line are discarded in centering the line, and # no trailing spaces appear at the end of lines. sed -e :a -e 's/^.\{1,77\}$/ & /;ta' # method 1 sed -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/$ *$\1/\1/' # method 2 # substitute (find & replace) "foo" with "bar" on each line sed 's/foo/bar/' # replaces only 1st instance in a line sed 's/foo/bar/4' # replaces only 4th instance in a line sed 's/foo/bar/g' # replaces ALL instances in a line # substitute "foo" with "bar" ONLY for lines which contain "baz" sed '/baz/s/foo/bar/g' # substitute "foo" with "bar" EXCEPT for lines which contain "baz" sed '/baz/!s/foo/bar/g' # reverse order of lines (emulates "tac") sed '1!G;h;$!d'
# reverse each character on the line (emulates "rev") sed '/\n/!G;s/$.$$.*\n$/&\2\1/;//D;s/.//' # join pairs of lines side-by-side (like "paste") sed 'N;s/\n/ /' SELECTIVE PRINTING OF CERTAIN LINES: # print first 10 lines of file (emulates behavior of "head") sed 10q # print first line of file (emulates "head -1") sed q # print last 10 lines of file (emulates "tail") sed -e :a -e '$q;N;11,$D;ba' # print last line of file (emulates "tail -1") sed '$!d' # print only lines which match regular expression (emulates "grep") sed -n '/regexp/p' # method 1 sed '/regexp/!d' # method 2 # print only lines which do NOT match regexp (emulates "grep -v") sed -n '/regexp/!p' # method 1, corresponds to above sed '/regexp/d' # method 2, simpler syntax # print 1 line of context before and after regexp, with line number # indicating where the regexp occurred (similar to "grep -A1 -B1") sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h # grep for AAA and BBB and CCC (in any order) sed '/AAA/!d; /BBB/!d; /CCC/!d' # grep for AAA or BBB or CCC (emulates "egrep") sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d # print only lines of 65 characters or longer sed -n '/^.\{65\}/p' # print only lines of less than 65 characters sed -n '/^.\{65\}/!p' # method 1, corresponds to above sed '/^.\{65\}/d' # method 2, simpler syntax
# print section of file from regular expression to end of file sed -n '/regexp/,$p' # print section of file based on line numbers (lines 8-12, inclusive) sed -n '8,12p' # method 1 sed '8,12!d' # method 2 # print line number 52 sed -n '52p' # method 1 sed '52!d' # method 2 sed '52q;d' # method 3, efficient on large files # print section of file between two regular expressions (inclusive) sed -n '/Iowa/,/Montana/p' # case sensitive SELECTIVE DELETION OF CERTAIN LINES: # print all of file EXCEPT section between 2 regular expressions sed '/Iowa/,/Montana/d' # delete duplicate lines from a sorted file (emulates "uniq"). First # line in a set of duplicate lines is kept, the rest are deleted sed '$!N; /^$.*$\n\1$/!P; D' # delete ALL blank lines from a file (same as "grep '.' ") sed '/^$/d' # delete all CONSECUTIVE blank lines from file except the first; also # deletes all blank lines from top and end of file (emulates "cat -s") sed '/./,/^$/!d' # method 1, allows 0 blanks at top, 1 at EOF sed '/^$/N;/\n$/D' # method 2, allows 1 blank at top, 0 at EOF # delete all CONSECUTIVE blank lines from file except the first 2: sed '/^$/N;/\n$/N;//D' # delete all leading blank lines at top of file sed '/./,$!d' # delete all trailing blank lines at end of file sed -e :a -e '/^\n*$/N;/\n$/ba' SPECIAL APPLICATIONS: # remove nroff overstrikes (char, backspace) from man pages sed "s/.ècho \\\b`//g" # double quotes required for Unix environment sed 's/.\x08//g' # hex expression for GNU sed (octal is "\010")
# get Usenet/e-mail message header sed '/^$/q' # deletes everything after first blank line # get Usenet/e-mail message body sed '1,/^$/d' # deletes everything up to first blank line # get Subject header, but remove initial "Subject: " portion sed '/^Subject: */!d; s///;q' # get return address header sed '/^Reply-To:/q; /^From:/h; /./d;g;q' # parse out the address proper. Pulls out the e-mail address by itself # from the 1-line return address header (see preceding script) sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//' # add a leading angle bracket and space to each line (quote a message) sed 's/^/> / # delete leading angle bracket & space from each line (unquote a message) sed 's/^> //' # remove most HTML tags (accommodates multiple-line tags) sed -e :a -e 's/<[^<]*>/ /g;/</{N;s/\n/ /;ba;}' # extract multi-part uuencoded binaries, removing extraneous header # info, so that only the uuencoded portion remains. Files passed to # sed must be passed in the proper order. Version 1 can be entered # from the command line; version 2 can be made into an executable # Unix shell script. (Modified from a script by Rahul Dhesi.) sed '/ênd/,/^begin/d' file1 file2 ... fileX | uudecode # vers. 1 sed '/ênd/,/^begin/d' $* | uudecode # vers. 2 # zip up each .TXT file individually, deleting the source file and # setting the name of each .ZIP file to the basename of the .TXT file # (under DOS: the "dir /b" switch returns bare filenames in all caps). echo @echo off >zipup.bat dir /b *.txt | sed "s/^$.*$\.TXT/pkzip -mo \1 \1.TXT/" >>zipup.bat TYPICAL USE: Sed takes one or more editing commands and applies all of them, in sequence, to each line of input. After all the commands have been applied to the first input line, that line is output and a second input line is taken for processing, and the cycle repeats. The preceding examples assume that input comes from the standard input device (i.e, the console, normally this will be piped input). One or
more filenames can be appended to the command line if the input does not come from stdin. Output is sent to stdout (the screen). Thus: cat filename | sed '10q' # uses piped input sed '10q' filename # same effect, avoids a useless "cat" sed '10q' filename > newfile # redirects output to disk For additional syntax instructions, including the way to apply editing commands from a disk file instead of the command line, consult "sed & awk, 2nd Edition," by Dale Dougherty and Arnold Robbins (O'Reilly, 1997; http://www.ora.com), "UNIX Text Processing," by Dale Dougherty and Tim O'Reilly (Hayden Books, 1987) or the tutorials by Mike Arst distributed in U-SEDIT2.ZIP (many sites). To fully exploit the power of sed, one must understand "regular expressions." For this, see "Mastering Regular Expressions" by Jeffrey Friedl (O'Reilly, 1997). The manual ("man") pages on Unix systems may be helpful (try "man sed", "man regexp", or the subsection on regular expressions in "man ed"), but man pages are notoriously difficult. They are not written to teach sed use or regexps to first-time users, but as a reference text for those already acquainted with these tools. QUOTING SYNTAX: The preceding examples use single quotes ('...') instead of double quotes ("...") to enclose editing commands, since sed is typically used on a Unix platform. Single quotes prevent the Unix shell from intrepreting the dollar sign ($) and backquotes (`...`), which are expanded by the shell if they are enclosed in double quotes. Users of the "csh" shell and derivatives will also need to quote the exclamation mark (!) with the backslash (i.e., \!) to properly run the examples listed above, even within single quotes. Versions of sed written for DOS invariably require double quotes ("...") instead of single quotes to enclose editing commands. USE OF '\t' IN SED SCRIPTS: For clarity in documentation, we have used the expression '\t' to indicate a tab character (0x09) in the scripts. However, most versions of sed do not recognize the '\t' abbreviation, so when typing these scripts from the command line, you should press the TAB key instead. '\t' is supported as a regular expression metacharacter in awk, perl, and in a few implementations of sed. VERSIONS OF SED: Versions of sed do differ, and some slight syntax variation is to be expected. In particular, most do not support the use of labels (:name) or branch instructions (b,t) within editing commands, except at the end of those commands. We have used the syntax which will be portable to most users of sed, even though the popular GNU versions of sed allow a more succinct syntax. When the reader sees a fairly long command such as this:
sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d it is heartening to know that GNU sed will let you reduce it to: sed '/AAA/b;/BBB/b;/CCC/b;d' In addition, remember that while many versions of sed accept a command like "/one/ s/RE1/RE2/", some do NOT allow "/one/! s/RE1/RE2/", which contains space before the 's'. Omit the space when typing the command. OPTIMIZING FOR SPEED: If execution speed needs to be increased (due to large input files or slow processors or hard disks), substitution will be executed more quickly if the "find" expression is specified before giving the "s/.../.../" instruction. Thus: sed 's/foo/bar/g' filename # standard replace command sed '/foo/ s/foo/bar/g' filename # executes more quickly sed '/foo/ s//bar/g' filename # shorthand sed syntax On line selection or deletion in which you only need to output lines from the first part of the file, a "quit" command (q) in the script will drastically reduce processing time for large files. Thus: sed -n '45,50p' filename # print line nos. 45-50 of a file sed -n '51q;45,50p' filename # same, but executes much faster

Unix Interview Preparation

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Unix Interview Preparation

Încărcat de

Drepturi de autor:

Formate disponibile

AWK

Why learn AWK?

Executing an AWK script

Which shell to use with AWK?

The Essential Syntax of AWK

Unary arithmetic operators

The Autoincrement and Autodecrement Operators

Click here to get file: average.awk

AWK Built-in Variables

#!/bin/awk -f { $1=""; $3=""; print; }

and the second

FS - The Input Field Separator Variable

OFS - The Output Field Separator Variable

NF - The Number of Fields Variable

NR - The Number of Records Variable

Click here to get file: awk_example5.awk

RS - The Record Separator Variable

ORS - The Output Record Separator Variable

Click here to get file: add_cr.awk

FILENAME - The Current Filename Variable

Click here to get file: awk_example7.awk

Example of using AWK's Associative Arrays

Output of the script

Picture Perfect PRINTF Output

PRINTF - formatting output

Width - specifying minimum field size

Click here to get file: awk_example10.awk or even better

The Field Precision Value

There is one more topic needed to complete this lesson on printf.

Explicit File output

AWK Numerical Functions

Exponents, logs and square roots

The Lotto script

The Length function

Click here to get file: center.awk

The Index Function

The Substr function

Click here to get file: upper_to_lower.awk

GAWK's Tolower and Toupper function

Click here to get file: upper_to_lower.gawk

The Split function

NAWK's string functions

The Match function

The System function

The Getline function

for (i in cmd) { printf("%s=%s\n", i, cmd[i]); }

The systime function

Click here to get file: awk_example17.gawk

The Strftime function

Click here to get file: one_week_later.gawk

User Defined Functions

Click here to get file: awk_example18.nawk

Formatting AWK programs

ARGC - Number or arguments (NAWK/GAWK)

ARGV - Array of arguments (NAWK/GAWK)

ARGIND - Argument Index (GAWK only)

RSTART, RLENGTH and match (NAWK/GAWK)

SUBSEP - Multi-dimensional array separator (NAWK/GAWK)

ENVIRON - environment variables (GAWK only)

IGNORECASE (GAWK only)

CONVFMT - conversion format (GAWK only)

ERRNO - system errors (GAWK only)

FIELDWIDTHS - fixed width fields (GAWK only)

#!/bin/sh sed ' /begin/,/end/ !{ s/#.// s/[ ^I]$// /^$/ d p } '