(gawk) Uniq Program
Info Catalog
(gawk) Tee Program
(gawk) Clones
(gawk) Wc Program
Printing Non-duplicated Lines of Text
-------------------------------------
The `uniq' utility reads sorted lines of data on its standard input,
and (by default) removes duplicate lines. In other words, only unique
lines are printed, hence the name. `uniq' has a number of options. The
usage is:
uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]]
The option meanings are:
`-d'
Only print repeated lines.
`-u'
Only print non-repeated lines.
`-c'
Count lines. This option overrides `-d' and `-u'. Both repeated
and non-repeated lines are counted.
`-N'
Skip N fields before comparing lines. The definition of fields is
similar to `awk''s default: non-whitespace characters separated by
runs of spaces and/or tabs.
`+N'
Skip N characters before comparing lines. Any fields specified
with `-N' are skipped first.
`INPUT FILE'
Data is read from the input file named on the command line,
instead of from the standard input.
`OUTPUT FILE'
The generated output is sent to the named output file, instead of
to the standard output.
Normally `uniq' behaves as if both the `-d' and `-u' options had
been provided.
Here is an `awk' implementation of `uniq'. It uses the `getopt'
DONTPRINTYET library function ( Processing Command Line Options Getopt
Function.), and the `join' library function (*note Merging an Array
DONTPRINTYET library function ( Processing Command Line Options Getopt
Function.), and the `join' library function ( Merging an Array
Into a String Join Function.).
The program begins with a `usage' function and then a brief outline
of the options and their meanings in a comment.
The `BEGIN' rule deals with the command line arguments and options.
It uses a trick to get `getopt' to handle options of the form `-25',
treating such an option as the option letter `2' with an argument of
`5'. If indeed two or more digits were supplied (`Optarg' looks like a
number), `Optarg' is concatenated with the option digit, and then
result is added to zero to make it into a number. If there is only one
digit in the option, then `Optarg' is not needed, and `Optind' must be
decremented so that `getopt' will process it next time. This code is
admittedly a bit tricky.
If no options were supplied, then the default is taken, to print both
repeated and non-repeated lines. The output file, if provided, is
assigned to `outputfile'. Earlier, `outputfile' was initialized to the
standard output, `/dev/stdout'.
# uniq.awk --- do uniq in awk
# Arnold Robbins, arnold@gnu.org, Public Domain
# May 1993
function usage( e)
{
e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
print e > "/dev/stderr"
exit 1
}
# -c count lines. overrides -d and -u
# -d only repeated lines
# -u only non-repeated lines
# -n skip n fields
# +n skip n characters, skip fields first
BEGIN \
{
count = 1
outputfile = "/dev/stdout"
opts = "udc0:1:2:3:4:5:6:7:8:9:"
while ((c = getopt(ARGC, ARGV, opts)) != -1) {
if (c == "u")
non_repeated_only++
else if (c == "d")
repeated_only++
else if (c == "c")
do_count++
else if (index("0123456789", c) != 0) {
# getopt requires args to options
# this messes us up for things like -5
if (Optarg ~ /^[0-9]+$/)
fcount = (c Optarg) + 0
else {
fcount = c + 0
Optind--
}
} else
usage()
}
if (ARGV[Optind] ~ /^\+[0-9]+$/) {
charcount = substr(ARGV[Optind], 2) + 0
Optind++
}
for (i = 1; i < Optind; i++)
ARGV[i] = ""
if (repeated_only == 0 && non_repeated_only == 0)
repeated_only = non_repeated_only = 1
if (ARGC - Optind == 2) {
outputfile = ARGV[ARGC - 1]
ARGV[ARGC - 1] = ""
}
}
The following function, `are_equal', compares the current line,
`$0', to the previous line, `last'. It handles skipping fields and
characters.
If no field count and no character count were specified, `are_equal'
simply returns one or zero depending upon the result of a simple string
comparison of `last' and `$0'. Otherwise, things get more complicated.
If fields have to be skipped, each line is broken into an array using
`split' ( Built-in Functions for String Manipulation String
Functions.), and then the desired fields are joined back into a line
using `join'. The joined lines are stored in `clast' and `cline'. If
no fields are skipped, `clast' and `cline' are set to `last' and `$0'
respectively.
Finally, if characters are skipped, `substr' is used to strip off the
leading `charcount' characters in `clast' and `cline'. The two strings
are then compared, and `are_equal' returns the result.
function are_equal( n, m, clast, cline, alast, aline)
{
if (fcount == 0 && charcount == 0)
return (last == $0)
if (fcount > 0) {
n = split(last, alast)
m = split($0, aline)
clast = join(alast, fcount+1, n)
cline = join(aline, fcount+1, m)
} else {
clast = last
cline = $0
}
if (charcount) {
clast = substr(clast, charcount + 1)
cline = substr(cline, charcount + 1)
}
return (clast == cline)
}
The following two rules are the body of the program. The first one
is executed only for the very first line of data. It sets `last' equal
to `$0', so that subsequent lines of text have something to be compared
to.
The second rule does the work. The variable `equal' will be one or
zero depending upon the results of `are_equal''s comparison. If `uniq'
is counting repeated lines, then the `count' variable is incremented if
the lines are equal. Otherwise the line is printed and `count' is
reset, since the two lines are not equal.
If `uniq' is not counting, `count' is incremented if the lines are
equal. Otherwise, if `uniq' is counting repeated lines, and more than
one line has been seen, or if `uniq' is counting non-repeated lines,
and only one line has been seen, then the line is printed, and `count'
is reset.
Finally, similar logic is used in the `END' rule to print the final
line of input data.
NR == 1 {
last = $0
next
}
{
equal = are_equal()
if (do_count) { # overrides -d and -u
if (equal)
count++
else {
printf("%4d %s\n", count, last) > outputfile
last = $0
count = 1 # reset
}
next
}
if (equal)
count++
else {
if ((repeated_only && count > 1) ||
(non_repeated_only && count == 1))
print last > outputfile
last = $0
count = 1
}
}
END {
if (do_count)
printf("%4d %s\n", count, last) > outputfile
else if ((repeated_only && count > 1) ||
(non_repeated_only && count == 1))
print last > outputfile
}
Info Catalog
(gawk) Tee Program
(gawk) Clones
(gawk) Wc Program
automatically generated byinfo2html