(gawk) History Sorting
Info Catalog
(gawk) Word Sorting
(gawk) Miscellaneous Programs
(gawk) Extract Program
Removing Duplicates from Unsorted Text
--------------------------------------
The `uniq' program ( Printing Non-duplicated Lines of Text
Uniq Program.), removes duplicate lines from _sorted_ data.
Suppose, however, you need to remove duplicate lines from a data
file, but that you wish to preserve the order the lines are in? A good
example of this might be a shell history file. The history file keeps
a copy of all the commands you have entered, and it is not unusual to
repeat a command several times in a row. Occasionally you might wish
to compact the history by removing duplicate entries. Yet it is
desirable to maintain the order of the original commands.
This simple program does the job. It uses two arrays. The `data'
array is indexed by the text of each line. For each line, `data[$0]'
is incremented.
If a particular line has not been seen before, then `data[$0]' will
be zero. In that case, the text of the line is stored in
`lines[count]'. Each element of `lines' is a unique command, and the
indices of `lines' indicate the order in which those lines were
encountered. The `END' rule simply prints out the lines, in order.
# histsort.awk --- compact a shell history file
# Arnold Robbins, arnold@gnu.org, Public Domain
# May 1993
# Thanks to Byron Rakitzis for the general idea
{
if (data[$0]++ == 0)
lines[++count] = $0
}
END {
for (i = 1; i <= count; i++)
print lines[i]
}
This program also provides a foundation for generating other useful
information. For example, using the following `print' satement in the
`END' rule would indicate how often a particular command was used.
print data[lines[i]], lines[i]
This works because `data[$0]' was incremented each time a line was
seen.
Info Catalog
(gawk) Word Sorting
(gawk) Miscellaneous Programs
(gawk) Extract Program
automatically generated byinfo2html