DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 

(gawk.info) String Functions

Info Catalog (gawk.info) Numeric Functions (gawk.info) Built-in (gawk.info) I/O Functions
 
 Built-in Functions for String Manipulation
 ==========================================
 
    The functions in this section look at or change the text of one or
 more strings.  Optional parameters are enclosed in square brackets ("["
 and "]").
 
 `index(IN, FIND)'
      This searches the string IN for the first occurrence of the string
      FIND, and returns the position in characters where that occurrence
      begins in the string IN.  For example:
 
           $ awk 'BEGIN { print index("peanut", "an") }'
           -| 3
 
      If FIND is not found, `index' returns zero.  (Remember that string
      indices in `awk' start at one.)
 
 `length([STRING])'
      This gives you the number of characters in STRING.  If STRING is a
      number, the length of the digit string representing that number is
      returned.  For example, `length("abcde")' is five.  By contrast,
      `length(15 * 35)' works out to three.  How?  Well, 15 * 35 = 525,
      and 525 is then converted to the string `"525"', which has three
      characters.
 
      If no argument is supplied, `length' returns the length of `$0'.
 
      In older versions of `awk', you could call the `length' function
      without any parentheses.  Doing so is marked as "deprecated" in the
      POSIX standard.  This means that while you can do this in your
      programs, it is a feature that can eventually be removed from a
      future version of the standard.  Therefore, for maximal
      portability of your `awk' programs, you should always supply the
      parentheses.
 
 `match(STRING, REGEXP)'
      The `match' function searches the string, STRING, for the longest,
      leftmost substring matched by the regular expression, REGEXP.  It
      returns the character position, or "index", of where that
      substring begins (one, if it starts at the beginning of STRING).
      If no match is found, it returns zero.
 
      The `match' function sets the built-in variable `RSTART' to the
      index.  It also sets the built-in variable `RLENGTH' to the length
      in characters of the matched substring.  If no match is found,
      `RSTART' is set to zero, and `RLENGTH' to -1.
 
      For example:
 
           awk '{
                  if ($1 == "FIND")
                    regex = $2
                  else {
                    where = match($0, regex)
                    if (where != 0)
                      print "Match of", regex, "found at", \
                                where, "in", $0
                  }
           }'
 
      This program looks for lines that match the regular expression
      stored in the variable `regex'.  This regular expression can be
      changed.  If the first word on a line is `FIND', `regex' is
      changed to be the second word on that line.  Therefore, given:
 
           FIND ru+n
           My program runs
           but not very quickly
           FIND Melvin
           JF+KM
           This line is property of Reality Engineering Co.
           Melvin was here.
 
      `awk' prints:
 
           Match of ru+n found at 12 in My program runs
           Match of Melvin found at 1 in Melvin was here.
 
 `split(STRING, ARRAY [, FIELDSEP])'
      This divides STRING into pieces separated by FIELDSEP, and stores
      the pieces in ARRAY.  The first piece is stored in `ARRAY[1]', the
      second piece in `ARRAY[2]', and so forth.  The string value of the
      third argument, FIELDSEP, is a regexp describing where to split
      STRING (much as `FS' can be a regexp describing where to split
      input records).  If the FIELDSEP is omitted, the value of `FS' is
      used.  `split' returns the number of elements created.
 
      The `split' function splits strings into pieces in a manner
      similar to the way input lines are split into fields.  For example:
 
           split("cul-de-sac", a, "-")
 
      splits the string `cul-de-sac' into three fields using `-' as the
      separator.  It sets the contents of the array `a' as follows:
 
           a[1] = "cul"
           a[2] = "de"
           a[3] = "sac"
 
      The value returned by this call to `split' is three.
 
      As with input field-splitting, when the value of FIELDSEP is
      `" "', leading and trailing whitespace is ignored, and the elements
      are separated by runs of whitespace.
 
      Also as with input field-splitting, if FIELDSEP is the null
      string, each individual character in the string is split into its
      own array element.  (This is a `gawk'-specific extension.)
 
      Recent implementations of `awk', including `gawk', allow the third
      argument to be a regexp constant (`/abc/'), as well as a string
      (d.c.).  The POSIX standard allows this as well.
 
      Before splitting the string, `split' deletes any previously
      existing elements in the array ARRAY (d.c.).
 
      If STRING does not match FIELDSEP at all, ARRAY will have one
      element. The value of that element will be the original STRING.
 
 `sprintf(FORMAT, EXPRESSION1,...)'
      This returns (without printing) the string that `printf' would
      have printed out with the same arguments ( Using `printf'
      Statements for Fancier Printing Printf.).  For example:
 
           sprintf("pi = %.2f (approx.)", 22/7)
 
      returns the string `"pi = 3.14 (approx.)"'.
 
 `sub(REGEXP, REPLACEMENT [, TARGET])'
      The `sub' function alters the value of TARGET.  It searches this
      value, which is treated as a string, for the leftmost longest
      substring matched by the regular expression, REGEXP, extending
      this match as far as possible.  Then the entire string is changed
      by replacing the matched text with REPLACEMENT.  The modified
      string becomes the new value of TARGET.
 
      This function is peculiar because TARGET is not simply used to
      compute a value, and not just any expression will do: it must be a
      variable, field or array element, so that `sub' can store a
      modified value there.  If this argument is omitted, then the
      default is to use and alter `$0'.
 
      For example:
 
           str = "water, water, everywhere"
           sub(/at/, "ith", str)
 
      sets `str' to `"wither, water, everywhere"', by replacing the
      leftmost, longest occurrence of `at' with `ith'.
 
      The `sub' function returns the number of substitutions made (either
      one or zero).
 
      If the special character `&' appears in REPLACEMENT, it stands for
      the precise substring that was matched by REGEXP.  (If the regexp
      can match more than one string, then this precise substring may
      vary.)  For example:
 
           awk '{ sub(/candidate/, "& and his wife"); print }'
 
      changes the first occurrence of `candidate' to `candidate and his
      wife' on each input line.
 
      Here is another example:
 
           awk 'BEGIN {
                   str = "daabaaa"
                   sub(/a+/, "C&C", str)
                   print str
           }'
           -| dCaaCbaaa
 
      This shows how `&' can represent a non-constant string, and also
      illustrates the "leftmost, longest" rule in regexp matching (
      How Much Text Matches? Leftmost Longest.).
 
      The effect of this special character (`&') can be turned off by
      putting a backslash before it in the string.  As usual, to insert
      one backslash in the string, you must write two backslashes.
      Therefore, write `\\&' in a string constant to include a literal
      `&' in the replacement.  For example, here is how to replace the
      first `|' on each line with an `&':
 
           awk '{ sub(/\|/, "\\&"); print }'
 
         the third argument to `sub' must be a
      variable, field or array reference.  Some versions of `awk' allow
      the third argument to be an expression which is not an lvalue.  In
      such a case, `sub' would still search for the pattern and return
      zero or one, but the result of the substitution (if any) would be
      thrown away because there is no place to put it.  Such versions of
      `awk' accept expressions like this:
 
           sub(/USA/, "United States", "the USA and Canada")
 
      For historical compatibility, `gawk' will accept erroneous code,
      such as in the above example. However, using any other
      non-changeable object as the third parameter will cause a fatal
      error, and your program will not run.
 
      Finally, if the REGEXP is not a regexp constant, it is converted
      into a string and then the value of that string is treated as the
      regexp to match.
 
 `gsub(REGEXP, REPLACEMENT [, TARGET])'
      This is similar to the `sub' function, except `gsub' replaces
      _all_ of the longest, leftmost, _non-overlapping_ matching
      substrings it can find.  The `g' in `gsub' stands for "global,"
      which means replace everywhere.  For example:
 
           awk '{ gsub(/Britain/, "United Kingdom"); print }'
 
      replaces all occurrences of the string `Britain' with `United
      Kingdom' for all input records.
 
      The `gsub' function returns the number of substitutions made.  If
      the variable to be searched and altered, TARGET, is omitted, then
      the entire input record, `$0', is used.
 
      As in `sub', the characters `&' and `\' are special, and the third
      argument must be an lvalue.
 
 `gensub(REGEXP, REPLACEMENT, HOW [, TARGET])'
      `gensub' is a general substitution function.  Like `sub' and
      `gsub', it searches the target string TARGET for matches of the
      regular expression REGEXP.  Unlike `sub' and `gsub', the modified
      string is returned as the result of the function, and the original
      target string is _not_ changed.  If HOW is a string beginning with
      `g' or `G', then it replaces all matches of REGEXP with
      REPLACEMENT.  Otherwise, HOW is a number indicating which match of
      REGEXP to replace. If no TARGET is supplied, `$0' is used instead.
 
      `gensub' provides an additional feature that is not available in
      `sub' or `gsub': the ability to specify components of a regexp in
      the replacement text.  This is done by using parentheses in the
      regexp to mark the components, and then specifying `\N' in the
      replacement text, where N is a digit from one to nine.  For
      example:
 
           $ gawk '
           > BEGIN {
           >      a = "abc def"
           >      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
           >      print b
           > }'
           -| def abc
 
      As described above for `sub', you must type two backslashes in
      order to get one into the string.
 
      In the replacement text, the sequence `\0' represents the entire
      matched text, as does the character `&'.
 
      This example shows how you can use the third argument to control
      which match of the regexp should be changed.
 
           $ echo a b c a b c |
           > gawk '{ print gensub(/a/, "AA", 2) }'
           -| a b c AA b c
 
      In this case, `$0' is used as the default target string.  `gensub'
      returns the new string as its result, which is passed directly to
      `print' for printing.
 
      If the HOW argument is a string that does not begin with `g' or
      `G', or if it is a number that is less than zero, only one
      substitution is performed.
 
      If REGEXP does not match TARGET, `gensub''s return value is the
      original, unchanged value of TARGET.
 
      `gensub' is a `gawk' extension; it is not available in
      compatibility mode ( Command Line Options Options.).
 
 `substr(STRING, START [, LENGTH])'
      This returns a LENGTH-character-long substring of STRING, starting
      at character number START.  The first character of a string is
      character number one.  For example, `substr("washington", 5, 3)'
      returns `"ing"'.
 
      If LENGTH is not present, this function returns the whole suffix of
      STRING that begins at character number START.  For example,
      `substr("washington", 5)' returns `"ington"'.  The whole suffix is
      also returned if LENGTH is greater than the number of characters
      remaining in the string, counting from character number START.
 
        
      Thus, it is a mistake to attempt to change a portion of a string,
      like this:
 
           string = "abcdef"
           # try to get "abCDEf", won't work
           substr(string, 3, 3) = "CDE"
 
      or to use `substr' as the third agument of `sub' or `gsub':
 
           gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG
 
 `tolower(STRING)'
      This returns a copy of STRING, with each upper-case character in
      the string replaced with its corresponding lower-case character.
      Non-alphabetic characters are left unchanged.  For example,
      `tolower("MiXeD cAsE 123")' returns `"mixed case 123"'.
 
 `toupper(STRING)'
      This returns a copy of STRING, with each lower-case character in
      the string replaced with its corresponding upper-case character.
      Non-alphabetic characters are left unchanged.  For example,
      `toupper("MiXeD cAsE 123")' returns `"MIXED CASE 123"'.
 
 More About `\' and `&' with `sub', `gsub' and `gensub'
 ------------------------------------------------------
 
    When using `sub', `gsub' or `gensub', and trying to get literal
 backslashes and ampersands into the replacement text, you need to
 remember that there are several levels of "escape processing" going on.
 
    First, there is the "lexical" level, which is when `awk' reads your
 program, and builds an internal copy of your program that can be
 executed.
 
    Then there is the run-time level, when `awk' actually scans the
 replacement string to determine what to generate.
 
    At both levels, `awk' looks for a defined set of characters that can
 come after a backslash.  At the lexical level, it looks for the escape
 sequences listed in  Escape Sequences.  Thus, for every `\' that
 `awk' will process at the run-time level, you type two `\'s at the
 lexical level.  When a character that is not valid for an escape
 sequence follows the `\', Unix `awk' and `gawk' both simply remove the
 initial `\', and put the following character into the string. Thus, for
 example, `"a\qb"' is treated as `"aqb"'.
 
    At the run-time level, the various functions handle sequences of `\'
 and `&' differently.  The situation is (sadly) somewhat complex.
 
    Historically, the `sub' and `gsub' functions treated the two
 character sequence `\&' specially; this sequence was replaced in the
 generated text with a single `&'.  Any other `\' within the REPLACEMENT
 string that did not precede an `&' was passed through unchanged.  To
 illustrate with a table:
 
       You type         `sub' sees          `sub' generates
       --------         ----------          ---------------
           `\&'              `&'            the matched text
          `\\&'             `\&'            a literal `&'
         `\\\&'             `\&'            a literal `&'
        `\\\\&'            `\\&'            a literal `\&'
       `\\\\\&'            `\\&'            a literal `\&'
      `\\\\\\&'           `\\\&'            a literal `\\&'
          `\\q'             `\q'            a literal `\q'
 
 This table shows both the lexical level processing, where an odd number
 of backslashes becomes an even number at the run time level, and the
 run-time processing done by `sub'.  (For the sake of simplicity, the
 rest of the tables below only show the case of even numbers of `\'s
 entered at the lexical level.)
 
    The problem with the historical approach is that there is no way to
 get a literal `\' followed by the matched text.
 
    The 1992 POSIX standard attempted to fix this problem. The standard
 says that `sub' and `gsub' look for either a `\' or an `&' after the
 `\'. If either one follows a `\', that character is output literally.
 The interpretation of `\' and `&' then becomes like this:
 
       You type         `sub' sees          `sub' generates
       --------         ----------          ---------------
            `&'              `&'            the matched text
          `\\&'             `\&'            a literal `&'
        `\\\\&'            `\\&'            a literal `\', then the matched text
      `\\\\\\&'           `\\\&'            a literal `\&'
 
 This would appear to solve the problem.  Unfortunately, the phrasing of
 the standard is unusual. It says, in effect, that `\' turns off the
 special meaning of any following character, but that for anything other
 than `\' and `&', such special meaning is undefined.  This wording
 leads to two problems.
 
   1. Backslashes must now be doubled in the REPLACEMENT string, breaking
      historical `awk' programs.
 
   2. To make sure that an `awk' program is portable, _every_ character
      in the REPLACEMENT string must be preceded with a backslash.(1)
 
    The POSIX standard is under revision.(2) Because of the above
 problems, proposed text for the revised standard reverts to rules that
 correspond more closely to the original existing practice. The proposed
 rules have special cases that make it possible to produce a `\'
 preceding the matched text.
 
       You type         `sub' sees         `sub' generates
       --------         ----------         ---------------
      `\\\\\\&'           `\\\&'            a literal `\&'
        `\\\\&'            `\\&'            a literal `\', followed by the matched text
          `\\&'             `\&'            a literal `&'
          `\\q'             `\q'            a literal `\q'
 
    In a nutshell, at the run-time level, there are now three special
 sequences of characters, `\\\&', `\\&' and `\&', whereas historically,
 there was only one.  However, as in the historical case, any `\' that
 is not part of one of these three sequences is not special, and appears
 in the output literally.
 
    `gawk' 3.0 follows these proposed POSIX rules for `sub' and `gsub'.
 Whether these proposed rules will actually become codified into the
 standard is unknown at this point. Subsequent `gawk' releases will
 track the standard and implement whatever the final version specifies;
 this Info file will be updated as well.
 
    The rules for `gensub' are considerably simpler. At the run-time
 level, whenever `gawk' sees a `\', if the following character is a
 digit, then the text that matched the corresponding parenthesized
 subexpression is placed in the generated output.  Otherwise, no matter
 what the character after the `\' is, that character will appear in the
 generated text, and the `\' will not.
 
        You type          `gensub' sees         `gensub' generates
        --------          -------------         ------------------
            `&'                    `&'            the matched text
          `\\&'                   `\&'            a literal `&'
         `\\\\'                   `\\'            a literal `\'
        `\\\\&'                  `\\&'            a literal `\', then the matched text
      `\\\\\\&'                 `\\\&'            a literal `\&'
          `\\q'                   `\q'            a literal `q'
 
    Because of the complexity of the lexical and run-time level
 processing, and the special cases for `sub' and `gsub', we recommend
 the use of `gawk' and `gensub' for when you have to do substitutions.
 
    ---------- Footnotes ----------
 
    (1) This consequence was certainly unintended.
 
    (2) As of July, 2000, with final approval and publication as part of
 the Austin Group Standards hopefully sometime in 2001.
 
Info Catalog (gawk.info) Numeric Functions (gawk.info) Built-in (gawk.info) I/O Functions
automatically generated byinfo2html