AWK is a language specialized in textual data manipulation and is a standard part of the text processing utility suite of all Unix-like systems. The language first appeared in the 1970s, and significantly evolved in the next decade.

Various implementations of AWK were created, differing in language capabilities, as well as in program size, speed, and method of program translation (interpretation, compilation, translation to C or to other languages). Besides Unix-es, AWK is also available for most other operating systems.

The most advanced dialect – standardly present in the GNU/Linux distributions – is Gawk, or GNU awk, supporting many features (including TCP/IP networking) not found in the other variants. Mawk, on the other hand, is known for the high performance of the interpreter. Most of AWK implementations are open-source, among them now the original one.

The creation of AWK was probably inspired by the earlier text processing languages Snobol and sed, especially the latter. Sed (another Unix classic tool – not a complete programming language but, rather, an advanced text substitution processor) does make extensive use of regular expressions and is line-oriented. The same is characteristic of AWK.

AWK exemplifies a matching model of computation. It continuously extracts records from an input text file and carries out actions on each record, depending on whether its contents match a set of patterns. A program is a series of rules: pairs of the form

    pattern { action }

and definitions of (user) functions if such are needed. Pattern is an expression or a pair of expressions, and action is one or more statements, such as assignments, function calls, and built-in commands.

Patterns are applied in turn on the current record, and for each one that matches, the corresponding action is taken. In the case of a single expression, matching means that that expression evaluates to true. Pairs of expressions specify ranges of applicability (each starting from a record for which the first expression is true and ending at one for which the second expression is true). One particularly useful kind of pattern expressions involve regular expressions being matched against the whole current record or parts of it.

Two special ‘patterns’, named BEGIN and END, designate actions that should be executed before and after the actual processing of the input file, respectively.

Often, it is useful to view input records as sequences of substrings of a kind. To facilitate such mode of operation, AWK automatically breaks each input record down into fields, which thus can be referenced and processed individually.

AWK is dynamically typed, with dynamic data structures and automatic allocation and reclamation of storage for them. In fact, all simple values are text strings which can be interpreted as numbers according to arithmetic or similar operational context (“context typing”). AWK's set of operators and control statements is very much like that of C. In addition, AWK features associative arrays (ones indexed by strings), a for (… in …) statement specialized for iterating over such arrays, and a number of built-in functions for string and other processing.

AWK programs make use of a number of special variables. Some of them carry values reflecting the current reading context, e.g. FILENAME (current file), NR (number of records of the current file read so far), NF (number of fields in the current record), $0 (current record), and $1, $2, … (fields within a record).

Other variables are given predefined values, but also can receive ones by means of explicit assignment. These include RS and FS (input record and field separators), and ORS and OFS (output record and field separators), which tell AWK how to break down a file into records and a record into fields, in effect defining the framework for the data processing carried out by AWK. By default, records are separated by newline characters, and fields – by whitespace.

Not only the said system variables can be defaulted in AWK. Arguments to commands, whether data values or reference points within the input text, as well as actions can be omitted, with natural assumptions taking their place.

Solving problems with AWK reveals several shortcomings or annoyances with the language. For example, its built-in functions for text matching and replacement make use of regular expressions but lack support for captures: individually identifiable substrings of a matched string that correspond to parts of the regular expression. This is fixed in Gawk, which is fortunately the prevalently used version of the language.

There is no way to apply the same rule (or the set of rules in a program) for matching and transforming an input line more than once. If a repetitive processing is desired, one has to program it explicitly, resorting to defining a function for doing that and calling it in the action part of a rule. Being able to apply arbitrarily many rules to a record (thus providing a sort of repetition already), but no rule more than once, leaves an impression of inconsistency: if AWK is viewed as a rule-based transformation language, then its matching/transformation model is half-baked.

It is curious to observe that sed, although in general a much more limited language than AWK, does not have this particular limitation of being non-repetitive.

Another option that would have been useful but is missing in AWK is evaluating strings as portions of the program: something that is present in one form or another in virtually all dynamic, text-representation based languages.

As a minor annoyance, one could note the unusual and potentially confusing method of specifying which variables are local to a user-defined function: such variables are tail-listed with the formal parameters of the function.

AWK, along with sed, sh (see also bash), and C, was an inspiration for creating Perl, the text processing facilities of which then, in turn, heavily influenced a number of other programming languages. In this sense, AWK is a major progenitor of most of today's languages with support for text processing.

AWK is simple yet sufficiently capable and convenient in solving many problems where textual and symbolic information is dealt with, up to using it to teach AI (also reproduced here). Despite the abundance of languages with text processing facilities nowadays, AWK is still useful and competitive in its area of application. The small implementation size and immediate usability (no installation needed) is an advantage, e.g. making it suitable, unlike most more modern languages, for embedded systems.

Links of Relevance the Awk Community Portal

Gawk: the GNU awk

The Gawk manual. Also published as a book

POSIX.1-2008, 2013 Edition : awk – pattern scanning and processing language’

“The one true AWK” page: implementation of the original AWK

Mawk: a very fast interpreter for AWK

TAWK: a compiler for a dialect of AWK, by Ken Thompson (yes, himself). No more actively developed