Prig: like AWK, but uses Go for “scripting”

February 2022

Summary: The article describes Prig, my AWK-like tool that uses Go as the scripting language. I compare Prig with AWK, then dive into how Prig works, and finally look briefly at Prig’s Sort and SortMap builtins (which use Go’s new generics if Go 1.18 is available).

Go to: Prig vs AWK | Go output | Testing | Generics | Conclusion

In a recent Hacker News comment I learned about rp, a little text processing tool by Charles Blake that is kind of like AWK, but uses Nim as the scripting language. The works because the Nim compiler is fast and the Nim language is terse, so you can use it for one-off scripts.

Go has one of those two things going for it: fast build times. It’s not exactly terse, which makes it less than ideal for one-liner scripts. On the other hand, it’s not terrible, either: Prig scripts are about twice as many characters as their AWK versions.

Charles suggested that languages with fast compile speeds, like Nim and Go, are ideal for this kind of tool. The code to do it is almost trivial: Prig is about 200 lines of straight-forward Go code that inserts the user’s command-line “script” into a Go source code template, compiles that, and then runs the resulting executable.

On my Linux machine, go build can build a program that only uses the standard library in about 200 milliseconds, so the startup time is very reasonable – Go can compile and run the program almost before you’ve released the Enter key. For comparison, Nim gives rp a startup time of about 1.4 seconds on my system (0.8 seconds if using the tcc backend).

So I decided to build an equivalent of rp in Go, and ended up with Prig, which is of course for Processing Records In Go. You could say that Prig is like AWK, but snobbish – it turns down its nose at dynamic typing.

Prig compared to AWK

First, if you haven’t used AWK before, here it is in one paragraph. AWK is a language interpreter for processing input line-by-line. First it runs an optional BEGIN block. Then it runs a pattern { action } block for each line of input: if the line matches the pattern, AWK runs the action. If you don’t specify a pattern, every line matches; if you don’t specify an action, the default action is to print the line. After processing input it runs an optional END block. We’ll see some examples soon.

So what does Prig look like, and how does that compare to AWK? Let’s look at a few example scripts. Say you have a log file containing HTTP request lines, like so:

$ cat logs.txt
GET /robots.txt HTTP/1.1
HEAD /README.md HTTP/1.1
GET /wp-admin/ HTTP/1.0

You want to pull out the second field (the relative URL) and for each request, print the full URL for your site. Here’s how to do it with Prig:

$ prig 'Println("https://example.com" + S(2))' <logs.txt
https://example.com/robots.txt
https://example.com/README.md
https://example.com/wp-admin/

The Println function is just Go’s fmt.Println, but using a buffered writer for efficiency. It’s equivalent to AWK’s print statement. The S(i) function returns field i as a string, so S(2) returns the second field, like $2 in AWK. The rest of the semantics are just regular Go semantics.

In AWK, the same script would look like this:

$ awk '{ print "https://example.com" $2 }' <logs.txt
https://example.com/robots.txt
...

Just 3 characters shorter – not bad so far.

Here’s where things start to get worse for Go. Below is a script, shown in both Prig and AWK variants, that prints the average value of the last field, by summing the field and then dividing by the number of records at the end:

$ cat average.txt 
a b 400
c d 200
e f 200
g h 200

$ prig -b 's := 0.0' 's += F(NF())' -e 'Println(s / float64(NR()))' \
  <average.txt
250

$ awk '{ s += $NF } END { print s / NR }' <average.txt
250

The script is 60 characters for Prig, 35 for AWK – almost twice the length. Go (and many statically-typed languages) are at a disadvantage here. First we have to initialize our sum variable to 0; in AWK that’s implicit.

Then we have the extra parentheses in F(NF()) compared to AWK’s cleaner $NF. I made a design decision early on to make all Prig builtins functions – initially I had NF and NR as variables, but making them all functions means the code can split into fields lazily, only as needed (some simple scripts don’t).

Then there’s the float64() conversion, which along with the parentheses for NR() and Println(), mean Prig ends up looking a bit like Lisp in some cases. AWK’s print s / NR is definitely easier on the eye!

Our third example prints the third field of each line multiplied by 1000 (that is, in milliseconds) if the input line contains either of the strings GET or HEAD. Here’s that Prig script compared to its AWK equivalent:

$ cat millis.txt 
1 GET 3.14159
2 HEAD 4.0
3 GET 1.0

$ prig 'if Match(`GET|HEAD`, S(0)) { Printf("%.0fms\n", F(3)*1000) }' \
  <millis.txt
3142ms
4000ms
1000ms

$ awk '/GET|HEAD/ { printf "%.0fms\n", $3*1000 }' <millis.txt
3142ms
4000ms
1000ms

That’s 62 characters in Prig, 43 in AWK – not bad. The main difference here is the AWK /regex/ shortcut. I thought about adding a special case for this in Prig, but I decided on simple, consistent Go over shortcuts – so in Prig you have to write the if and Match explicitly.

Now a longer example. This is a script that counts the frequencies of unique words in the input and then prints the words and their counts, most frequent first.

$ cat words.txt 
The foo barfs
foo the the the

$ prig -b 'freqs := map[string]int{}' \
       'for i := 1; i <= NF(); i++ { freqs[strings.ToLower(S(i))]++ }' \
       -e 'for _, f := range SortMap(freqs, ByValue, Reverse) { ' \
       -e 'Println(f.K, f.V) }' \
       <words.txt 
the 4
foo 2
barfs 1

$ awk '{ for (i = 1; i <= NF; i++) freqs[tolower($i)]++ }
      END { for (k in freqs) print k, freqs[k] | "sort -nr -k2,1" }' \
      <words.txt

That’s quite a mouthful, particularly in Prig. First we initialize a map of frequencies, keyed by word (again, that’s implicit in AWK). The per-record code is very similar, albeit a bit more verbose in Go with the strings package prefix.

The sorting is done quite differently between the two: in Prig, I’ve defined two sorting functions, Sort, which takes a slice of ints, floats, or strings and returns a new sorted slice, and SortMap, which returns a sorted slice of key-value pairs in the map (optionally sorted by value, and optionally sorted in reverse order).

POSIX AWK doesn’t have built-in sorting (only Gawk does), so we use AWK’s pipe redirect syntax to send it through the sort utility. We could have used the same technique with Prig using a shell pipeline, but this shows how to use the SortMap function.

For most examples, AWK is definitely clearer and less verbose – there was a reason Aho, Weinberger, and Kernighan designed a new language for AWK instead of using C (or similar) as the base language.

On the other hand, if you know Go well and don’t know AWK, Prig might be useful for you. It’s also significantly faster, because Go compiles to optimized machine code, whereas AWK is interpreted.

Some brief performance numbers: for the “count word frequencies” example shown above, Prig is about three times as fast as AWK (using Gawk): Prig counts a 43MB file in 1.1 seconds, Gawk in 3.1 seconds. Of course, at this point we’re really comparing Go with Gawk (see much more detail in this performance comparison).

For a CPU-bound task like adding number together, Go is of course much faster, about 20 times in this example (and remember that Go takes 200 of those 274 milliseconds to compile):

$ time gawk 'BEGIN { for (i=0; i<100000000; i++) s+=i; print s }'
4999999950000000

real    0m5.698s
...
$ time ./prig -b 's:=0; for i:=0; i<100000000; i++ { s+=i }; Println(s)'
4999999950000000

real    0m0.274s
...

Resulting Go program

The prig.go code itself is trivial: about 200 lines of Go code, about a third of which is to parse command line arguments. The rest just puts your script in a Go source template, runs go build to compile it, and then executes the result.

The basic structure of the resulting Go program is just what you’d expect: some setup code, the “begin” code, a bufio.Scanner loop over the lines with the “per-record” code, and then the “end” code. There’s also the Prig built-in functions.

You can view the resulting Go source code with prig -s. Below is the “average value of last field” example from above. It’s not quite verbatim; I’ve elided unused parts for brevity:

$ prig -s -b 's := 0.0' 's += F(NF())' -e 'Println(s / float64(NR()))'
// ... package and import ...
var (
    _output *bufio.Writer
    _record string
    _nr     int
    _fields []string
)

func main() {
    _output = bufio.NewWriter(os.Stdout)
    defer _output.Flush()

    // begin
    s := 0.0

    _scanner := bufio.NewScanner(os.Stdin)
    for _scanner.Scan() {
        _record = _scanner.Text()
        _nr++
        _fields = nil

        // per-record
        s += F(NF())
    }
    if _scanner.Err() != nil {
        _errorf("error reading stdin: %v", _scanner.Err())
    }

    // end
    Println(s / float64(NR()))
}

func Println(args ...interface{}) {
    _, err := fmt.Fprintln(_output, args...)
    if err != nil {
        _errorf("error writing output: %v", err)
    }
}

func NR() int {
    return _nr
}

func S(i int) string {
    if i == 0 {
        return _record
    }
    _ensureFields()
    if i < 1 || i > len(_fields) {
        return ""
    }
    return _fields[i-1]
}

func F(i int) float64 {
    s := S(i)
    f, _ := strconv.ParseFloat(s, 64)
    return f
}

func _ensureFields() {
    if _fields != nil {
        return
    }
    _fields = strings.Fields(_record)
}

func NF() int {
    _ensureFields()
    return len(_fields)
}
// ... other Prig builtin functions ...

Note how I’ve prefixed Prig internal names with underscore to avoid name clashes with variables the Prig user defines. Far from foolproof, but good enough for this use case.

The main loop is basically how you’d write the code manually in Go (though you’d probably use local variables instead of globals). However, in typical Go you’d likely write the F(NF()) inline along with bounds checks, something like this inside the main loop:

if len(fields) > 0 {
    last := fields[len(fields)-1]
    f, err := strconv.ParseFloat(last, 64)
    if err == nil {
        s += f
    }
}

In this context it’s nice to have Prig’s F() do the bounds checking for you: s += F(NF()) is a lot simpler than that 7-line chunk of verbosity. Go is verbose, but Go with a few well-placed helper functions can be very succinct!

Fun with testing

Prig’s tests (in prig_test.go) are a bit unconventional in that they just run the prig binary. Some developers would balk at this, but it keeps Prig a bit simpler. The main tests are “table-driven tests”, a staple of Go testing that you can read about elsewhere.

Due to the go build cycle, each test is relatively slow (on the order of 200 milliseconds), but the test suite still runs in 7-8 seconds on my system. It’s a lot slower on Windows, where starting a new process is much heavier.

However, one of the neat things I did was to test the examples shown in prig --help. In writing the Prig usage message, I kept making small typos in the examples, and had to keep copying and pasting them into my terminal to test them manually.

At some point I thought, why don’t I test these examples automatically using go test? So I extracted the command line examples to separate strings that are tested in TestExamples. I use an ad-hoc little parser to turn each example command line into an argument list, and then call prig on the result.

This is similar to Go’s excellent testable examples, but for command-line examples instead of Go code examples.

Experimenting with generics

One of the more difficult parts of Prig to design was the sorting helpers, and I’m still not at all sure I got them right. API design is an area where programming seems more like art than science.

In any case, I ended up with two functions that are useful on the data types I think you’d use with Prig. Here’s what the rather terse usage message says:

Sort[T int|float64|string](s []T) []T
  // return new sorted slice; also Sort(s, Reverse) to sort descending
SortMap[T int|float64|string](m map[string]T) []KV[T]
  // return sorted slice of key-value pairs
  // also Sort(s[, Reverse][, ByValue]) to sort descending or by value

On Go 1.18 (which should be released very soon), these make use of the new generics feature, so they’re type-checked and return a concrete slice type. Because of the optional parameters, the actual Go signatures (and the KV type) are defined as follows:

type _sortOption int

const (
    Reverse _sortOption = iota
    ByValue
)

func Sort[T int|float64|string](s []T, options ..._sortOption) []T {
    // ... implementation ...
}

type KV[T int|float64|string] struct {
    K string
    V T
}

func SortMap[T int|float64|string](m map[string]T,
        options ..._sortOption) []KV[T] {
    // ... implementation ...
}

Sort is simple enough: it takes a slice and returns a new sorted slice. It’s sorted from low to high by default, or from high to low if you pass the Reverse option. I could have used a broader type set than just int, float64, and string, but this keeps it simple for Prig (and for the non-generic version that we’ll look at below).

SortMap was a bit trickier to design an API for. You can’t sort a Go map directly, so you need to convert it to a slice of key-value pairs: that’s the KV type. You can sort by key (the default), or by value if you pass the ByValue option.

All this works okay, and my very limited experience with Go 1.18’s generics was a success.

But what about most of us, who are still using pre-1.18 versions of Go without support for generics? Well, I made the same API work without generics … kind of. The non-generic version uses interface{}, so it’s not type safe, of course. And it only works at all without type conversions because you’re often just printing the results; the Print family of functions already take any type of argument (via interface{}).

So the word-count example code works just as well on Go 1.18 (with generics) and Go 1.17 (without them):

for _, f := range SortMap(freqs, ByValue, Reverse) {
    Println(f.K, f.V)
}

Prig detects the Go version you have installed by running go version, and uses the non-generic version if it’s 1.17 or below. Here’s how the non-generic versions of Sort and SortMap are defined:

func Sort(s interface{}, options ..._sortOption) []interface{} {
    // ... implementation ...
}

type KV struct {
    K string
    V interface{}
}

func SortMap(m interface{}, options ..._sortOption) []KV {
    // ... implementation ...
}

Crazy? Probably. Most libraries could never get away with this kind of switcheroo, because the APIs just aren’t compatible for a lot of tasks. But for an experiment in Prig, it seems to work pretty well.

Conclusion: was it worth it?

I’m unashamedly a nerd at heart, so yes, I had fun building Prig (mostly on a flight from Christchurch to Frankfurt). I like how simple the code is: about 200 lines of Go code, 300 lines of template code … and 400 lines of tests. Go and its standard library are doing all the hard work!

Would I use Prig for real? Possibly, if I’m processing large files and need a bit more performance than AWK can give me. I might also use it just for testing tiny snippets of Go code – for example, “How do Printf widths work again? Ah yes, let’s try it with prig”:

$ prig -b 'Printf("%3.5s\n", "hi")'
 hi
$ prig -b 'Printf("%3.5s\n", "hello world")'
hello

Should you use Prig? I’m not going to stop you! But to be honest, you’re probably better off learning the ubiquitous (and significantly terser) AWK language. It’s a brilliant, 45-year-old tool that’s still quite widely used for text and data processing in 2022. The original book by A, W, and K called The AWK Programming Language is really good.

You might also use it if you need an executable for some data processing, for example in a lightweight container that doesn’t have awk installed. For cases like this, you can use prig -s to print the source, go build the result, and copy the executable to the target – no other dependencies needed.

If you want to integrate AWK into your Go programs, or just want to learn how an AWK interpreter works, check out my GoAWK project.

I’d love to hear your feedback about Prig: if you have any ideas for improvement, or if you make an rp or Prig variant in another language, do say hello!

I’d love it if you sponsored me on GitHub – it will motivate me to work on my open source projects and write more good content. Thanks!