AwkLab

Unix Pipes Under Load: Streaming, Barriers, Backpressure, and Bottlenecks

Gábor Dombay — Thu, 09 Apr 2026 17:30:26 GMT

1. The Unix pipeline

Every Unix user knows the pipe operator. Typing ls | wc -l, and two independent programs exchange data as if they were designed together. That simplicity reflects a deliberate design philosophy that Doug McIlroy articulated at Bell Labs in the 1970s: write programs that do one thing well, and let them communicate through a universal interface to combine them into complex workflows.

Essentially, a Unix pipe allows one program to send its output directly into another program’s input without saving a temporary file to disk. It turns two independent tools into a single data pipeline.

What is less obvious is why pipes work so well for large data. A pipeline processing a multi-gigabyte log file uses roughly the same memory as one processing a kilobyte. The data flows through, it does not accumulate. This streaming behavior is the pipe's defining characteristic — and also its most misunderstood one.

This article examines pipes from that angle: not as a convenience feature, but as a memory-efficient streaming primitive. We will look at how they work, where they excel, and where they quietly fail.

2. The McIlroy story

In 1986, Donald Knuth was asked to write a program solving a simple problem: read a file, find the most frequently used words, and print the top results. He produced a literate programming masterpiece — a carefully crafted Pascal program, several pages long, with a custom data structure optimized for the task.

Doug McIlroy reviewed it and responded with six lines of shell:

tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | head

The pipeline reads a file, splits it into words, lowercases them, sorts, counts occurrences, sorts by frequency, and prints the top results. It requires no custom data structures, no memory management, no compilation. It also processes files larger than available RAM without modification — each stage handles only what fits in the pipe buffer at any moment.

McIlroy's point was not that shell pipelines are always better than carefully written programs. It was that composition of simple tools can match or exceed purpose-built solutions, at a fraction of the complexity. The memory efficiency was not a design goal — it was a consequence of how pipes work.

3. How pipes work

When you type ls | wc -l, the shell creates a pipe before either program starts. A pipe is a kernel-managed buffer — on Linux, 64KB by default — with two ends: a write end and a read end. The shell connects ls stdout to the write end, and wc stdin to the read end, using a system call called dup2. Neither program knows or cares about the pipe — ls writes to what it thinks is standard output, wc reads from what it thinks is standard input.

Behind the scenes, the shell uses three system calls to wire this together. First, pipe() creates the buffer and returns two file descriptors — one for reading, one for writing. Then fork() creates two child processes, both inheriting those file descriptors. Finally, dup2() redirects stdout in the first child to the pipe's write end, and stdin in the second child to the pipe's read end. The original pipe file descriptors are then closed, and each child calls exec() to become ls and wc respectively. From that point, the two programs run independently, connected only through the kernel buffer.

Both processes start concurrently. When ls fills the 64KB buffer, the kernel blocks it until wc reads some data and makes room. When the buffer is empty and ls hasn't finished writing, wc blocks and waits. This backpressure mechanism is why pipelines are memory-efficient: at most 64KB of data exists in the pipe at any moment, regardless of how large the input is. This holds true for streaming stages — programs like grep, awk, or sed that process input line by line. Barrier stages like sort are a different matter: they must read all input into their own memory before producing any output, making their memory usage proportional to input size regardless of the pipe buffer.

This is also why pipelines are not parallel in any meaningful sense. The processes take turns. A fast producer is throttled by a slow consumer, and a slow producer starves a fast consumer. The 64KB buffer smooths out brief mismatches, but it does not change the fundamental constraint: the slowest stage becomes a bottleneck that limits total throughput.

4. Streaming vs barrier stage

The backpressure mechanism described in section 3 creates an important distinction between two fundamentally different kinds of pipeline stages. Understanding this distinction is the key to reasoning about pipeline performance and memory usage.

A streaming stage processes input incrementally and produces output as it goes. grep reads a line, tests it, emits it or discards it, moves to the next. tr, cut, sed are all streamers. Their memory footprint is constant regardless of input size, and they never block the pipeline longer than it takes to process one record.

A barrier stage must consume its entire input before producing any output. sort is the canonical example — you cannot emit the smallest element until you have seen all elements. tac reverses a file, so it must read everything before writing anything. uniq -c as typically used follows a sort, so by the time it runs the damage is already done. These stages break the streaming contract: memory grows with input size, and everything downstream waits.

Consider the classic word frequency pipeline from section 2: tr streams instantly. The first sort then consumes everything — gigabytes if necessary — before uniq -c sees a single line. uniq -c streams quickly over the sorted output, then the second sort again consumes everything before head gets its ten lines. Two full barriers, three sequential phases. The pipeline is not six concurrent processes — it is three sequential batches connected by two streaming bridges.

You can observe this directly. Run /usr/bin/time -v on the full pipeline versus the first sort alone on the same input:

/usr/bin/time -v sh -c 'tr -cs A-Za-z "\n" < example.txt | tr A-Z a-z | sort | uniq -c | sort -rn | head'
/usr/bin/time -v sh -c 'tr -cs A-Za-z "\n" < example.txt | tr A-Z a-z | sort > /dev/null'

E.g. using The Complete Works of William Shakespeare as an example I measured 0.32 sec vs 0.29 sec, with practically the same 12.1 MB peak memory for both cases.

It indicates that the first sort dominates memory consumption for the entire pipeline. Everything before it is essentially free, and everything after operates on a fraction of the original data.

This has a practical implication: when optimizing a pipeline, identifying and addressing the first barrier is almost always the highest-leverage intervention. Stages before the barrier run for free in terms of memory. Stages after it operate on reduced data. The barrier itself is where the cost lives.

5. Where pipelines work well

The most common real-world pipeline is probably log analysis.

a. Finding the most frequent error messages in a log file looks like this:

grep "ERROR" app.log | sort | uniq -c | sort -rn

grep streams through the file, emitting only matching lines — constant memory, no barrier. Then sort accumulates everything into memory before emitting a single line. uniq -c streams over the sorted output counting consecutive duplicates. The second sort -rn accumulates again to rank by frequency.

Two barriers in four stages. On a large log file, peak memory is determined entirely by how many ERROR lines exist — not by the file size, but not constant either.

b. Another common pattern is searching for a string across many files:

find . -name "*.log" | xargs grep "ERROR"

find streams filenames one by one into xargs, which batches them into grep invocations. No stage accumulates data — memory stays constant regardless of how many files exist or how large they are.

This is pipelines doing what they do best: composing simple tools into a workflow that scales to arbitrary input size without modification.

6. Where pipes fall short

Pipelines excel at linear transformations of a single data stream — filtering, counting, reformatting. But not every problem fits that shape. Three categories of problems expose their limits clearly.

Complex state. Pipelines are stateless between stages. Each stage sees only its own input stream, with no knowledge of what other stages have seen or produced. If your processing requires correlating events across records, tracking sequences, or maintaining context that spans multiple passes over the data, you are working against the model. At that point you are no longer composing a pipeline — you are writing a program, and a proper scripting language or tool will serve you better than forcing the logic into shell.

Heavy parallelism. As established in section 3, pipeline stages take turns rather than run truly concurrently. If your workload is CPU-bound and the data can be partitioned, a pipeline will leave cores idle. This is a consequence of the sequential streaming model. The pipe was never designed for parallel computation.

Joins. Combining two datasets by a common key — the most basic operation in data processing — has no clean pipeline expression. You can approximate it with sort and join, but this requires both inputs to be sorted first, adding two barriers before the actual work begins. For anything beyond trivial cases, a pipeline join is awkward, brittle, and slow compared to a proper tool.

7. Beyond pipelines

When a pipeline reaches its limits, two directions are worth considering depending on the problem. If the bottleneck is a barrier stage, the pipeline itself can often be restructured. If the bottleneck is parallelism, the pipeline needs external orchestration.

7.1. AWK

AWK can replace entire pipelines in a single pass. It operates in two modes depending on the problem: purely streaming, processing each record and emitting output immediately, or stateful, accumulating data across records using associative arrays and producing output at the end. Most pipeline tools are one or the other — grep and sed stream, sort accumulates. AWK can do both, which makes it effective at eliminating the sort-based barriers that dominate most text processing pipelines.

The most common barrier in a pipeline is sort used purely to prepare input for uniq or uniq -c. AWK's associative arrays make both redundant.

Deduplication without sort:

sort file | uniq

Using AWK:

awk '!x[$0]++' file

A single streaming pass. Each line is checked against an associative array — seen lines are discarded, unseen lines pass through. The tradeoff is explicit: output order is insertion order, not sorted order. When sorted output is not required, the barrier is gone entirely.

As efficient it looks, resource footprints are a different questions. Running a quick benchmarking using the same benchmark runner and csv file as in my Practical AWK Benchmarking article, the results are as follows:

--- Statistical Summary ---
cmd   Runtime [s]                       			Peak Memory [MB]
      mean ± sdev       min    median  max    Jtr%  mean ± sdev       min    median  max    Jtr%
gawk  2.5438 ± 0.0251   2.5167 2.5488  2.5661 0.2   594.21 ± 0.25    593.96 594.20  594.46 0.0
mawk  2.5876 ± 0.0035   2.5842 2.5877  2.5911 0.0   291.24 ± 0.57    290.89 290.93  291.89 0.1
nawk  2.5799 ± 0.0124   2.5726 2.5730  2.5942 0.3   322.17 ± 0.17    322.03 322.12  322.35 0.0
s|u   2.4596 ± 0.0048   2.4556 2.4585  2.4649 0.0   368.64 ± 0.07    368.57 368.62  368.71 0.0

--- Normalized Benchmarks ---
cmd   RT    PM    d     F
gawk  1.04  2.04  1.04  2.12
mawk  1.05  1.00  0.05  1.05
nawk  1.05  1.11  0.12  1.16
s|u   1.00  1.27  0.27  1.27

Definitions - RT: Normalized Runtime; PM: Normalized Peak Memory; d: Euclidean Distance; F: Resource Footprint. (Calculations are based on 1 warmup and 3 test runs.)

Runtime is nearly identical across all variants — AWK offers no speed advantage over sort | uniq here. The real difference is memory. mawk uses 21% less memory than sort | uniq and has the best overall resource footprint (F=1.05), making it the clear winner. nawk is a reasonable middle ground at F=1.16. gawk, despite being the most widely used AWK variant, performs worst — consuming double the memory of mawk and 61% more than sort | uniq, reflected in a footprint of F=2.12. The choice of AWK implementation matters as much as the algorithmic change — the wrong one erases any benefit.

7.2 Real parallelism with xargs and GNU parallel

When the bottleneck is not a barrier but throughput — processing many independent inputs simultaneously — pipelines need external orchestration. xargs -P and GNU parallel both achieve this by partitioning work across multiple processes.

The find | xargs grep example from section 5 becomes parallel with one addition:

find . -name "*.log" | xargs -P $(nproc) grep "ERROR"

-P $(nproc) runs one grep process per available core. The pipeline structure remains intact — find still streams filenames, xargs still batches them — but the processing stage now uses all cores.

GNU parallel offers finer control:

find . -name "*.log" | parallel grep "ERROR"

By default it spawns one job per core. It also preserves output order, handles errors per job, and supports more complex partitioning strategies than xargs.

The key distinction from pipeline concurrency is explicit: you are not getting parallelism from the pipe itself. You are orchestrating multiple independent processes from outside the pipeline. The pipe remains a sequential channel — what changes is how many workers consume from it simultaneously.

8. Conclusion

Pipes are not primarily a performance tool. They are a composition tool — a way to connect programs that were never designed to work together, avoiding intermediate files and keeping memory usage flat regardless of input size. That property is powerful, and it is why pipelines designed in the 1970s can still process gigabytes of data efficiently today.

The limits are just as real. A pipeline is a single stream, moving in one direction, through stages that take turns. When your problem needs multiple streams, parallel execution, or global state, the model stops helping and starts getting in the way. Knowing where that boundary lies — and reaching for the right tool when you cross it — is what makes pipelines useful rather than limiting.

When RAM Matters: Memory Efficiency of AWK Variants

Gábor Dombay — Sat, 14 Mar 2026 17:24:25 GMT

The AWK scripting language emerged from Bell Labs in 1977, named for its creators Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is still widely used today, as a core tool it is available on any Unix or Unix-like system (Linux, BSDs, macOS etc.). It operates as a compact, domain-specific language for text processing. AWK reads input line by line, splits each line into fields, and executes code when patterns match. No explicit loops are needed for reading data; the program focuses on what to do with each record, not how to traverse the file. This makes it exceptionally effective for rapid ad-hoc data analysis and transformation, as well as filtering and more complex operations within pipelines. AWK is Turing-complete and can handle logic beyond simple pattern matching.

While the AWK language is POSIX standard, it exists in several distinct implementations, most notably:

gawk (GNU Awk): The feature-rich version with extensions beyond POSIX, maintained by Arnold Robbins. Default in Arch Linux, RHEL, Fedora.
mawk (Mike Brennan’s Awk): An efficiency-oriented implementation using a bytecode interpreter, currently maintained by Thomas Dickey. Default in Debian and many of its derivatives.
nawk (The "One True Awk"): The original implementation from the language’s creators, maintained by Brian Kernighan. Default in BSDs and macOS.

In most Linux distributions, the awk command is a symbolic link to a specific implementation. You can verify which variant is being used with:

ls -l $(which awk)

AWK has a place in modern data pipelines as an effective Phase 2 pre-filter: it is schema-agnostic, low footprint, zero-setup, and readily available (see article). It is suitable for the earliest stage of validation, as a first-pass filter, before any format-specific interpretation.

AWK can operate in two fundamentally different modes with respect to memory usage:

Streaming operations maintain constant memory usage regardless of file size. A multi-hundred-gigabyte file can be inspected using the same resources as a kilobyte sample. This makes AWK effective for null rate checks, schema validation, and range or boundary verification on datasets that exceed available memory.
Stateful operations, however, require accumulating data in memory. This can take several forms: populating associative arrays for deduplication (!x[$0]++) or field distribution analysis (x[NF]++), loading records into indexed arrays for multi-pass processing, or concatenating strings to build aggregate outputs. For these operations, memory efficiency matters, and implementation differences between AWK variants become significant.

This article evaluates the memory efficiency of gawk, mawk, and nawk in stateful operations, as a function of input file size.

Benchmarking Approach

The benchmarking evaluates memory consumption patterns for four different stateful operation scenarios in AWK when processing CSV data. The focus is on memory usage comparison during data population, with no additional processing or operations performed. This allows for direct measurement of how different data storage strategies impact memory footprint. In addition to memory usage execution time was also measured. Resource tracking was performed using cgmemtime, an ideal tool for this purpose as it captures peak memory consumption for the process group. The benchmarking process was automated via my custom runner that handles warmups, multiple test runs, and calculates statistical metrics as well as normalized parameters for comparative analysis. For details see my BEHILOS Benchmark article.

Test Dataset

The benchmarking uses CSV files with a consistent structure of 14 fields per row. To observe memory scaling behavior, 7 different file sizes were tested ranging from 1,000 rows to 10 million rows, 120KB to 1.2GB of size. The CSV test files are available here. The 10M row file was generated by concatenating the 5M file twice.

File name	Rows	File size [MB]
sales1K.csv	1K	0.12
sales10K.csv	10K	1.2
sales100K.csv	100K	12
sales500K.csv	500K	60
sales1.5M.csv	1.5M	178
sales5M.csv	5M	595
sales10M.csv	10M	1190

Test Environment

Tests were conducted on an Arch Linux workstation powered by a Ryzen 5900x CPU with 64GB of RAM, using the Alacritty terminal within a dwm session.

The following table provides a summary of the specific versions and main characteristics of the three AWK implementations tested:

Name	Version	Binary Size	Installed Size	--csv	UTF-8	Extensions
gawk	5.3.2	853 kB	3.60 MB	yes	yes	yes
mawk	1.3.4 20260129	179 kB	206 kB	no	no	no
nawk	20251225	139 kB	145 kB	yes	yes	no

The Benchmarks

Four benchmarks were applied. They represent common patterns for storing CSV data in AWK, each with different memory characteristics and use cases.

Each benchmark sequence included one initial warmup run followed by three recorded runs. Normalized paramteres are based on median, 1.0 being the baseline (e.g lowest peak memory or runtime).

Benchmark #1: Store entire lines in array

x[NR]=$0

This is the simplest storage method and keeps the original line intact without parsing individual fields. The memory footprint includes the full text of each line including all field separators. This method is commonly used when you need to preserve the exact input for later processing or output, or when you need random access to complete lines.

Results of Benchmark #1

Benchmark #1 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0016 ± 0.0001        0.0016   0.0016    0.0017   1.1              0.75 ± 0.02            0.73     0.74      0.77     1.4        
  mawk                         0.0010 ± 0.0001        0.0009   0.0010    0.0011   1.3              0.82 ± 0.15            0.73     0.74      1.00     11.6       
  nawk                         0.0015 ± 0.0001        0.0015   0.0015    0.0016   2.3              0.57 ± 0.15            0.48     0.49      0.74     16.8       

sales10K.csv                                                                                                                                                     
  gawk                         0.0048 ± 0.0001        0.0047   0.0049    0.0049   0.9              3.07 ± 0.15            2.98     2.99      3.25     2.8        
  mawk                         0.0027 ± 0.0002        0.0026   0.0028    0.0028   1.2              2.74 ± 0.15            2.73     2.74      2.74     0.0        
  nawk                         0.0089 ± 0.0003        0.0087   0.0087    0.0091   1.5              2.85 ± 0.19            2.73     2.83      2.98     0.6        

sales100K.csv                                                                                                                                                    
  gawk                         0.0330 ± 0.0008        0.0324   0.0328    0.0339   0.7              25.74 ± 0.15           25.73    25.74     25.74    0.0        
  mawk                         0.0174 ± 0.0014        0.0163   0.0169    0.0189   3.0              21.49 ± 0.15           21.48    21.49     21.49    0.0        
  nawk                         0.0766 ± 0.0020        0.0750   0.0760    0.0789   0.9              23.57 ± 0.35           23.24    23.74     23.74    0.7        

sales500K.csv                                                                                                                                                    
  gawk                         0.1488 ± 0.0022        0.1473   0.1480    0.1511   0.5              125.49 ± 0.29          125.23   125.48    125.74   0.0        
  mawk                         0.1025 ± 0.0048        0.0987   0.1012    0.1076   1.3              105.24 ± 0.29          104.99   105.25    105.48   0.0        
  nawk                         0.3974 ± 0.0063        0.3918   0.3966    0.4038   0.2              120.19 ± 0.38          120.02   120.27    120.27   0.1        

sales1.5M.csv                                                                                                                                                    
  gawk                         0.4399 ± 0.0036        0.4367   0.4406    0.4422   0.2              374.99 ± 0.58          374.49   374.98    375.49   0.0        
  mawk                         0.3360 ± 0.0159        0.3192   0.3403    0.3486   1.3              313.24 ± 0.52          312.98   313.00    313.74   0.1        
  nawk                         1.1826 ± 0.0132        1.1753   1.1766    1.1959   0.5              346.26 ± 0.45          346.02   346.26    346.51   0.0        

sales5M.csv                                                                                                                                                      
  gawk                         1.4856 ± 0.0037        1.4848   1.4853    1.4868   0.0              1252.07 ± 0.65         1251.73  1252.23   1252.23  0.0        
  mawk                         1.1790 ± 0.0185        1.1696   1.1788    1.1885   0.0              1042.48 ± 0.58         1042.23  1042.48   1042.73  0.0        
  nawk                         3.9644 ± 0.0155        3.9555   3.9660    3.9717   0.0              1156.11 ± 0.47         1156.02  1156.03   1156.27  0.0        

sales10M.csv                                                                                                                                                     
  gawk                         2.9759 ± 0.0070        2.9691   2.9783    2.9802   0.1              2507.74 ± 0.69         2507.49  2507.74   2507.99  0.0        
  mawk                         2.4046 ± 0.0220        2.3926   2.4044    2.4166   0.0              2083.90 ± 0.60         2083.73  2083.98   2083.98  0.0        
  nawk                         8.0351 ± 0.0158        8.0334   8.0337    8.0383   0.0              2361.94 ± 0.70         2361.53  2361.78   2362.52  0.0

Summary Table

Benchmark #1 Summary Table

File size        rt [s]                  pm [MB]        
    [MB]  gawk    mawk    nawk        gawk    mawk    nawk
----------------------------------------------------------
    0.12  0.0016  0.0010  0.0015      0.74    0.74    0.49
     1.2  0.0049  0.0028  0.0087      2.99    2.74    2.83
      12  0.0328  0.0169  0.0760     25.74   21.49   23.74
      60  0.1480  0.1012  0.3966    125.48  105.25  120.27
     178  0.4406  0.3403  1.1766    374.98  313.00  346.26
     595  1.4853  1.1788  3.9660   1252.23 1042.48 1156.03
    1190  2.9783  2.4044  8.0337   2507.74 2083.98 2361.78

Normalized results: RT (normalized runtime) and MO (memory overhead)

Benchmark #1 Normalized Results

File size   RT                    MO          
    [MB]    gawk   mawk   nawk    gawk   mawk   nawk
----------------------------------------------------
    0.12    1.6    1.0    1.5     6.2    6.2    4.1
     1.2    1.8    1.0    3.1     2.5    2.3    2.4
      12    1.9    1.0    4.5     2.1    1.8    2.0
      60    1.5    1.0    3.9     2.1    1.8    2.0
     178    1.3    1.0    3.5     2.1    1.8    1.9
     595    1.3    1.0    3.4     2.1    1.8    1.9
    1190    1.2    1.0    3.3     2.1    1.8    2.0

Benchmark #2: Populate 2D matrix

for (i=1; i<=NF; i++) x[NR,i] = $i

AWK simulates 2D arrays by concatenating keys with a built-in separator (SUBSEP), so x[row, col] is stored internally as x[row SUBSEP col]. This approach provides indexed access to individual fields and is useful when you need to perform operations on specific columns across all rows. The memory overhead includes both the field data and the composite key structures.

Results of Benchmark #2

Benchmark #2 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0087 ± 0.0003        0.0084   0.0086    0.0090   0.7              5.24 ± 0.00            5.24     5.24      5.24     0.0        
  mawk                         0.0066 ± 0.0011        0.0058   0.0061    0.0078   7.8              2.15 ± 0.29            1.98     1.98      2.49     8.5        
  nawk                         0.0090 ± 0.0007        0.0081   0.0094    0.0094   4.5              2.33 ± 0.13            2.23     2.28      2.48     2.2        

sales10K.csv                                                                                                                                                     
  gawk                         0.0662 ± 0.0006        0.0657   0.0661    0.0668   0.2              46.07 ± 0.14           45.99    45.99     46.24    0.2        
  mawk                         0.0511 ± 0.0027        0.0491   0.0503    0.0539   1.6              16.32 ± 0.32           16.24    16.24     16.48    0.5        
  nawk                         0.0820 ± 0.0010        0.0814   0.0821    0.0827   0.0              19.67 ± 0.20           19.59    19.59     19.84    0.4        

sales100K.csv                                                                                                                                                    
  gawk                         0.7788 ± 0.0262        0.7500   0.7852    0.8011   0.8              452.65 ± 0.20          452.48   452.73    452.74   0.0        
  mawk                         0.9017 ± 0.0144        0.8931   0.8941    0.9180   0.9              156.74 ± 0.32          156.73   156.73    156.74   0.0        
  nawk                         0.7998 ± 0.0117        0.7911   0.7953    0.8131   0.6              178.69 ± 0.25          178.52   178.77    178.78   0.0        

sales500K.csv                                                                                                                                                    
  gawk                         4.5791 ± 0.0277        4.5695   4.5800    4.5878   0.0              2261.98 ± 0.48         2261.72  2261.73   2262.48  0.0        
  mawk                         6.3488 ± 0.0417        6.3037   6.3686    6.3742   0.3              785.24 ± 0.32          785.23   785.23    785.24   0.0        
  nawk                         4.3709 ± 0.0170        4.3583   4.3714    4.3830   0.0              959.94 ± 0.46          959.52   960.04    960.26   0.0        

sales1.5M.csv                                                                                                                                                    
  gawk                         14.8738 ± 0.1380       14.7941  14.7974   15.0298  0.5              6775.31 ± 0.70         6774.75  6775.47   6775.72  0.0        
  mawk                         19.7716 ± 0.0491       19.7417  19.7862   19.7870  0.1              2356.07 ± 0.50         2355.73  2355.98   2356.48  0.0        
  nawk                         12.1783 ± 0.0873       12.1189  12.1395   12.2765  0.3              2677.02 ± 0.52         2676.77  2677.02   2677.27  0.0        

sales5M.csv                                                                                                                                                      
  gawk                         50.6685 ± 0.1408       50.6431  50.6636   50.6989  0.0              22592.04 ± 0.71        22591.95 22591.96  22592.20 0.0        
  mawk                         72.4570 ± 0.1123       72.3429  72.4932   72.5348  0.0              7963.89 ± 0.52         7963.73  7963.96   7963.99  0.0        
  nawk                         40.7116 ± 0.2032       40.5819  40.6313   40.9215  0.2              8988.45 ± 0.65         8988.01  8988.57   8988.76  0.0        

sales10M.csv                                                                                                                                                     
  gawk                         101.8983 ± 0.2029      101.7380 101.9330  102.0240 0.0              45182.70 ± 0.71        45182.70 45182.71  45182.71 0.0        
  mawk                         150.8563 ± 0.2839      150.6080 150.8330  151.1280 0.0              15965.48 ± 0.84        15964.98 15965.23  15966.23 0.0        
  nawk                         84.8894 ± 0.4522       84.5106  84.8431   85.3145  0.1              18777.15 ± 0.68        18776.93 18777.25  18777.26 0.0

Summary Table

Benchmark #2 Summary Table

File size          rt [s]                      pm [MB]                
    [MB]   gawk     mawk     nawk         gawk     mawk     nawk
----------------------------------------------------------------
    0.12   0.0086   0.0061   0.0094       5.24     1.98     2.28
     1.2   0.0661   0.0503   0.0821      45.99    16.24    19.59
      12   0.7852   0.8941   0.7953     452.73   156.73   178.77
      60   4.5800   6.3686   4.3714    2261.73   785.23   960.04
     178  14.7974  19.7862  12.1395    6775.47  2355.98  2677.02
     595  50.6636  72.4932  40.6313   22591.96  7963.96  8988.57
    1190 101.9330 150.8330  84.8431   45182.71 15965.23 18777.25

Normalized results: RT (normalized runtime) and MO (memory overhead)

Benchmark #2 Normalized Results

File size   RT                   MO          
    [MB]    gawk   mawk   nawk   gawk   mawk   nawk
---------------------------------------------------
    0.12    1.4    1.0    1.5    43.7   16.5   19.0
     1.2    1.3    1.0    1.6    38.3   13.5   16.3
      12    1.0    1.1    1.0    37.7   13.1   14.9
      60    1.0    1.5    1.0    37.7   13.1   16.0
     178    1.2    1.6    1.0    38.1   13.2   15.0
     595    1.2    1.8    1.0    38.0   13.4   15.1
    1190    1.2    1.8    1.0    38.0   13.4   15.8

Benchmark #3: Populate 1D array for each field

x1[NR]=\(1; x2[NR]=\)2; x3[NR]=\(3; ... x14[NR]=\)14

This creates 14 independent hash table structures in memory, avoiding the composite key overhead of the 2D approach. This method is efficient when you frequently access all values of a particular field, as each field's data is stored contiguously in its own array structure. The tradeoff is managing multiple array variables instead of a single unified structure.

In Benchmark #3 gawk's native array of arrays feature was also tested:

for (i=1; i<=NF; i++) x[NR][i]=$i

This creates a true nested structure where each row is a parent array containing 14 child elements.

Results

Benchmark #3 Result Table

File / variant                  Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0054 ± 0.0004        0.0051   0.0052    0.0059   4.0              2.83 ± 0.14            2.74     2.77      2.99     2.2        
  mawk                         0.0027 ± 0.0000        0.0026   0.0027    0.0027   0.1              1.82 ± 0.14            1.74     1.74      1.99     4.8        
  nawk                         0.0053 ± 0.0001        0.0052   0.0053    0.0053   0.8              2.23 ± 0.00            2.23     2.23      2.24     0.0        
  gawk*                        0.0066 ± 0.0004        0.0062   0.0068    0.0070   1.9              3.76 ± 0.07            3.70     3.74      3.84     0.6        

sales10K.csv                                                                                                                                                     
  gawk                         0.0362 ± 0.0008        0.0355   0.0362    0.0369   0.0              21.40 ± 0.20           21.23    21.49     21.49    0.4        
  mawk                         0.0176 ± 0.0011        0.0164   0.0179    0.0184   1.8              13.07 ± 0.20           12.98    12.99     13.23    0.6        
  nawk                         0.0413 ± 0.0006        0.0408   0.0413    0.0419   0.0              19.07 ± 0.15           18.98    18.99     19.24    0.4        
  gawk*                        0.0489 ± 0.0016        0.0471   0.0497    0.0498   1.7              31.49 ± 0.45           31.24    31.24     32.00    0.8        

sales100K.csv                                                                                                                                                    
  gawk                         0.3390 ± 0.0047        0.3337   0.3406    0.3426   0.5              206.16 ± 0.43          205.74   206.25    206.48   0.0        
  mawk                         0.2401 ± 0.0100        0.2288   0.2444    0.2472   1.7              125.74 ± 0.32          125.49   125.73    125.99   0.0        
  nawk                         0.4159 ± 0.0036        0.4119   0.4177    0.4183   0.4              178.06 ± 0.40          177.73   177.96    178.47   0.1        
  gawk*                        0.4581 ± 0.0026        0.4563   0.4578    0.4602   0.1              308.23 ± 0.51          307.98   308.24    308.48   0.0        

sales500K.csv                                                                                                                                                    
  gawk                         1.6668 ± 0.0108        1.6607   1.6618    1.6780   0.3              1023.65 ± 0.57         1023.24  1023.74   1023.98  0.0        
  mawk                         1.2514 ± 0.0122        1.2445   1.2510    1.2586   0.0              621.15 ± 0.35          620.99   621.23    621.24   0.0        
  nawk                         2.4466 ± 0.0109        2.4348   2.4523    2.4528   0.2              946.89 ± 0.49          946.56   947.04    947.07   0.0        
  gawk*                        2.2579 ± 0.0027        2.2572   2.2582    2.2583   0.0              1537.32 ± 0.53         1537.24  1537.24   1537.49  0.0        

sales1.5M.csv                                                                                                                                                    
  gawk                         5.0319 ± 0.0295        5.0004   5.0445    5.0509   0.2              3064.90 ± 0.69         3064.48  3064.98   3065.24  0.0        
  mawk                         3.5495 ± 0.0221        3.5303   3.5515    3.5669   0.1              1847.24 ± 0.56         1846.73  1847.48   1847.49  0.0        
  nawk                         6.3204 ± 0.0325        6.2918   6.3166    6.3527   0.1              2664.61 ± 0.53         2664.46  2664.56   2664.82  0.0        
  gawk*                        6.7958 ± 0.0173        6.7846   6.7873    6.8155   0.1              4609.57 ± 0.93         4608.73  4609.73   4610.24  0.0        

sales5M.csv                                                                                                                                                      
  gawk                         17.2646 ± 0.0952       17.1816  17.2512   17.3611  0.1              10289.49 ± 0.85        10288.99 10289.49  10289.98 0.0        
  mawk                         12.3195 ± 0.0415       12.2834  12.3215   12.3537  0.0              6174.32 ± 0.76         6173.74  6174.48   6174.74  0.0        
  nawk                         22.0794 ± 0.1297       21.9774  22.0413   22.2196  0.2              8938.40 ± 0.55         8938.31  8938.32   8938.57  0.0        
  gawk*                        22.7869 ± 0.1159       22.6546  22.8511   22.8549  0.3              15367.34 ± 0.94        15367.23 15367.27  15367.50 0.0        

sales10M.csv                                                                                                                                                     
  gawk                         34.8630 ± 0.2366       34.6662  34.8277   35.0951  0.1              20633.40 ± 0.90        20633.23 20633.24  20633.74 0.0        
  mawk                         24.6822 ± 0.0494       24.6660  24.6677   24.7130  0.1              12346.65 ± 0.82        12346.48 12346.49  12346.98 0.0        
  nawk                         48.6894 ± 0.1791       48.5470  48.7520   48.7691  0.1              18576.88 ± 0.57        18576.79 18576.80  18577.05 0.0        
  gawk*                        45.8100 ± 0.1638       45.7327  45.7543   45.9431  0.1              30737.79 ± 1.19        30736.99 30737.99  30738.39 0.0

Summary Table

Benchmark #3 Summary Table

File size            rt [s]                            pm [MB]                      
    [MB]   gawk     mawk     nawk     gawk*        gawk     mawk     nawk    gawk*
----------------------------------------------------------------------------------
    0.12   0.0052   0.0027   0.0053   0.0068       2.77     1.74     2.23     3.74
     1.2   0.0362   0.0179   0.0413   0.0497      21.49    12.99    18.99    31.24
      12   0.3406   0.2444   0.4177   0.4578     206.25   125.73   177.96   308.24
      60   1.6618   1.2510   2.4523   2.2582    1023.74   621.23   947.04  1537.24
     178   5.0445   3.5515   6.3166   6.7873    3064.98  1847.48  2664.56  4609.73
     595  17.2512  12.3215  22.0413  22.8511   10289.49  6174.48  8938.32 15367.27
    1190  34.8277  24.6677  48.7520  45.7543   20633.24 12346.49 18576.80 30737.99

Normalized results: RT (normalized runtime) and MO (memory overhead)

Benchmark #3 Normalized Results

File size   RT                          MO                  
    [MB]    gawk   mawk   nawk   gawk*  gawk   mawk   nawk   gawk*
-----------------------------------------------------------------
    0.12    1.9    1.0    2.5    2.0    23.1   14.5   18.6   31.2
     1.2    2.0    1.0    2.8    2.3    17.9   10.8   15.8   26.0
      12    1.4    1.0    1.9    1.7    17.2   10.5   14.8   25.7
      60    1.3    1.0    1.8    2.0    17.1   10.4   15.8   25.6
     178    1.4    1.0    1.9    1.8    17.2   10.4   15.0   25.9
     595    1.4    1.0    1.9    1.8    17.3   10.4   15.0   25.8
    1190    1.4    1.0    1.9    2.0    17.3   10.4   15.6   25.8

Benchmark 4: Concatenate entire data in one string

x = x $0

Each line is appended to the existing string, creating progressively larger string values. This pattern can be useful when building complete records for batch output, log aggregation or creating hash/checksum input.

Results

Benchmark #4 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv 0.12                                                                                                                                                 
  gawk                         0.0016 ± 0.0001        0.0016   0.0017    0.0017   1.8              0.75 ± 0.02            0.74     0.74      0.78     1.8        
  mawk                         0.0058 ± 0.0003        0.0054   0.0059    0.0061   0.9              0.98 ± 0.07            0.94     0.95      1.07     3.7        
  nawk                         0.0084 ± 0.0003        0.0081   0.0084    0.0086   0.1              1.06 ± 0.11            0.94     1.08      1.15     2.1        

sales10K.csv 1.2                                                                                                                                                 
  gawk                         0.0041 ± 0.0001        0.0040   0.0040    0.0042   0.8              1.98 ± 0.02            1.98     1.98      1.98     0.1        
  mawk                         1.2520 ± 0.0030        1.2485   1.2533    1.2541   0.1              5.18 ± 0.16            5.02     5.25      5.29     1.2        
  nawk                         1.4993 ± 0.0047        1.4954   1.4979    1.5046   0.1              6.30 ± 0.27            6.03     6.36      6.51     0.9        

sales100K.csv 12                                                                                                                                                 
  gawk                         0.0239 ± 0.0002        0.0238   0.0238    0.0241   0.3              12.73 ± 0.02           12.72    12.73     12.73    0.0        
  mawk                         57.3232 ± 0.4294       56.8661  57.3852   57.7182  0.1              37.42 ± 0.95           36.46    37.46     38.33    0.1        
  nawk                         92.1290 ± 0.9580       91.0598  92.4178   92.9094  0.3              49.94 ± 1.15           49.00    49.64     51.17    0.6        

sales500K.csv 60                                                                                                                                                 
  gawk                         0.1075 ± 0.0015        0.1059   0.1080    0.1087   0.4              60.05 ± 0.13           59.97    59.98     60.21    0.1        
  mawk                         1479.21                                                             152.68 
  nawk                         3854.76                                                             180.05 

sales1.5M.csv 178                                                                                                                                                
  gawk                         0.3163 ± 0.0018        0.3155   0.3160    0.3174   0.1              178.63 ± 0.32          178.46   178.47    178.97   0.1        

sales5M.csv 595                                                                                                                                                  
  gawk                         1.0447 ± 0.0041        1.0408   1.0454    1.0480   0.1              592.62 ± 0.43          592.45   592.45    592.96   0.0        

sales10M.csv 1190                                                                                                                                                
  gawk                         2.0706 ± 0.0041        2.0704   2.0706    2.0709   0.0              1184.01 ± 0.46         1183.92  1183.93   1184.19  0.0

Summary Table

Benchmark #4 Summary Table

File size    RT                               MO                  
    [MB]     gawk       mawk       nawk       gawk       mawk       nawk
------------------------------------------------------------------------
    0.12     0.0017     0.0062     0.0081      0.7        1.1        1.0
     1.2     0.0040     1.2365     1.4840      2.0        5.1        6.4
      12     0.0238    54.5716    88.8361     12.7       36.9       49.4
      60     0.1080  1479.2100  3854.7600     60.0      152.7      180.1
     178     0.3160 ---------- ----------    178.5 ---------- ----------
     595     1.0454 ---------- ----------    592.5 ---------- ----------
    1190     2.0706 ---------- ----------   1183.9 ---------- ----------

Normalized results: RT (normalized runtime) and MO (memory overhead)

Benchmark #4 Normalized Results

File size   RT                       MO          
    [MB]    gawk   mawk     nawk     gawk   mawk   nawk
-------------------------------------------------------
    0.12    1.0     3.6      4.8     6.2    9.6    7.9
     1.2    1.0   309.1    371.0     1.7    4.3    5.4
      12    1.0  2292.9   3732.6     1.1    3.1    4.1
      60    1.0 13696.4  35692.2     1.0    2.5    3.0

Discussion

For the comparative analysis normalized metrics were used:

MO (Memory Overhead): This represents the ratio of peak memory usage relative to the raw file size. For example, an MO of 2.0 means the process used exactly twice the RAM as the size of the data on disk. It allows for a direct comparison of memory efficiency regardless of the input file size.
RT (Normalized Runtime): This is the execution time scaled against a baseline (the fastest result or the file size, 1.0). It measures how long the engine takes to process each unit of data, providing a clear picture of speed performance across different AWK variants.

Benchmark #1

The data confirms that for the simple line-storage pattern (x[NR]=$0), memory consumption is a strictly linear function of the input file size across all three variants. As the data scales from 120KB to 1.2GB, the normalized memory overhead (MO) exhibits clear asymptotic behavior; the initial variance caused by interpreter startup costs (which peaked at 6.2x for the smallest file) stabilizes at higher volumes. By the 1.2GB mark, gawk and nawk settle at roughly 2.0x and 2.1x overhead relative to the raw file size, while mawk maintains a leaner 1.8x, proving to be the most memory-efficient engine for large-scale string retention.

In terms of runtime performance, mawk consistently dominated as the fastest variant, serving as the baseline (1.0) for all normalized runtime (RT) measurements above the smallest file size. While gawk showed improving efficiency as the workload increased—dropping from 1.9x to 1.2x the runtime of mawk and nawk struggled significantly with this storage pattern, ending with a runtime 3.3x slower than mawk at the 10 million row limit. These results highlight that for pure data population tasks where preserving line integrity is key, mawk offers the best balance of speed and a minimized memory footprint.

Benchmark #2

The data for the 2D matrix population x[NR, i] = $i shows a massive increase in resource requirements compared to simple line storage, though the peak memory remains a strictly linear function of the file size. As the dataset scales toward 1.2GB, the normalized memory overhead reaches an asymptotic state where the initial interpreter costs become negligible. In this scenario, mawk proves to be the most memory-efficient by far, stabilizing at a memory overhead of 13.4. In contrast, gawk is exceptionally heavy for this storage pattern, requiring 38 times the raw file size in RAM, which is nearly triple the footprint of mawk.

The runtime performance results reveal a significant shift in execution efficiency as the number of array elements grows. While mawk is the fastest for small files, its performance degrades significantly at scale, eventually becoming the slowest variant with a normalized runtime of 1.8. Conversely, nawk emerges as the performance leader for large-scale matrix population, maintaining the baseline speed of 1.0 at high volumes. These results illustrate a clear trade-off: mawk is the optimal choice for minimizing the memory footprint in massive stateful operations, but nawk offers superior throughput when processing tens of millions of discrete fields.

Benchmark #3

In Benchmark #3, using 14 independent 1D arrays proves significantly more memory-efficient than the 2D composite key approach across all variants. The peak memory usage remains a linear function of file size, with normalized memory overhead (MO) reaching a steady state quickly. mawk again demonstrates superior memory management, stabilizing at an MO of 10.4, which is about 40% more efficient than gawk’s 17.3 and nawk’s 15.6. Interestingly, gawk's native array-of-arrays feature (gawk*) proved to be the most resource-intensive strategy in this test, with a stabilized MO of 25.8. This suggests that the internal overhead of managing nested objects in gawk is substantially higher than managing multiple flat hash tables.

Runtime-wise, mawk maintained its lead as the fastest variant, serving as the 1.0 baseline for all file sizes. gawk and nawk performed similarly at scale, with gawk finishing about 1.4 times slower than mawk, while nawk lagged at 1.9 times slower. Despite the structural elegance of gawk's nested arrays, the gawk* results showed no performance benefit over the 1D array method, consistently running about 2.0 times slower than mawk. For users requiring field-level access at scale, the strategy of multiple 1D arrays in mawk provides the best optimization of both execution speed and memory footprint.

Benchmark #4

Benchmark #4 reveals a dramatic divergence in performance, highlighting how different engines handle repeated string concatenation (x = x $0). In this scenario, gawk performs exceptionally, maintaining near-linear time complexity as the file size increases. This efficiency is due to gawk's optimized string management, which applies a smarter reallocation strategy than its counterparts. As the dataset scales to 60MB, gawk completes the task in just 0.1 seconds, whereas mawk and nawk experience an exponential performance collapse, taking approximately 24 minutes and 64 minutes respectively. Due to these extreme runtime requirements, mawk and nawk were not tested for file sizes larger than 60MB. For any workflow involving large-scale string building, gawk is the only viable option among the three.

The memory overhead data also shows an interesting reversal of the previous benchmarks' trends. While mawk and nawk struggle with time, they initially show higher memory overhead relative to the file size during the transition phases. However, gawk’s memory usage remains extremely tight, approaching a 1.0 overhead ratio at the 60MB mark and beyond, effectively matching the raw file size. The massive RT (normalized runtime) values for mawk and nawk, reaching over 13,000x and 35,000x the duration of gawk, underscore a fundamental architectural difference: gawk is specifically optimized for efficient string appending, while the others suffer from costly repeated memory copying and reallocations.

Conclusion

This table summarizes the Memory Overhead (MO, the ratio of peak memory usage relative to the raw file size) of the four benchmarks. These values represent the stable multiplier of peak memory relative to file size once the dataset is large enough to make interpreter startup costs negligible.

Memory Overhead (MO) Summary Table

Benchmark Scenario	gawk	mawk	nawk	Best Efficiency
#1: Store entire lines	2.1	1.8	2.0	mawk
#2: Populate 2D matrix	38.0	13.4	15.8	mawk
#3: 1D array per field	17.3	10.4	15.6	mawk
#4: String concatenation*	1.0	2.5	3.0	gawk

*Note: Benchmark #4 values are taken from the 60MB file due to the runtime constraints of mawk and nawk.

Key Findings for the Article

The array efficiency gap: For stateful data population, mawk was consistently the most memory-efficient. In the 2D matrix test, it used nearly 3x less memory than gawk, highlighting its leaner internal representation of hash tables and strings.
Structure penalty: Breaking a CSV line into 14 discrete fields (Benchmark #3) increases memory overhead by approximately 5x to 8x compared to storing the line as a single string (Benchmark #1).
gawk’s specialization: While gawk is the heaviest variant for array-based storage, it is uniquely optimized for string management. It was the only variant where memory overhead effectively equaled the file size (1.0) during massive string concatenation, coupled with extremely fast execution.
The cost of "Array of Arrays": Though not in the summary table, the results for gawk (25.8 MO) show that native nested structures are significantly more expensive than multiple 1D arrays (17.3 MO), likely due to the overhead of managing multiple internal hash table objects.

In conclusion, these results demonstrate that using AWK in stateful mode requires careful consideration. While these benchmarks were conducted by populating the entire dataset to test engine limits, significant memory can be saved in practice by populating only the specific fields or records needed for the task. If RAM matters, mawk is the clear leader for population methods involving arrays or matrix simulations. However, for methods requiring large-scale string building, gawk remains the only viable alternative. Ultimately, selecting the right population method and the appropriate AWK variant is essential for maintaining stability and performance when processing large datasets.

FreeBSD and dwl on a 2010 ThinkPad

Gábor Dombay — Thu, 05 Mar 2026 19:06:40 GMT

FreeBSD is a Unix operating system with a long history of stability, clean design, and excellent documentation. Older hardware tends to run it particularly well: mature driver support and a lean base system make aging machines surprisingly capable. Old does not necessarily mean obsolete, provided the hardware is paired with the right OS and the software stack remains minimal.

This write-up covers a 2010 ThinkPad L412 with an Intel Core i3-350M and 8 GB RAM running FreeBSD 15.0 with dwl as the Wayland compositor. dwl is the Wayland equivalent of dwm, the well-known X11 window manager from the suckless project. It is minimal, efficient, and follows the Unix philosophy — you start with the bare minimum and add only what you need. The result is a modern Wayland desktop running on sixteen-year-old hardware that remains usable for everyday tasks.

For installation, I followed the FreeBSD Handbook and installed FreeBSD 15.0 with only the base system, sh shell and lib32 on a UFS filesystem. The installation itself was straightforward. The only notable limitation was Wi-Fi: on this hardware the driver works reliably only on the 2.4 GHz band. The 5 GHz band proved highly unstable and was practically unusable.

The following section walks through the dwl installation process in enough detail to reproduce this setup.

Required packages for dwl

wayland, wayland-protocols
drm-kmod (GPU driver)
wlroots019 (for dwl-0.8)
foot (terminal)
dejavu (my preferred font)
wmenu (dmenu equivalent)
wl-clipboard, grim, slurp (clipboard and screenshots)
swaybg (for background image)
mako (notification daemon)
neovim (editor)
gmake, gcc, pkgconf, evdev-proto (for compliation)
fcft, tllist
wget, firefox , neovim

sudo pkg install wayland wayland-protocols drm-kmod wlroots019 foot dejavu
sudo pkg install wmenu wl-clipboard grim slurp swaybg mako neovim
sudo pkg install gmake gcc pkgconf evdev-proto fcft tllist wget firefox

Note: sudo is not included in the base system on FreeBSD, so install it first (pkg install sudo) or run commands as root.

System configuration

Enable seatd

sudo sysrc seatd_enable="YES"
sudo service seatd start

# Add to .profile
export LIBSEAT_BACKEND="seatd"

Add your user to the video/input groups

sudo pw groupmod video -m [your_username]
sudo pw groupmod input -m [your_username]

Enable audio server

sudo sysrc sndiod_enable="YES"
sudo service sndiod start

Load GPU driver

sudo kldload /boot/modules/i915kms.ko

# Make it permanent adding to /etc/rc.conf
sudo sysrc kld_list+="/boot/modules/i915kms.ko"

Update evdev-proto header path

sudo ln -s /usr/local/include/linux /usr/include/linux

Enable UTF-8 for foot

# Add to .profile
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

Note: logout-login to take effect.

Building dwl

Download dwl 0.8 and slstatus 1.1 from their respective repositories and extract them.

wget https://codeberg.org/dwl/dwl/archive/v0.8.tar.gz
wget https://dl.suckless.org/tools/slstatus-1.1.tar.gz
tar xvf v0.8.tar.gz
tar xvf slstatus-1.1.tar.gz

To customize dwl, edit config.def.h and apply patches as needed. For reference, here is the list of patches I applied in this setup:

attachbottom, movestack, pertag for my preferred dwl behavior.
bar - necessary patch for a status-bar

The command to patch:

patch -i bar.patch

Note: If a patch fails to apply cleanly, the compiler output will indicate what needs to be adjusted, and the changes must be applied manually.

Unlike dwm, the dwl patches does not include a focusadjacenttag patch, so I implemented the modification myself. This mod adds two functions: one to focus the tag immediately to the left or right of the currently active tag, and another to move the focused window to the adjacent tag in either direction.

// Equivalent to focusadjacenttag dwm patch
// Add to dwl.c

static void viewtoadjacent(const Arg *arg);
static void tagtoadjacent(const Arg *arg);

void
viewtoadjacent(const Arg *arg)
{
	unsigned int newtag;
	unsigned int curtag = selmon->tagset[selmon->seltags];
	if (arg->i > 0) // Cycle Right
		newtag = (curtag << 1);
	else // Cycle Left
		newtag = (curtag >> 1);
	// Wrap around logic for standard 9 tags
	if (newtag >= (1 << TAGCOUNT)) newtag = 1;
	if (newtag <= 0) newtag = (1 << (TAGCOUNT - 1));
	view(&(Arg){.ui = newtag});
}

void
tagtoadjacent(const Arg *arg)
{
	Client *c = focustop(selmon);
	unsigned int newtag;
	unsigned int curtag;
	if (!c)
		return;
	curtag = c->tags;
	if (arg->i > 0) // Shift Right
		newtag = (curtag << 1);
	else // Shift Left
		newtag = (curtag >> 1);
	// Wrap around logic for standard 9 tags
	if (newtag >= (1 << TAGCOUNT)) newtag = 1;
	if (newtag <= 0) newtag = (1 << (TAGCOUNT - 1));
	tag(&(Arg){.ui = newtag});
}

// Add to config.def.h

/* tagging */
#define TAGCOUNT 9

/* modifier                  key             function         argument */
{ MODKEY,                    XKB_KEY_Left,   viewtoadjacent,  {.i = -1} },
{ MODKEY,                    XKB_KEY_Right,  viewtoadjacent,  {.i = +1} },
{ MODKEY|WLR_MODIFIER_SHIFT, XKB_KEY_Left,   tagtoadjacent,   {.i = -1} },
{ MODKEY|WLR_MODIFIER_SHIFT, XKB_KEY_Right,  tagtoadjacent,   {.i = +1} },

For screenshots, use grim and slurp with the following settings in config.def.h. Screenshots are copied to the clipboard via wl-copy and also saved in the ~/Pictures/Screenshots directory.

/* Region Screenshot with Notification */
static const char *scrregion[] = { "sh", "-c", "grim -g \"\((slurp)\" - | tee ~/Pictures/Screenshots/\)(date +%Y-%m-%d_%H-%m-%s).png | wl-copy && notify-send 'Region Saved'", NULL };
/* Full Screen Screenshot with Notification */
static const char *scrfull[]   = { "sh", "-c", "grim - | tee ~/Pictures/Screenshots/$(date +%Y-%m-%d_%H-%m-%s).png | wl-copy && notify-send 'Full Screen Saved'", NULL };

/* modifier                  key          function     argument */
{ MODKEY,					 XKB_KEY_p,	  spawn,       {.v = scrfull } },
{ MODKEY|WLR_MODIFIER_SHIFT, XKB_KEY_p,   spawn,	   {.v = scrregion } },

To customize keybindings and appearance, edit config.def.h t, then copy it to config.h . Build and install.

cp config.def.h config.h
sudo gmake clean install

Slstatus is can be customized the same way. For FreeBSD `config.mk` has to be modified.

# Add to config.mk
LDLIBS   = -lX11 -lkvm -lsndio

Startup script

In $HOME/bin create sdwl script (with chmod +x) to launch dwl. You can set your wallpaper there, displayed using swaybg.

#!/bin/sh
export $(dbus-launch)
mako &
slstatus -s | dwl -s "sh -c 'swaybg -i ~/Pictures/BSDviolet.png &'"

Configuration

I use the Dracula color scheme for foot and Neovim, with the vim-startify and vim-airline plugins. mako is also configured for Wayland notifications. The setup uses the DejaVu font. No gaps, no Nerd Fonts, no icons on the status bar. It’s purely functional and minimal, keeping the desktop clean and efficient.

All configuration files, including the wallpaper, are available in my GitHub repository. The wallpaper is AI-generated. Screenshots are shown below.

Conclusion

FreeBSD with dwl results in a usable and surprisingly snappy system on this old, modest hardware. On a fresh start, total memory usage (Active + Wired + Laundry) stays under 700 MB, disk usage is 8 GB. Hopefully this guide proves useful for anyone looking to revive older machines with a minimal Wayland setup.

The BEHILOS Benchmark

Gábor Dombay — Tue, 10 Feb 2026 22:55:34 GMT

In his book Unix: A History and a Memoir, Brian Kernighan recounts his favorite grep story from the early days of Unix. Someone at Bell Labs asked whether it was possible to find English words composed only of the letters formed by an upside-down calculator. The digits on a turned calculator display, 5071438, map to the letter set BEHILOS.

Kernighan grepped the regular expression ^[behilos]*$ against the Webster’s Second International Dictionary which contained 234,936 words, and found 263 matches - including words he has never seen before.

The current Webster’s Second International Dictionary contains 236,007 words. To reproduce the results, run:

grep '^[behilos]*$' /usr/share/dict/web2

This results in 264 matches. The longest words are nine characters long: blissless and booboisie.

What started as a historical footnote quickly made me wonder: how would the old BEHILOS grep perform as a quick-and-dirty benchmark for today’s search tools?

Selected Text Search Tools for the BEHILOS Benchmark

For the BEHILOS benchmark, I selected a mix of classic and modern text search tools, covering traditional Unix utilities, AWK variants, and fast recursive searchers widely used by developers today.

grep – the classic Unix tool and historical baseline for text searching.
rg (ripgrep) – a modern, extremely fast recursive searcher optimized for large codebases.
gawk – the GNU AWK implementation, feature-rich and widely used for text processing.
mawk – a lightweight, efficient AWK variant with minimal memory footprint.
nawk – the traditional New AWK, preserving historical behavior for legacy scripts.
ag (The Silver Searcher) – a fast recursive searcher often replacing ack.
pt (The Platinum Searcher) – a newer, recursive grep alternative with multithreading support.
ack – a Perl-based source tree searcher, maintained and optimized for code patterns.
ugrep – a feature-rich, modern grep clone with extended regex support and performance tuning.
sift – a recursive search tool for large directories, optimized for developer workflows.

Benchmarking Methodology

To evaluate the performance of the search engines, the benchmarking focused on two critical metrics, runtime and peak memory usage, which together represent the total resource footprint. Resource tracking was performed using cgmemtime, an ideal tool for this purpose as it captures peak memory consumption for the process group.

The benchmarking process was automated via benchgab.awk (version 2026.02.10.), my custom runner that handles warmups, multiple test runs, and calculates statistical metrics as well as normalized parameters for comparative analysis. Each benchmark sequence included one initial warmup run followed by 100 recorded runs.

The following table summarizes the evaluated search engines, their versions, and the exact commands used for the BEHILOS grep.

Name      Version      Command
----      -------      -------
grep      3.12         grep '^[behilos]*$' /usr/share/dict/web2
ripgrep   15.1.0       rg '^[behilos]*$' /usr/share/dict/web2
gawk      5.3.2        gawk '/^[behilos]*$/' /usr/share/dict/web2
mawk      1.3.4        mawk '/^[behilos]*$/' /usr/share/dict/web2
nawk      20251225     nawk '/^[behilos]*$/' /usr/share/dict/web2
ag        2.2.0        ag -s '^[behilos]*$' /usr/share/dict/web2
pt        2.2.0        pt -e '^[behilos]*$' /usr/share/dict/web2
ack       3.9.0        ack '^[behilos]*$' /usr/share/dict/web2
ugrep     7.5.0        ugrep '^[behilos]*$' /usr/share/dict/web2
sift      0.9.1        sift '^[behilos]*$' /usr/share/dict/web2

Tests were conducted on an Arch Linux workstation powered by a Ryzen 5900x CPU, using the Alacritty terminal within a dwm session.

Results

The statistical summary was computed by the script, deriving the mean, standard deviation, median, minimum, and maximum from this 100-run sample for both runtime and peak memory usage.

Jitter (Jtr%) was calculated for both runtime and peak memory usage as abs((mean − median) / median) × 100%, quantifying run-to-run variability. For low-footprint commands, even minor scheduling effects or transient memory spikes can noticeably influence averages, making jitter a useful indicator of measurement stability.

--- Statistical Summary of BEHILOS Benchmarks ---
cmd     Runtime [s]                                      Peak Memory [MB]
        mean ± sdev      min     median  max     Jtr%    mean ± sdev      min     median  max     Jtr%
gawk    0.0220 ± 0.0008  0.0204  0.0219  0.0251  0.5     0.75 ± 0.16      0.48    0.74    1.00    1.0
ugrep   0.0087 ± 0.0005  0.0076  0.0086  0.0102  1.1     1.64 ± 0.14      1.44    1.60    2.03    2.3
ack     0.0880 ± 0.0027  0.0838  0.0874  0.1003  0.7     6.49 ± 0.26      6.12    6.39    7.19    1.6
sift    0.0402 ± 0.0021  0.0370  0.0398  0.0489  1.1     13.32 ± 2.12     9.98    13.11   20.46   1.7
mawk    0.0089 ± 0.0005  0.0078  0.0088  0.0109  1.0     0.59 ± 0.14      0.48    0.54    1.25    8.9
ag      0.0139 ± 0.0008  0.0119  0.0138  0.0167  0.5     2.22 ± 0.45      1.31    2.21    3.23    0.2
pt      0.0245 ± 0.0010  0.0227  0.0244  0.0286  0.5     8.03 ± 0.32      7.49    7.98    10.48   0.6
grep    0.0048 ± 0.0003  0.0041  0.0048  0.0068  0.3     0.71 ± 0.11      0.60    0.64    1.12    10.9
rg      0.0056 ± 0.0003  0.0048  0.0056  0.0064  1.1     1.13 ± 0.17      0.98    1.01    1.75    11.4
nawk    0.0389 ± 0.0012  0.0372  0.0387  0.0429  0.7     0.64 ± 0.16      0.48    0.73    1.05    12.4

The evaluation is based on normalized metrics:

RT: Normalized median runtime. The execution time relative to the fastest implementation (1.0 is the baseline).
PM: Normalized median group peak memory. The peak memory relative to the implementation with the lowest memory footprint (1.0 is the baseline).
d: Euclidean Distance. Measures the geometric distance from the "Ideal Point" (1,1). Lower values denote higher implementation efficiency.
F: Resource Footprint. Calculated as RT×PM. This represents the total resource footprint; lower values indicate a more efficient use of system resources to complete the same task.

The following table summarizes the overall performance of the 10 tested search engines according to the normalized BEHILOS benchmarks.

-- Normalized BEHILOS Benchmarks ---
cmd     RT      PM      d       F
gawk    4.52    1.37    3.54    6.20
ugrep   1.78    2.97    2.12    5.28
ack     18.07   11.86   20.23   214.26
sift    8.23    24.31   24.41   200.00
mawk    1.83    1.00    0.83    1.83
ag      2.85    4.11    3.62    11.73
pt      5.05    14.81   14.39   74.79
grep    1.00    1.18    0.18    1.18
rg      1.17    1.88    0.89    2.19
nawk    8.00    1.36    7.01    10.90

Discussion

The normalized BEHILOS benchmarks were evaluated using Pareto frontier analysis (as previously applied in my AWK benchmarking study). To visualize search engine performance, the normalized values were plotted in a two-dimensional coordinate system, where the x-axis represents normalized runtime (RT) and the y-axis represents normalized peak memory usage (PM).

The ideal point is located at (1, 1), representing an implementation that is simultaneously the fastest and the most memory-efficient. To improve the visibility of implementations clustered near the ideal point, a logarithmic scale was applied.

Graph: The Pareto frontier of search engines tested in the BEHILOS Benchmark, visualizing the optimal trade-off between execution speed and memory footprint.

The normalized BEHILOS results clearly separate the tested tools into distinct performance tiers when runtime and peak memory usage are considered jointly.

grep the overall winner of the BEHILOS Benchmark.

Taken together, the normalized metrics and the Pareto frontier analysis identify grep as the overall winner of the BEHILOS Benchmark. It is not only the fastest implementation (RT = 1.00) but also achieves the lowest total resource footprint, as reflected by the minimum F value (F=1.18)

grep and mawk define the Pareto frontier.

grep is the fastest implementation (RT = 1.00) while remaining very close to the minimum memory baseline (PM = 1.18). mawk, in contrast, achieves the lowest peak memory usage (PM = 1.00) with only a modest runtime penalty (RT = 1.83). Neither tool can be improved in one dimension without degrading the other, placing both on the Pareto frontier and representing the optimal trade-off envelope for this workload.

Near-frontier but dominated tools.

rg, ugrep, ag, and gawk are dominated by the frontier but remain reasonably close to it. Their normalized distance and F values indicate that they are not optimal for this specific task, yet their performance characteristics are still competitive. This reflects design choices favoring richer feature sets, broader file handling, or more general workloads rather than minimal footprint.

Clearly dominated implementations.

pt, and especially sift and ack, lie far from the Pareto frontier. Their high normalized runtime and peak memory usage result in very large F values, indicating poor efficiency for this narrowly defined benchmark. These tools incur significant overhead relative to the simplicity of the BEHILOS search.

The case of nawk.

Although nawk exhibits a low peak memory footprint (PM = 1.36), its slow runtime (RT = 8.00) places it well outside the efficient region. In addition, it showed the highest peak memory jitter among all tested tools, which negatively affected its stability metrics and overall performance profile.

Overall, the Pareto analysis highlights that tools optimized for minimalism and predictability dominate this benchmark, while more feature-heavy searchers pay a measurable cost in both runtime and memory.

Conclusion

The BEHILOS grep, despite originating as a historical anecdote from the early days of Unix, turns out to be an effective micro-benchmark. Its simplicity isolates the core costs of regex matching, process startup, and memory allocation without confounding factors such as filesystem traversal or complex I/O patterns.

This benchmark shows that, for low-footprint text searches, decades-old design principles still matter. Classic Unix tools like grep, along with lean implementations such as mawk, remain hard to beat when efficiency is the primary goal. Modern search engines deliver powerful features and excellent performance for real-world workloads, but those advantages are not free.

The BEHILOS Benchmark does not aim to crown a universal “best” search tool. Instead, it demonstrates how a minimal, well-chosen workload can expose fundamental trade-offs between speed, memory usage, and stability—and why even a small historical footnote can still teach us something meaningful about performance today.

AWK: the Zero-Setup Pre-Processor

Gábor Dombay — Sun, 01 Feb 2026 20:39:50 GMT

Modern data pipelines most often fail at their beginning, not their end. A malformed record, an unexpected delimiter, or an encoding anomaly can cause otherwise robust processing engines to abort after consuming significant computational resources. These failures are not rare edge cases; they are a predictable consequence of feeding untrusted, heterogeneous input into systems that implicitly assume structural coherence.

Preventing such aborts requires a tool that operates before schema, before types, and before semantic assumptions are applied: schema-agnostic, streaming, low in resource footprint, and available everywhere data flows. It must tolerate mixed record structures, inconsistent delimiters, and partial corruption without imposing premature interpretation. Such a tool already exists—and has existed since the 1970s. Its name is AWK.

The Contemporary Data Pipeline Model

Modern data systems are commonly described using a five-phase pipeline:

Phase 1. Ingest – raw data arrival from external sources

Phase 2. Validate – quality checks and correctness guarantees

Phase 3. Transform – schema enforcement, normalization, columnar operations

Phase 4. Analyze – analytics, feature engineering, ML preparation

Phase 5. Consume – BI, reporting, downstream products

This model is widely recognized across data engineering practice, even if the terminology varies slightly between platforms and tools. Crucially, validation is now treated as a first-class concern: data contracts, expectations, and quality gates are standard components of modern stacks.

In practice, however, validation is often implemented primarily as a semantic operation—type checks, nullability constraints, and value ranges—implicitly presuming that incoming data already satisfies basic structural requirements.

Structural Validation Comes Before Semantic Validation

Validation can be divided into two fundamentally different layers:

Structural (geometric) validation: Concerned with physical integrity: record boundaries, delimiter consistency, field counts, encoding correctness, and basic layout.

Semantic validation: Concerned with meaning: data types, ranges, domain rules, and business logic.

Semantic validation depends on structural integrity. A columnar engine cannot validate a date column if unescaped delimiters have shifted field boundaries. A schema-on-read system cannot enforce types if records are misaligned or partially corrupted. The pipeline fails before semantics can even be evaluated.

A related failure mode appears when pandas is used in Phase 3, where schema enforcement and large-scale transformation are expected. Pandas is architecturally aligned with Phase 4 workloads, and applying it earlier on large datasets can lead to memory exhaustion, just as applying Polars or DuckDB in Phase 2—before structural validation—leads to structural parse failures.

This highlights the absence of an explicit Phase-2 structural validation layer in many modern data pipelines.

Phase 2, Explicitly: Structural Validation

Within the standard ingest → validate → transform model, structural validation is the earliest and most failure-prone part of Phase 2. Its purpose is not to interpret data, but to determine whether the data is fit to be interpreted at all.

A tool operating at this layer must satisfy specific constraints:

operate on untrusted, possibly malformed input
process data as a stream, with constant memory usage
make minimal assumptions about structure
integrate cleanly into automated pipelines
fail fast and produce actionable diagnostics

This is where AWK belongs.

AWK’s Role in Phase 2

AWK is not an analytics tool, a transformation engine, or a schema system. Its strength lies earlier — inside Phase 2, before schema and semantics are applied.

Architecturally, AWK functions as a pre-schema validation sentinel.

It processes text streams sequentially, with memory usage independent of input size. A multi-hundred-gigabyte file can be inspected using the same resources as a kilobyte-scale sample. This property alone makes AWK suitable for structural inspection of datasets that exceed available memory.

Its footprint is minimal—on the order of hundreds of kilobytes—and its availability is effectively universal. Every Unix-derived system, including Linux distributions, macOS, BSD variants, container base images, and CI runners, provides AWK by default. No environment setup, dependency resolution, or runtime configuration is required.

For a tool whose purpose is to guard the entrance to a pipeline, this matters. Phase-2 components should be reliable, predictable, and easy to deploy everywhere.

Why Textual Validation Matters

Real-world data is rarely homogeneous. Files often contain:

headers and footers with different formats
multiple delimiter conventions within a single stream
varying field counts by record type
embedded structured blocks inside free-form text
multi-line records such as logs or stack traces

Specialized validators and parsers typically assume consistency. When that assumption fails, they abort. AWK does not impose such constraints. It operates on patterns, not schemas, allowing validation logic to adapt dynamically to what the data actually contains.

This does not mean that AWK magically “fixes” broken data. It means that AWK can observe, classify, and assert structural properties before downstream tools are engaged. It can count anomalies, flag record classes, detect shifts in layout, and isolate segments that would cause rigid parsers to fail.

This textual, pattern-first perspective is precisely what is required at the earliest stage of validation.

Comparison with Specialized Streaming Tools

A number of contemporary command-line tools address aspects of streaming data inspection and manipulation. Utilities such as Miller, csvkit, xsv, qsv, xan, and related programs are widely used for high-performance processing of delimited data, particularly CSV. They excel when input conforms to a recognizable tabular structure and when field boundaries, quoting rules, and record layouts are already well defined.

These tools are optimized for structured streams: they provide fast parsing, expressive transformations, and strong guarantees once basic format assumptions are met. In that role, they are highly effective components of modern data pipelines.

Their limitation, in the context of early validation, is not capability but scope. They presuppose that structural coherence already exists. When confronted with inconsistent field counts, shifting delimiters, malformed records, or mixed-format sections, they typically fail early or require pre-cleaned input.

AWK occupies a different position. It does not assume a stable schema or even a stable record shape. By operating on text and patterns rather than fixed structures, it can observe, classify, and assert properties of a stream before any format-specific interpretation is imposed. This makes it suitable for the earliest stage of validation, where the primary question is not how to transform the data, but whether the data can be safely interpreted at all.

The Data Assertion Pattern

Robust pipelines treat validation as a gate, not a side effect. A practical way to implement this is through data assertions: small, focused validation programs that return explicit success or failure signals.

These assertions execute immediately after ingestion and before any expensive processing begins. If a structural invariant is violated—unexpected field counts, malformed records, encoding issues—the pipeline fails fast, with diagnostics that point directly to the source of the problem.

AWK is well-suited to this pattern. Its exit codes integrate naturally with shell pipelines and workflow orchestrators. Its diagnostics can include precise line numbers, pattern matches, and anomaly counts. And its simplicity reduces the operational risk of the validation layer itself.

Unix Composition and Streaming Architecture

AWK operates within the Unix toolchain and composes naturally with core utilities such as grep, sed, sort, uniq, and cut through pipes and standard streams. At the same time, it is fundamentally different from these tools. AWK is a complete programming language in its own right—Turing complete, stateful, and capable of expressing non-trivial control flow, aggregation, and structural analysis logic.

This dual nature is central to its role in data pipelines. AWK participates in Unix composition like a classic streaming filter, yet it can encapsulate validation logic that would otherwise require custom programs or heavier runtimes. Pattern matching, conditional execution, state carried across records, and multi-line context can all be handled within a single streaming pass, without abandoning the simplicity of standard input and output.

Composition remains a strength rather than a constraint. AWK can act as a thin structural probe between other tools, or as a self-contained validation stage that replaces entire chains of simpler utilities. In both cases, execution remains streaming, memory usage remains bounded, and behavior remains transparent and inspectable.

AWK does not compete with modern data tools. It complements them by ensuring that the assumptions they rely on are actually satisfied.

Conclusion

Modern data pipelines increasingly recognize validation as essential, yet often conflate semantic correctness with structural integrity. In practice, structure must be established before meaning can be enforced. Treating malformed or heterogeneous input as if it were already schema-ready remains a common source of avoidable failure.

AWK occupies a precise and early position inside Phase 2 of the pipeline: structural, pre-schema validation. Its streaming execution model, minimal assumptions, constant memory usage, and universal availability make it well suited to this role. These properties are not historical artifacts but practical advantages when dealing with untrusted data at scale.

AWK’s continued relevance is not a matter of nostalgia, but of architectural fit. It does not replace modern data tools, nor does it compete with them. Instead, it operates where many pipelines remain weakest—at the point where data is first examined, before interpretation begins.

Further articles will explore concrete applications of AWK in modern data workflows and validation scenarios. These discussions continue at AwkLab, where AWK’s role in contemporary data engineering is examined in depth.

AWK Syntax Essentials

Gábor Dombay — Sat, 31 Jan 2026 13:58:38 GMT

Syntax is based on The AWK Programming Language, 2nd Edition by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger.

Pattern

A pattern determines when an action is executed. When a pattern matches an input line, its associated action is executed.

If no action is specified, the default is print $0.

Syntax: pattern { action }

Examples

BEGIN — runs once before any input is read

BEGIN { FS=":" }

END — runs once after all input is processed

END { print "total:", total }

Expression — executes on every input line where the condition is true

Skip the header line

NR > 1 { print $0 }

Regex — executes on every line matching the pattern

/error/ { print $0 }

Range — matches all lines from pattern1 through pattern2, inclusive

/start/,/end/ { print $0 }

Conditionals

AWK supports standard conditionals for branching logic.

Examples

if

Skip empty lines

if (NF > 0) print $0

if-else

if ($1 > 0) print "positive"; else print "negative"

if-else if

if ($1 > 0) print "positive"
else if ($1 < 0) print "negative"
else print "zero"

Ternary Expression

The ternary operator (?:) is a compact, C-style alternative to the if-else statement. By using this operator, a concise one-line ternary expression can be constructed, which, unlike a statement, returns a value that can be used directly within calculations or commands.

Syntax: condition ? value_if_true : value_if_false

Examples

Absolute Value

AWK lacks a built-in abs() function. The ternary operator handles this math logic efficiently:

$1 = ($1 < 0) ? -$1 : $1

Truthiness & Success Labels

AWK treats 0 and "" as false, and everything else as true. Use this to label status codes:

print $1, ($1 ? "SUCCESS" : "FAILURE")

printf("%s\t%s\n", $1, ($1 ? "SUCCESS" : "FAILURE"))

Associative Arrays

AWK arrays are associative — keys can be strings or numbers, making them ideal for counting, grouping, and lookups without any pre-declaration.

Syntax: array[key] = value

Examples

Populate an array with lines

x[NR] = $0

Deduplicate lines

!x[$0]++

Populate a 2D matrix with all fields

for (i=1; i<=NF; i++) x[NR, i] = $i

AWK simulates 2D arrays by concatenating keys with a built-in separator (SUBSEP), so x[row, col] is stored internally as x[row SUBSEP col].

Loops

AWK supports standard C-style loops as well as a dedicated form for traversing associative arrays.

Examples

for loop

for (i=1; i<=NF; i++) print $i

while loop

while (i <= NF) { print $i; i++ }

do-while loop

do { print $i; i++ } while (i <= NF)

Iterate over an associative array

for (k in array) print k, array[k]

Pipes

AWK can send output to, or receive input from, external shell commands using pipes, enabling seamless integration with standard Unix tools.

Syntax: command | getline [var] / print | "command"

Examples

Send output to a shell command

print $1 | "sort"

Read a shell command's output into a variable

"date" | getline today

Pipe to multiple commands (close between uses)

print $1 | "sort | uniq -c"

Comparison Operators

Operator	Meaning
`<`	less than
`<=`	less than or equal to
`==`	equal to
`!=`	not equal to
`>=`	greater than or equal to
`>`	greater than
`~`	matched by
`!~`	not matched by

Logical Operators

Operator	Meaning
`&&`	AND
`││`	OR
`!`	NOT

Syntax: condition1 operator condition2

Built-in variables

Variable	Description	Default
`ARGC`	Number of command-line arguments, including command name	-
`ARGV`	Array of command-line arguments, numbered 0..ARGC-1	-
`CONVFMT`	Conversion format for numbers	`"%.6g"`
`ENVIRON`	Array of shell environment variables	-
`FILENAME`	Name of current input file	-
`FNR`	Record number in current file	-
`FS`	Input field separator	`" "`
`NF`	Number of fields in current record	-
`NR`	Number of records read so far	-
`OFMT`	Output format for numbers	`"%.6g"`
`OFS`	Output field separator for print	`" "`
`ORS`	Output record separator for print	`"\n"`
`RLENGTH`	Length of string matched by match function	-
`RS`	Input record separator	`"\n"`
`RSTART`	Start of string matched by match function	-
`SUBSEP`	Subscript separator	`"\034"`

Built-in Arithmetic Functions

Function	Value Returned
`atan2(y, x)`	arctangent of y/x in the range −π to π
`cos(x)`	cosine of x, with x in radians
`exp(x)`	exponential function of x, e^x
`int(x)`	integer part of x; truncated towards 0
`log(x)`	natural (base e) logarithm of x
`rand()`	random number r, where 0 ≤ r < 1
`sin(x)`	sine of x, with x in radians
`sqrt(x)`	square root of x
`srand(x)`	x is new seed for rand(); use time of day if x is omitted; return previous seed

Built-in String Functions

Function	Description
`gsub(r,s)`	substitute s for r globally in $0, return number of substitutions made
`gsub(r,s,t)`	substitute s for r globally in string t, return number of substitutions made
`index(s,t)`	return first position of string t in s, or 0 if t is not present
`length(s)`	return number of Unicode characters in s; return number of elements if s is an array
`match(s,r)`	test whether s contains a substring matched by r; return index or 0; sets RSTART and RLENGTH
`split(s,a)`	split s into array a on FS or as CSV if --csv is set, return number of elements in a
`split(s,a,fs)`	split s into array a on field separator fs, return number of elements in a
`sprintf(fmt,expr-list)`	return expr-list formatted according to format string fmt
`sub(r,s)`	substitute s for the leftmost longest substring of $0 matched by r; return number of substitutions made
`sub(r,s,t)`	substitute s for the leftmost longest substring of t matched by r; return number of substitutions made
`substr(s,p)`	return suffix of s starting at position p
`substr(s,p,n)`	return substring of s of length at most n starting at position p
`tolower(s)`	return s with upper case ASCII letters mapped to lower case
`toupper(s)`	return s with lower case ASCII letters mapped to upper case

Expression Operators

Operation	Operators	Example	Meaning of Example
assignment	`= += -= *= /= %= ^=`	`x *= 2`	`x = x * 2`
conditional	`?:`	`x ? y : z`	if `x` is true then `y` else `z`
logical OR	`││`	`x ││ y`	1 if `x` or `y` is true, 0 otherwise
logical AND	`&&`	`x && y`	1 if `x` and `y` are true, 0 otherwise
array membership	`in`	`i in a`	1 if `a[i]` exists, 0 otherwise
matching	`~ !~`	`$1 ~ /x/`	1 if the first field contains an `x`, 0 otherwise
relational	`< <= == != >= >`	`x == y`	1 if `x` is equal to `y`, 0 otherwise
concatenation	(none)	`"a" "bc"`	`"abc"`; there is no explicit concatenation operator
add, subtract	`+ -`	`x + y`	sum of `x` and `y`
multiply, divide, mod	`* / %`	`x % y`	remainder of `x` divided by `y`
unary plus and minus	`+ -`	`-x`	negated value of `x`
logical NOT	`!`	`!$1`	1 if `$1` is zero or null, 0 otherwise
exponentiation	`^`	`x ^ y`	`x` to the power `y`
increment, decrement	`++ --`	`++x, x++`	add 1 to `x`
field	`$`	`$i + 1`	value of `i`-th field, plus 1
grouping	`()`	`$(i++)`	return `i`-th field, then increment `i`

printf

printf format-control characters

Character	Print Expression As
`c`	single UTF-8 character (code point)
`d` or `i`	decimal integer
`e` or `E`	[-]d.dddddde[+-]dd or [-]d.ddddddE[+-]dd
`f`	[-]ddd.dddddd
`g` or `G`	e or f conversion, whichever is shorter, with nonsignificant zeros suppressed
`o`	unsigned octal number
`u`	unsigned integer
`s`	string
`x` or `X`	unsigned hexadecimal number
`%`	print a %; no argument is consumed

Why AWK in 2026?

Gábor Dombay — Thu, 29 Jan 2026 08:12:16 GMT

Because small is beautiful.
Because AWK gives unmatched bang for the buck.
Because AWK is the antidote to AI slop.
Because AWK is always there when you need it.
Because AWK assumes your data fits in a pipe, not a cluster.
Because AWK makes you think, not just prompt.
Because AWK’s syntax stays out of your way.
Because AWK works across formats, not just inside one.
Because AWK is zero-setup — no need to import the world.
Because AWK lets you think in records, not lines.
Because AWK treats text as the universal interface.
Because AWK already is the loop.
Because AWK pre-processes gigabytes of text on a potato.
Because AWK works before schemas exist.

Practical AWK Benchmarking

Gábor Dombay — Mon, 26 Jan 2026 22:23:06 GMT

"I think in terms of programming languages you get the most bang for your buck by learning AWK",

said Brian Kernighan, the K in AWK (Lex Fridman podcast #109). AWK was created in Bell Labs in 1977, and its name is derived from the surnames of its authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is still widely used today, as a core tool it is available on any Unix or Unix-like system (Linux, BSDs, macOS etc.). Its relevance extends to modern data pipelines, where AWK can be applied as an effective, schema-agnostic pre-processor.

Why Benchmark AWK

AWK is a concise, domain-specific programming language designed for efficient text processing via its pattern–action execution model. Its core strength lies in its implicit record- and field-level iteration: input is processed sequentially, one record at a time, with automatic field decomposition applied to each record. This design removes the need for explicit control flow for input traversal, allowing programs to express what transformation should occur rather than how to iterate. By abstracting record traversal, field iteration, and memory management, AWK allows concise expressions of C-like logic that are executed immediately when a pattern matches, making it exceptionally effective for rapid, ad-hoc data analysis and transformation.

While the AWK language is a standard, it exists in several distinct implementations, most notably:

gawk (GNU Awk): The feature-rich version maintained by Arnold Robbins. Default in Arch Linux, RHEL, Fedora.
mawk (Mike Brennan’s Awk): A speed-oriented implementation using a bytecode interpreter, currently maintained by Thomas Dickey. Default in Debian and many of its derivatives.
nawk (The "One True Awk"): The original implementation from the language’s creators, maintained by Brian Kernighan. Default in BSDs and macOS.

In most Linux distributions, the awk command is a symbolic link to a specific implementation. You can verify which variant is being used with: ls -l $(which awk).

This article benchmarks gawk, mawk, and nawk by evaluating both execution time and memory footprint through a Pareto Frontier analysis to determine their true resource efficiency. The primary catalyst for this comparison is Brian Kernighan’s 2025 update to nawk, which introduced CSV and UTF-8 support.

Benchmarking Approach

The benchmarks utilize functional one-liners that perform logical data analysis tasks relevant to the dataset. Rather than relying on synthetic loops or isolated instructions, these benchmarks are designed to reflect idiomatic AWK usage. This approach evaluates engine performance across various internal operations, including:

Data aggregation: Extensive use of associative arrays.
Control flow: Implementation of conditional logic and loops.
Text processing: Pattern matching and string manipulation through regex and built-in functions.
Arithmetic: Processing numeric fields for financial calculations.

Methodology

To evaluate the performance of the three AWK implementations, the benchmarking focused on two critical metrics: runtime and peak memory usage. Resource tracking was performed using cgmemtime, an ideal tool for this purpose as it captures peak group memory consumption, including any spawned child processes. The benchmarking process was automated via benchgab.awk, my custom benchmark runner that handles warmups and multiple test runs. The script is built on top of cgmemtime and implemented in AWK —a deliberately fitting and self-referential choice for this study.

The workload consisted of a 179 MB CSV dataset containing 1.5 million lines and 14 fields. The chosen dataset ensures that commas only appear as field delimiters, allowing for a comparison across all three engines using the standard -F, flag, as mawk lacks -- csv support. The fields are structured as follows:

Region, 2. Country, 3. Item Type, 4. Sales Channel, 5. Order Priority, 6. Order Date, 7. Order ID, 8. Ship Date, 9. Units Sold, 10. Unit Price, 11.Unit Cost, 12. Total Revenue, 13. Total Cost, 14. Total Profit.

Each benchmark sequence included one initial warmup followed by ten recorded runs, with the mean and standard deviation derived from this ten-run sample.

The following table provides a summary of the specific versions and main characteristics of the three AWK implementations tested:

|  Name  |        Version | Binary Size | Installed Size |  CSV  | UTF-8 | Extensions |
|--------|----------------|-------------|----------------|-------|-------|------------|
|  gawk  |          5.3.2 |    853 kB   |    3.60 MB     |  yes  |  yes  |     yes    |
|  mawk  | 1.3.4 20250131 |    179 kB   |     206 kB     |  no   |  no   |     no     |
|  nawk  |       20251225 |    139 kB   |     145 kB     |  yes  |  yes  |     no     |

Benchmarks were conducted on an Arch Linux workstation powered by a Ryzen 5900x CPU, using the Alacritty terminal within a dwm session.

Benchmarks

Understanding the Results

Each benchmark includes a result table. The metrics are defined as follows:

Runtime: The average execution time [s] followed by the standard deviation (±σ).
Peak Mem: The average peak group memory [MB] followed by the standard deviation (±σ).
RT: Normalized average runtime. The execution time relative to the fastest implementation (1.0 is the baseline).
PM: Normalized average group peak memory. The peak memory relative to the implementation with the lowest memory footprint (1.0 is the baseline).

#1 Benchmark: duplicate lines

Objective: Identify and print the total number of duplicate lines within the dataset.

Targeted operations: Associative arrays.

awk -F, 'x[$0]++ { i++ } END { print i }'

Output: 108603

|  #1  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 1.395 ± 0.044 | 551.16 ± 0.47 | 1.12 | 1.98 |
| mawk | 1.241 ± 0.030 | 290.59 ± 0.17 | 1.00 | 1.04 |
| nawk | 1.267 ± 0.007 | 278.90 ± 0.21 | 1.02 | 1.00 |

#2 Benchmark: most units sold by country

Objective: Find the country with the highest total units sold, excluding duplicate entries.

Targeted Operations: multi-array processing and max-value search

awk -F, 'NR > 1 && !x[$0]++ { u[$2] += $9 } END { for (i in u) if (u[i] > u_max) { u_max = u[i]; c = i }  print c, u_max }'

|  #2  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 2.717 ± 0.018 | 551.12 ± 0.31 | 1.62 | 1.98 |
| mawk | 1.678 ± 0.037 | 290.71 ± 0.28 | 1.00 | 1.04 |
| nawk | 2.175 ± 0.007 | 278.84 ± 0.22 | 1.30 | 1.00 |

#3 Benchmark: highest profit margin

Objective: Identify the order ID with the greatest ratio of profit to unit price.

Targeted operations: Floating-point arithmetic and conditional max-value tracking.

awk -F, 'NR > 1 { pm = ($10 - $11) / $10; if (pm > pm_max) { pm_max = pm; id = $7 }} END { print id }'

Output: 667593514

|  #3  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 1.783 ± 0.006 |  0.74 ± 0.01  | 3.02 | 2.69 |
| mawk | 0.591 ± 0.005 |  0.62 ± 0.13  | 1.00 | 2.25 |
| nawk | 1.340 ± 0.005 |  0.28 ± 0.08  | 2.27 | 1.00 |

#4 Benchmark: count European countries

Objective: Count unique country names within the Europe region using exact string matching.

Targeted operations: Exact string matching and associative array lookups.

awk -F, '$1 == "Europe" { eu[$2]++ } END { for (country in eu) n++; print n }'

Output: 48

|  #4  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 0.513 ± 0.005 |  0.74 ± 0.00  | 1.46 | 1.89 |
| mawk | 0.351 ± 0.004 |  0.63 ± 0.12  | 1.00 | 1.62 |
| nawk | 1.284 ± 0.006 |  0.39 ± 0.12  | 3.66 | 1.00 |

#5 Benchmark: count European countries (regex)

Objective: Count unique country names within the Europe region using regex matching**.**

Targeted operations: Regex matching and associative array lookups.

awk -F, '$1 ~ /Europe/ { eu[$2]++ } END { for (country in eu) n++; print n }'

Output: 48

|  #5  |  Runtime [s]   | Peak Mem [MB] |  RT  |  PM  |
|------|----------------|---------------|------|------|
| gawk | 0.524 ± 0.017  |  0.76 ± 0.14  | 1.49 | 1.42 |
| mawk | 0.351 ± 0.006  |  0.67 ± 0.11  | 1.00 | 1.24 |
| nawk | 1.420 ± 0.007  |  0.54 ± 0.11  | 4.04 | 1.00 |

#6 Benchmark: number of orders in date range

Objective: Count number of orders (excluding duplicates) between 3/1/2014 and 3/31/2015.

Targeted operations: String manipulation functions, relational string comparisons, and associative array deduplication.

awk -F, 'NR > 1 && !x[$0]++ { split($6, a, "/"); d = sprintf("%d%02d%02d", a[3], a[1], a[2]); if (d >= "20140301" && d <= "20150331") n++ } END { print n }'

Output: 203060

|  #6  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 4.106 ± 0.043 | 551.14 ± 0.21 | 1.95 | 1.98 |
| mawk | 2.107 ± 0.016 | 290.76 ± 0.27 | 1.00 | 1.04 |
| nawk | 3.434 ± 0.019 | 278.87 ± 0.26 | 1.63 | 1.00 |

Results

Geometric Mean and Normalization

To provide a representative comparison across multiple benchmarks, the Geometric Mean for the normalized RT and PM values was calculated. The geometric mean is the mathematically appropriate choice for averaging ratios or normalized values, as it ensures that relative improvements are weighted consistently across all tests. Unlike the arithmetic mean, this approach prevents outliers in absolute execution time from disproportionately skewing the aggregate performance profile.

Evaluation Metrics

To synthesize these normalized results into a single actionable score, I have applied two evaluation metrics:

Euclidean Distance (d): Measures the geometric distance from the "Ideal Point" (1,1). A lower d indicates a more balanced implementation that is close to being the best in both speed and memory simultaneously.
Resource Footprint (F): Calculated as RT×PM. This represents the total resource footprint; lower values indicate a more efficient use of system resources to complete the same task.

Summary Table

The following table summarizes the overall performance of the three AWK engines based on the geometric mean of all normalized benchmarks:

| Summary |  RT  |  PM  |  d   |  F   |
|---------|------|------|------|------|
| gawk    | 1.80 | 1.96 | 1.25 | 3.51 |
| mawk    | 1.00 | 1.31 | 0.31 | 1.31 |
| nawk    | 2.13 | 1.00 | 1.13 | 2.13 |

Definitions - RT: Normalized Runtime; PM: Normalized Peak Memory; d: Euclidean Distance; F: Resource Footprint

Discussion

The benchmarking results across six diverse objectives show a clear and consistent performance profile for each implementation. Across all six benchmarks, mawk was consistently the fastest, while nawk maintained the lowest memory footprint. Conversely, gawk exhibited the highest memory usage in every benchmark. However, gawk demonstrates higher relative speed consistency than nawk; even when finishing second or third, it generally avoids the significant performance collapses seen by nawk. While nawk is fast at mathematical logic and simple field processing, it is significantly slower at regex and string operations, and complex array management.

These individual performance patterns serve as the foundation for my aggregate metrics, where the trade-off between speed and memory is formally quantified.

While the Euclidean distance (d) provides a useful preliminary indication of effectiveness, relying on it alone can be misleading. For instance, the Euclidean Distances for gawk (1.25) and nawk (1.13) are relatively close, yet their Resource Footprints (F) reveal a significant disparity: gawk consumes nearly 65% more total resources.

This limitation necessitates a more robust analysis via the Pareto frontier.

To visualize the trade-offs, I plotted the normalized values on a 2D coordinate system where the x-axis represents the normalized runtime (RT) and the y-axis represents normalized peak memory (PM). The "Ideal Point" is located at (1,1), representing an implementation that is simultaneously the fastest and the most memory-efficient.

Graph: The Pareto Frontier of AWK implementations: Visualizing the optimal equilibrium between execution speed and memory footprint.

The Pareto frontier represents the boundary of "non-dominated" solutions—implementations where you cannot improve one metric (like speed) without degrading another (like memory). In this study, mawk and nawk define the frontier: mawk is the choice for raw speed, while nawk is the choice for minimal footprint. gawk, however, is positioned away from this boundary; because it is slower than mawk and uses more memory than nawk, it is considered "dominated" and sub-optimal in terms of raw resource efficiency.

Conclusion

The data confirms that the "best" AWK implementation is a calculated trade-off between throughput and resource overhead. Within the Unix philosophy of choosing the right tool for the job, each engine serves a distinct operational profile.

mawk is the powerhouse for high-volume data. If your primary bottleneck is execution speed, its bytecode engine is unrivaled. It consistently defines the leading edge of the Pareto frontier, delivering the highest performance-to-resource ratio.
nawk is the go-to for minimalist environments. While it prioritizes simplicity over the heavy lifting of complex regex or string manipulation, its memory footprint is remarkably small and predictable. It is the definitive choice for systems where memory overhead is a strictly limited resource.
gawk offers a more nuanced value proposition. While it is mathematically dominated by its rivals, that overhead pays for a much broader feature set which can outweigh its increased resource consumption.

Across various workflows—from data science pipelines to system automation—mawk provides the highest performance return for most standard tasks. Ultimately, these results show that the choice of engine should be a deliberate decision: use mawk for speed, nawk for a light footprint, and gawk when you need its extended toolkit.