<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[AwkLab]]></title><description><![CDATA[Mastering AWK – scripting, data processing, Unix philosophy]]></description><link>https://awklab.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1769778947818/3990eb17-569d-4bfd-979d-724444faff01.png</url><title>AwkLab</title><link>https://awklab.com</link></image><generator>RSS for Node</generator><lastBuildDate>Thu, 09 Apr 2026 22:53:42 GMT</lastBuildDate><atom:link href="https://awklab.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Unix Pipes Under Load: Streaming, Barriers, Backpressure, and Bottlenecks ]]></title><description><![CDATA[1. The Unix pipeline
Every Unix user knows the pipe operator. Typing ls | wc -l, and two independent programs exchange data as if they were designed together. That simplicity reflects a deliberate des]]></description><link>https://awklab.com/unix-pipes-under-load</link><guid isPermaLink="true">https://awklab.com/unix-pipes-under-load</guid><category><![CDATA[unix]]></category><category><![CDATA[Pipeline]]></category><category><![CDATA[Linux]]></category><category><![CDATA[awk]]></category><category><![CDATA[performance]]></category><category><![CDATA[shell]]></category><dc:creator><![CDATA[Gábor Dombay]]></dc:creator><pubDate>Thu, 09 Apr 2026 17:30:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/f9e4cacc-03fa-4073-ac8b-379ecff1e73c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>1. The Unix pipeline</h2>
<p>Every Unix user knows the pipe operator. Typing <code>ls | wc -l</code>, and two independent programs exchange data as if they were designed together. That simplicity reflects a deliberate design philosophy that Doug McIlroy articulated at Bell Labs in the 1970s: write programs that do one thing well, and let them communicate through a universal interface to combine them into complex workflows.</p>
<p>Essentially, a Unix pipe allows one program to send its output directly into another program’s input without saving a temporary file to disk. It turns two independent tools into a single data pipeline.</p>
<p>What is less obvious is why pipes work so well for large data. A pipeline processing a multi-gigabyte log file uses roughly the same memory as one processing a kilobyte. The data flows through, it does not accumulate. This streaming behavior is the pipe's defining characteristic — and also its most misunderstood one.</p>
<p>This article examines pipes from that angle: not as a convenience feature, but as a memory-efficient streaming primitive. We will look at how they work, where they excel, and where they quietly fail.</p>
<h2>2. The McIlroy story</h2>
<p>In 1986, Donald Knuth was asked to write a program solving a simple problem: read a file, find the most frequently used words, and print the top results. He produced a literate programming masterpiece — a carefully crafted Pascal program, several pages long, with a custom data structure optimized for the task.</p>
<p>Doug McIlroy reviewed it and responded with six lines of shell:</p>
<pre><code class="language-plaintext">tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | head
</code></pre>
<p>The pipeline reads a file, splits it into words, lowercases them, sorts, counts occurrences, sorts by frequency, and prints the top results. It requires no custom data structures, no memory management, no compilation. It also processes files larger than available RAM without modification — each stage handles only what fits in the pipe buffer at any moment.</p>
<p>McIlroy's point was not that shell pipelines are always better than carefully written programs. It was that composition of simple tools can match or exceed purpose-built solutions, at a fraction of the complexity. The memory efficiency was not a design goal — it was a consequence of how pipes work.</p>
<h2>3. How pipes work</h2>
<p>When you type <code>ls | wc -l</code>, the shell creates a pipe before either program starts. A pipe is a kernel-managed buffer — on Linux, 64KB by default — with two ends: a write end and a read end. The shell connects <code>ls</code> stdout to the write end, and <code>wc</code> stdin to the read end, using a system call called <code>dup2</code>. Neither program knows or cares about the pipe — <code>ls</code> writes to what it thinks is standard output, <code>wc</code> reads from what it thinks is standard input.</p>
<p>Behind the scenes, the shell uses three system calls to wire this together. First, <code>pipe()</code> creates the buffer and returns two file descriptors — one for reading, one for writing. Then <code>fork()</code> creates two child processes, both inheriting those file descriptors. Finally, <code>dup2()</code> redirects stdout in the first child to the pipe's write end, and stdin in the second child to the pipe's read end. The original pipe file descriptors are then closed, and each child calls <code>exec()</code> to become <code>ls</code> and <code>wc</code> respectively. From that point, the two programs run independently, connected only through the kernel buffer.</p>
<p>Both processes start concurrently. When <code>ls</code> fills the 64KB buffer, the kernel blocks it until <code>wc</code> reads some data and makes room. When the buffer is empty and <code>ls</code> hasn't finished writing, <code>wc</code> blocks and waits. This <strong>backpressure mechanism</strong> is why pipelines are memory-efficient: at most 64KB of data exists in the pipe at any moment, regardless of how large the input is. This holds true for <strong>streaming stages</strong> — programs like <code>grep</code>, <code>awk</code>, or <code>sed</code> that process input line by line. <strong>Barrier stages</strong> like <code>sort</code> are a different matter: they must read all input into their own memory before producing any output, making their memory usage proportional to input size regardless of the pipe buffer.</p>
<p>This is also why pipelines are not parallel in any meaningful sense. The processes take turns. A fast producer is throttled by a slow consumer, and a slow producer starves a fast consumer. The 64KB buffer smooths out brief mismatches, but it does not change the fundamental constraint: the slowest stage becomes a <strong>bottleneck</strong> that limits total throughput.</p>
<h2>4. Streaming vs barrier stage</h2>
<p>The <strong>backpressure mechanism</strong> described in section 3 creates an important distinction between two fundamentally different kinds of pipeline stages. Understanding this distinction is the key to reasoning about pipeline performance and memory usage.</p>
<p>A <strong>streaming stage</strong> processes input incrementally and produces output as it goes. <code>grep</code> reads a line, tests it, emits it or discards it, moves to the next. <code>tr</code>, <code>cut</code>, <code>sed</code> are all streamers. Their memory footprint is constant regardless of input size, and they never block the pipeline longer than it takes to process one record.</p>
<p>A <strong>barrier stage</strong> must consume its entire input before producing any output. <code>sort</code> is the canonical example — you cannot emit the smallest element until you have seen all elements. <code>tac</code> reverses a file, so it must read everything before writing anything. <code>uniq -c</code> as typically used follows a <code>sort</code>, so by the time it runs the damage is already done. These stages break the streaming contract: memory grows with input size, and everything downstream waits.</p>
<p>Consider the classic word frequency pipeline from section 2: <code>tr</code> streams instantly. The first <code>sort</code> then consumes everything — gigabytes if necessary — before <code>uniq -c</code> sees a single line. <code>uniq -c</code> streams quickly over the sorted output, then the second <code>sort</code> again consumes everything before head gets its ten lines. Two full barriers, three sequential phases. The pipeline is not six concurrent processes — it is three sequential batches connected by two streaming bridges.</p>
<p>You can observe this directly. Run <code>/usr/bin/time -v</code> on the full pipeline versus the first <code>sort</code> alone on the same input:</p>
<pre><code class="language-plaintext">/usr/bin/time -v sh -c 'tr -cs A-Za-z "\n" &lt; example.txt | tr A-Z a-z | sort | uniq -c | sort -rn | head'
/usr/bin/time -v sh -c 'tr -cs A-Za-z "\n" &lt; example.txt | tr A-Z a-z | sort &gt; /dev/null'
</code></pre>
<p>E.g. using <a href="https://www.gutenberg.org/ebooks/100">The Complete Works of William Shakespeare</a> as an example I measured 0.32 sec vs 0.29 sec, with practically the same 12.1 MB peak memory for both cases.</p>
<p>It indicates that the first <code>sort</code> dominates memory consumption for the entire pipeline. Everything before it is essentially free, and everything after operates on a fraction of the original data.</p>
<p>This has a practical implication: <strong>when optimizing a pipeline, identifying and addressing the first barrier is almost always the highest-leverage intervention</strong>. Stages before the barrier run for free in terms of memory. Stages after it operate on reduced data. The barrier itself is where the cost lives.</p>
<h2>5. Where pipelines work well</h2>
<p>The most common real-world pipeline is probably log analysis.</p>
<p>a. Finding the most frequent error messages in a log file looks like this:</p>
<pre><code class="language-plaintext">grep "ERROR" app.log | sort | uniq -c | sort -rn
</code></pre>
<p><code>grep</code> streams through the file, emitting only matching lines — constant memory, no barrier. Then <code>sort</code> accumulates everything into memory before emitting a single line. <code>uniq -c</code> streams over the sorted output counting consecutive duplicates. The second <code>sort -rn</code> accumulates again to rank by frequency.</p>
<p>Two barriers in four stages. On a large log file, peak memory is determined entirely by how many ERROR lines exist — not by the file size, but not constant either.</p>
<p>b. Another common pattern is searching for a string across many files:</p>
<pre><code class="language-plaintext">find . -name "*.log" | xargs grep "ERROR"
</code></pre>
<p><code>find</code> streams filenames one by one into <code>xargs</code>, which batches them into <code>grep</code> invocations. No stage accumulates data — memory stays constant regardless of how many files exist or how large they are.</p>
<p>This is pipelines doing what they do best: composing simple tools into a workflow that scales to arbitrary input size without modification.</p>
<h2>6. Where pipes fall short</h2>
<p>Pipelines excel at linear transformations of a single data stream — filtering, counting, reformatting. But not every problem fits that shape. Three categories of problems expose their limits clearly.</p>
<p><strong>Complex state</strong>. Pipelines are stateless between stages. Each stage sees only its own input stream, with no knowledge of what other stages have seen or produced. If your processing requires correlating events across records, tracking sequences, or maintaining context that spans multiple passes over the data, you are working against the model. At that point you are no longer composing a pipeline — you are writing a program, and a proper scripting language or tool will serve you better than forcing the logic into shell.</p>
<p><strong>Heavy parallelism</strong>. As established in section 3, pipeline stages take turns rather than run truly concurrently. If your workload is CPU-bound and the data can be partitioned, a pipeline will leave cores idle. This is a consequence of the sequential streaming model. <strong>The pipe was never designed for parallel computation</strong>.</p>
<p><strong>Joins</strong>. Combining two datasets by a common key — the most basic operation in data processing — has no clean pipeline expression. You can approximate it with <code>sort</code> and <code>join</code>, but this requires both inputs to be sorted first, adding two barriers before the actual work begins. For anything beyond trivial cases, a pipeline join is awkward, brittle, and slow compared to a proper tool.</p>
<h2>7. Beyond pipelines</h2>
<p>When a pipeline reaches its limits, two directions are worth considering depending on the problem. If the bottleneck is a barrier stage, the pipeline itself can often be restructured. If the bottleneck is parallelism, the pipeline needs external orchestration.</p>
<h3>7.1. AWK</h3>
<p><strong>AWK</strong> can replace entire pipelines in a single pass. It operates in two modes depending on the problem: purely streaming, processing each record and emitting output immediately, or stateful, accumulating data across records using associative arrays and producing output at the end. Most pipeline tools are one or the other — <code>grep</code> and <code>sed</code> stream, <code>sort</code> accumulates. AWK can do both, which makes it effective at eliminating the sort-based barriers that dominate most text processing pipelines.</p>
<p>The most common barrier in a pipeline is <code>sort</code> used purely to prepare input for <code>uniq</code> or <code>uniq -c</code>. AWK's associative arrays make both redundant.</p>
<p>Deduplication without sort:</p>
<pre><code class="language-plaintext">sort file | uniq
</code></pre>
<p>Using <strong>AWK</strong>:</p>
<pre><code class="language-plaintext">awk '!x[$0]++' file
</code></pre>
<p>A single streaming pass. Each line is checked against an associative array — seen lines are discarded, unseen lines pass through. The tradeoff is explicit: output order is insertion order, not sorted order. When sorted output is not required, the barrier is gone entirely.</p>
<p>As efficient it looks, resource footprints are a different questions. Running a quick benchmarking using the same benchmark runner and csv file as in my <a href="https://awklab.com/practical-awk-benchmarking">Practical AWK Benchmarking</a> article, the results are as follows:</p>
<pre><code class="language-plaintext">--- Statistical Summary ---
cmd   Runtime [s]                       			Peak Memory [MB]
      mean ± sdev       min    median  max    Jtr%  mean ± sdev       min    median  max    Jtr%
gawk  2.5438 ± 0.0251   2.5167 2.5488  2.5661 0.2   594.21 ± 0.25    593.96 594.20  594.46 0.0
mawk  2.5876 ± 0.0035   2.5842 2.5877  2.5911 0.0   291.24 ± 0.57    290.89 290.93  291.89 0.1
nawk  2.5799 ± 0.0124   2.5726 2.5730  2.5942 0.3   322.17 ± 0.17    322.03 322.12  322.35 0.0
s|u   2.4596 ± 0.0048   2.4556 2.4585  2.4649 0.0   368.64 ± 0.07    368.57 368.62  368.71 0.0
</code></pre>
<pre><code class="language-plaintext">--- Normalized Benchmarks ---
cmd   RT    PM    d     F
gawk  1.04  2.04  1.04  2.12
mawk  1.05  1.00  0.05  1.05
nawk  1.05  1.11  0.12  1.16
s|u   1.00  1.27  0.27  1.27
</code></pre>
<p><strong>Definitions</strong> - RT: Normalized Runtime; PM: Normalized Peak Memory; d: Euclidean Distance; F: Resource Footprint. 
(Calculations are based on 1 warmup and 3 test runs.)</p>
<p>Runtime is nearly identical across all variants — <strong>AWK</strong> offers no speed advantage over <code>sort | uniq</code> here. The real difference is memory. <strong>mawk</strong> uses 21% less memory than <code>sort | uniq</code> and has the best overall resource footprint (F=1.05), making it the clear winner. <strong>nawk</strong> is a reasonable middle ground at F=1.16. <strong>gawk</strong>, despite being the most widely used <strong>AWK</strong> variant, performs worst — consuming double the memory of <strong>mawk</strong> and 61% more than <code>sort | uniq</code>, reflected in a footprint of F=2.12. The choice of <strong>AWK</strong> implementation matters as much as the algorithmic change — the wrong one erases any benefit.</p>
<h3>7.2 Real parallelism with xargs and GNU parallel</h3>
<p>When the bottleneck is not a barrier but throughput — processing many independent inputs simultaneously — pipelines need external orchestration. <code>xargs -P</code> and <strong>GNU parallel</strong> both achieve this by partitioning work across multiple processes.</p>
<p>The <code>find | xargs grep</code> example from section 5 becomes parallel with one addition:</p>
<pre><code class="language-plaintext">find . -name "*.log" | xargs -P $(nproc) grep "ERROR"
</code></pre>
<p><code>-P $(nproc)</code> runs one <code>grep</code> process per available core. The pipeline structure remains intact — <code>find</code> still streams filenames, <code>xargs</code> still batches them — but the processing stage now uses all cores.</p>
<p><strong>GNU parallel</strong> offers finer control:</p>
<pre><code class="language-plaintext">find . -name "*.log" | parallel grep "ERROR"
</code></pre>
<p>By default it spawns one job per core. It also preserves output order, handles errors per job, and supports more complex partitioning strategies than xargs.</p>
<p>The key distinction from pipeline concurrency is explicit: you are not getting parallelism from the pipe itself. You are orchestrating multiple independent processes from outside the pipeline. The pipe remains a sequential channel — what changes is how many workers consume from it simultaneously.</p>
<h2>8. Conclusion</h2>
<p>Pipes are not primarily a performance tool. They are a composition tool — a way to connect programs that were never designed to work together, avoiding intermediate files and keeping memory usage flat regardless of input size. That property is powerful, and it is why pipelines designed in the 1970s can still process gigabytes of data efficiently today.</p>
<p>The limits are just as real. A pipeline is a single stream, moving in one direction, through stages that take turns. When your problem needs multiple streams, parallel execution, or global state, the model stops helping and starts getting in the way. Knowing where that boundary lies — and reaching for the right tool when you cross it — is what makes pipelines useful rather than limiting.</p>
]]></content:encoded></item><item><title><![CDATA[When RAM Matters: Memory Efficiency of AWK Variants]]></title><description><![CDATA[The AWK scripting language emerged from Bell Labs in 1977, named for its creators Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is still widely used today, as a core tool it is available on a]]></description><link>https://awklab.com/memory-efficiency-awk-mawk-gawk</link><guid isPermaLink="true">https://awklab.com/memory-efficiency-awk-mawk-gawk</guid><category><![CDATA[awk]]></category><category><![CDATA[unix]]></category><category><![CDATA[Linux]]></category><category><![CDATA[performance]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Scripting]]></category><dc:creator><![CDATA[Gábor Dombay]]></dc:creator><pubDate>Sat, 14 Mar 2026 17:24:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/0292b8b5-bbc6-48ae-8a76-d18d686a198c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The AWK scripting language emerged from Bell Labs in 1977, named for its creators Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is still widely used today, as a core tool it is available on any Unix or Unix-like system (Linux, BSDs, macOS etc.). It operates as a compact, domain-specific language for text processing. AWK reads input line by line, splits each line into fields, and executes code when patterns match. No explicit loops are needed for reading data; the program focuses on what to do with each record, not how to traverse the file. This makes it exceptionally effective for rapid ad-hoc data analysis and transformation, as well as filtering and more complex operations within pipelines. AWK is Turing-complete and can handle logic beyond simple pattern matching.</p>
<p>While the AWK language is POSIX standard, it exists in several distinct implementations, most notably:</p>
<ul>
<li><p><strong>gawk</strong> (GNU Awk): The feature-rich version with extensions beyond POSIX, maintained by Arnold Robbins. Default in Arch Linux, RHEL, Fedora.</p>
</li>
<li><p><strong>mawk</strong> (Mike Brennan’s Awk): An efficiency-oriented implementation using a bytecode interpreter, currently maintained by Thomas Dickey. Default in Debian and many of its derivatives.</p>
</li>
<li><p><strong>nawk</strong> (The "One True Awk"): The original implementation from the language’s creators, maintained by Brian Kernighan. Default in BSDs and macOS.</p>
</li>
</ul>
<p>In most Linux distributions, the <code>awk</code> command is a symbolic link to a specific implementation. You can verify which variant is being used with:</p>
<p><code>ls -l $(which awk)</code></p>
<p><strong>AWK has a place in modern data pipelines as an effective Phase 2 pre-filter</strong>: it is schema-agnostic, low footprint, zero-setup, and readily available (see <a href="https://awklab.com/awk-the-zero-setup-pre-processor">article</a>). It is suitable for the earliest stage of validation, as a first-pass filter, before any format-specific interpretation.</p>
<p><strong>AWK can operate in two fundamentally different modes with respect to memory usage:</strong></p>
<ol>
<li><p><strong>Streaming operations</strong> maintain constant memory usage regardless of file size. A multi-hundred-gigabyte file can be inspected using the same resources as a kilobyte sample. This makes AWK effective for null rate checks, schema validation, and range or boundary verification on datasets that exceed available memory.</p>
</li>
<li><p><strong>Stateful operations</strong>, however, require accumulating data in memory. This can take several forms: populating associative arrays for deduplication (<code>!x[$0]++</code>) or field distribution analysis (<code>x[NF]++</code>), loading records into indexed arrays for multi-pass processing, or concatenating strings to build aggregate outputs. For these operations, memory efficiency matters, and implementation differences between AWK variants become significant.</p>
</li>
</ol>
<p><strong>This article evaluates the memory efficiency of gawk, mawk, and nawk in stateful operations, as a function of input file size.</strong></p>
<h2>Benchmarking Approach</h2>
<p>The benchmarking evaluates memory consumption patterns for four different stateful operation scenarios in AWK when processing CSV data. The focus is on memory usage comparison during data population, with no additional processing or operations performed. This allows for direct measurement of how different data storage strategies impact memory footprint. In addition to memory usage execution time was also measured. Resource tracking was performed using <a href="https://github.com/gsauthof/cgmemtime">cgmemtime</a>, an ideal tool for this purpose as it captures peak memory consumption for the process group. The benchmarking process was automated via my custom runner that handles warmups, multiple test runs, and calculates statistical metrics as well as normalized parameters for comparative analysis. For details see my <a href="https://awklab.com/behilos-benchmark">BEHILOS Benchmark article</a>.</p>
<h3>Test Dataset</h3>
<p>The benchmarking uses CSV files with a consistent structure of 14 fields per row. To observe memory scaling behavior, 7 different file sizes were tested ranging from 1,000 rows to 10 million rows, 120KB to 1.2GB of size. The CSV test files are available <a href="https://excelbianalytics.com/downloads-18-sample-csv-files-data-sets-for-testing-sales/">here</a>. The 10M row file was generated by concatenating the 5M file twice.</p>
<table>
<thead>
<tr>
<th>File name</th>
<th>Rows</th>
<th>File size [MB]</th>
</tr>
</thead>
<tbody><tr>
<td>sales1K.csv</td>
<td>1K</td>
<td>0.12</td>
</tr>
<tr>
<td>sales10K.csv</td>
<td>10K</td>
<td>1.2</td>
</tr>
<tr>
<td>sales100K.csv</td>
<td>100K</td>
<td>12</td>
</tr>
<tr>
<td>sales500K.csv</td>
<td>500K</td>
<td>60</td>
</tr>
<tr>
<td>sales1.5M.csv</td>
<td>1.5M</td>
<td>178</td>
</tr>
<tr>
<td>sales5M.csv</td>
<td>5M</td>
<td>595</td>
</tr>
<tr>
<td>sales10M.csv</td>
<td>10M</td>
<td>1190</td>
</tr>
</tbody></table>
<h3>Test Environment</h3>
<p>Tests were conducted on an Arch Linux workstation powered by a Ryzen 5900x CPU with 64GB of RAM, using the Alacritty terminal within a dwm session.</p>
<p>The following table provides a summary of the specific versions and main characteristics of the three AWK implementations tested:</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Version</th>
<th>Binary Size</th>
<th>Installed Size</th>
<th>--csv</th>
<th>UTF-8</th>
<th>Extensions</th>
</tr>
</thead>
<tbody><tr>
<td>gawk</td>
<td>5.3.2</td>
<td>853 kB</td>
<td>3.60 MB</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>mawk</td>
<td>1.3.4 20260129</td>
<td>179 kB</td>
<td>206 kB</td>
<td>no</td>
<td>no</td>
<td>no</td>
</tr>
<tr>
<td>nawk</td>
<td>20251225</td>
<td>139 kB</td>
<td>145 kB</td>
<td>yes</td>
<td>yes</td>
<td>no</td>
</tr>
</tbody></table>
<h2>The Benchmarks</h2>
<p>Four benchmarks were applied. They represent common patterns for storing CSV data in AWK, each with different memory characteristics and use cases.</p>
<p>Each benchmark sequence included one initial warmup run followed by three recorded runs. Normalized paramteres are based on median, 1.0 being the baseline (e.g lowest peak memory or runtime).</p>
<h3>Benchmark #1: Store entire lines in array</h3>
<pre><code class="language-plaintext">x[NR]=$0
</code></pre>
<p>This is the simplest storage method and keeps the original line intact without parsing individual fields. The memory footprint includes the full text of each line including all field separators. This method is commonly used when you need to preserve the exact input for later processing or output, or when you need random access to complete lines.</p>
<p>Results of Benchmark #1</p>
<pre><code class="language-plaintext">Benchmark #1 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0016 ± 0.0001        0.0016   0.0016    0.0017   1.1              0.75 ± 0.02            0.73     0.74      0.77     1.4        
  mawk                         0.0010 ± 0.0001        0.0009   0.0010    0.0011   1.3              0.82 ± 0.15            0.73     0.74      1.00     11.6       
  nawk                         0.0015 ± 0.0001        0.0015   0.0015    0.0016   2.3              0.57 ± 0.15            0.48     0.49      0.74     16.8       

sales10K.csv                                                                                                                                                     
  gawk                         0.0048 ± 0.0001        0.0047   0.0049    0.0049   0.9              3.07 ± 0.15            2.98     2.99      3.25     2.8        
  mawk                         0.0027 ± 0.0002        0.0026   0.0028    0.0028   1.2              2.74 ± 0.15            2.73     2.74      2.74     0.0        
  nawk                         0.0089 ± 0.0003        0.0087   0.0087    0.0091   1.5              2.85 ± 0.19            2.73     2.83      2.98     0.6        

sales100K.csv                                                                                                                                                    
  gawk                         0.0330 ± 0.0008        0.0324   0.0328    0.0339   0.7              25.74 ± 0.15           25.73    25.74     25.74    0.0        
  mawk                         0.0174 ± 0.0014        0.0163   0.0169    0.0189   3.0              21.49 ± 0.15           21.48    21.49     21.49    0.0        
  nawk                         0.0766 ± 0.0020        0.0750   0.0760    0.0789   0.9              23.57 ± 0.35           23.24    23.74     23.74    0.7        

sales500K.csv                                                                                                                                                    
  gawk                         0.1488 ± 0.0022        0.1473   0.1480    0.1511   0.5              125.49 ± 0.29          125.23   125.48    125.74   0.0        
  mawk                         0.1025 ± 0.0048        0.0987   0.1012    0.1076   1.3              105.24 ± 0.29          104.99   105.25    105.48   0.0        
  nawk                         0.3974 ± 0.0063        0.3918   0.3966    0.4038   0.2              120.19 ± 0.38          120.02   120.27    120.27   0.1        

sales1.5M.csv                                                                                                                                                    
  gawk                         0.4399 ± 0.0036        0.4367   0.4406    0.4422   0.2              374.99 ± 0.58          374.49   374.98    375.49   0.0        
  mawk                         0.3360 ± 0.0159        0.3192   0.3403    0.3486   1.3              313.24 ± 0.52          312.98   313.00    313.74   0.1        
  nawk                         1.1826 ± 0.0132        1.1753   1.1766    1.1959   0.5              346.26 ± 0.45          346.02   346.26    346.51   0.0        

sales5M.csv                                                                                                                                                      
  gawk                         1.4856 ± 0.0037        1.4848   1.4853    1.4868   0.0              1252.07 ± 0.65         1251.73  1252.23   1252.23  0.0        
  mawk                         1.1790 ± 0.0185        1.1696   1.1788    1.1885   0.0              1042.48 ± 0.58         1042.23  1042.48   1042.73  0.0        
  nawk                         3.9644 ± 0.0155        3.9555   3.9660    3.9717   0.0              1156.11 ± 0.47         1156.02  1156.03   1156.27  0.0        

sales10M.csv                                                                                                                                                     
  gawk                         2.9759 ± 0.0070        2.9691   2.9783    2.9802   0.1              2507.74 ± 0.69         2507.49  2507.74   2507.99  0.0        
  mawk                         2.4046 ± 0.0220        2.3926   2.4044    2.4166   0.0              2083.90 ± 0.60         2083.73  2083.98   2083.98  0.0        
  nawk                         8.0351 ± 0.0158        8.0334   8.0337    8.0383   0.0              2361.94 ± 0.70         2361.53  2361.78   2362.52  0.0                 
</code></pre>
<p>Summary Table</p>
<pre><code class="language-plaintext">Benchmark #1 Summary Table

File size        rt [s]                  pm [MB]        
    [MB]  gawk    mawk    nawk        gawk    mawk    nawk
----------------------------------------------------------
    0.12  0.0016  0.0010  0.0015      0.74    0.74    0.49
     1.2  0.0049  0.0028  0.0087      2.99    2.74    2.83
      12  0.0328  0.0169  0.0760     25.74   21.49   23.74
      60  0.1480  0.1012  0.3966    125.48  105.25  120.27
     178  0.4406  0.3403  1.1766    374.98  313.00  346.26
     595  1.4853  1.1788  3.9660   1252.23 1042.48 1156.03
    1190  2.9783  2.4044  8.0337   2507.74 2083.98 2361.78     
</code></pre>
<p>Normalized results: RT (normalized runtime) and MO (memory overhead)</p>
<pre><code class="language-plaintext">Benchmark #1 Normalized Results

File size   RT                    MO          
    [MB]    gawk   mawk   nawk    gawk   mawk   nawk
----------------------------------------------------
    0.12    1.6    1.0    1.5     6.2    6.2    4.1
     1.2    1.8    1.0    3.1     2.5    2.3    2.4
      12    1.9    1.0    4.5     2.1    1.8    2.0
      60    1.5    1.0    3.9     2.1    1.8    2.0
     178    1.3    1.0    3.5     2.1    1.8    1.9
     595    1.3    1.0    3.4     2.1    1.8    1.9
    1190    1.2    1.0    3.3     2.1    1.8    2.0
</code></pre>
<h3>Benchmark #2: Populate 2D matrix</h3>
<pre><code class="language-plaintext">for (i=1; i&lt;=NF; i++) x[NR,i] = $i
</code></pre>
<p>AWK simulates 2D arrays by concatenating keys with a built-in separator (SUBSEP), so <code>x[row, col]</code> is stored internally as <code>x[row SUBSEP col]</code>. This approach provides indexed access to individual fields and is useful when you need to perform operations on specific columns across all rows. The memory overhead includes both the field data and the composite key structures.</p>
<p>Results of Benchmark #2</p>
<pre><code class="language-plaintext">Benchmark #2 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0087 ± 0.0003        0.0084   0.0086    0.0090   0.7              5.24 ± 0.00            5.24     5.24      5.24     0.0        
  mawk                         0.0066 ± 0.0011        0.0058   0.0061    0.0078   7.8              2.15 ± 0.29            1.98     1.98      2.49     8.5        
  nawk                         0.0090 ± 0.0007        0.0081   0.0094    0.0094   4.5              2.33 ± 0.13            2.23     2.28      2.48     2.2        

sales10K.csv                                                                                                                                                     
  gawk                         0.0662 ± 0.0006        0.0657   0.0661    0.0668   0.2              46.07 ± 0.14           45.99    45.99     46.24    0.2        
  mawk                         0.0511 ± 0.0027        0.0491   0.0503    0.0539   1.6              16.32 ± 0.32           16.24    16.24     16.48    0.5        
  nawk                         0.0820 ± 0.0010        0.0814   0.0821    0.0827   0.0              19.67 ± 0.20           19.59    19.59     19.84    0.4        

sales100K.csv                                                                                                                                                    
  gawk                         0.7788 ± 0.0262        0.7500   0.7852    0.8011   0.8              452.65 ± 0.20          452.48   452.73    452.74   0.0        
  mawk                         0.9017 ± 0.0144        0.8931   0.8941    0.9180   0.9              156.74 ± 0.32          156.73   156.73    156.74   0.0        
  nawk                         0.7998 ± 0.0117        0.7911   0.7953    0.8131   0.6              178.69 ± 0.25          178.52   178.77    178.78   0.0        

sales500K.csv                                                                                                                                                    
  gawk                         4.5791 ± 0.0277        4.5695   4.5800    4.5878   0.0              2261.98 ± 0.48         2261.72  2261.73   2262.48  0.0        
  mawk                         6.3488 ± 0.0417        6.3037   6.3686    6.3742   0.3              785.24 ± 0.32          785.23   785.23    785.24   0.0        
  nawk                         4.3709 ± 0.0170        4.3583   4.3714    4.3830   0.0              959.94 ± 0.46          959.52   960.04    960.26   0.0        

sales1.5M.csv                                                                                                                                                    
  gawk                         14.8738 ± 0.1380       14.7941  14.7974   15.0298  0.5              6775.31 ± 0.70         6774.75  6775.47   6775.72  0.0        
  mawk                         19.7716 ± 0.0491       19.7417  19.7862   19.7870  0.1              2356.07 ± 0.50         2355.73  2355.98   2356.48  0.0        
  nawk                         12.1783 ± 0.0873       12.1189  12.1395   12.2765  0.3              2677.02 ± 0.52         2676.77  2677.02   2677.27  0.0        

sales5M.csv                                                                                                                                                      
  gawk                         50.6685 ± 0.1408       50.6431  50.6636   50.6989  0.0              22592.04 ± 0.71        22591.95 22591.96  22592.20 0.0        
  mawk                         72.4570 ± 0.1123       72.3429  72.4932   72.5348  0.0              7963.89 ± 0.52         7963.73  7963.96   7963.99  0.0        
  nawk                         40.7116 ± 0.2032       40.5819  40.6313   40.9215  0.2              8988.45 ± 0.65         8988.01  8988.57   8988.76  0.0        

sales10M.csv                                                                                                                                                     
  gawk                         101.8983 ± 0.2029      101.7380 101.9330  102.0240 0.0              45182.70 ± 0.71        45182.70 45182.71  45182.71 0.0        
  mawk                         150.8563 ± 0.2839      150.6080 150.8330  151.1280 0.0              15965.48 ± 0.84        15964.98 15965.23  15966.23 0.0        
  nawk                         84.8894 ± 0.4522       84.5106  84.8431   85.3145  0.1              18777.15 ± 0.68        18776.93 18777.25  18777.26 0.0        
</code></pre>
<p>Summary Table</p>
<pre><code class="language-plaintext">Benchmark #2 Summary Table

File size          rt [s]                      pm [MB]                
    [MB]   gawk     mawk     nawk         gawk     mawk     nawk
----------------------------------------------------------------
    0.12   0.0086   0.0061   0.0094       5.24     1.98     2.28
     1.2   0.0661   0.0503   0.0821      45.99    16.24    19.59
      12   0.7852   0.8941   0.7953     452.73   156.73   178.77
      60   4.5800   6.3686   4.3714    2261.73   785.23   960.04
     178  14.7974  19.7862  12.1395    6775.47  2355.98  2677.02
     595  50.6636  72.4932  40.6313   22591.96  7963.96  8988.57
    1190 101.9330 150.8330  84.8431   45182.71 15965.23 18777.25
</code></pre>
<p>Normalized results: RT (normalized runtime) and MO (memory overhead)</p>
<pre><code class="language-plaintext">Benchmark #2 Normalized Results

File size   RT                   MO          
    [MB]    gawk   mawk   nawk   gawk   mawk   nawk
---------------------------------------------------
    0.12    1.4    1.0    1.5    43.7   16.5   19.0
     1.2    1.3    1.0    1.6    38.3   13.5   16.3
      12    1.0    1.1    1.0    37.7   13.1   14.9
      60    1.0    1.5    1.0    37.7   13.1   16.0
     178    1.2    1.6    1.0    38.1   13.2   15.0
     595    1.2    1.8    1.0    38.0   13.4   15.1
    1190    1.2    1.8    1.0    38.0   13.4   15.8
</code></pre>
<h3>Benchmark #3: Populate 1D array for each field</h3>
<pre><code class="language-plaintext">x1[NR]=\(1; x2[NR]=\)2; x3[NR]=\(3; ... x14[NR]=\)14
</code></pre>
<p>This creates 14 independent hash table structures in memory, avoiding the composite key overhead of the 2D approach. This method is efficient when you frequently access all values of a particular field, as each field's data is stored contiguously in its own array structure. The tradeoff is managing multiple array variables instead of a single unified structure.</p>
<p>In Benchmark #3 <strong>gawk</strong>'s native array of arrays feature was also tested:</p>
<pre><code class="language-plaintext">for (i=1; i&lt;=NF; i++) x[NR][i]=$i
</code></pre>
<p>This creates a true nested structure where each row is a parent array containing 14 child elements.</p>
<p>Results</p>
<pre><code class="language-plaintext">Benchmark #3 Result Table

File / variant                  Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0054 ± 0.0004        0.0051   0.0052    0.0059   4.0              2.83 ± 0.14            2.74     2.77      2.99     2.2        
  mawk                         0.0027 ± 0.0000        0.0026   0.0027    0.0027   0.1              1.82 ± 0.14            1.74     1.74      1.99     4.8        
  nawk                         0.0053 ± 0.0001        0.0052   0.0053    0.0053   0.8              2.23 ± 0.00            2.23     2.23      2.24     0.0        
  gawk*                        0.0066 ± 0.0004        0.0062   0.0068    0.0070   1.9              3.76 ± 0.07            3.70     3.74      3.84     0.6        

sales10K.csv                                                                                                                                                     
  gawk                         0.0362 ± 0.0008        0.0355   0.0362    0.0369   0.0              21.40 ± 0.20           21.23    21.49     21.49    0.4        
  mawk                         0.0176 ± 0.0011        0.0164   0.0179    0.0184   1.8              13.07 ± 0.20           12.98    12.99     13.23    0.6        
  nawk                         0.0413 ± 0.0006        0.0408   0.0413    0.0419   0.0              19.07 ± 0.15           18.98    18.99     19.24    0.4        
  gawk*                        0.0489 ± 0.0016        0.0471   0.0497    0.0498   1.7              31.49 ± 0.45           31.24    31.24     32.00    0.8        

sales100K.csv                                                                                                                                                    
  gawk                         0.3390 ± 0.0047        0.3337   0.3406    0.3426   0.5              206.16 ± 0.43          205.74   206.25    206.48   0.0        
  mawk                         0.2401 ± 0.0100        0.2288   0.2444    0.2472   1.7              125.74 ± 0.32          125.49   125.73    125.99   0.0        
  nawk                         0.4159 ± 0.0036        0.4119   0.4177    0.4183   0.4              178.06 ± 0.40          177.73   177.96    178.47   0.1        
  gawk*                        0.4581 ± 0.0026        0.4563   0.4578    0.4602   0.1              308.23 ± 0.51          307.98   308.24    308.48   0.0        

sales500K.csv                                                                                                                                                    
  gawk                         1.6668 ± 0.0108        1.6607   1.6618    1.6780   0.3              1023.65 ± 0.57         1023.24  1023.74   1023.98  0.0        
  mawk                         1.2514 ± 0.0122        1.2445   1.2510    1.2586   0.0              621.15 ± 0.35          620.99   621.23    621.24   0.0        
  nawk                         2.4466 ± 0.0109        2.4348   2.4523    2.4528   0.2              946.89 ± 0.49          946.56   947.04    947.07   0.0        
  gawk*                        2.2579 ± 0.0027        2.2572   2.2582    2.2583   0.0              1537.32 ± 0.53         1537.24  1537.24   1537.49  0.0        

sales1.5M.csv                                                                                                                                                    
  gawk                         5.0319 ± 0.0295        5.0004   5.0445    5.0509   0.2              3064.90 ± 0.69         3064.48  3064.98   3065.24  0.0        
  mawk                         3.5495 ± 0.0221        3.5303   3.5515    3.5669   0.1              1847.24 ± 0.56         1846.73  1847.48   1847.49  0.0        
  nawk                         6.3204 ± 0.0325        6.2918   6.3166    6.3527   0.1              2664.61 ± 0.53         2664.46  2664.56   2664.82  0.0        
  gawk*                        6.7958 ± 0.0173        6.7846   6.7873    6.8155   0.1              4609.57 ± 0.93         4608.73  4609.73   4610.24  0.0        

sales5M.csv                                                                                                                                                      
  gawk                         17.2646 ± 0.0952       17.1816  17.2512   17.3611  0.1              10289.49 ± 0.85        10288.99 10289.49  10289.98 0.0        
  mawk                         12.3195 ± 0.0415       12.2834  12.3215   12.3537  0.0              6174.32 ± 0.76         6173.74  6174.48   6174.74  0.0        
  nawk                         22.0794 ± 0.1297       21.9774  22.0413   22.2196  0.2              8938.40 ± 0.55         8938.31  8938.32   8938.57  0.0        
  gawk*                        22.7869 ± 0.1159       22.6546  22.8511   22.8549  0.3              15367.34 ± 0.94        15367.23 15367.27  15367.50 0.0        

sales10M.csv                                                                                                                                                     
  gawk                         34.8630 ± 0.2366       34.6662  34.8277   35.0951  0.1              20633.40 ± 0.90        20633.23 20633.24  20633.74 0.0        
  mawk                         24.6822 ± 0.0494       24.6660  24.6677   24.7130  0.1              12346.65 ± 0.82        12346.48 12346.49  12346.98 0.0        
  nawk                         48.6894 ± 0.1791       48.5470  48.7520   48.7691  0.1              18576.88 ± 0.57        18576.79 18576.80  18577.05 0.0        
  gawk*                        45.8100 ± 0.1638       45.7327  45.7543   45.9431  0.1              30737.79 ± 1.19        30736.99 30737.99  30738.39 0.0        
</code></pre>
<p>Summary Table</p>
<pre><code class="language-plaintext">Benchmark #3 Summary Table

File size            rt [s]                            pm [MB]                      
    [MB]   gawk     mawk     nawk     gawk*        gawk     mawk     nawk    gawk*
----------------------------------------------------------------------------------
    0.12   0.0052   0.0027   0.0053   0.0068       2.77     1.74     2.23     3.74
     1.2   0.0362   0.0179   0.0413   0.0497      21.49    12.99    18.99    31.24
      12   0.3406   0.2444   0.4177   0.4578     206.25   125.73   177.96   308.24
      60   1.6618   1.2510   2.4523   2.2582    1023.74   621.23   947.04  1537.24
     178   5.0445   3.5515   6.3166   6.7873    3064.98  1847.48  2664.56  4609.73
     595  17.2512  12.3215  22.0413  22.8511   10289.49  6174.48  8938.32 15367.27
    1190  34.8277  24.6677  48.7520  45.7543   20633.24 12346.49 18576.80 30737.99
</code></pre>
<p>Normalized results: RT (normalized runtime) and MO (memory overhead)</p>
<pre><code class="language-plaintext">Benchmark #3 Normalized Results

File size   RT                          MO                  
    [MB]    gawk   mawk   nawk   gawk*  gawk   mawk   nawk   gawk*
-----------------------------------------------------------------
    0.12    1.9    1.0    2.5    2.0    23.1   14.5   18.6   31.2
     1.2    2.0    1.0    2.8    2.3    17.9   10.8   15.8   26.0
      12    1.4    1.0    1.9    1.7    17.2   10.5   14.8   25.7
      60    1.3    1.0    1.8    2.0    17.1   10.4   15.8   25.6
     178    1.4    1.0    1.9    1.8    17.2   10.4   15.0   25.9
     595    1.4    1.0    1.9    1.8    17.3   10.4   15.0   25.8
    1190    1.4    1.0    1.9    2.0    17.3   10.4   15.6   25.8
</code></pre>
<h3>Benchmark 4: Concatenate entire data in one string</h3>
<pre><code class="language-plaintext">x = x $0
</code></pre>
<p>Each line is appended to the existing string, creating progressively larger string values. This pattern can be useful when building complete records for batch output, log aggregation or creating hash/checksum input.</p>
<p>Results</p>
<pre><code class="language-plaintext">Benchmark #4 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv 0.12                                                                                                                                                 
  gawk                         0.0016 ± 0.0001        0.0016   0.0017    0.0017   1.8              0.75 ± 0.02            0.74     0.74      0.78     1.8        
  mawk                         0.0058 ± 0.0003        0.0054   0.0059    0.0061   0.9              0.98 ± 0.07            0.94     0.95      1.07     3.7        
  nawk                         0.0084 ± 0.0003        0.0081   0.0084    0.0086   0.1              1.06 ± 0.11            0.94     1.08      1.15     2.1        

sales10K.csv 1.2                                                                                                                                                 
  gawk                         0.0041 ± 0.0001        0.0040   0.0040    0.0042   0.8              1.98 ± 0.02            1.98     1.98      1.98     0.1        
  mawk                         1.2520 ± 0.0030        1.2485   1.2533    1.2541   0.1              5.18 ± 0.16            5.02     5.25      5.29     1.2        
  nawk                         1.4993 ± 0.0047        1.4954   1.4979    1.5046   0.1              6.30 ± 0.27            6.03     6.36      6.51     0.9        

sales100K.csv 12                                                                                                                                                 
  gawk                         0.0239 ± 0.0002        0.0238   0.0238    0.0241   0.3              12.73 ± 0.02           12.72    12.73     12.73    0.0        
  mawk                         57.3232 ± 0.4294       56.8661  57.3852   57.7182  0.1              37.42 ± 0.95           36.46    37.46     38.33    0.1        
  nawk                         92.1290 ± 0.9580       91.0598  92.4178   92.9094  0.3              49.94 ± 1.15           49.00    49.64     51.17    0.6        

sales500K.csv 60                                                                                                                                                 
  gawk                         0.1075 ± 0.0015        0.1059   0.1080    0.1087   0.4              60.05 ± 0.13           59.97    59.98     60.21    0.1        
  mawk                         1479.21                                                             152.68 
  nawk                         3854.76                                                             180.05 

sales1.5M.csv 178                                                                                                                                                
  gawk                         0.3163 ± 0.0018        0.3155   0.3160    0.3174   0.1              178.63 ± 0.32          178.46   178.47    178.97   0.1        

sales5M.csv 595                                                                                                                                                  
  gawk                         1.0447 ± 0.0041        1.0408   1.0454    1.0480   0.1              592.62 ± 0.43          592.45   592.45    592.96   0.0        

sales10M.csv 1190                                                                                                                                                
  gawk                         2.0706 ± 0.0041        2.0704   2.0706    2.0709   0.0              1184.01 ± 0.46         1183.92  1183.93   1184.19  0.0        
</code></pre>
<p>Summary Table</p>
<pre><code class="language-plaintext">Benchmark #4 Summary Table

File size    RT                               MO                  
    [MB]     gawk       mawk       nawk       gawk       mawk       nawk
------------------------------------------------------------------------
    0.12     0.0017     0.0062     0.0081      0.7        1.1        1.0
     1.2     0.0040     1.2365     1.4840      2.0        5.1        6.4
      12     0.0238    54.5716    88.8361     12.7       36.9       49.4
      60     0.1080  1479.2100  3854.7600     60.0      152.7      180.1
     178     0.3160 ---------- ----------    178.5 ---------- ----------
     595     1.0454 ---------- ----------    592.5 ---------- ----------
    1190     2.0706 ---------- ----------   1183.9 ---------- ----------
</code></pre>
<p>Normalized results: RT (normalized runtime) and MO (memory overhead)</p>
<pre><code class="language-plaintext">Benchmark #4 Normalized Results

File size   RT                       MO          
    [MB]    gawk   mawk     nawk     gawk   mawk   nawk
-------------------------------------------------------
    0.12    1.0     3.6      4.8     6.2    9.6    7.9
     1.2    1.0   309.1    371.0     1.7    4.3    5.4
      12    1.0  2292.9   3732.6     1.1    3.1    4.1
      60    1.0 13696.4  35692.2     1.0    2.5    3.0
</code></pre>
<h2>Discussion</h2>
<p>For the comparative analysis normalized metrics were used:</p>
<ul>
<li><p><strong>MO (Memory Overhead):</strong> This represents the ratio of peak memory usage relative to the raw file size. For example, an <strong>MO of 2.0</strong> means the process used exactly twice the RAM as the size of the data on disk. It allows for a direct comparison of memory efficiency regardless of the input file size.</p>
</li>
<li><p><strong>RT (Normalized Runtime):</strong> This is the execution time scaled against a baseline (the fastest result or the file size, 1.0). It measures how long the engine takes to process each unit of data, providing a clear picture of speed performance across different AWK variants.</p>
</li>
</ul>
<h3>Benchmark #1</h3>
<p>The data confirms that for the simple line-storage pattern (x[NR]=$0), memory consumption is a strictly linear function of the input file size across all three variants. As the data scales from 120KB to 1.2GB, the normalized memory overhead (MO) exhibits clear asymptotic behavior; the initial variance caused by interpreter startup costs (which peaked at 6.2x for the smallest file) stabilizes at higher volumes. By the 1.2GB mark, <strong>gawk</strong> and <strong>nawk</strong> settle at roughly 2.0x and 2.1x overhead relative to the raw file size, while mawk maintains a leaner 1.8x, proving to be the most memory-efficient engine for large-scale string retention.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/fd36e335-63f9-400b-9610-8e37f400fa4d.png" alt="Benchmark #1: Peak Memory vs File Size" style="display:block;margin:0 auto" />

<p>In terms of runtime performance, <strong>mawk</strong> consistently dominated as the fastest variant, serving as the baseline (1.0) for all normalized runtime (RT) measurements above the smallest file size. While <strong>gawk</strong> showed improving efficiency as the workload increased—dropping from 1.9x to 1.2x the runtime of <strong>mawk</strong> and <strong>nawk</strong> struggled significantly with this storage pattern, ending with a runtime 3.3x slower than <strong>mawk</strong> at the 10 million row limit. These results highlight that for pure data population tasks where preserving line integrity is key, <strong>mawk</strong> offers the best balance of speed and a minimized memory footprint.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/72ec4294-3223-4e54-9c98-26c5191d34a9.png" alt="Benchmark #1: Normalized Results" style="display:block;margin:0 auto" />

<h3>Benchmark #2</h3>
<p>The data for the 2D matrix population <code>x[NR, i] = $i</code> shows a massive increase in resource requirements compared to simple line storage, though the peak memory remains a strictly linear function of the file size. As the dataset scales toward 1.2GB, the normalized memory overhead reaches an asymptotic state where the initial interpreter costs become negligible. In this scenario, <strong>mawk</strong> proves to be the most memory-efficient by far, stabilizing at a memory overhead of 13.4. In contrast, <strong>gawk</strong> is exceptionally heavy for this storage pattern, requiring <strong>38 times the raw file size in RAM</strong>, which is nearly triple the footprint of <strong>mawk</strong>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/606f1588-86e8-4965-ab07-fe24b97a1c81.png" alt="Benchmark #2: Peak Memory vs File Size" style="display:block;margin:0 auto" />

<p>The runtime performance results reveal a significant shift in execution efficiency as the number of array elements grows. While <strong>mawk</strong> is the fastest for small files, its performance degrades significantly at scale, eventually becoming the slowest variant with a normalized runtime of 1.8. Conversely, <strong>nawk</strong> emerges as the performance leader for large-scale matrix population, maintaining the baseline speed of 1.0 at high volumes. These results illustrate a clear trade-off: <strong>mawk</strong> is the optimal choice for minimizing the memory footprint in massive stateful operations, but <strong>nawk</strong> offers superior throughput when processing tens of millions of discrete fields.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/c3f4a214-39e1-46a5-bd21-aa443c2a2fc9.png" alt="Benchmark #2: Normalized Results" style="display:block;margin:0 auto" />

<h3>Benchmark #3</h3>
<p>In Benchmark #3, using 14 independent 1D arrays proves significantly more memory-efficient than the 2D composite key approach across all variants. The peak memory usage remains a linear function of file size, with normalized memory overhead (MO) reaching a steady state quickly. <strong>mawk</strong> again demonstrates superior memory management, stabilizing at an MO of 10.4, which is about 40% more efficient than <strong>gawk</strong>’s 17.3 and <strong>nawk</strong>’s 15.6. Interestingly, <strong>gawk</strong>'s native array-of-arrays feature (gawk*) proved to be the most resource-intensive strategy in this test, with a stabilized MO of 25.8. This suggests that the internal overhead of managing nested objects in <strong>gawk</strong> is substantially higher than managing multiple flat hash tables.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/1033e3cc-927d-417d-8b8a-9946997039bd.png" alt="Benchmark #3: Peak Memory vs File Size" style="display:block;margin:0 auto" />

<p>Runtime-wise, <strong>mawk</strong> maintained its lead as the fastest variant, serving as the 1.0 baseline for all file sizes. <strong>gawk</strong> and <strong>nawk</strong> performed similarly at scale, with <strong>gawk</strong> finishing about 1.4 times slower than <strong>mawk</strong>, while <strong>nawk</strong> lagged at 1.9 times slower. Despite the structural elegance of <strong>gawk</strong>'s nested arrays, the gawk* results showed no performance benefit over the 1D array method, consistently running about 2.0 times slower than <strong>mawk</strong>. For users requiring field-level access at scale, the strategy of multiple 1D arrays in <strong>mawk</strong> provides the best optimization of both execution speed and memory footprint.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/acd5bf6b-44bd-4ef7-8c23-36c4fe43b8f6.png" alt="Benchmark #3: Normalized Results" style="display:block;margin:0 auto" />

<h3>Benchmark #4</h3>
<p>Benchmark #4 reveals a dramatic divergence in performance, highlighting how different engines handle repeated string concatenation (x = x $0). In this scenario, <strong>gawk</strong> performs exceptionally, maintaining near-linear time complexity as the file size increases. This efficiency is due to <strong>gawk</strong>'s optimized string management, which applies a smarter reallocation strategy than its counterparts. As the dataset scales to 60MB, <strong>gawk</strong> completes the task in just 0.1 seconds, whereas <strong>mawk</strong> and <strong>nawk</strong> experience an exponential performance collapse, taking approximately 24 minutes and 64 minutes respectively. Due to these extreme runtime requirements, <strong>mawk</strong> and <strong>nawk</strong> were not tested for file sizes larger than 60MB. For any workflow involving large-scale string building, <strong>gawk</strong> is the only viable option among the three.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/3a39d7e1-2688-4c0a-b426-7b76c792cef3.png" alt="Benchmark #4: Peak Memory vs File Size" style="display:block;margin:0 auto" />

<p>The memory overhead data also shows an interesting reversal of the previous benchmarks' trends. While <strong>mawk</strong> and <strong>nawk</strong> struggle with time, they initially show higher memory overhead relative to the file size during the transition phases. However, <strong>gawk</strong>’s memory usage remains extremely tight, approaching a 1.0 overhead ratio at the 60MB mark and beyond, effectively matching the raw file size. The massive RT (normalized runtime) values for <strong>mawk</strong> and <strong>nawk</strong>, reaching over 13,000x and 35,000x the duration of <strong>gawk</strong>, underscore a fundamental architectural difference: <strong>gawk</strong> is specifically optimized for efficient string appending, while the others suffer from costly repeated memory copying and reallocations.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/b34cf3d7-3fb6-4bcf-b1af-0c042c74af0b.png" alt="Benchmark #4: Normalized Results" style="display:block;margin:0 auto" />

<h2>Conclusion</h2>
<p>This table summarizes the Memory Overhead (MO, the ratio of peak memory usage relative to the raw file size) of the four benchmarks. These values represent the stable multiplier of peak memory relative to file size once the dataset is large enough to make interpreter startup costs negligible.</p>
<p><strong>Memory Overhead (MO) Summary Table</strong></p>
<table>
<thead>
<tr>
<th>Benchmark Scenario</th>
<th>gawk</th>
<th>mawk</th>
<th>nawk</th>
<th>Best Efficiency</th>
</tr>
</thead>
<tbody><tr>
<td>#1: Store entire lines</td>
<td>2.1</td>
<td>1.8</td>
<td>2.0</td>
<td><strong>mawk</strong></td>
</tr>
<tr>
<td>#2: Populate 2D matrix</td>
<td>38.0</td>
<td>13.4</td>
<td>15.8</td>
<td><strong>mawk</strong></td>
</tr>
<tr>
<td>#3: 1D array per field</td>
<td>17.3</td>
<td>10.4</td>
<td>15.6</td>
<td><strong>mawk</strong></td>
</tr>
<tr>
<td>#4: String concatenation*</td>
<td>1.0</td>
<td>2.5</td>
<td>3.0</td>
<td><strong>gawk</strong></td>
</tr>
</tbody></table>
<p>*Note: Benchmark #4 values are taken from the 60MB file due to the runtime constraints of <strong>mawk</strong> and <strong>nawk</strong>.</p>
<h3>Key Findings for the Article</h3>
<ul>
<li><p>The array efficiency gap: For stateful data population, <strong>mawk was consistently the most memory-efficient</strong>. In the 2D matrix test, it used nearly 3x less memory than <strong>gawk</strong>, highlighting its leaner internal representation of hash tables and strings.</p>
</li>
<li><p>Structure penalty: Breaking a CSV line into 14 discrete fields (Benchmark #3) increases memory overhead by approximately 5x to 8x compared to storing the line as a single string (Benchmark #1).</p>
</li>
<li><p><strong>gawk</strong>’s specialization: While <strong>gawk</strong> is the heaviest variant for array-based storage, it is uniquely optimized for string management. It was the only variant where memory overhead effectively equaled the file size (1.0) during massive string concatenation, coupled with extremely fast execution.</p>
</li>
<li><p>The cost of "Array of Arrays": Though not in the summary table, the results for <strong>gawk</strong> (25.8 MO) show that native nested structures are significantly more expensive than multiple 1D arrays (17.3 MO), likely due to the overhead of managing multiple internal hash table objects.</p>
</li>
</ul>
<p>In conclusion, these results demonstrate that using AWK in stateful mode requires careful consideration. While these benchmarks were conducted by populating the entire dataset to test engine limits, significant memory can be saved in practice by populating only the specific fields or records needed for the task. <strong>If RAM matters, mawk is the clear leader for population methods involving arrays or matrix simulations.</strong> However, <strong>for methods requiring large-scale string building, gawk remains the only viable alternative</strong>. Ultimately, selecting the right population method and the appropriate AWK variant is essential for maintaining stability and performance when processing large datasets.</p>
]]></content:encoded></item><item><title><![CDATA[FreeBSD and dwl on a 2010 ThinkPad 
]]></title><description><![CDATA[FreeBSD is a Unix operating system with a long history of stability, clean design, and excellent documentation. Older hardware tends to run it particularly well: mature driver support and a lean base ]]></description><link>https://awklab.com/freebsd-dwl</link><guid isPermaLink="true">https://awklab.com/freebsd-dwl</guid><category><![CDATA[FreeBSD]]></category><category><![CDATA[dwl]]></category><category><![CDATA[unix]]></category><category><![CDATA[wayland]]></category><category><![CDATA[thinkpad]]></category><dc:creator><![CDATA[Gábor Dombay]]></dc:creator><pubDate>Thu, 05 Mar 2026 19:06:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/199072e1-72f5-4665-a1c6-150c011d4eaf.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>FreeBSD is a Unix operating system with a long history of stability, clean design, and excellent documentation. Older hardware tends to run it particularly well: mature driver support and a lean base system make aging machines surprisingly capable. Old does not necessarily mean obsolete, provided the hardware is paired with the right OS and the software stack remains minimal.</p>
<p>This write-up covers a 2010 ThinkPad L412 with an Intel Core i3-350M and 8 GB RAM running FreeBSD 15.0 with <strong>dwl</strong> as the Wayland compositor. <strong>dwl</strong> is the Wayland equivalent of <strong>dwm</strong>, the well-known X11 window manager from the suckless project. It is minimal, efficient, and follows the Unix philosophy — you start with the bare minimum and add only what you need. The result is a modern Wayland desktop running on sixteen-year-old hardware that remains usable for everyday tasks.</p>
<p>For installation, I followed the <a href="https://docs.freebsd.org/en/books/handbook/">FreeBSD Handbook</a> and installed <strong>FreeBSD 15.0</strong> with only the <strong>base system, sh</strong> shell and <strong>lib32</strong> on a <strong>UFS</strong> filesystem. The installation itself was straightforward. The only notable limitation was Wi-Fi: on this hardware the driver works reliably only on the 2.4 GHz band. The 5 GHz band proved highly unstable and was practically unusable.</p>
<p>The following section walks through the <strong>dwl</strong> installation process in enough detail to reproduce this setup.</p>
<h2>Required packages for dwl</h2>
<ul>
<li><p>wayland, wayland-protocols</p>
</li>
<li><p>drm-kmod (GPU driver)</p>
</li>
<li><p>wlroots019 (for dwl-0.8)</p>
</li>
<li><p>foot (terminal)</p>
</li>
<li><p>dejavu (my preferred font)</p>
</li>
<li><p>wmenu (dmenu equivalent)</p>
</li>
<li><p>wl-clipboard, grim, slurp (clipboard and screenshots)</p>
</li>
<li><p>swaybg (for background image)</p>
</li>
<li><p>mako (notification daemon)</p>
</li>
<li><p>neovim (editor)</p>
</li>
<li><p>gmake, gcc, pkgconf, evdev-proto (for compliation)</p>
</li>
<li><p>fcft, tllist</p>
</li>
<li><p>wget, firefox , neovim</p>
</li>
</ul>
<pre><code class="language-plaintext">sudo pkg install wayland wayland-protocols drm-kmod wlroots019 foot dejavu
sudo pkg install wmenu wl-clipboard grim slurp swaybg mako neovim
sudo pkg install gmake gcc pkgconf evdev-proto fcft tllist wget firefox 
</code></pre>
<p>Note: <code>sudo</code> is not included in the base system on FreeBSD, so install it first (<code>pkg install sudo</code>) or run commands as root.</p>
<h2>System configuration</h2>
<p>Enable seatd</p>
<pre><code class="language-plaintext">sudo sysrc seatd_enable="YES"
sudo service seatd start

# Add to .profile
export LIBSEAT_BACKEND="seatd"
</code></pre>
<p>Add your user to the video/input groups</p>
<pre><code class="language-plaintext">sudo pw groupmod video -m [your_username]
sudo pw groupmod input -m [your_username]
</code></pre>
<p>Enable audio server</p>
<pre><code class="language-plaintext">sudo sysrc sndiod_enable="YES"
sudo service sndiod start
</code></pre>
<p>Load GPU driver</p>
<pre><code class="language-plaintext">sudo kldload /boot/modules/i915kms.ko

# Make it permanent adding to /etc/rc.conf
sudo sysrc kld_list+="/boot/modules/i915kms.ko"
</code></pre>
<p>Update evdev-proto header path</p>
<pre><code class="language-plaintext">sudo ln -s /usr/local/include/linux /usr/include/linux
</code></pre>
<p>Enable UTF-8 for foot</p>
<pre><code class="language-plaintext"># Add to .profile
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
</code></pre>
<p>Note: logout-login to take effect.</p>
<h2>Building dwl</h2>
<p>Download <strong>dwl 0.8</strong> and <strong>slstatus 1.1</strong> from their respective repositories and extract them.</p>
<pre><code class="language-plaintext">wget https://codeberg.org/dwl/dwl/archive/v0.8.tar.gz
wget https://dl.suckless.org/tools/slstatus-1.1.tar.gz
tar xvf v0.8.tar.gz
tar xvf slstatus-1.1.tar.gz
</code></pre>
<p>To customize <strong>dwl</strong>, edit <code>config.def.h</code> and apply <a href="https://codeberg.org/dwl/dwl-patches">patches</a> as needed. For reference, here is the list of patches I applied in this setup:</p>
<ul>
<li><p><code>attachbottom</code>, <code>movestack</code>, <code>pertag</code> for my preferred <strong>dwl</strong> behavior.</p>
</li>
<li><p><code>bar</code> - necessary patch for a status-bar</p>
</li>
</ul>
<p>The command to patch:</p>
<pre><code class="language-plaintext">patch -i bar.patch
</code></pre>
<p>Note: If a patch fails to apply cleanly, the compiler output will indicate what needs to be adjusted, and the changes must be applied manually.</p>
<p>Unlike <strong>dwm</strong>, the dwl patches does not include a <code>focusadjacenttag</code> patch, so I implemented the modification myself. This mod adds two functions: one to focus the tag immediately to the left or right of the currently active tag, and another to move the focused window to the adjacent tag in either direction.</p>
<pre><code class="language-c">// Equivalent to focusadjacenttag dwm patch
// Add to dwl.c

static void viewtoadjacent(const Arg *arg);
static void tagtoadjacent(const Arg *arg);

void
viewtoadjacent(const Arg *arg)
{
	unsigned int newtag;
	unsigned int curtag = selmon-&gt;tagset[selmon-&gt;seltags];
	if (arg-&gt;i &gt; 0) // Cycle Right
		newtag = (curtag &lt;&lt; 1);
	else // Cycle Left
		newtag = (curtag &gt;&gt; 1);
	// Wrap around logic for standard 9 tags
	if (newtag &gt;= (1 &lt;&lt; TAGCOUNT)) newtag = 1;
	if (newtag &lt;= 0) newtag = (1 &lt;&lt; (TAGCOUNT - 1));
	view(&amp;(Arg){.ui = newtag});
}

void
tagtoadjacent(const Arg *arg)
{
	Client *c = focustop(selmon);
	unsigned int newtag;
	unsigned int curtag;
	if (!c)
		return;
	curtag = c-&gt;tags;
	if (arg-&gt;i &gt; 0) // Shift Right
		newtag = (curtag &lt;&lt; 1);
	else // Shift Left
		newtag = (curtag &gt;&gt; 1);
	// Wrap around logic for standard 9 tags
	if (newtag &gt;= (1 &lt;&lt; TAGCOUNT)) newtag = 1;
	if (newtag &lt;= 0) newtag = (1 &lt;&lt; (TAGCOUNT - 1));
	tag(&amp;(Arg){.ui = newtag});
}

// Add to config.def.h

/* tagging */
#define TAGCOUNT 9

/* modifier                  key             function         argument */
{ MODKEY,                    XKB_KEY_Left,   viewtoadjacent,  {.i = -1} },
{ MODKEY,                    XKB_KEY_Right,  viewtoadjacent,  {.i = +1} },
{ MODKEY|WLR_MODIFIER_SHIFT, XKB_KEY_Left,   tagtoadjacent,   {.i = -1} },
{ MODKEY|WLR_MODIFIER_SHIFT, XKB_KEY_Right,  tagtoadjacent,   {.i = +1} },
</code></pre>
<p>For screenshots, use <code>grim</code> and <code>slurp</code> with the following settings in <code>config.def.h</code>. Screenshots are copied to the clipboard via <code>wl-copy</code> and also saved in the <code>~/Pictures/Screenshots</code> directory.</p>
<pre><code class="language-c">/* Region Screenshot with Notification */
static const char *scrregion[] = { "sh", "-c", "grim -g \"\((slurp)\" - | tee ~/Pictures/Screenshots/\)(date +%Y-%m-%d_%H-%m-%s).png | wl-copy &amp;&amp; notify-send 'Region Saved'", NULL };
/* Full Screen Screenshot with Notification */
static const char *scrfull[]   = { "sh", "-c", "grim - | tee ~/Pictures/Screenshots/$(date +%Y-%m-%d_%H-%m-%s).png | wl-copy &amp;&amp; notify-send 'Full Screen Saved'", NULL };

/* modifier                  key          function     argument */
{ MODKEY,					 XKB_KEY_p,	  spawn,       {.v = scrfull } },
{ MODKEY|WLR_MODIFIER_SHIFT, XKB_KEY_p,   spawn,	   {.v = scrregion } },
</code></pre>
<p>To customize keybindings and appearance, edit <code>config.def.h</code> t, then copy it to <code>config.h</code> . Build and install.</p>
<pre><code class="language-plaintext">cp config.def.h config.h
sudo gmake clean install
</code></pre>
<p>Slstatus is can be customized the same way. For FreeBSD `config.mk` has to be modified.</p>
<pre><code class="language-plaintext"># Add to config.mk
LDLIBS   = -lX11 -lkvm -lsndio
</code></pre>
<h3>Startup script</h3>
<p>In <code>$HOME/bin</code> create <code>sdwl</code> script (with <code>chmod +x</code>) to launch <strong>dwl.</strong> You can set your wallpaper there, displayed using <code>swaybg</code>.</p>
<pre><code class="language-plaintext">#!/bin/sh
export $(dbus-launch)
mako &amp;
slstatus -s | dwl -s "sh -c 'swaybg -i ~/Pictures/BSDviolet.png &amp;'"
</code></pre>
<h2>Configuration</h2>
<p>I use the <strong>Dracula</strong> color scheme for <strong>foot</strong> and <strong>Neovim</strong>, with the <strong>vim-startify</strong> and <strong>vim-airline</strong> plugins. <strong>mako</strong> is also configured for Wayland notifications. The setup uses the <strong>DejaVu</strong> font. No gaps, no Nerd Fonts, no icons on the status bar. It’s purely functional and minimal, keeping the desktop clean and efficient.</p>
<p>All configuration files, including the wallpaper, are available in my <a href="https://github.com/awklab/FreeBSD-dwl">GitHub</a> repository. The wallpaper is AI-generated. Screenshots are shown below.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/3bcb562b-e3c0-4a3f-a8db-2dc14cfe865c.png" alt="FreeBSD with dwl, neovim, violet theme" style="display:block;margin:0 auto" />

<br />

<img src="https://cdn.hashnode.com/uploads/covers/6977b73612a913b189a167d8/1cc220b6-f6c3-4d8a-a280-99bb435105e1.png" alt="FreeBsd with dwl, tiling setup, foot, htop, neofetch, neovim" style="display:block;margin:0 auto" />

<h2>Conclusion</h2>
<p><strong>FreeBSD</strong> with <strong>dwl</strong> results in a usable and surprisingly snappy system on this old, modest hardware. On a fresh start, total memory usage (Active + Wired + Laundry) stays under 700 MB, disk usage is 8 GB. Hopefully this guide proves useful for anyone looking to revive older machines with a minimal Wayland setup.</p>
]]></content:encoded></item><item><title><![CDATA[The BEHILOS Benchmark]]></title><description><![CDATA[In his book Unix: A History and a Memoir, Brian Kernighan recounts his favorite grep story from the early days of Unix. Someone at Bell Labs asked whether it was possible to find English words composed only of the letters formed by an upside-down cal...]]></description><link>https://awklab.com/behilos-benchmark</link><guid isPermaLink="true">https://awklab.com/behilos-benchmark</guid><category><![CDATA[grep]]></category><category><![CDATA[cli]]></category><category><![CDATA[Search Engines]]></category><category><![CDATA[Linux]]></category><category><![CDATA[unix]]></category><category><![CDATA[benchmarking]]></category><dc:creator><![CDATA[Gábor Dombay]]></dc:creator><pubDate>Tue, 10 Feb 2026 22:55:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770756490806/28ff030b-0d40-43bb-a70b-c54e8fd74086.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In his book <em>Unix: A History and a Memoir</em>, Brian Kernighan recounts his favorite grep story from the early days of Unix. Someone at Bell Labs asked whether it was possible to find English words composed only of the letters formed by an upside-down calculator. The digits on a turned calculator display, 5071438, map to the letter set BEHILOS.</p>
<p>Kernighan grepped the regular expression <code>^[behilos]*$</code> against the Webster’s Second International Dictionary which contained 234,936 words, and found 263 matches - including words he has never seen before.</p>
<p>The current <a target="_blank" href="https://web.mit.edu/freebsd/head/share/dict/">Webster’s Second International Dictionary</a> contains 236,007 words. To reproduce the results, run:</p>
<pre><code class="lang-bash">grep <span class="hljs-string">'^[behilos]*$'</span> /usr/share/dict/web2
</code></pre>
<p>This results in 264 matches. The longest words are nine characters long: <em>blissless</em> and <em>booboisie</em>.</p>
<p>What started as a historical footnote quickly made me wonder: <strong>how would the old BEHILOS grep perform as a quick-and-dirty benchmark for today’s search tools?</strong></p>
<h2 id="heading-selected-text-search-tools-for-the-behilos-benchmark">Selected Text Search Tools for the BEHILOS Benchmark</h2>
<p>For the BEHILOS benchmark, I selected a mix of classic and modern text search tools, covering traditional Unix utilities, AWK variants, and fast recursive searchers widely used by developers today.</p>
<ul>
<li><p><strong>grep</strong> – the classic Unix tool and historical baseline for text searching.</p>
</li>
<li><p><strong>rg (ripgrep)</strong> – a modern, extremely fast recursive searcher optimized for large codebases.</p>
</li>
<li><p><strong>gawk</strong> – the GNU AWK implementation, feature-rich and widely used for text processing.</p>
</li>
<li><p><strong>mawk</strong> – a lightweight, efficient AWK variant with minimal memory footprint.</p>
</li>
<li><p><strong>nawk</strong> – the traditional New AWK, preserving historical behavior for legacy scripts.</p>
</li>
<li><p><strong>ag (The Silver Searcher)</strong> – a fast recursive searcher often replacing ack.</p>
</li>
<li><p><strong>pt (The Platinum Searcher)</strong> – a newer, recursive grep alternative with multithreading support.</p>
</li>
<li><p><strong>ack</strong> – a Perl-based source tree searcher, maintained and optimized for code patterns.</p>
</li>
<li><p><strong>ugrep</strong> – a feature-rich, modern grep clone with extended regex support and performance tuning.</p>
</li>
<li><p><strong>sift</strong> – a recursive search tool for large directories, optimized for developer workflows.</p>
</li>
</ul>
<h2 id="heading-benchmarking-methodology">Benchmarking Methodology</h2>
<p>To evaluate the performance of the search engines, the benchmarking focused on two critical metrics, <strong>runtime</strong> and <strong>peak memory usage</strong>, which together represent the <strong>total resource footprint</strong>. Resource tracking was performed using <a target="_blank" href="https://github.com/gsauthof/cgmemtime">cgmemtime</a>, an ideal tool for this purpose as it captures peak memory consumption for the process group.</p>
<p>The benchmarking process was automated via <a target="_blank" href="https://github.com/awklab/benchgab.awk"><code>benchgab.awk</code></a> (version 2026.02.10.), my custom runner that handles warmups, multiple test runs, and calculates statistical metrics as well as normalized parameters for comparative analysis. Each benchmark sequence included one initial warmup run followed by 100 recorded runs.</p>
<p>The following table summarizes the evaluated search engines, their versions, and the exact commands used for the BEHILOS grep.</p>
<pre><code class="lang-bash">Name      Version      Command
----      -------      -------
grep      3.12         grep <span class="hljs-string">'^[behilos]*$'</span> /usr/share/dict/web2
ripgrep   15.1.0       rg <span class="hljs-string">'^[behilos]*$'</span> /usr/share/dict/web2
gawk      5.3.2        gawk <span class="hljs-string">'/^[behilos]*$/'</span> /usr/share/dict/web2
mawk      1.3.4        mawk <span class="hljs-string">'/^[behilos]*$/'</span> /usr/share/dict/web2
nawk      20251225     nawk <span class="hljs-string">'/^[behilos]*$/'</span> /usr/share/dict/web2
ag        2.2.0        ag -s <span class="hljs-string">'^[behilos]*$'</span> /usr/share/dict/web2
pt        2.2.0        pt -e <span class="hljs-string">'^[behilos]*$'</span> /usr/share/dict/web2
ack       3.9.0        ack <span class="hljs-string">'^[behilos]*$'</span> /usr/share/dict/web2
ugrep     7.5.0        ugrep <span class="hljs-string">'^[behilos]*$'</span> /usr/share/dict/web2
sift      0.9.1        sift <span class="hljs-string">'^[behilos]*$'</span> /usr/share/dict/web2
</code></pre>
<p>Tests were conducted on an Arch Linux workstation powered by a Ryzen 5900x CPU, using the Alacritty terminal within a dwm session.</p>
<h2 id="heading-results">Results</h2>
<p>The statistical summary was computed by the script, deriving the mean, standard deviation, median, minimum, and maximum from this 100-run sample for both runtime and peak memory usage.</p>
<p>Jitter (Jtr%) was calculated for both runtime and peak memory usage as <code>abs((mean − median) / median) × 100%</code>, quantifying run-to-run variability. For low-footprint commands, even minor scheduling effects or transient memory spikes can noticeably influence averages, making jitter a useful indicator of measurement stability.</p>
<pre><code class="lang-bash">--- Statistical Summary of BEHILOS Benchmarks ---
cmd     Runtime [s]                                      Peak Memory [MB]
        mean ± sdev      min     median  max     Jtr%    mean ± sdev      min     median  max     Jtr%
gawk    0.0220 ± 0.0008  0.0204  0.0219  0.0251  0.5     0.75 ± 0.16      0.48    0.74    1.00    1.0
ugrep   0.0087 ± 0.0005  0.0076  0.0086  0.0102  1.1     1.64 ± 0.14      1.44    1.60    2.03    2.3
ack     0.0880 ± 0.0027  0.0838  0.0874  0.1003  0.7     6.49 ± 0.26      6.12    6.39    7.19    1.6
sift    0.0402 ± 0.0021  0.0370  0.0398  0.0489  1.1     13.32 ± 2.12     9.98    13.11   20.46   1.7
mawk    0.0089 ± 0.0005  0.0078  0.0088  0.0109  1.0     0.59 ± 0.14      0.48    0.54    1.25    8.9
ag      0.0139 ± 0.0008  0.0119  0.0138  0.0167  0.5     2.22 ± 0.45      1.31    2.21    3.23    0.2
pt      0.0245 ± 0.0010  0.0227  0.0244  0.0286  0.5     8.03 ± 0.32      7.49    7.98    10.48   0.6
grep    0.0048 ± 0.0003  0.0041  0.0048  0.0068  0.3     0.71 ± 0.11      0.60    0.64    1.12    10.9
rg      0.0056 ± 0.0003  0.0048  0.0056  0.0064  1.1     1.13 ± 0.17      0.98    1.01    1.75    11.4
nawk    0.0389 ± 0.0012  0.0372  0.0387  0.0429  0.7     0.64 ± 0.16      0.48    0.73    1.05    12.4
</code></pre>
<p>The evaluation is based on <strong>normalized metrics</strong>:</p>
<ul>
<li><p><strong>RT</strong>: Normalized median runtime. The execution time relative to the fastest implementation (1.0 is the baseline).</p>
</li>
<li><p><strong>PM</strong>: Normalized median group peak memory. The peak memory relative to the implementation with the lowest memory footprint (1.0 is the baseline).</p>
</li>
<li><p><strong>d</strong>: Euclidean Distance. Measures the geometric distance from the "Ideal Point" (1,1). Lower values denote higher implementation efficiency.</p>
</li>
<li><p><strong>F</strong>: Resource Footprint. Calculated as RT×PM. This represents the total resource footprint; lower values indicate a more efficient use of system resources to complete the same task.</p>
</li>
</ul>
<p>The following table summarizes the overall performance of the 10 tested search engines according to the normalized BEHILOS benchmarks.</p>
<pre><code class="lang-bash">-- Normalized BEHILOS Benchmarks ---
cmd     RT      PM      d       F
gawk    4.52    1.37    3.54    6.20
ugrep   1.78    2.97    2.12    5.28
ack     18.07   11.86   20.23   214.26
sift    8.23    24.31   24.41   200.00
mawk    1.83    1.00    0.83    1.83
ag      2.85    4.11    3.62    11.73
pt      5.05    14.81   14.39   74.79
grep    1.00    1.18    0.18    1.18
rg      1.17    1.88    0.89    2.19
nawk    8.00    1.36    7.01    10.90
</code></pre>
<h2 id="heading-discussion">Discussion</h2>
<p>The normalized BEHILOS benchmarks were evaluated using <strong>Pareto frontier analysis</strong> (as previously applied in my <a target="_blank" href="https://awklab.com/practical-awk-benchmarking">AWK benchmarking study</a>). To visualize search engine performance, the normalized values were plotted in a two-dimensional coordinate system, where the x-axis represents normalized runtime (RT) and the y-axis represents normalized peak memory usage (PM).</p>
<p>The ideal point is located at (1, 1), representing an implementation that is simultaneously the fastest and the most memory-efficient. To improve the visibility of implementations clustered near the ideal point, a logarithmic scale was applied.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770762029172/22a09a93-3452-47ed-8be3-a4892655b5e7.png" alt="BEHILOS Benchmark: Pareto Efficiency Plot" class="image--center mx-auto" /></p>
<p><strong>Graph</strong>: <em>The Pareto frontier of search engines tested in the BEHILOS Benchmark, visualizing the optimal trade-off between execution speed and memory footprint.</em></p>
<p>The normalized BEHILOS results clearly separate the tested tools into distinct performance tiers when runtime and peak memory usage are considered jointly.</p>
<h3 id="heading-grep-the-overall-winner-of-the-behilos-benchmark"><strong>grep</strong> the overall winner of the BEHILOS Benchmark.</h3>
<p>Taken together, the normalized metrics and the Pareto frontier analysis identify <strong>grep</strong> as the overall winner of the BEHILOS Benchmark. It is not only the fastest implementation (RT = 1.00) but also achieves the lowest total resource footprint, as reflected by the minimum F value (F=1.18)</p>
<h3 id="heading-grep-and-mawk-define-the-pareto-frontier"><strong>grep</strong> and <strong>mawk</strong> define the Pareto frontier.</h3>
<p><strong>grep</strong> is the fastest implementation (RT = 1.00) while remaining very close to the minimum memory baseline (PM = 1.18). <strong>mawk</strong>, in contrast, achieves the lowest peak memory usage (PM = 1.00) with only a modest runtime penalty (RT = 1.83). Neither tool can be improved in one dimension without degrading the other, placing both on the Pareto frontier and representing the optimal trade-off envelope for this workload.</p>
<h3 id="heading-near-frontier-but-dominated-tools">Near-frontier but dominated tools.</h3>
<p><strong>rg</strong>, <strong>ugrep</strong>, <strong>ag</strong>, and <strong>gawk</strong> are dominated by the frontier but remain reasonably close to it. Their normalized distance and F values indicate that they are not optimal for this specific task, yet their performance characteristics are still competitive. This reflects design choices favoring richer feature sets, broader file handling, or more general workloads rather than minimal footprint.</p>
<h3 id="heading-clearly-dominated-implementations">Clearly dominated implementations.</h3>
<p><strong>pt</strong>, and especially <strong>sift</strong> and <strong>ack</strong>, lie far from the Pareto frontier. Their high normalized runtime and peak memory usage result in very large F values, indicating poor efficiency for this narrowly defined benchmark. These tools incur significant overhead relative to the simplicity of the BEHILOS search.</p>
<h3 id="heading-the-case-of-nawk">The case of <strong>nawk</strong>.</h3>
<p>Although <strong>nawk</strong> exhibits a low peak memory footprint (PM = 1.36), its slow runtime (RT = 8.00) places it well outside the efficient region. In addition, it showed the highest peak memory jitter among all tested tools, which negatively affected its stability metrics and overall performance profile.</p>
<p>Overall, the Pareto analysis highlights that tools optimized for minimalism and predictability dominate this benchmark, while more feature-heavy searchers pay a measurable cost in both runtime and memory.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The BEHILOS grep, despite originating as a historical anecdote from the early days of Unix, turns out to be an effective micro-benchmark. Its simplicity isolates the core costs of regex matching, process startup, and memory allocation without confounding factors such as filesystem traversal or complex I/O patterns.</p>
<p>This benchmark shows that, for low-footprint text searches, decades-old design principles still matter. Classic Unix tools like <strong>grep</strong>, along with lean implementations such as <strong>mawk</strong>, remain hard to beat when efficiency is the primary goal. Modern search engines deliver powerful features and excellent performance for real-world workloads, but those advantages are not free.</p>
<p>The <strong>BEHILOS Benchmark</strong> does not aim to crown a universal “best” search tool. Instead, it demonstrates how a minimal, well-chosen workload can expose fundamental trade-offs between speed, memory usage, and stability—and why even a small historical footnote can still teach us something meaningful about performance today.</p>
]]></content:encoded></item><item><title><![CDATA[AWK: the Zero-Setup Pre-Processor]]></title><description><![CDATA[Modern data pipelines most often fail at their beginning, not their end. A malformed record, an unexpected delimiter, or an encoding anomaly can cause otherwise robust processing engines to abort after consuming significant computational resources. T...]]></description><link>https://awklab.com/awk-the-zero-setup-pre-processor</link><guid isPermaLink="true">https://awklab.com/awk-the-zero-setup-pre-processor</guid><category><![CDATA[awk]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[data pipeline]]></category><category><![CDATA[Linux]]></category><category><![CDATA[unix]]></category><dc:creator><![CDATA[Gábor Dombay]]></dc:creator><pubDate>Sun, 01 Feb 2026 20:39:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769876288331/3c7c5cb2-2ebe-4ea4-a31d-c407862ce952.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern data pipelines most often fail at their beginning, not their end. A malformed record, an unexpected delimiter, or an encoding anomaly can cause otherwise robust processing engines to abort after consuming significant computational resources. These failures are not rare edge cases; they are a predictable consequence of feeding untrusted, heterogeneous input into systems that implicitly assume structural coherence.</p>
<p>Preventing such aborts requires a tool that operates before schema, before types, and before semantic assumptions are applied: schema-agnostic, streaming, low in resource footprint, and available everywhere data flows. It must tolerate mixed record structures, inconsistent delimiters, and partial corruption without imposing premature interpretation. Such a tool already exists—and has existed since the 1970s. Its name is <strong>AWK</strong>.</p>
<h2 id="heading-the-contemporary-data-pipeline-model">The Contemporary Data Pipeline Model</h2>
<p>Modern data systems are commonly described using a five-phase pipeline:</p>
<p><strong>Phase 1. Ingest</strong> – raw data arrival from external sources</p>
<p><strong>Phase 2. Validate</strong> – quality checks and correctness guarantees</p>
<p><strong>Phase 3. Transform</strong> – schema enforcement, normalization, columnar operations</p>
<p><strong>Phase 4. Analyze</strong> – analytics, feature engineering, ML preparation</p>
<p><strong>Phase 5. Consume</strong> – BI, reporting, downstream products</p>
<p>This model is widely recognized across data engineering practice, even if the terminology varies slightly between platforms and tools. Crucially, validation is now treated as a first-class concern: data contracts, expectations, and quality gates are standard components of modern stacks.</p>
<p>In practice, however, validation is often implemented primarily as a semantic operation—type checks, nullability constraints, and value ranges—implicitly presuming that incoming data already satisfies basic structural requirements.</p>
<h2 id="heading-structural-validation-comes-before-semantic-validation">Structural Validation Comes Before Semantic Validation</h2>
<p>Validation can be divided into two fundamentally different layers:</p>
<p><strong>Structural (geometric) validation</strong>: Concerned with physical integrity: record boundaries, delimiter consistency, field counts, encoding correctness, and basic layout.</p>
<p><strong>Semantic validation</strong>: Concerned with meaning: data types, ranges, domain rules, and business logic.</p>
<p>Semantic validation <em>depends</em> on structural integrity. A columnar engine cannot validate a date column if unescaped delimiters have shifted field boundaries. A schema-on-read system cannot enforce types if records are misaligned or partially corrupted. The pipeline fails before semantics can even be evaluated.</p>
<p>A related failure mode appears when pandas is used in <strong>Phase 3</strong>, where schema enforcement and large-scale transformation are expected. Pandas is architecturally aligned with <strong>Phase 4</strong> workloads, and applying it earlier on large datasets can lead to memory exhaustion, just as applying Polars or DuckDB in <strong>Phase 2</strong>—before structural validation—leads to structural parse failures.</p>
<p>This highlights the absence of an explicit <strong>Phase-2</strong> structural validation layer in many modern data pipelines.</p>
<h2 id="heading-phase-2-explicitly-structural-validation">Phase 2, Explicitly: Structural Validation</h2>
<p>Within the standard ingest → validate → transform model, <strong>structural validation is the earliest and most failure-prone part of Phase 2</strong>. Its purpose is not to interpret data, but to determine whether the data is fit to be interpreted at all.</p>
<p>A tool operating at this layer must satisfy specific constraints:</p>
<ul>
<li><p>operate on untrusted, possibly malformed input</p>
</li>
<li><p>process data as a stream, with constant memory usage</p>
</li>
<li><p>make minimal assumptions about structure</p>
</li>
<li><p>integrate cleanly into automated pipelines</p>
</li>
<li><p>fail fast and produce actionable diagnostics</p>
</li>
</ul>
<p>This is where <strong>AWK</strong> belongs.</p>
<h2 id="heading-awks-role-in-phase-2">AWK’s Role in Phase 2</h2>
<p>AWK is not an analytics tool, a transformation engine, or a schema system. Its strength lies earlier — <strong>inside Phase 2, before schema and semantics are applied</strong>.</p>
<p>Architecturally, AWK functions as a <strong>pre-schema validation sentinel</strong>.</p>
<p>It processes text streams sequentially, with memory usage independent of input size. A multi-hundred-gigabyte file can be inspected using the same resources as a kilobyte-scale sample. This property alone makes AWK suitable for structural inspection of datasets that exceed available memory.</p>
<p>Its footprint is minimal—on the order of hundreds of kilobytes—and its availability is effectively universal. Every Unix-derived system, including Linux distributions, macOS, BSD variants, container base images, and CI runners, provides AWK by default. No environment setup, dependency resolution, or runtime configuration is required.</p>
<p>For a tool whose purpose is to guard the entrance to a pipeline, this matters. <strong>Phase-2</strong> components should be reliable, predictable, and easy to deploy everywhere.</p>
<h2 id="heading-why-textual-validation-matters">Why Textual Validation Matters</h2>
<p>Real-world data is rarely homogeneous. Files often contain:</p>
<ul>
<li><p>headers and footers with different formats</p>
</li>
<li><p>multiple delimiter conventions within a single stream</p>
</li>
<li><p>varying field counts by record type</p>
</li>
<li><p>embedded structured blocks inside free-form text</p>
</li>
<li><p>multi-line records such as logs or stack traces</p>
</li>
</ul>
<p>Specialized validators and parsers typically assume consistency. When that assumption fails, they abort. AWK does not impose such constraints. It operates on patterns, not schemas, allowing validation logic to adapt dynamically to what the data actually contains.</p>
<p>This does not mean that AWK magically “fixes” broken data. It means that AWK can observe, classify, and assert structural properties before downstream tools are engaged. It can count anomalies, flag record classes, detect shifts in layout, and isolate segments that would cause rigid parsers to fail.</p>
<p>This textual, pattern-first perspective is precisely what is required at the earliest stage of validation.</p>
<h2 id="heading-comparison-with-specialized-streaming-tools">Comparison with Specialized Streaming Tools</h2>
<p>A number of contemporary command-line tools address aspects of streaming data inspection and manipulation. Utilities such as Miller, csvkit, xsv, qsv, xan, and related programs are widely used for high-performance processing of delimited data, particularly CSV. They excel when input conforms to a recognizable tabular structure and when field boundaries, quoting rules, and record layouts are already well defined.</p>
<p>These tools are optimized for structured streams: they provide fast parsing, expressive transformations, and strong guarantees once basic format assumptions are met. In that role, they are highly effective components of modern data pipelines.</p>
<p>Their limitation, in the context of early validation, is not capability but scope. They presuppose that structural coherence already exists. When confronted with inconsistent field counts, shifting delimiters, malformed records, or mixed-format sections, they typically fail early or require pre-cleaned input.</p>
<p>AWK occupies a different position. It does not assume a stable schema or even a stable record shape. By operating on text and patterns rather than fixed structures, it can observe, classify, and assert properties of a stream before any format-specific interpretation is imposed. This makes it suitable for the earliest stage of validation, where the primary question is not how to transform the data, but whether the data can be safely interpreted at all.</p>
<h2 id="heading-the-data-assertion-pattern">The Data Assertion Pattern</h2>
<p>Robust pipelines treat validation as a gate, not a side effect. A practical way to implement this is through data assertions: small, focused validation programs that return explicit success or failure signals.</p>
<p>These assertions execute immediately after ingestion and before any expensive processing begins. If a structural invariant is violated—unexpected field counts, malformed records, encoding issues—the pipeline fails fast, with diagnostics that point directly to the source of the problem.</p>
<p>AWK is well-suited to this pattern. Its exit codes integrate naturally with shell pipelines and workflow orchestrators. Its diagnostics can include precise line numbers, pattern matches, and anomaly counts. And its simplicity reduces the operational risk of the validation layer itself.</p>
<h2 id="heading-unix-composition-and-streaming-architecture">Unix Composition and Streaming Architecture</h2>
<p>AWK operates within the Unix toolchain and composes naturally with core utilities such as grep, sed, sort, uniq, and cut through pipes and standard streams. At the same time, it is fundamentally different from these tools. AWK is a complete programming language in its own right—Turing complete, stateful, and capable of expressing non-trivial control flow, aggregation, and structural analysis logic.</p>
<p>This dual nature is central to its role in data pipelines. AWK participates in Unix composition like a classic streaming filter, yet it can encapsulate validation logic that would otherwise require custom programs or heavier runtimes. Pattern matching, conditional execution, state carried across records, and multi-line context can all be handled within a single streaming pass, without abandoning the simplicity of standard input and output.</p>
<p>Composition remains a strength rather than a constraint. AWK can act as a thin structural probe between other tools, or as a self-contained validation stage that replaces entire chains of simpler utilities. In both cases, execution remains streaming, memory usage remains bounded, and behavior remains transparent and inspectable.</p>
<p>AWK does not compete with modern data tools. It complements them by ensuring that the assumptions they rely on are actually satisfied.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Modern data pipelines increasingly recognize validation as essential, yet often conflate semantic correctness with structural integrity. In practice, structure must be established before meaning can be enforced. Treating malformed or heterogeneous input as if it were already schema-ready remains a common source of avoidable failure.</p>
<p>AWK occupies a precise and early position inside <strong>Phase 2</strong> of the pipeline: structural, pre-schema validation. Its streaming execution model, minimal assumptions, constant memory usage, and universal availability make it well suited to this role. These properties are not historical artifacts but practical advantages when dealing with untrusted data at scale.</p>
<p>AWK’s continued relevance is not a matter of nostalgia, but of architectural fit. It does not replace modern data tools, nor does it compete with them. Instead, it operates where many pipelines remain weakest—at the point where data is first examined, before interpretation begins.</p>
<p>Further articles will explore concrete applications of AWK in modern data workflows and validation scenarios. These discussions continue at AwkLab, where AWK’s role in contemporary data engineering is examined in depth.</p>
]]></content:encoded></item><item><title><![CDATA[AWK Syntax Essentials]]></title><description><![CDATA[Syntax is based on The AWK Programming Language, 2nd Edition by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger.
Pattern
A pattern determines when an action is executed. When a pattern matches an input line, its associated action is execut...]]></description><link>https://awklab.com/awk-syntax-essentials</link><guid isPermaLink="true">https://awklab.com/awk-syntax-essentials</guid><category><![CDATA[awk]]></category><category><![CDATA[Scripting]]></category><category><![CDATA[cli]]></category><category><![CDATA[Linux]]></category><category><![CDATA[unix]]></category><category><![CDATA[programming languages]]></category><dc:creator><![CDATA[Gábor Dombay]]></dc:creator><pubDate>Sat, 31 Jan 2026 13:58:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769867577548/617b23b1-2b4e-4cc4-94c6-cbdc337e5624.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Syntax is based on <a target="_blank" href="https://awk.dev/"><em>The AWK Programming Language</em>, 2nd Edition</a> by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger.</p>
<h2 id="heading-pattern"><strong>Pattern</strong></h2>
<p>A pattern determines when an action is executed. When a pattern matches an input line, its associated action is executed.</p>
<p>If no action is specified, the default is <code>print $0</code>.</p>
<p>Syntax: <code>pattern { action }</code></p>
<p><strong>Examples</strong></p>
<ol>
<li><code>BEGIN</code> — runs once before any input is read</li>
</ol>
<pre><code class="lang-bash">BEGIN { FS=<span class="hljs-string">":"</span> }
</code></pre>
<ol start="2">
<li><code>END</code> — runs once after all input is processed</li>
</ol>
<pre><code class="lang-bash">END { <span class="hljs-built_in">print</span> <span class="hljs-string">"total:"</span>, total }
</code></pre>
<ol start="3">
<li><p>Expression — executes on every input line where the condition is true</p>
<p> Skip the header line</p>
</li>
</ol>
<pre><code class="lang-bash">NR &gt; 1 { <span class="hljs-built_in">print</span> <span class="hljs-variable">$0</span> }
</code></pre>
<ol start="4">
<li>Regex — executes on every line matching the pattern</li>
</ol>
<pre><code class="lang-bash">/error/ { <span class="hljs-built_in">print</span> <span class="hljs-variable">$0</span> }
</code></pre>
<ol start="5">
<li>Range — matches all lines from <code>pattern1</code> through <code>pattern2</code>, inclusive</li>
</ol>
<pre><code class="lang-bash">/start/,/end/ { <span class="hljs-built_in">print</span> <span class="hljs-variable">$0</span> }
</code></pre>
<h2 id="heading-conditionals"><strong>Conditionals</strong></h2>
<p>AWK supports standard conditionals for branching logic.</p>
<p><strong>Examples</strong></p>
<ol>
<li><p><code>if</code></p>
<p> Skip empty lines</p>
</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> (NF &gt; 0) <span class="hljs-built_in">print</span> <span class="hljs-variable">$0</span>
</code></pre>
<ol start="2">
<li><code>if-else</code></li>
</ol>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> (<span class="hljs-variable">$1</span> &gt; 0) <span class="hljs-built_in">print</span> <span class="hljs-string">"positive"</span>; <span class="hljs-keyword">else</span> <span class="hljs-built_in">print</span> <span class="hljs-string">"negative"</span>
</code></pre>
<ol start="3">
<li><code>if-else if</code></li>
</ol>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> (<span class="hljs-variable">$1</span> &gt; 0) <span class="hljs-built_in">print</span> <span class="hljs-string">"positive"</span>
<span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (<span class="hljs-variable">$1</span> &lt; 0) <span class="hljs-built_in">print</span> <span class="hljs-string">"negative"</span>
<span class="hljs-keyword">else</span> <span class="hljs-built_in">print</span> <span class="hljs-string">"zero"</span>
</code></pre>
<h2 id="heading-ternary-expression">Ternary Expression</h2>
<p>The <strong>ternary operator</strong> (<code>?:</code>) is a compact, C-style alternative to the <code>if-else</code> statement. By using this operator, a concise one-line <strong>ternary expression</strong> can be constructed, which, unlike a statement, returns a value that can be used directly within calculations or commands.</p>
<p><strong>Syntax</strong>: <em>condition</em> <code>?</code> <em>value_if_true</em> <code>:</code> <em>value_if_false</em></p>
<p><strong>Examples</strong></p>
<ol>
<li>Absolute Value</li>
</ol>
<p>AWK lacks a built-in abs() function. The ternary operator handles this math logic efficiently:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$1</span> = (<span class="hljs-variable">$1</span> &lt; 0) ? -<span class="hljs-variable">$1</span> : <span class="hljs-variable">$1</span>
</code></pre>
<ol start="2">
<li>Truthiness &amp; Success Labels</li>
</ol>
<p>AWK treats <code>0</code> and <code>""</code> as false, and everything else as true. Use this to label status codes:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">print</span> <span class="hljs-variable">$1</span>, (<span class="hljs-variable">$1</span> ? <span class="hljs-string">"SUCCESS"</span> : <span class="hljs-string">"FAILURE"</span>)
</code></pre>
<p>or</p>
<pre><code class="lang-bash"><span class="hljs-built_in">printf</span>(<span class="hljs-string">"%s\t%s\n"</span>, <span class="hljs-variable">$1</span>, (<span class="hljs-variable">$1</span> ? <span class="hljs-string">"SUCCESS"</span> : <span class="hljs-string">"FAILURE"</span>))
</code></pre>
<h2 id="heading-associative-arrays"><strong>Associative Arrays</strong></h2>
<p>AWK arrays are associative — keys can be strings or numbers, making them ideal for counting, grouping, and lookups without any pre-declaration.</p>
<p>Syntax: <code>array[key] = value</code></p>
<p><strong>Examples</strong></p>
<ol>
<li>Populate an array with lines</li>
</ol>
<pre><code class="lang-bash">x[NR] = <span class="hljs-variable">$0</span>
</code></pre>
<ol start="2">
<li>Deduplicate lines</li>
</ol>
<pre><code class="lang-bash">!x[<span class="hljs-variable">$0</span>]++
</code></pre>
<ol start="3">
<li>Populate a 2D matrix with all fields</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> (i=1; i&lt;=NF; i++) x[NR, i] = <span class="hljs-variable">$i</span>
</code></pre>
<p>AWK simulates 2D arrays by concatenating keys with a built-in separator (<code>SUBSEP</code>), so <code>x[row, col]</code> is stored internally as <code>x[row SUBSEP col]</code>.</p>
<h2 id="heading-loops"><strong>Loops</strong></h2>
<p>AWK supports standard C-style loops as well as a dedicated form for traversing associative arrays.</p>
<p><strong>Examples</strong></p>
<ol>
<li><code>for</code> loop</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> (i=1; i&lt;=NF; i++) <span class="hljs-built_in">print</span> <span class="hljs-variable">$i</span>
</code></pre>
<ol start="2">
<li><code>while</code> loop</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-keyword">while</span> (i &lt;= NF) { <span class="hljs-built_in">print</span> <span class="hljs-variable">$i</span>; i++ }
</code></pre>
<ol start="3">
<li><code>do-while</code> loop</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-keyword">do</span> { <span class="hljs-built_in">print</span> <span class="hljs-variable">$i</span>; i++ } <span class="hljs-keyword">while</span> (i &lt;= NF)
</code></pre>
<ol start="4">
<li>Iterate over an associative array</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> (k <span class="hljs-keyword">in</span> array) <span class="hljs-built_in">print</span> k, array[k]
</code></pre>
<h2 id="heading-pipes"><strong>Pipes</strong></h2>
<p>AWK can send output to, or receive input from, external shell commands using pipes, enabling seamless integration with standard Unix tools.</p>
<p>Syntax: <code>command | getline [var]</code> / <code>print | "command"</code></p>
<p><strong>Examples</strong></p>
<ol>
<li>Send output to a shell command</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-built_in">print</span> <span class="hljs-variable">$1</span> | <span class="hljs-string">"sort"</span>
</code></pre>
<ol start="2">
<li>Read a shell command's output into a variable</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-string">"date"</span> | getline today
</code></pre>
<ol start="3">
<li>Pipe to multiple commands (close between uses)</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-built_in">print</span> <span class="hljs-variable">$1</span> | <span class="hljs-string">"sort | uniq -c"</span>
</code></pre>
<h2 id="heading-comparison-operators">Comparison Operators</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Operator</td><td>Meaning</td></tr>
</thead>
<tbody>
<tr>
<td><code>&lt;</code></td><td>less than</td></tr>
<tr>
<td><code>&lt;=</code></td><td>less than or equal to</td></tr>
<tr>
<td><code>==</code></td><td>equal to</td></tr>
<tr>
<td><code>!=</code></td><td>not equal to</td></tr>
<tr>
<td><code>&gt;=</code></td><td>greater than or equal to</td></tr>
<tr>
<td><code>&gt;</code></td><td>greater than</td></tr>
<tr>
<td><code>~</code></td><td>matched by</td></tr>
<tr>
<td><code>!~</code></td><td>not matched by</td></tr>
</tbody>
</table>
</div><h2 id="heading-logical-operators">Logical Operators</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Operator</td><td>Meaning</td></tr>
</thead>
<tbody>
<tr>
<td><code>&amp;&amp;</code></td><td>AND</td></tr>
<tr>
<td><code>││</code></td><td>OR</td></tr>
<tr>
<td><code>!</code></td><td>NOT</td></tr>
</tbody>
</table>
</div><p><strong>Syntax</strong>: <em>condition1 operator condition2</em></p>
<h2 id="heading-built-in-variables">Built-in variables</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Variable</td><td>Description</td><td>Default</td></tr>
</thead>
<tbody>
<tr>
<td><code>ARGC</code></td><td>Number of command-line arguments, including command name</td><td>-</td></tr>
<tr>
<td><code>ARGV</code></td><td>Array of command-line arguments, numbered 0..ARGC-1</td><td>-</td></tr>
<tr>
<td><code>CONVFMT</code></td><td>Conversion format for numbers</td><td><code>"%.6g"</code></td></tr>
<tr>
<td><code>ENVIRON</code></td><td>Array of shell environment variables</td><td>-</td></tr>
<tr>
<td><code>FILENAME</code></td><td>Name of current input file</td><td>-</td></tr>
<tr>
<td><code>FNR</code></td><td>Record number in current file</td><td>-</td></tr>
<tr>
<td><code>FS</code></td><td>Input field separator</td><td><code>" "</code></td></tr>
<tr>
<td><code>NF</code></td><td>Number of fields in current record</td><td>-</td></tr>
<tr>
<td><code>NR</code></td><td>Number of records read so far</td><td>-</td></tr>
<tr>
<td><code>OFMT</code></td><td>Output format for numbers</td><td><code>"%.6g"</code></td></tr>
<tr>
<td><code>OFS</code></td><td>Output field separator for print</td><td><code>" "</code></td></tr>
<tr>
<td><code>ORS</code></td><td>Output record separator for print</td><td><code>"\n"</code></td></tr>
<tr>
<td><code>RLENGTH</code></td><td>Length of string matched by match function</td><td>-</td></tr>
<tr>
<td><code>RS</code></td><td>Input record separator</td><td><code>"\n"</code></td></tr>
<tr>
<td><code>RSTART</code></td><td>Start of string matched by match function</td><td>-</td></tr>
<tr>
<td><code>SUBSEP</code></td><td>Subscript separator</td><td><code>"\034"</code></td></tr>
</tbody>
</table>
</div><h2 id="heading-built-in-arithmetic-functions">Built-in Arithmetic Functions</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Function</td><td>Value Returned</td></tr>
</thead>
<tbody>
<tr>
<td><code>atan2(y, x)</code></td><td>arctangent of y/x in the range −π to π</td></tr>
<tr>
<td><code>cos(x)</code></td><td>cosine of x, with x in radians</td></tr>
<tr>
<td><code>exp(x)</code></td><td>exponential function of x, e^x</td></tr>
<tr>
<td><code>int(x)</code></td><td>integer part of x; truncated towards 0</td></tr>
<tr>
<td><code>log(x)</code></td><td>natural (base e) logarithm of x</td></tr>
<tr>
<td><code>rand()</code></td><td>random number r, where 0 ≤ r &lt; 1</td></tr>
<tr>
<td><code>sin(x)</code></td><td>sine of x, with x in radians</td></tr>
<tr>
<td><code>sqrt(x)</code></td><td>square root of x</td></tr>
<tr>
<td><code>srand(x)</code></td><td>x is new seed for rand(); use time of day if x is omitted; return previous seed</td></tr>
</tbody>
</table>
</div><h2 id="heading-built-in-string-functions">Built-in String Functions</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Function</td><td>Description</td></tr>
</thead>
<tbody>
<tr>
<td><code>gsub(r,s)</code></td><td>substitute s for r globally in $0, return number of substitutions made</td></tr>
<tr>
<td><code>gsub(r,s,t)</code></td><td>substitute s for r globally in string t, return number of substitutions made</td></tr>
<tr>
<td><code>index(s,t)</code></td><td>return first position of string t in s, or 0 if t is not present</td></tr>
<tr>
<td><code>length(s)</code></td><td>return number of Unicode characters in s; return number of elements if s is an array</td></tr>
<tr>
<td><code>match(s,r)</code></td><td>test whether s contains a substring matched by r; return index or 0; sets RSTART and RLENGTH</td></tr>
<tr>
<td><code>split(s,a)</code></td><td>split s into array a on FS or as CSV if --csv is set, return number of elements in a</td></tr>
<tr>
<td><code>split(s,a,fs)</code></td><td>split s into array a on field separator fs, return number of elements in a</td></tr>
<tr>
<td><code>sprintf(fmt,expr-list)</code></td><td>return expr-list formatted according to format string fmt</td></tr>
<tr>
<td><code>sub(r,s)</code></td><td>substitute s for the leftmost longest substring of $0 matched by r; return number of substitutions made</td></tr>
<tr>
<td><code>sub(r,s,t)</code></td><td>substitute s for the leftmost longest substring of t matched by r; return number of substitutions made</td></tr>
<tr>
<td><code>substr(s,p)</code></td><td>return suffix of s starting at position p</td></tr>
<tr>
<td><code>substr(s,p,n)</code></td><td>return substring of s of length at most n starting at position p</td></tr>
<tr>
<td><code>tolower(s)</code></td><td>return s with upper case ASCII letters mapped to lower case</td></tr>
<tr>
<td><code>toupper(s)</code></td><td>return s with lower case ASCII letters mapped to upper case</td></tr>
</tbody>
</table>
</div><h2 id="heading-expression-operators">Expression Operators</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Operation</td><td>Operators</td><td>Example</td><td>Meaning of Example</td></tr>
</thead>
<tbody>
<tr>
<td>assignment</td><td><code>= += -= *= /= %= ^=</code></td><td><code>x *= 2</code></td><td><code>x = x * 2</code></td></tr>
<tr>
<td>conditional</td><td><code>?:</code></td><td><code>x ? y : z</code></td><td>if <code>x</code> is true then <code>y</code> else <code>z</code></td></tr>
<tr>
<td>logical OR</td><td><code>││</code></td><td><code>x ││ y</code></td><td>1 if <code>x</code> or <code>y</code> is true, 0 otherwise</td></tr>
<tr>
<td>logical AND</td><td><code>&amp;&amp;</code></td><td><code>x &amp;&amp; y</code></td><td>1 if <code>x</code> and <code>y</code> are true, 0 otherwise</td></tr>
<tr>
<td>array membership</td><td><code>in</code></td><td><code>i in a</code></td><td>1 if <code>a[i]</code> exists, 0 otherwise</td></tr>
<tr>
<td>matching</td><td><code>~ !~</code></td><td><code>$1 ~ /x/</code></td><td>1 if the first field contains an <code>x</code>, 0 otherwise</td></tr>
<tr>
<td>relational</td><td><code>&lt; &lt;= == != &gt;= &gt;</code></td><td><code>x == y</code></td><td>1 if <code>x</code> is equal to <code>y</code>, 0 otherwise</td></tr>
<tr>
<td>concatenation</td><td>(none)</td><td><code>"a" "bc"</code></td><td><code>"abc"</code>; there is no explicit concatenation operator</td></tr>
<tr>
<td>add, subtract</td><td><code>+ -</code></td><td><code>x + y</code></td><td>sum of <code>x</code> and <code>y</code></td></tr>
<tr>
<td>multiply, divide, mod</td><td><code>* / %</code></td><td><code>x % y</code></td><td>remainder of <code>x</code> divided by <code>y</code></td></tr>
<tr>
<td>unary plus and minus</td><td><code>+ -</code></td><td><code>-x</code></td><td>negated value of <code>x</code></td></tr>
<tr>
<td>logical NOT</td><td><code>!</code></td><td><code>!$1</code></td><td>1 if <code>$1</code> is zero or null, 0 otherwise</td></tr>
<tr>
<td>exponentiation</td><td><code>^</code></td><td><code>x ^ y</code></td><td><code>x</code> to the power <code>y</code></td></tr>
<tr>
<td>increment, decrement</td><td><code>++ --</code></td><td><code>++x, x++</code></td><td>add 1 to <code>x</code></td></tr>
<tr>
<td>field</td><td><code>$</code></td><td><code>$i + 1</code></td><td>value of <code>i</code>-th field, plus 1</td></tr>
<tr>
<td>grouping</td><td><code>()</code></td><td><code>$(i++)</code></td><td>return <code>i</code>-th field, then increment <code>i</code></td></tr>
</tbody>
</table>
</div><h2 id="heading-printf">printf</h2>
<p><strong>printf format-control characters</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Character</td><td>Print Expression As</td></tr>
</thead>
<tbody>
<tr>
<td><code>c</code></td><td>single UTF-8 character (code point)</td></tr>
<tr>
<td><code>d</code> or <code>i</code></td><td>decimal integer</td></tr>
<tr>
<td><code>e</code> or <code>E</code></td><td>[-]d.dddddde[+-]dd or [-]d.ddddddE[+-]dd</td></tr>
<tr>
<td><code>f</code></td><td>[-]ddd.dddddd</td></tr>
<tr>
<td><code>g</code> or <code>G</code></td><td>e or f conversion, whichever is shorter, with nonsignificant zeros suppressed</td></tr>
<tr>
<td><code>o</code></td><td>unsigned octal number</td></tr>
<tr>
<td><code>u</code></td><td>unsigned integer</td></tr>
<tr>
<td><code>s</code></td><td>string</td></tr>
<tr>
<td><code>x</code> or <code>X</code></td><td>unsigned hexadecimal number</td></tr>
<tr>
<td><code>%</code></td><td>print a %; no argument is consumed</td></tr>
</tbody>
</table>
</div><hr />
]]></content:encoded></item><item><title><![CDATA[Why AWK in 2026?]]></title><description><![CDATA[Because small is beautiful.

Because AWK gives unmatched bang for the buck.

Because AWK is the antidote to AI slop.

Because AWK is always there when you need it.

Because AWK assumes your data fits in a pipe, not a cluster.

Because AWK makes you t...]]></description><link>https://awklab.com/why-awk-in-2026</link><guid isPermaLink="true">https://awklab.com/why-awk-in-2026</guid><category><![CDATA[awk]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Linux]]></category><category><![CDATA[unix]]></category><category><![CDATA[Scripting]]></category><category><![CDATA[programming]]></category><dc:creator><![CDATA[Gábor Dombay]]></dc:creator><pubDate>Thu, 29 Jan 2026 08:12:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769868484560/67dc23f4-5018-4e84-8c2c-9f3dbaeda522.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<ul>
<li><p>Because small is beautiful.</p>
</li>
<li><p>Because AWK gives unmatched bang for the buck.</p>
</li>
<li><p>Because AWK is the antidote to AI slop.</p>
</li>
<li><p>Because AWK is always there when you need it.</p>
</li>
<li><p>Because AWK assumes your data fits in a pipe, not a cluster.</p>
</li>
<li><p>Because AWK makes you think, not just prompt.</p>
</li>
<li><p>Because AWK’s syntax stays out of your way.</p>
</li>
<li><p>Because AWK works across formats, not just inside one.</p>
</li>
<li><p>Because AWK is zero-setup — no need to import the world.</p>
</li>
<li><p>Because AWK lets you think in records, not lines.</p>
</li>
<li><p>Because AWK treats text as the universal interface.</p>
</li>
<li><p>Because AWK already <em>is</em> the loop.</p>
</li>
<li><p>Because AWK pre-processes gigabytes of text on a potato.</p>
</li>
<li><p>Because AWK works before schemas exist.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Practical AWK Benchmarking]]></title><description><![CDATA["I think in terms of programming languages you get the most bang for your buck by learning AWK",

said Brian Kernighan, the K in AWK (Lex Fridman podcast #109). AWK was created in Bell Labs in 1977, and its name is derived from the surnames of its au...]]></description><link>https://awklab.com/practical-awk-benchmarking</link><guid isPermaLink="true">https://awklab.com/practical-awk-benchmarking</guid><category><![CDATA[awk]]></category><category><![CDATA[Linux]]></category><category><![CDATA[benchmarking]]></category><category><![CDATA[Scripting]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data processing]]></category><category><![CDATA[unix]]></category><dc:creator><![CDATA[Gábor Dombay]]></dc:creator><pubDate>Mon, 26 Jan 2026 22:23:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769614522598/61617995-2073-4abc-a38e-49da9d6c04bf.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>"I think in terms of programming languages you get the most bang for your buck by learning AWK",</p>
</blockquote>
<p>said Brian Kernighan, the K in AWK (<a target="_blank" href="https://youtu.be/O9upVbGSBFo?si=-dY-2fzqteL1aUU-&amp;t=2196">Lex Fridman podcast #109</a>). AWK was created in Bell Labs in 1977, and its name is derived from the surnames of its authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is still widely used today, as a core tool it is available on any Unix or Unix-like system (Linux, BSDs, macOS etc.). Its relevance extends to modern data pipelines, where AWK can be applied as an effective, <a target="_blank" href="https://awklab.com/awk-the-zero-setup-pre-processor">schema-agnostic pre-processor</a>.</p>
<h1 id="heading-why-benchmark-awk">Why Benchmark AWK</h1>
<p>AWK is a concise, domain-specific programming language designed for efficient text processing via its pattern–action execution model. Its core strength lies in its implicit record- and field-level iteration: input is processed sequentially, one record at a time, with automatic field decomposition applied to each record. This design removes the need for explicit control flow for input traversal, allowing programs to express <em>what</em> transformation should occur rather than <em>how</em> to iterate. By abstracting record traversal, field iteration, and memory management, AWK allows concise expressions of C-like logic that are executed immediately when a pattern matches, making it exceptionally effective for rapid, ad-hoc data analysis and transformation.</p>
<p>While the AWK language is a standard, it exists in several distinct implementations, most notably:</p>
<ul>
<li><p><strong>gawk</strong> (GNU Awk): The feature-rich version maintained by Arnold Robbins. Default in Arch Linux, RHEL, Fedora.</p>
</li>
<li><p><strong>mawk</strong> (Mike Brennan’s Awk): A speed-oriented implementation using a bytecode interpreter, currently maintained by Thomas Dickey. Default in Debian and many of its derivatives.</p>
</li>
<li><p><strong>nawk</strong> (The "One True Awk"): The original implementation from the language’s creators, maintained by Brian Kernighan. Default in BSDs and macOS.</p>
</li>
</ul>
<p>In most Linux distributions, the <code>awk</code> command is a symbolic link to a specific implementation. You can verify which variant is being used with: <code>ls -l $(which awk)</code>.</p>
<p>This article benchmarks <strong>gawk</strong>, <strong>mawk</strong>, and <strong>nawk</strong> by evaluating both execution time and memory footprint through a <strong>Pareto Frontier</strong> analysis to determine their true resource efficiency. <strong>The primary catalyst for this comparison is Brian Kernighan’s 2025 update to nawk, which introduced CSV and UTF-8 support.</strong></p>
<h1 id="heading-benchmarking-approach"><strong>Benchmarking Approach</strong></h1>
<p>The benchmarks utilize functional one-liners that perform logical data analysis tasks relevant to the dataset. Rather than relying on synthetic loops or isolated instructions, these benchmarks are designed to reflect idiomatic AWK usage. This approach evaluates engine performance across various internal operations, including:</p>
<ul>
<li><p>Data aggregation: Extensive use of associative arrays.</p>
</li>
<li><p>Control flow: Implementation of conditional logic and loops.</p>
</li>
<li><p>Text processing: Pattern matching and string manipulation through regex and built-in functions.</p>
</li>
<li><p>Arithmetic: Processing numeric fields for financial calculations.</p>
</li>
</ul>
<h1 id="heading-methodology"><strong>Methodology</strong></h1>
<p>To evaluate the performance of the three AWK implementations, the benchmarking focused on two critical metrics: runtime and peak memory usage. Resource tracking was performed using <a target="_blank" href="https://github.com/gsauthof/cgmemtime">cgmemtime</a>, an ideal tool for this purpose as it captures peak group memory consumption, including any spawned child processes. The benchmarking process was automated via <a target="_blank" href="https://github.com/awklab/benchgab.awk"><code>benchgab.awk</code></a>, my custom benchmark runner that handles warmups and multiple test runs. The script is built on top of <strong>cgmemtime</strong> and implemented in AWK —a deliberately fitting and self-referential choice for this study.</p>
<p>The workload consisted of a <a target="_blank" href="https://excelbianalytics.com/wp/wp-content/uploads/2017/07/1500000%20Sales%20Records.zip">179 MB CSV dataset</a> containing 1.5 million lines and 14 fields. The chosen dataset ensures that commas only appear as field delimiters, allowing for a comparison across all three engines using the standard <code>-F,</code> flag, as mawk lacks <code>-- csv</code> support. The fields are structured as follows:</p>
<ol>
<li>Region, 2. Country, 3. Item Type, 4. Sales Channel, 5. Order Priority, 6. Order Date, 7. Order ID, 8. Ship Date, 9. Units Sold, 10. Unit Price, 11.Unit Cost, 12. Total Revenue, 13. Total Cost, 14. Total Profit.</li>
</ol>
<p>Each benchmark sequence included one initial warmup followed by ten recorded runs, with the mean and standard deviation derived from this ten-run sample.</p>
<p>The following table provides <strong>a summary of the specific versions</strong> and main characteristics of the three AWK implementations tested:</p>
<pre><code class="lang-plaintext">|  Name  |        Version | Binary Size | Installed Size |  CSV  | UTF-8 | Extensions |
|--------|----------------|-------------|----------------|-------|-------|------------|
|  gawk  |          5.3.2 |    853 kB   |    3.60 MB     |  yes  |  yes  |     yes    |
|  mawk  | 1.3.4 20250131 |    179 kB   |     206 kB     |  no   |  no   |     no     |
|  nawk  |       20251225 |    139 kB   |     145 kB     |  yes  |  yes  |     no     |
</code></pre>
<p>Benchmarks were conducted on an Arch Linux workstation powered by a Ryzen 5900x CPU, using the Alacritty terminal within a dwm session.</p>
<h1 id="heading-benchmarks">Benchmarks</h1>
<h2 id="heading-understanding-the-results">Understanding the Results</h2>
<p>Each benchmark includes a result table. The metrics are defined as follows:</p>
<ul>
<li><p><strong>Runtime</strong>: The average execution time [s] followed by the standard deviation (±σ).</p>
</li>
<li><p><strong>Peak Mem</strong>: The average peak group memory [MB] followed by the standard deviation (±σ).</p>
</li>
<li><p><strong>RT</strong>: Normalized average runtime. The execution time relative to the fastest implementation (1.0 is the baseline).</p>
</li>
<li><p><strong>PM</strong>: Normalized average group peak memory. The peak memory relative to the implementation with the lowest memory footprint (1.0 is the baseline).</p>
</li>
</ul>
<h2 id="heading-1-benchmark-duplicate-lines">#1 Benchmark: duplicate lines</h2>
<p><strong>Objective</strong>: Identify and print the total number of duplicate lines within the dataset.</p>
<p><strong>Targeted operations:</strong> Associative arrays.</p>
<pre><code class="lang-plaintext">awk -F, 'x[$0]++ { i++ } END { print i }'
</code></pre>
<p><strong>Output</strong>: 108603</p>
<pre><code class="lang-plaintext">|  #1  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 1.395 ± 0.044 | 551.16 ± 0.47 | 1.12 | 1.98 |
| mawk | 1.241 ± 0.030 | 290.59 ± 0.17 | 1.00 | 1.04 |
| nawk | 1.267 ± 0.007 | 278.90 ± 0.21 | 1.02 | 1.00 |
</code></pre>
<h2 id="heading-2-benchmark-most-units-sold-by-country">#2 Benchmark: most units sold by country</h2>
<p><strong>Objective</strong>: Find the country with the highest total units sold, excluding duplicate entries.</p>
<p><strong>Targeted Operations</strong>: multi-array processing and max-value search</p>
<pre><code class="lang-plaintext">awk -F, 'NR &gt; 1 &amp;&amp; !x[$0]++ { u[$2] += $9 } END { for (i in u) if (u[i] &gt; u_max) { u_max = u[i]; c = i }  print c, u_max }'
</code></pre>
<pre><code class="lang-plaintext">|  #2  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 2.717 ± 0.018 | 551.12 ± 0.31 | 1.62 | 1.98 |
| mawk | 1.678 ± 0.037 | 290.71 ± 0.28 | 1.00 | 1.04 |
| nawk | 2.175 ± 0.007 | 278.84 ± 0.22 | 1.30 | 1.00 |
</code></pre>
<h2 id="heading-3-benchmark-highest-profit-margin">#3 Benchmark: highest profit margin</h2>
<p><strong>Objective</strong>: Identify the order ID with the greatest ratio of profit to unit price.</p>
<p><strong>Targeted operations</strong>: Floating-point arithmetic and conditional max-value tracking.</p>
<pre><code class="lang-plaintext">awk -F, 'NR &gt; 1 { pm = ($10 - $11) / $10; if (pm &gt; pm_max) { pm_max = pm; id = $7 }} END { print id }'
</code></pre>
<p><strong>Output</strong>: 667593514</p>
<pre><code class="lang-plaintext">|  #3  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 1.783 ± 0.006 |  0.74 ± 0.01  | 3.02 | 2.69 |
| mawk | 0.591 ± 0.005 |  0.62 ± 0.13  | 1.00 | 2.25 |
| nawk | 1.340 ± 0.005 |  0.28 ± 0.08  | 2.27 | 1.00 |
</code></pre>
<h2 id="heading-4-benchmark-count-european-countries">#4 Benchmark: count European countries</h2>
<p><strong>Objective</strong>: Count unique country names within the Europe region using exact <strong>string</strong> matching.</p>
<p><strong>Targeted operations</strong>: Exact string matching and associative array lookups.</p>
<pre><code class="lang-plaintext">awk -F, '$1 == "Europe" { eu[$2]++ } END { for (country in eu) n++; print n }'
</code></pre>
<p><strong>Output</strong>: 48</p>
<pre><code class="lang-plaintext">|  #4  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 0.513 ± 0.005 |  0.74 ± 0.00  | 1.46 | 1.89 |
| mawk | 0.351 ± 0.004 |  0.63 ± 0.12  | 1.00 | 1.62 |
| nawk | 1.284 ± 0.006 |  0.39 ± 0.12  | 3.66 | 1.00 |
</code></pre>
<h2 id="heading-5-benchmark-count-european-countries-regex">#5 Benchmark: count European countries (regex)</h2>
<p><strong>Objective</strong>: Count unique country names within the Europe region using <strong>regex</strong> matching**.**</p>
<p>T<strong>argeted operations:</strong> Regex matching and associative array lookups.</p>
<pre><code class="lang-plaintext">awk -F, '$1 ~ /Europe/ { eu[$2]++ } END { for (country in eu) n++; print n }'
</code></pre>
<p><strong>Output</strong>: 48</p>
<pre><code class="lang-plaintext">|  #5  |  Runtime [s]   | Peak Mem [MB] |  RT  |  PM  |
|------|----------------|---------------|------|------|
| gawk | 0.524 ± 0.017  |  0.76 ± 0.14  | 1.49 | 1.42 |
| mawk | 0.351 ± 0.006  |  0.67 ± 0.11  | 1.00 | 1.24 |
| nawk | 1.420 ± 0.007  |  0.54 ± 0.11  | 4.04 | 1.00 |
</code></pre>
<h2 id="heading-6-benchmark-number-of-orders-in-date-range">#6 Benchmark: number of orders in date range</h2>
<p><strong>Objective</strong>: Count number of orders (excluding duplicates) between 3/1/2014 and 3/31/2015.</p>
<p><strong>Targeted operations</strong>: String manipulation functions, relational string comparisons, and associative array deduplication.</p>
<pre><code class="lang-plaintext">awk -F, 'NR &gt; 1 &amp;&amp; !x[$0]++ { split($6, a, "/"); d = sprintf("%d%02d%02d", a[3], a[1], a[2]); if (d &gt;= "20140301" &amp;&amp; d &lt;= "20150331") n++ } END { print n }'
</code></pre>
<p><strong>Output</strong>: 203060</p>
<pre><code class="lang-plaintext">|  #6  |  Runtime [s]  | Peak Mem [MB] |  RT  |  PM  |
|------|---------------|---------------|------|------|
| gawk | 4.106 ± 0.043 | 551.14 ± 0.21 | 1.95 | 1.98 |
| mawk | 2.107 ± 0.016 | 290.76 ± 0.27 | 1.00 | 1.04 |
| nawk | 3.434 ± 0.019 | 278.87 ± 0.26 | 1.63 | 1.00 |
</code></pre>
<h1 id="heading-results"><strong>Results</strong></h1>
<h2 id="heading-geometric-mean-and-normalization"><strong>Geometric Mean and Normalization</strong></h2>
<p>To provide a representative comparison across multiple benchmarks, the <a target="_blank" href="https://en.wikipedia.org/wiki/Geometric_mean">Geometric Mean</a> for the normalized RT and PM values was calculated. The geometric mean is the mathematically appropriate choice for averaging ratios or normalized values, as it ensures that relative improvements are weighted consistently across all tests. Unlike the arithmetic mean, this approach prevents outliers in absolute execution time from disproportionately skewing the aggregate performance profile.</p>
<h2 id="heading-evaluation-metrics"><strong>Evaluation Metrics</strong></h2>
<p>To synthesize these normalized results into a single actionable score, I have applied two evaluation metrics:</p>
<ul>
<li><p><strong>Euclidean Distance (d)</strong>: Measures the geometric distance from the "Ideal Point" (1,1). A lower d indicates a more balanced implementation that is close to being the best in both speed and memory simultaneously.</p>
</li>
<li><p><strong>Resource Footprint (F)</strong>: Calculated as RT×PM. This represents the total resource footprint; lower values indicate a more efficient use of system resources to complete the same task.</p>
</li>
</ul>
<h2 id="heading-summary-table">Summary Table</h2>
<p>The following table summarizes the overall performance of the three AWK engines based on the geometric mean of all normalized benchmarks:</p>
<pre><code class="lang-plaintext">| Summary |  RT  |  PM  |  d   |  F   |
|---------|------|------|------|------|
| gawk    | 1.80 | 1.96 | 1.25 | 3.51 |
| mawk    | 1.00 | 1.31 | 0.31 | 1.31 |
| nawk    | 2.13 | 1.00 | 1.13 | 2.13 |
</code></pre>
<p><strong>Definitions - RT</strong>: Normalized Runtime; <strong>PM</strong>: Normalized Peak Memory; <strong>d</strong>: Euclidean Distance; <strong>F</strong>: Resource Footprint</p>
<h1 id="heading-discussion"><strong>Discussion</strong></h1>
<p>The benchmarking results across six diverse objectives show a clear and consistent performance profile for each implementation. Across all six benchmarks, <strong>mawk</strong> was consistently the fastest, while <strong>nawk</strong> maintained the lowest memory footprint. Conversely, <strong>gawk</strong> exhibited the highest memory usage in every benchmark. However, gawk demonstrates higher relative speed consistency than <strong>nawk</strong>; even when finishing second or third, it generally avoids the significant performance collapses seen by <strong>nawk</strong>. While <strong>nawk</strong> is fast at mathematical logic and simple field processing, it is significantly slower at regex and string operations, and complex array management.</p>
<p>These individual performance patterns serve as the foundation for my aggregate metrics, where the trade-off between speed and memory is formally quantified.</p>
<p>While the Euclidean distance (d) provides a useful preliminary indication of effectiveness, relying on it alone can be misleading. For instance, the Euclidean Distances for <strong>gawk</strong> (1.25) and <strong>nawk</strong> (1.13) are relatively close, yet their Resource Footprints (F) reveal a significant disparity: gawk consumes nearly 65% more total resources.</p>
<p>This limitation necessitates a more robust analysis via the <a target="_blank" href="https://en.wikipedia.org/wiki/Pareto_front">Pareto frontier</a>.</p>
<p>To visualize the trade-offs, I plotted the normalized values on a 2D coordinate system where the x-axis represents the normalized runtime (RT) and the y-axis represents normalized peak memory (PM). The "Ideal Point" is located at (1,1), representing an implementation that is simultaneously the fastest and the most memory-efficient.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769614439027/e65bad4b-7119-4e31-b493-504065b11e93.png" alt class="image--center mx-auto" /></p>
<p><strong>Graph</strong>: <em>The Pareto Frontier of AWK implementations: Visualizing the optimal equilibrium between execution speed and memory footprint.</em></p>
<p>The Pareto frontier represents the boundary of "non-dominated" solutions—implementations where you cannot improve one metric (like speed) without degrading another (like memory). In this study, <strong>mawk</strong> and <strong>nawk</strong> define the frontier: <strong>mawk</strong> is the choice for raw speed, while <strong>nawk</strong> is the choice for minimal footprint. gawk, however, is positioned away from this boundary; because it is slower than <strong>mawk</strong> and uses more memory than <strong>nawk</strong>, it is considered "dominated" and sub-optimal in terms of raw resource efficiency.</p>
<h1 id="heading-conclusion"><strong>Conclusion</strong></h1>
<p>The data confirms that the "best" AWK implementation is a calculated trade-off between throughput and resource overhead. Within the Unix philosophy of choosing the right tool for the job, each engine serves a distinct operational profile.</p>
<ul>
<li><p><strong>mawk</strong> is the powerhouse for high-volume data. If your primary bottleneck is execution speed, its bytecode engine is unrivaled. It consistently defines the leading edge of the Pareto frontier, delivering the highest performance-to-resource ratio.</p>
</li>
<li><p><strong>nawk</strong> is the go-to for minimalist environments. While it prioritizes simplicity over the heavy lifting of complex regex or string manipulation, its memory footprint is remarkably small and predictable. It is the definitive choice for systems where memory overhead is a strictly limited resource.</p>
</li>
<li><p><strong>gawk</strong> offers a more nuanced value proposition. While it is mathematically dominated by its rivals, that overhead pays for a much broader feature set which can outweigh its increased resource consumption.</p>
</li>
</ul>
<p>Across various workflows—from data science pipelines to system automation—<strong>mawk</strong> provides the highest performance return for most standard tasks. Ultimately, these results show that the choice of engine should be a deliberate decision: use <strong>mawk for speed</strong>, <strong>nawk for a light footprint</strong>, and <strong>gawk when you need its extended toolkit.</strong></p>
]]></content:encoded></item></channel></rss>