When RAM Matters: Memory Efficiency of AWK Variants

The AWK scripting language emerged from Bell Labs in 1977, named for its creators Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is still widely used today, as a core tool it is available on any Unix or Unix-like system (Linux, BSDs, macOS etc.). It operates as a compact, domain-specific language for text processing. AWK reads input line by line, splits each line into fields, and executes code when patterns match. No explicit loops are needed for reading data; the program focuses on what to do with each record, not how to traverse the file. This makes it exceptionally effective for rapid ad-hoc data analysis and transformation, as well as filtering and more complex operations within pipelines. AWK is Turing-complete and can handle logic beyond simple pattern matching.
While the AWK language is POSIX standard, it exists in several distinct implementations, most notably:
gawk (GNU Awk): The feature-rich version with extensions beyond POSIX, maintained by Arnold Robbins. Default in Arch Linux, RHEL, Fedora.
mawk (Mike Brennan’s Awk): An efficiency-oriented implementation using a bytecode interpreter, currently maintained by Thomas Dickey. Default in Debian and many of its derivatives.
nawk (The "One True Awk"): The original implementation from the language’s creators, maintained by Brian Kernighan. Default in BSDs and macOS.
In most Linux distributions, the awk command is a symbolic link to a specific implementation. You can verify which variant is being used with:
ls -l $(which awk)
AWK has a place in modern data pipelines as an effective Phase 2 pre-filter: it is schema-agnostic, low footprint, zero-setup, and readily available (see article). It is suitable for the earliest stage of validation, as a first-pass filter, before any format-specific interpretation.
AWK can operate in two fundamentally different modes with respect to memory usage:
Streaming operations maintain constant memory usage regardless of file size. A multi-hundred-gigabyte file can be inspected using the same resources as a kilobyte sample. This makes AWK effective for null rate checks, schema validation, and range or boundary verification on datasets that exceed available memory.
Stateful operations, however, require accumulating data in memory. This can take several forms: populating associative arrays for deduplication (
!x[$0]++) or field distribution analysis (x[NF]++), loading records into indexed arrays for multi-pass processing, or concatenating strings to build aggregate outputs. For these operations, memory efficiency matters, and implementation differences between AWK variants become significant.
This article evaluates the memory efficiency of gawk, mawk, and nawk in stateful operations, as a function of input file size.
Benchmarking Approach
The benchmarking evaluates memory consumption patterns for four different stateful operation scenarios in AWK when processing CSV data. The focus is on memory usage comparison during data population, with no additional processing or operations performed. This allows for direct measurement of how different data storage strategies impact memory footprint. In addition to memory usage execution time was also measured. Resource tracking was performed using cgmemtime, an ideal tool for this purpose as it captures peak memory consumption for the process group. The benchmarking process was automated via my custom runner that handles warmups, multiple test runs, and calculates statistical metrics as well as normalized parameters for comparative analysis. For details see my BEHILOS Benchmark article.
Test Dataset
The benchmarking uses CSV files with a consistent structure of 14 fields per row. To observe memory scaling behavior, 7 different file sizes were tested ranging from 1,000 rows to 10 million rows, 120KB to 1.2GB of size. The CSV test files are available here. The 10M row file was generated by concatenating the 5M file twice.
| File name | Rows | File size [MB] |
|---|---|---|
| sales1K.csv | 1K | 0.12 |
| sales10K.csv | 10K | 1.2 |
| sales100K.csv | 100K | 12 |
| sales500K.csv | 500K | 60 |
| sales1.5M.csv | 1.5M | 178 |
| sales5M.csv | 5M | 595 |
| sales10M.csv | 10M | 1190 |
Test Environment
Tests were conducted on an Arch Linux workstation powered by a Ryzen 5900x CPU with 64GB of RAM, using the Alacritty terminal within a dwm session.
The following table provides a summary of the specific versions and main characteristics of the three AWK implementations tested:
| Name | Version | Binary Size | Installed Size | --csv | UTF-8 | Extensions |
|---|---|---|---|---|---|---|
| gawk | 5.3.2 | 853 kB | 3.60 MB | yes | yes | yes |
| mawk | 1.3.4 20260129 | 179 kB | 206 kB | no | no | no |
| nawk | 20251225 | 139 kB | 145 kB | yes | yes | no |
The Benchmarks
Four benchmarks were applied. They represent common patterns for storing CSV data in AWK, each with different memory characteristics and use cases.
Each benchmark sequence included one initial warmup run followed by three recorded runs. Normalized paramteres are based on median, 1.0 being the baseline (e.g lowest peak memory or runtime).
Benchmark #1: Store entire lines in array
x[NR]=$0
This is the simplest storage method and keeps the original line intact without parsing individual fields. The memory footprint includes the full text of each line including all field separators. This method is commonly used when you need to preserve the exact input for later processing or output, or when you need random access to complete lines.
Results of Benchmark #1
Benchmark #1 Result Table
File / variant Runtime [s] Peak Memory [MB]
mean ± sdev min median max Jtr% mean ± sdev min median max Jtr%
sales1K.csv
gawk 0.0016 ± 0.0001 0.0016 0.0016 0.0017 1.1 0.75 ± 0.02 0.73 0.74 0.77 1.4
mawk 0.0010 ± 0.0001 0.0009 0.0010 0.0011 1.3 0.82 ± 0.15 0.73 0.74 1.00 11.6
nawk 0.0015 ± 0.0001 0.0015 0.0015 0.0016 2.3 0.57 ± 0.15 0.48 0.49 0.74 16.8
sales10K.csv
gawk 0.0048 ± 0.0001 0.0047 0.0049 0.0049 0.9 3.07 ± 0.15 2.98 2.99 3.25 2.8
mawk 0.0027 ± 0.0002 0.0026 0.0028 0.0028 1.2 2.74 ± 0.15 2.73 2.74 2.74 0.0
nawk 0.0089 ± 0.0003 0.0087 0.0087 0.0091 1.5 2.85 ± 0.19 2.73 2.83 2.98 0.6
sales100K.csv
gawk 0.0330 ± 0.0008 0.0324 0.0328 0.0339 0.7 25.74 ± 0.15 25.73 25.74 25.74 0.0
mawk 0.0174 ± 0.0014 0.0163 0.0169 0.0189 3.0 21.49 ± 0.15 21.48 21.49 21.49 0.0
nawk 0.0766 ± 0.0020 0.0750 0.0760 0.0789 0.9 23.57 ± 0.35 23.24 23.74 23.74 0.7
sales500K.csv
gawk 0.1488 ± 0.0022 0.1473 0.1480 0.1511 0.5 125.49 ± 0.29 125.23 125.48 125.74 0.0
mawk 0.1025 ± 0.0048 0.0987 0.1012 0.1076 1.3 105.24 ± 0.29 104.99 105.25 105.48 0.0
nawk 0.3974 ± 0.0063 0.3918 0.3966 0.4038 0.2 120.19 ± 0.38 120.02 120.27 120.27 0.1
sales1.5M.csv
gawk 0.4399 ± 0.0036 0.4367 0.4406 0.4422 0.2 374.99 ± 0.58 374.49 374.98 375.49 0.0
mawk 0.3360 ± 0.0159 0.3192 0.3403 0.3486 1.3 313.24 ± 0.52 312.98 313.00 313.74 0.1
nawk 1.1826 ± 0.0132 1.1753 1.1766 1.1959 0.5 346.26 ± 0.45 346.02 346.26 346.51 0.0
sales5M.csv
gawk 1.4856 ± 0.0037 1.4848 1.4853 1.4868 0.0 1252.07 ± 0.65 1251.73 1252.23 1252.23 0.0
mawk 1.1790 ± 0.0185 1.1696 1.1788 1.1885 0.0 1042.48 ± 0.58 1042.23 1042.48 1042.73 0.0
nawk 3.9644 ± 0.0155 3.9555 3.9660 3.9717 0.0 1156.11 ± 0.47 1156.02 1156.03 1156.27 0.0
sales10M.csv
gawk 2.9759 ± 0.0070 2.9691 2.9783 2.9802 0.1 2507.74 ± 0.69 2507.49 2507.74 2507.99 0.0
mawk 2.4046 ± 0.0220 2.3926 2.4044 2.4166 0.0 2083.90 ± 0.60 2083.73 2083.98 2083.98 0.0
nawk 8.0351 ± 0.0158 8.0334 8.0337 8.0383 0.0 2361.94 ± 0.70 2361.53 2361.78 2362.52 0.0
Summary Table
Benchmark #1 Summary Table
File size rt [s] pm [MB]
[MB] gawk mawk nawk gawk mawk nawk
----------------------------------------------------------
0.12 0.0016 0.0010 0.0015 0.74 0.74 0.49
1.2 0.0049 0.0028 0.0087 2.99 2.74 2.83
12 0.0328 0.0169 0.0760 25.74 21.49 23.74
60 0.1480 0.1012 0.3966 125.48 105.25 120.27
178 0.4406 0.3403 1.1766 374.98 313.00 346.26
595 1.4853 1.1788 3.9660 1252.23 1042.48 1156.03
1190 2.9783 2.4044 8.0337 2507.74 2083.98 2361.78
Normalized results: RT (normalized runtime) and MO (memory overhead)
Benchmark #1 Normalized Results
File size RT MO
[MB] gawk mawk nawk gawk mawk nawk
----------------------------------------------------
0.12 1.6 1.0 1.5 6.2 6.2 4.1
1.2 1.8 1.0 3.1 2.5 2.3 2.4
12 1.9 1.0 4.5 2.1 1.8 2.0
60 1.5 1.0 3.9 2.1 1.8 2.0
178 1.3 1.0 3.5 2.1 1.8 1.9
595 1.3 1.0 3.4 2.1 1.8 1.9
1190 1.2 1.0 3.3 2.1 1.8 2.0
Benchmark #2: Populate 2D matrix
for (i=1; i<=NF; i++) x[NR,i] = $i
AWK simulates 2D arrays by concatenating keys with a built-in separator (SUBSEP), so x[row, col] is stored internally as x[row SUBSEP col]. This approach provides indexed access to individual fields and is useful when you need to perform operations on specific columns across all rows. The memory overhead includes both the field data and the composite key structures.
Results of Benchmark #2
Benchmark #2 Result Table
File / variant Runtime [s] Peak Memory [MB]
mean ± sdev min median max Jtr% mean ± sdev min median max Jtr%
sales1K.csv
gawk 0.0087 ± 0.0003 0.0084 0.0086 0.0090 0.7 5.24 ± 0.00 5.24 5.24 5.24 0.0
mawk 0.0066 ± 0.0011 0.0058 0.0061 0.0078 7.8 2.15 ± 0.29 1.98 1.98 2.49 8.5
nawk 0.0090 ± 0.0007 0.0081 0.0094 0.0094 4.5 2.33 ± 0.13 2.23 2.28 2.48 2.2
sales10K.csv
gawk 0.0662 ± 0.0006 0.0657 0.0661 0.0668 0.2 46.07 ± 0.14 45.99 45.99 46.24 0.2
mawk 0.0511 ± 0.0027 0.0491 0.0503 0.0539 1.6 16.32 ± 0.32 16.24 16.24 16.48 0.5
nawk 0.0820 ± 0.0010 0.0814 0.0821 0.0827 0.0 19.67 ± 0.20 19.59 19.59 19.84 0.4
sales100K.csv
gawk 0.7788 ± 0.0262 0.7500 0.7852 0.8011 0.8 452.65 ± 0.20 452.48 452.73 452.74 0.0
mawk 0.9017 ± 0.0144 0.8931 0.8941 0.9180 0.9 156.74 ± 0.32 156.73 156.73 156.74 0.0
nawk 0.7998 ± 0.0117 0.7911 0.7953 0.8131 0.6 178.69 ± 0.25 178.52 178.77 178.78 0.0
sales500K.csv
gawk 4.5791 ± 0.0277 4.5695 4.5800 4.5878 0.0 2261.98 ± 0.48 2261.72 2261.73 2262.48 0.0
mawk 6.3488 ± 0.0417 6.3037 6.3686 6.3742 0.3 785.24 ± 0.32 785.23 785.23 785.24 0.0
nawk 4.3709 ± 0.0170 4.3583 4.3714 4.3830 0.0 959.94 ± 0.46 959.52 960.04 960.26 0.0
sales1.5M.csv
gawk 14.8738 ± 0.1380 14.7941 14.7974 15.0298 0.5 6775.31 ± 0.70 6774.75 6775.47 6775.72 0.0
mawk 19.7716 ± 0.0491 19.7417 19.7862 19.7870 0.1 2356.07 ± 0.50 2355.73 2355.98 2356.48 0.0
nawk 12.1783 ± 0.0873 12.1189 12.1395 12.2765 0.3 2677.02 ± 0.52 2676.77 2677.02 2677.27 0.0
sales5M.csv
gawk 50.6685 ± 0.1408 50.6431 50.6636 50.6989 0.0 22592.04 ± 0.71 22591.95 22591.96 22592.20 0.0
mawk 72.4570 ± 0.1123 72.3429 72.4932 72.5348 0.0 7963.89 ± 0.52 7963.73 7963.96 7963.99 0.0
nawk 40.7116 ± 0.2032 40.5819 40.6313 40.9215 0.2 8988.45 ± 0.65 8988.01 8988.57 8988.76 0.0
sales10M.csv
gawk 101.8983 ± 0.2029 101.7380 101.9330 102.0240 0.0 45182.70 ± 0.71 45182.70 45182.71 45182.71 0.0
mawk 150.8563 ± 0.2839 150.6080 150.8330 151.1280 0.0 15965.48 ± 0.84 15964.98 15965.23 15966.23 0.0
nawk 84.8894 ± 0.4522 84.5106 84.8431 85.3145 0.1 18777.15 ± 0.68 18776.93 18777.25 18777.26 0.0
Summary Table
Benchmark #2 Summary Table
File size rt [s] pm [MB]
[MB] gawk mawk nawk gawk mawk nawk
----------------------------------------------------------------
0.12 0.0086 0.0061 0.0094 5.24 1.98 2.28
1.2 0.0661 0.0503 0.0821 45.99 16.24 19.59
12 0.7852 0.8941 0.7953 452.73 156.73 178.77
60 4.5800 6.3686 4.3714 2261.73 785.23 960.04
178 14.7974 19.7862 12.1395 6775.47 2355.98 2677.02
595 50.6636 72.4932 40.6313 22591.96 7963.96 8988.57
1190 101.9330 150.8330 84.8431 45182.71 15965.23 18777.25
Normalized results: RT (normalized runtime) and MO (memory overhead)
Benchmark #2 Normalized Results
File size RT MO
[MB] gawk mawk nawk gawk mawk nawk
---------------------------------------------------
0.12 1.4 1.0 1.5 43.7 16.5 19.0
1.2 1.3 1.0 1.6 38.3 13.5 16.3
12 1.0 1.1 1.0 37.7 13.1 14.9
60 1.0 1.5 1.0 37.7 13.1 16.0
178 1.2 1.6 1.0 38.1 13.2 15.0
595 1.2 1.8 1.0 38.0 13.4 15.1
1190 1.2 1.8 1.0 38.0 13.4 15.8
Benchmark #3: Populate 1D array for each field
x1[NR]=\(1; x2[NR]=\)2; x3[NR]=\(3; ... x14[NR]=\)14
This creates 14 independent hash table structures in memory, avoiding the composite key overhead of the 2D approach. This method is efficient when you frequently access all values of a particular field, as each field's data is stored contiguously in its own array structure. The tradeoff is managing multiple array variables instead of a single unified structure.
In Benchmark #3 gawk's native array of arrays feature was also tested:
for (i=1; i<=NF; i++) x[NR][i]=$i
This creates a true nested structure where each row is a parent array containing 14 child elements.
Results
Benchmark #3 Result Table
File / variant Runtime [s] Peak Memory [MB]
mean ± sdev min median max Jtr% mean ± sdev min median max Jtr%
sales1K.csv
gawk 0.0054 ± 0.0004 0.0051 0.0052 0.0059 4.0 2.83 ± 0.14 2.74 2.77 2.99 2.2
mawk 0.0027 ± 0.0000 0.0026 0.0027 0.0027 0.1 1.82 ± 0.14 1.74 1.74 1.99 4.8
nawk 0.0053 ± 0.0001 0.0052 0.0053 0.0053 0.8 2.23 ± 0.00 2.23 2.23 2.24 0.0
gawk* 0.0066 ± 0.0004 0.0062 0.0068 0.0070 1.9 3.76 ± 0.07 3.70 3.74 3.84 0.6
sales10K.csv
gawk 0.0362 ± 0.0008 0.0355 0.0362 0.0369 0.0 21.40 ± 0.20 21.23 21.49 21.49 0.4
mawk 0.0176 ± 0.0011 0.0164 0.0179 0.0184 1.8 13.07 ± 0.20 12.98 12.99 13.23 0.6
nawk 0.0413 ± 0.0006 0.0408 0.0413 0.0419 0.0 19.07 ± 0.15 18.98 18.99 19.24 0.4
gawk* 0.0489 ± 0.0016 0.0471 0.0497 0.0498 1.7 31.49 ± 0.45 31.24 31.24 32.00 0.8
sales100K.csv
gawk 0.3390 ± 0.0047 0.3337 0.3406 0.3426 0.5 206.16 ± 0.43 205.74 206.25 206.48 0.0
mawk 0.2401 ± 0.0100 0.2288 0.2444 0.2472 1.7 125.74 ± 0.32 125.49 125.73 125.99 0.0
nawk 0.4159 ± 0.0036 0.4119 0.4177 0.4183 0.4 178.06 ± 0.40 177.73 177.96 178.47 0.1
gawk* 0.4581 ± 0.0026 0.4563 0.4578 0.4602 0.1 308.23 ± 0.51 307.98 308.24 308.48 0.0
sales500K.csv
gawk 1.6668 ± 0.0108 1.6607 1.6618 1.6780 0.3 1023.65 ± 0.57 1023.24 1023.74 1023.98 0.0
mawk 1.2514 ± 0.0122 1.2445 1.2510 1.2586 0.0 621.15 ± 0.35 620.99 621.23 621.24 0.0
nawk 2.4466 ± 0.0109 2.4348 2.4523 2.4528 0.2 946.89 ± 0.49 946.56 947.04 947.07 0.0
gawk* 2.2579 ± 0.0027 2.2572 2.2582 2.2583 0.0 1537.32 ± 0.53 1537.24 1537.24 1537.49 0.0
sales1.5M.csv
gawk 5.0319 ± 0.0295 5.0004 5.0445 5.0509 0.2 3064.90 ± 0.69 3064.48 3064.98 3065.24 0.0
mawk 3.5495 ± 0.0221 3.5303 3.5515 3.5669 0.1 1847.24 ± 0.56 1846.73 1847.48 1847.49 0.0
nawk 6.3204 ± 0.0325 6.2918 6.3166 6.3527 0.1 2664.61 ± 0.53 2664.46 2664.56 2664.82 0.0
gawk* 6.7958 ± 0.0173 6.7846 6.7873 6.8155 0.1 4609.57 ± 0.93 4608.73 4609.73 4610.24 0.0
sales5M.csv
gawk 17.2646 ± 0.0952 17.1816 17.2512 17.3611 0.1 10289.49 ± 0.85 10288.99 10289.49 10289.98 0.0
mawk 12.3195 ± 0.0415 12.2834 12.3215 12.3537 0.0 6174.32 ± 0.76 6173.74 6174.48 6174.74 0.0
nawk 22.0794 ± 0.1297 21.9774 22.0413 22.2196 0.2 8938.40 ± 0.55 8938.31 8938.32 8938.57 0.0
gawk* 22.7869 ± 0.1159 22.6546 22.8511 22.8549 0.3 15367.34 ± 0.94 15367.23 15367.27 15367.50 0.0
sales10M.csv
gawk 34.8630 ± 0.2366 34.6662 34.8277 35.0951 0.1 20633.40 ± 0.90 20633.23 20633.24 20633.74 0.0
mawk 24.6822 ± 0.0494 24.6660 24.6677 24.7130 0.1 12346.65 ± 0.82 12346.48 12346.49 12346.98 0.0
nawk 48.6894 ± 0.1791 48.5470 48.7520 48.7691 0.1 18576.88 ± 0.57 18576.79 18576.80 18577.05 0.0
gawk* 45.8100 ± 0.1638 45.7327 45.7543 45.9431 0.1 30737.79 ± 1.19 30736.99 30737.99 30738.39 0.0
Summary Table
Benchmark #3 Summary Table
File size rt [s] pm [MB]
[MB] gawk mawk nawk gawk* gawk mawk nawk gawk*
----------------------------------------------------------------------------------
0.12 0.0052 0.0027 0.0053 0.0068 2.77 1.74 2.23 3.74
1.2 0.0362 0.0179 0.0413 0.0497 21.49 12.99 18.99 31.24
12 0.3406 0.2444 0.4177 0.4578 206.25 125.73 177.96 308.24
60 1.6618 1.2510 2.4523 2.2582 1023.74 621.23 947.04 1537.24
178 5.0445 3.5515 6.3166 6.7873 3064.98 1847.48 2664.56 4609.73
595 17.2512 12.3215 22.0413 22.8511 10289.49 6174.48 8938.32 15367.27
1190 34.8277 24.6677 48.7520 45.7543 20633.24 12346.49 18576.80 30737.99
Normalized results: RT (normalized runtime) and MO (memory overhead)
Benchmark #3 Normalized Results
File size RT MO
[MB] gawk mawk nawk gawk* gawk mawk nawk gawk*
-----------------------------------------------------------------
0.12 1.9 1.0 2.5 2.0 23.1 14.5 18.6 31.2
1.2 2.0 1.0 2.8 2.3 17.9 10.8 15.8 26.0
12 1.4 1.0 1.9 1.7 17.2 10.5 14.8 25.7
60 1.3 1.0 1.8 2.0 17.1 10.4 15.8 25.6
178 1.4 1.0 1.9 1.8 17.2 10.4 15.0 25.9
595 1.4 1.0 1.9 1.8 17.3 10.4 15.0 25.8
1190 1.4 1.0 1.9 2.0 17.3 10.4 15.6 25.8
Benchmark 4: Concatenate entire data in one string
x = x $0
Each line is appended to the existing string, creating progressively larger string values. This pattern can be useful when building complete records for batch output, log aggregation or creating hash/checksum input.
Results
Benchmark #4 Result Table
File / variant Runtime [s] Peak Memory [MB]
mean ± sdev min median max Jtr% mean ± sdev min median max Jtr%
sales1K.csv 0.12
gawk 0.0016 ± 0.0001 0.0016 0.0017 0.0017 1.8 0.75 ± 0.02 0.74 0.74 0.78 1.8
mawk 0.0058 ± 0.0003 0.0054 0.0059 0.0061 0.9 0.98 ± 0.07 0.94 0.95 1.07 3.7
nawk 0.0084 ± 0.0003 0.0081 0.0084 0.0086 0.1 1.06 ± 0.11 0.94 1.08 1.15 2.1
sales10K.csv 1.2
gawk 0.0041 ± 0.0001 0.0040 0.0040 0.0042 0.8 1.98 ± 0.02 1.98 1.98 1.98 0.1
mawk 1.2520 ± 0.0030 1.2485 1.2533 1.2541 0.1 5.18 ± 0.16 5.02 5.25 5.29 1.2
nawk 1.4993 ± 0.0047 1.4954 1.4979 1.5046 0.1 6.30 ± 0.27 6.03 6.36 6.51 0.9
sales100K.csv 12
gawk 0.0239 ± 0.0002 0.0238 0.0238 0.0241 0.3 12.73 ± 0.02 12.72 12.73 12.73 0.0
mawk 57.3232 ± 0.4294 56.8661 57.3852 57.7182 0.1 37.42 ± 0.95 36.46 37.46 38.33 0.1
nawk 92.1290 ± 0.9580 91.0598 92.4178 92.9094 0.3 49.94 ± 1.15 49.00 49.64 51.17 0.6
sales500K.csv 60
gawk 0.1075 ± 0.0015 0.1059 0.1080 0.1087 0.4 60.05 ± 0.13 59.97 59.98 60.21 0.1
mawk 1479.21 152.68
nawk 3854.76 180.05
sales1.5M.csv 178
gawk 0.3163 ± 0.0018 0.3155 0.3160 0.3174 0.1 178.63 ± 0.32 178.46 178.47 178.97 0.1
sales5M.csv 595
gawk 1.0447 ± 0.0041 1.0408 1.0454 1.0480 0.1 592.62 ± 0.43 592.45 592.45 592.96 0.0
sales10M.csv 1190
gawk 2.0706 ± 0.0041 2.0704 2.0706 2.0709 0.0 1184.01 ± 0.46 1183.92 1183.93 1184.19 0.0
Summary Table
Benchmark #4 Summary Table
File size RT MO
[MB] gawk mawk nawk gawk mawk nawk
------------------------------------------------------------------------
0.12 0.0017 0.0062 0.0081 0.7 1.1 1.0
1.2 0.0040 1.2365 1.4840 2.0 5.1 6.4
12 0.0238 54.5716 88.8361 12.7 36.9 49.4
60 0.1080 1479.2100 3854.7600 60.0 152.7 180.1
178 0.3160 ---------- ---------- 178.5 ---------- ----------
595 1.0454 ---------- ---------- 592.5 ---------- ----------
1190 2.0706 ---------- ---------- 1183.9 ---------- ----------
Normalized results: RT (normalized runtime) and MO (memory overhead)
Benchmark #4 Normalized Results
File size RT MO
[MB] gawk mawk nawk gawk mawk nawk
-------------------------------------------------------
0.12 1.0 3.6 4.8 6.2 9.6 7.9
1.2 1.0 309.1 371.0 1.7 4.3 5.4
12 1.0 2292.9 3732.6 1.1 3.1 4.1
60 1.0 13696.4 35692.2 1.0 2.5 3.0
Discussion
For the comparative analysis normalized metrics were used:
MO (Memory Overhead): This represents the ratio of peak memory usage relative to the raw file size. For example, an MO of 2.0 means the process used exactly twice the RAM as the size of the data on disk. It allows for a direct comparison of memory efficiency regardless of the input file size.
RT (Normalized Runtime): This is the execution time scaled against a baseline (the fastest result or the file size, 1.0). It measures how long the engine takes to process each unit of data, providing a clear picture of speed performance across different AWK variants.
Benchmark #1
The data confirms that for the simple line-storage pattern (x[NR]=$0), memory consumption is a strictly linear function of the input file size across all three variants. As the data scales from 120KB to 1.2GB, the normalized memory overhead (MO) exhibits clear asymptotic behavior; the initial variance caused by interpreter startup costs (which peaked at 6.2x for the smallest file) stabilizes at higher volumes. By the 1.2GB mark, gawk and nawk settle at roughly 2.0x and 2.1x overhead relative to the raw file size, while mawk maintains a leaner 1.8x, proving to be the most memory-efficient engine for large-scale string retention.
In terms of runtime performance, mawk consistently dominated as the fastest variant, serving as the baseline (1.0) for all normalized runtime (RT) measurements above the smallest file size. While gawk showed improving efficiency as the workload increased—dropping from 1.9x to 1.2x the runtime of mawk and nawk struggled significantly with this storage pattern, ending with a runtime 3.3x slower than mawk at the 10 million row limit. These results highlight that for pure data population tasks where preserving line integrity is key, mawk offers the best balance of speed and a minimized memory footprint.
Benchmark #2
The data for the 2D matrix population x[NR, i] = $i shows a massive increase in resource requirements compared to simple line storage, though the peak memory remains a strictly linear function of the file size. As the dataset scales toward 1.2GB, the normalized memory overhead reaches an asymptotic state where the initial interpreter costs become negligible. In this scenario, mawk proves to be the most memory-efficient by far, stabilizing at a memory overhead of 13.4. In contrast, gawk is exceptionally heavy for this storage pattern, requiring 38 times the raw file size in RAM, which is nearly triple the footprint of mawk.
The runtime performance results reveal a significant shift in execution efficiency as the number of array elements grows. While mawk is the fastest for small files, its performance degrades significantly at scale, eventually becoming the slowest variant with a normalized runtime of 1.8. Conversely, nawk emerges as the performance leader for large-scale matrix population, maintaining the baseline speed of 1.0 at high volumes. These results illustrate a clear trade-off: mawk is the optimal choice for minimizing the memory footprint in massive stateful operations, but nawk offers superior throughput when processing tens of millions of discrete fields.
Benchmark #3
In Benchmark #3, using 14 independent 1D arrays proves significantly more memory-efficient than the 2D composite key approach across all variants. The peak memory usage remains a linear function of file size, with normalized memory overhead (MO) reaching a steady state quickly. mawk again demonstrates superior memory management, stabilizing at an MO of 10.4, which is about 40% more efficient than gawk’s 17.3 and nawk’s 15.6. Interestingly, gawk's native array-of-arrays feature (gawk*) proved to be the most resource-intensive strategy in this test, with a stabilized MO of 25.8. This suggests that the internal overhead of managing nested objects in gawk is substantially higher than managing multiple flat hash tables.
Runtime-wise, mawk maintained its lead as the fastest variant, serving as the 1.0 baseline for all file sizes. gawk and nawk performed similarly at scale, with gawk finishing about 1.4 times slower than mawk, while nawk lagged at 1.9 times slower. Despite the structural elegance of gawk's nested arrays, the gawk* results showed no performance benefit over the 1D array method, consistently running about 2.0 times slower than mawk. For users requiring field-level access at scale, the strategy of multiple 1D arrays in mawk provides the best optimization of both execution speed and memory footprint.
Benchmark #4
Benchmark #4 reveals a dramatic divergence in performance, highlighting how different engines handle repeated string concatenation (x = x $0). In this scenario, gawk performs exceptionally, maintaining near-linear time complexity as the file size increases. This efficiency is due to gawk's optimized string management, which applies a smarter reallocation strategy than its counterparts. As the dataset scales to 60MB, gawk completes the task in just 0.1 seconds, whereas mawk and nawk experience an exponential performance collapse, taking approximately 24 minutes and 64 minutes respectively. Due to these extreme runtime requirements, mawk and nawk were not tested for file sizes larger than 60MB. For any workflow involving large-scale string building, gawk is the only viable option among the three.
The memory overhead data also shows an interesting reversal of the previous benchmarks' trends. While mawk and nawk struggle with time, they initially show higher memory overhead relative to the file size during the transition phases. However, gawk’s memory usage remains extremely tight, approaching a 1.0 overhead ratio at the 60MB mark and beyond, effectively matching the raw file size. The massive RT (normalized runtime) values for mawk and nawk, reaching over 13,000x and 35,000x the duration of gawk, underscore a fundamental architectural difference: gawk is specifically optimized for efficient string appending, while the others suffer from costly repeated memory copying and reallocations.
Conclusion
This table summarizes the Memory Overhead (MO, the ratio of peak memory usage relative to the raw file size) of the four benchmarks. These values represent the stable multiplier of peak memory relative to file size once the dataset is large enough to make interpreter startup costs negligible.
Memory Overhead (MO) Summary Table
| Benchmark Scenario | gawk | mawk | nawk | Best Efficiency |
|---|---|---|---|---|
| #1: Store entire lines | 2.1 | 1.8 | 2.0 | mawk |
| #2: Populate 2D matrix | 38.0 | 13.4 | 15.8 | mawk |
| #3: 1D array per field | 17.3 | 10.4 | 15.6 | mawk |
| #4: String concatenation* | 1.0 | 2.5 | 3.0 | gawk |
*Note: Benchmark #4 values are taken from the 60MB file due to the runtime constraints of mawk and nawk.
Key Findings for the Article
The array efficiency gap: For stateful data population, mawk was consistently the most memory-efficient. In the 2D matrix test, it used nearly 3x less memory than gawk, highlighting its leaner internal representation of hash tables and strings.
Structure penalty: Breaking a CSV line into 14 discrete fields (Benchmark #3) increases memory overhead by approximately 5x to 8x compared to storing the line as a single string (Benchmark #1).
gawk’s specialization: While gawk is the heaviest variant for array-based storage, it is uniquely optimized for string management. It was the only variant where memory overhead effectively equaled the file size (1.0) during massive string concatenation, coupled with extremely fast execution.
The cost of "Array of Arrays": Though not in the summary table, the results for gawk (25.8 MO) show that native nested structures are significantly more expensive than multiple 1D arrays (17.3 MO), likely due to the overhead of managing multiple internal hash table objects.
In conclusion, these results demonstrate that using AWK in stateful mode requires careful consideration. While these benchmarks were conducted by populating the entire dataset to test engine limits, significant memory can be saved in practice by populating only the specific fields or records needed for the task. If RAM matters, mawk is the clear leader for population methods involving arrays or matrix simulations. However, for methods requiring large-scale string building, gawk remains the only viable alternative. Ultimately, selecting the right population method and the appropriate AWK variant is essential for maintaining stability and performance when processing large datasets.





