Skip to main content

Command Palette

Search for a command to run...

When RAM Matters: Memory Efficiency of AWK Variants

Published
23 min read
When RAM Matters: Memory Efficiency of AWK Variants

The AWK scripting language emerged from Bell Labs in 1977, named for its creators Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is still widely used today, as a core tool it is available on any Unix or Unix-like system (Linux, BSDs, macOS etc.). It operates as a compact, domain-specific language for text processing. AWK reads input line by line, splits each line into fields, and executes code when patterns match. No explicit loops are needed for reading data; the program focuses on what to do with each record, not how to traverse the file. This makes it exceptionally effective for rapid ad-hoc data analysis and transformation, as well as filtering and more complex operations within pipelines. AWK is Turing-complete and can handle logic beyond simple pattern matching.

While the AWK language is POSIX standard, it exists in several distinct implementations, most notably:

  • gawk (GNU Awk): The feature-rich version with extensions beyond POSIX, maintained by Arnold Robbins. Default in Arch Linux, RHEL, Fedora.

  • mawk (Mike Brennan’s Awk): An efficiency-oriented implementation using a bytecode interpreter, currently maintained by Thomas Dickey. Default in Debian and many of its derivatives.

  • nawk (The "One True Awk"): The original implementation from the language’s creators, maintained by Brian Kernighan. Default in BSDs and macOS.

In most Linux distributions, the awk command is a symbolic link to a specific implementation. You can verify which variant is being used with:

ls -l $(which awk)

AWK has a place in modern data pipelines as an effective Phase 2 pre-filter: it is schema-agnostic, low footprint, zero-setup, and readily available (see article). It is suitable for the earliest stage of validation, as a first-pass filter, before any format-specific interpretation.

AWK can operate in two fundamentally different modes with respect to memory usage:

  1. Streaming operations maintain constant memory usage regardless of file size. A multi-hundred-gigabyte file can be inspected using the same resources as a kilobyte sample. This makes AWK effective for null rate checks, schema validation, and range or boundary verification on datasets that exceed available memory.

  2. Stateful operations, however, require accumulating data in memory. This can take several forms: populating associative arrays for deduplication (!x[$0]++) or field distribution analysis (x[NF]++), loading records into indexed arrays for multi-pass processing, or concatenating strings to build aggregate outputs. For these operations, memory efficiency matters, and implementation differences between AWK variants become significant.

This article evaluates the memory efficiency of gawk, mawk, and nawk in stateful operations, as a function of input file size.

Benchmarking Approach

The benchmarking evaluates memory consumption patterns for four different stateful operation scenarios in AWK when processing CSV data. The focus is on memory usage comparison during data population, with no additional processing or operations performed. This allows for direct measurement of how different data storage strategies impact memory footprint. In addition to memory usage execution time was also measured. Resource tracking was performed using cgmemtime, an ideal tool for this purpose as it captures peak memory consumption for the process group. The benchmarking process was automated via my custom runner that handles warmups, multiple test runs, and calculates statistical metrics as well as normalized parameters for comparative analysis. For details see my BEHILOS Benchmark article.

Test Dataset

The benchmarking uses CSV files with a consistent structure of 14 fields per row. To observe memory scaling behavior, 7 different file sizes were tested ranging from 1,000 rows to 10 million rows, 120KB to 1.2GB of size. The CSV test files are available here. The 10M row file was generated by concatenating the 5M file twice.

File name Rows File size [MB]
sales1K.csv 1K 0.12
sales10K.csv 10K 1.2
sales100K.csv 100K 12
sales500K.csv 500K 60
sales1.5M.csv 1.5M 178
sales5M.csv 5M 595
sales10M.csv 10M 1190

Test Environment

Tests were conducted on an Arch Linux workstation powered by a Ryzen 5900x CPU with 64GB of RAM, using the Alacritty terminal within a dwm session.

The following table provides a summary of the specific versions and main characteristics of the three AWK implementations tested:

Name Version Binary Size Installed Size --csv UTF-8 Extensions
gawk 5.3.2 853 kB 3.60 MB yes yes yes
mawk 1.3.4 20260129 179 kB 206 kB no no no
nawk 20251225 139 kB 145 kB yes yes no

The Benchmarks

Four benchmarks were applied. They represent common patterns for storing CSV data in AWK, each with different memory characteristics and use cases.

Each benchmark sequence included one initial warmup run followed by three recorded runs. Normalized paramteres are based on median, 1.0 being the baseline (e.g lowest peak memory or runtime).

Benchmark #1: Store entire lines in array

x[NR]=$0

This is the simplest storage method and keeps the original line intact without parsing individual fields. The memory footprint includes the full text of each line including all field separators. This method is commonly used when you need to preserve the exact input for later processing or output, or when you need random access to complete lines.

Results of Benchmark #1

Benchmark #1 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0016 ± 0.0001        0.0016   0.0016    0.0017   1.1              0.75 ± 0.02            0.73     0.74      0.77     1.4        
  mawk                         0.0010 ± 0.0001        0.0009   0.0010    0.0011   1.3              0.82 ± 0.15            0.73     0.74      1.00     11.6       
  nawk                         0.0015 ± 0.0001        0.0015   0.0015    0.0016   2.3              0.57 ± 0.15            0.48     0.49      0.74     16.8       

sales10K.csv                                                                                                                                                     
  gawk                         0.0048 ± 0.0001        0.0047   0.0049    0.0049   0.9              3.07 ± 0.15            2.98     2.99      3.25     2.8        
  mawk                         0.0027 ± 0.0002        0.0026   0.0028    0.0028   1.2              2.74 ± 0.15            2.73     2.74      2.74     0.0        
  nawk                         0.0089 ± 0.0003        0.0087   0.0087    0.0091   1.5              2.85 ± 0.19            2.73     2.83      2.98     0.6        

sales100K.csv                                                                                                                                                    
  gawk                         0.0330 ± 0.0008        0.0324   0.0328    0.0339   0.7              25.74 ± 0.15           25.73    25.74     25.74    0.0        
  mawk                         0.0174 ± 0.0014        0.0163   0.0169    0.0189   3.0              21.49 ± 0.15           21.48    21.49     21.49    0.0        
  nawk                         0.0766 ± 0.0020        0.0750   0.0760    0.0789   0.9              23.57 ± 0.35           23.24    23.74     23.74    0.7        

sales500K.csv                                                                                                                                                    
  gawk                         0.1488 ± 0.0022        0.1473   0.1480    0.1511   0.5              125.49 ± 0.29          125.23   125.48    125.74   0.0        
  mawk                         0.1025 ± 0.0048        0.0987   0.1012    0.1076   1.3              105.24 ± 0.29          104.99   105.25    105.48   0.0        
  nawk                         0.3974 ± 0.0063        0.3918   0.3966    0.4038   0.2              120.19 ± 0.38          120.02   120.27    120.27   0.1        

sales1.5M.csv                                                                                                                                                    
  gawk                         0.4399 ± 0.0036        0.4367   0.4406    0.4422   0.2              374.99 ± 0.58          374.49   374.98    375.49   0.0        
  mawk                         0.3360 ± 0.0159        0.3192   0.3403    0.3486   1.3              313.24 ± 0.52          312.98   313.00    313.74   0.1        
  nawk                         1.1826 ± 0.0132        1.1753   1.1766    1.1959   0.5              346.26 ± 0.45          346.02   346.26    346.51   0.0        

sales5M.csv                                                                                                                                                      
  gawk                         1.4856 ± 0.0037        1.4848   1.4853    1.4868   0.0              1252.07 ± 0.65         1251.73  1252.23   1252.23  0.0        
  mawk                         1.1790 ± 0.0185        1.1696   1.1788    1.1885   0.0              1042.48 ± 0.58         1042.23  1042.48   1042.73  0.0        
  nawk                         3.9644 ± 0.0155        3.9555   3.9660    3.9717   0.0              1156.11 ± 0.47         1156.02  1156.03   1156.27  0.0        

sales10M.csv                                                                                                                                                     
  gawk                         2.9759 ± 0.0070        2.9691   2.9783    2.9802   0.1              2507.74 ± 0.69         2507.49  2507.74   2507.99  0.0        
  mawk                         2.4046 ± 0.0220        2.3926   2.4044    2.4166   0.0              2083.90 ± 0.60         2083.73  2083.98   2083.98  0.0        
  nawk                         8.0351 ± 0.0158        8.0334   8.0337    8.0383   0.0              2361.94 ± 0.70         2361.53  2361.78   2362.52  0.0                 

Summary Table

Benchmark #1 Summary Table

File size        rt [s]                  pm [MB]        
    [MB]  gawk    mawk    nawk        gawk    mawk    nawk
----------------------------------------------------------
    0.12  0.0016  0.0010  0.0015      0.74    0.74    0.49
     1.2  0.0049  0.0028  0.0087      2.99    2.74    2.83
      12  0.0328  0.0169  0.0760     25.74   21.49   23.74
      60  0.1480  0.1012  0.3966    125.48  105.25  120.27
     178  0.4406  0.3403  1.1766    374.98  313.00  346.26
     595  1.4853  1.1788  3.9660   1252.23 1042.48 1156.03
    1190  2.9783  2.4044  8.0337   2507.74 2083.98 2361.78     

Normalized results: RT (normalized runtime) and MO (memory overhead)

Benchmark #1 Normalized Results

File size   RT                    MO          
    [MB]    gawk   mawk   nawk    gawk   mawk   nawk
----------------------------------------------------
    0.12    1.6    1.0    1.5     6.2    6.2    4.1
     1.2    1.8    1.0    3.1     2.5    2.3    2.4
      12    1.9    1.0    4.5     2.1    1.8    2.0
      60    1.5    1.0    3.9     2.1    1.8    2.0
     178    1.3    1.0    3.5     2.1    1.8    1.9
     595    1.3    1.0    3.4     2.1    1.8    1.9
    1190    1.2    1.0    3.3     2.1    1.8    2.0

Benchmark #2: Populate 2D matrix

for (i=1; i<=NF; i++) x[NR,i] = $i

AWK simulates 2D arrays by concatenating keys with a built-in separator (SUBSEP), so x[row, col] is stored internally as x[row SUBSEP col]. This approach provides indexed access to individual fields and is useful when you need to perform operations on specific columns across all rows. The memory overhead includes both the field data and the composite key structures.

Results of Benchmark #2

Benchmark #2 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0087 ± 0.0003        0.0084   0.0086    0.0090   0.7              5.24 ± 0.00            5.24     5.24      5.24     0.0        
  mawk                         0.0066 ± 0.0011        0.0058   0.0061    0.0078   7.8              2.15 ± 0.29            1.98     1.98      2.49     8.5        
  nawk                         0.0090 ± 0.0007        0.0081   0.0094    0.0094   4.5              2.33 ± 0.13            2.23     2.28      2.48     2.2        

sales10K.csv                                                                                                                                                     
  gawk                         0.0662 ± 0.0006        0.0657   0.0661    0.0668   0.2              46.07 ± 0.14           45.99    45.99     46.24    0.2        
  mawk                         0.0511 ± 0.0027        0.0491   0.0503    0.0539   1.6              16.32 ± 0.32           16.24    16.24     16.48    0.5        
  nawk                         0.0820 ± 0.0010        0.0814   0.0821    0.0827   0.0              19.67 ± 0.20           19.59    19.59     19.84    0.4        

sales100K.csv                                                                                                                                                    
  gawk                         0.7788 ± 0.0262        0.7500   0.7852    0.8011   0.8              452.65 ± 0.20          452.48   452.73    452.74   0.0        
  mawk                         0.9017 ± 0.0144        0.8931   0.8941    0.9180   0.9              156.74 ± 0.32          156.73   156.73    156.74   0.0        
  nawk                         0.7998 ± 0.0117        0.7911   0.7953    0.8131   0.6              178.69 ± 0.25          178.52   178.77    178.78   0.0        

sales500K.csv                                                                                                                                                    
  gawk                         4.5791 ± 0.0277        4.5695   4.5800    4.5878   0.0              2261.98 ± 0.48         2261.72  2261.73   2262.48  0.0        
  mawk                         6.3488 ± 0.0417        6.3037   6.3686    6.3742   0.3              785.24 ± 0.32          785.23   785.23    785.24   0.0        
  nawk                         4.3709 ± 0.0170        4.3583   4.3714    4.3830   0.0              959.94 ± 0.46          959.52   960.04    960.26   0.0        

sales1.5M.csv                                                                                                                                                    
  gawk                         14.8738 ± 0.1380       14.7941  14.7974   15.0298  0.5              6775.31 ± 0.70         6774.75  6775.47   6775.72  0.0        
  mawk                         19.7716 ± 0.0491       19.7417  19.7862   19.7870  0.1              2356.07 ± 0.50         2355.73  2355.98   2356.48  0.0        
  nawk                         12.1783 ± 0.0873       12.1189  12.1395   12.2765  0.3              2677.02 ± 0.52         2676.77  2677.02   2677.27  0.0        

sales5M.csv                                                                                                                                                      
  gawk                         50.6685 ± 0.1408       50.6431  50.6636   50.6989  0.0              22592.04 ± 0.71        22591.95 22591.96  22592.20 0.0        
  mawk                         72.4570 ± 0.1123       72.3429  72.4932   72.5348  0.0              7963.89 ± 0.52         7963.73  7963.96   7963.99  0.0        
  nawk                         40.7116 ± 0.2032       40.5819  40.6313   40.9215  0.2              8988.45 ± 0.65         8988.01  8988.57   8988.76  0.0        

sales10M.csv                                                                                                                                                     
  gawk                         101.8983 ± 0.2029      101.7380 101.9330  102.0240 0.0              45182.70 ± 0.71        45182.70 45182.71  45182.71 0.0        
  mawk                         150.8563 ± 0.2839      150.6080 150.8330  151.1280 0.0              15965.48 ± 0.84        15964.98 15965.23  15966.23 0.0        
  nawk                         84.8894 ± 0.4522       84.5106  84.8431   85.3145  0.1              18777.15 ± 0.68        18776.93 18777.25  18777.26 0.0        

Summary Table

Benchmark #2 Summary Table

File size          rt [s]                      pm [MB]                
    [MB]   gawk     mawk     nawk         gawk     mawk     nawk
----------------------------------------------------------------
    0.12   0.0086   0.0061   0.0094       5.24     1.98     2.28
     1.2   0.0661   0.0503   0.0821      45.99    16.24    19.59
      12   0.7852   0.8941   0.7953     452.73   156.73   178.77
      60   4.5800   6.3686   4.3714    2261.73   785.23   960.04
     178  14.7974  19.7862  12.1395    6775.47  2355.98  2677.02
     595  50.6636  72.4932  40.6313   22591.96  7963.96  8988.57
    1190 101.9330 150.8330  84.8431   45182.71 15965.23 18777.25

Normalized results: RT (normalized runtime) and MO (memory overhead)

Benchmark #2 Normalized Results

File size   RT                   MO          
    [MB]    gawk   mawk   nawk   gawk   mawk   nawk
---------------------------------------------------
    0.12    1.4    1.0    1.5    43.7   16.5   19.0
     1.2    1.3    1.0    1.6    38.3   13.5   16.3
      12    1.0    1.1    1.0    37.7   13.1   14.9
      60    1.0    1.5    1.0    37.7   13.1   16.0
     178    1.2    1.6    1.0    38.1   13.2   15.0
     595    1.2    1.8    1.0    38.0   13.4   15.1
    1190    1.2    1.8    1.0    38.0   13.4   15.8

Benchmark #3: Populate 1D array for each field

x1[NR]=\(1; x2[NR]=\)2; x3[NR]=\(3; ... x14[NR]=\)14

This creates 14 independent hash table structures in memory, avoiding the composite key overhead of the 2D approach. This method is efficient when you frequently access all values of a particular field, as each field's data is stored contiguously in its own array structure. The tradeoff is managing multiple array variables instead of a single unified structure.

In Benchmark #3 gawk's native array of arrays feature was also tested:

for (i=1; i<=NF; i++) x[NR][i]=$i

This creates a true nested structure where each row is a parent array containing 14 child elements.

Results

Benchmark #3 Result Table

File / variant                  Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv                                                                                                                                                      
  gawk                         0.0054 ± 0.0004        0.0051   0.0052    0.0059   4.0              2.83 ± 0.14            2.74     2.77      2.99     2.2        
  mawk                         0.0027 ± 0.0000        0.0026   0.0027    0.0027   0.1              1.82 ± 0.14            1.74     1.74      1.99     4.8        
  nawk                         0.0053 ± 0.0001        0.0052   0.0053    0.0053   0.8              2.23 ± 0.00            2.23     2.23      2.24     0.0        
  gawk*                        0.0066 ± 0.0004        0.0062   0.0068    0.0070   1.9              3.76 ± 0.07            3.70     3.74      3.84     0.6        

sales10K.csv                                                                                                                                                     
  gawk                         0.0362 ± 0.0008        0.0355   0.0362    0.0369   0.0              21.40 ± 0.20           21.23    21.49     21.49    0.4        
  mawk                         0.0176 ± 0.0011        0.0164   0.0179    0.0184   1.8              13.07 ± 0.20           12.98    12.99     13.23    0.6        
  nawk                         0.0413 ± 0.0006        0.0408   0.0413    0.0419   0.0              19.07 ± 0.15           18.98    18.99     19.24    0.4        
  gawk*                        0.0489 ± 0.0016        0.0471   0.0497    0.0498   1.7              31.49 ± 0.45           31.24    31.24     32.00    0.8        

sales100K.csv                                                                                                                                                    
  gawk                         0.3390 ± 0.0047        0.3337   0.3406    0.3426   0.5              206.16 ± 0.43          205.74   206.25    206.48   0.0        
  mawk                         0.2401 ± 0.0100        0.2288   0.2444    0.2472   1.7              125.74 ± 0.32          125.49   125.73    125.99   0.0        
  nawk                         0.4159 ± 0.0036        0.4119   0.4177    0.4183   0.4              178.06 ± 0.40          177.73   177.96    178.47   0.1        
  gawk*                        0.4581 ± 0.0026        0.4563   0.4578    0.4602   0.1              308.23 ± 0.51          307.98   308.24    308.48   0.0        

sales500K.csv                                                                                                                                                    
  gawk                         1.6668 ± 0.0108        1.6607   1.6618    1.6780   0.3              1023.65 ± 0.57         1023.24  1023.74   1023.98  0.0        
  mawk                         1.2514 ± 0.0122        1.2445   1.2510    1.2586   0.0              621.15 ± 0.35          620.99   621.23    621.24   0.0        
  nawk                         2.4466 ± 0.0109        2.4348   2.4523    2.4528   0.2              946.89 ± 0.49          946.56   947.04    947.07   0.0        
  gawk*                        2.2579 ± 0.0027        2.2572   2.2582    2.2583   0.0              1537.32 ± 0.53         1537.24  1537.24   1537.49  0.0        

sales1.5M.csv                                                                                                                                                    
  gawk                         5.0319 ± 0.0295        5.0004   5.0445    5.0509   0.2              3064.90 ± 0.69         3064.48  3064.98   3065.24  0.0        
  mawk                         3.5495 ± 0.0221        3.5303   3.5515    3.5669   0.1              1847.24 ± 0.56         1846.73  1847.48   1847.49  0.0        
  nawk                         6.3204 ± 0.0325        6.2918   6.3166    6.3527   0.1              2664.61 ± 0.53         2664.46  2664.56   2664.82  0.0        
  gawk*                        6.7958 ± 0.0173        6.7846   6.7873    6.8155   0.1              4609.57 ± 0.93         4608.73  4609.73   4610.24  0.0        

sales5M.csv                                                                                                                                                      
  gawk                         17.2646 ± 0.0952       17.1816  17.2512   17.3611  0.1              10289.49 ± 0.85        10288.99 10289.49  10289.98 0.0        
  mawk                         12.3195 ± 0.0415       12.2834  12.3215   12.3537  0.0              6174.32 ± 0.76         6173.74  6174.48   6174.74  0.0        
  nawk                         22.0794 ± 0.1297       21.9774  22.0413   22.2196  0.2              8938.40 ± 0.55         8938.31  8938.32   8938.57  0.0        
  gawk*                        22.7869 ± 0.1159       22.6546  22.8511   22.8549  0.3              15367.34 ± 0.94        15367.23 15367.27  15367.50 0.0        

sales10M.csv                                                                                                                                                     
  gawk                         34.8630 ± 0.2366       34.6662  34.8277   35.0951  0.1              20633.40 ± 0.90        20633.23 20633.24  20633.74 0.0        
  mawk                         24.6822 ± 0.0494       24.6660  24.6677   24.7130  0.1              12346.65 ± 0.82        12346.48 12346.49  12346.98 0.0        
  nawk                         48.6894 ± 0.1791       48.5470  48.7520   48.7691  0.1              18576.88 ± 0.57        18576.79 18576.80  18577.05 0.0        
  gawk*                        45.8100 ± 0.1638       45.7327  45.7543   45.9431  0.1              30737.79 ± 1.19        30736.99 30737.99  30738.39 0.0        

Summary Table

Benchmark #3 Summary Table

File size            rt [s]                            pm [MB]                      
    [MB]   gawk     mawk     nawk     gawk*        gawk     mawk     nawk    gawk*
----------------------------------------------------------------------------------
    0.12   0.0052   0.0027   0.0053   0.0068       2.77     1.74     2.23     3.74
     1.2   0.0362   0.0179   0.0413   0.0497      21.49    12.99    18.99    31.24
      12   0.3406   0.2444   0.4177   0.4578     206.25   125.73   177.96   308.24
      60   1.6618   1.2510   2.4523   2.2582    1023.74   621.23   947.04  1537.24
     178   5.0445   3.5515   6.3166   6.7873    3064.98  1847.48  2664.56  4609.73
     595  17.2512  12.3215  22.0413  22.8511   10289.49  6174.48  8938.32 15367.27
    1190  34.8277  24.6677  48.7520  45.7543   20633.24 12346.49 18576.80 30737.99

Normalized results: RT (normalized runtime) and MO (memory overhead)

Benchmark #3 Normalized Results

File size   RT                          MO                  
    [MB]    gawk   mawk   nawk   gawk*  gawk   mawk   nawk   gawk*
-----------------------------------------------------------------
    0.12    1.9    1.0    2.5    2.0    23.1   14.5   18.6   31.2
     1.2    2.0    1.0    2.8    2.3    17.9   10.8   15.8   26.0
      12    1.4    1.0    1.9    1.7    17.2   10.5   14.8   25.7
      60    1.3    1.0    1.8    2.0    17.1   10.4   15.8   25.6
     178    1.4    1.0    1.9    1.8    17.2   10.4   15.0   25.9
     595    1.4    1.0    1.9    1.8    17.3   10.4   15.0   25.8
    1190    1.4    1.0    1.9    2.0    17.3   10.4   15.6   25.8

Benchmark 4: Concatenate entire data in one string

x = x $0

Each line is appended to the existing string, creating progressively larger string values. This pattern can be useful when building complete records for batch output, log aggregation or creating hash/checksum input.

Results

Benchmark #4 Result Table

File / variant                 Runtime [s]                                                        Peak Memory [MB]                                             
                               mean ± sdev            min      median    max      Jtr%             mean ± sdev            min      median    max      Jtr%        
sales1K.csv 0.12                                                                                                                                                 
  gawk                         0.0016 ± 0.0001        0.0016   0.0017    0.0017   1.8              0.75 ± 0.02            0.74     0.74      0.78     1.8        
  mawk                         0.0058 ± 0.0003        0.0054   0.0059    0.0061   0.9              0.98 ± 0.07            0.94     0.95      1.07     3.7        
  nawk                         0.0084 ± 0.0003        0.0081   0.0084    0.0086   0.1              1.06 ± 0.11            0.94     1.08      1.15     2.1        

sales10K.csv 1.2                                                                                                                                                 
  gawk                         0.0041 ± 0.0001        0.0040   0.0040    0.0042   0.8              1.98 ± 0.02            1.98     1.98      1.98     0.1        
  mawk                         1.2520 ± 0.0030        1.2485   1.2533    1.2541   0.1              5.18 ± 0.16            5.02     5.25      5.29     1.2        
  nawk                         1.4993 ± 0.0047        1.4954   1.4979    1.5046   0.1              6.30 ± 0.27            6.03     6.36      6.51     0.9        

sales100K.csv 12                                                                                                                                                 
  gawk                         0.0239 ± 0.0002        0.0238   0.0238    0.0241   0.3              12.73 ± 0.02           12.72    12.73     12.73    0.0        
  mawk                         57.3232 ± 0.4294       56.8661  57.3852   57.7182  0.1              37.42 ± 0.95           36.46    37.46     38.33    0.1        
  nawk                         92.1290 ± 0.9580       91.0598  92.4178   92.9094  0.3              49.94 ± 1.15           49.00    49.64     51.17    0.6        

sales500K.csv 60                                                                                                                                                 
  gawk                         0.1075 ± 0.0015        0.1059   0.1080    0.1087   0.4              60.05 ± 0.13           59.97    59.98     60.21    0.1        
  mawk                         1479.21                                                             152.68 
  nawk                         3854.76                                                             180.05 

sales1.5M.csv 178                                                                                                                                                
  gawk                         0.3163 ± 0.0018        0.3155   0.3160    0.3174   0.1              178.63 ± 0.32          178.46   178.47    178.97   0.1        

sales5M.csv 595                                                                                                                                                  
  gawk                         1.0447 ± 0.0041        1.0408   1.0454    1.0480   0.1              592.62 ± 0.43          592.45   592.45    592.96   0.0        

sales10M.csv 1190                                                                                                                                                
  gawk                         2.0706 ± 0.0041        2.0704   2.0706    2.0709   0.0              1184.01 ± 0.46         1183.92  1183.93   1184.19  0.0        

Summary Table

Benchmark #4 Summary Table

File size    RT                               MO                  
    [MB]     gawk       mawk       nawk       gawk       mawk       nawk
------------------------------------------------------------------------
    0.12     0.0017     0.0062     0.0081      0.7        1.1        1.0
     1.2     0.0040     1.2365     1.4840      2.0        5.1        6.4
      12     0.0238    54.5716    88.8361     12.7       36.9       49.4
      60     0.1080  1479.2100  3854.7600     60.0      152.7      180.1
     178     0.3160 ---------- ----------    178.5 ---------- ----------
     595     1.0454 ---------- ----------    592.5 ---------- ----------
    1190     2.0706 ---------- ----------   1183.9 ---------- ----------

Normalized results: RT (normalized runtime) and MO (memory overhead)

Benchmark #4 Normalized Results

File size   RT                       MO          
    [MB]    gawk   mawk     nawk     gawk   mawk   nawk
-------------------------------------------------------
    0.12    1.0     3.6      4.8     6.2    9.6    7.9
     1.2    1.0   309.1    371.0     1.7    4.3    5.4
      12    1.0  2292.9   3732.6     1.1    3.1    4.1
      60    1.0 13696.4  35692.2     1.0    2.5    3.0

Discussion

For the comparative analysis normalized metrics were used:

  • MO (Memory Overhead): This represents the ratio of peak memory usage relative to the raw file size. For example, an MO of 2.0 means the process used exactly twice the RAM as the size of the data on disk. It allows for a direct comparison of memory efficiency regardless of the input file size.

  • RT (Normalized Runtime): This is the execution time scaled against a baseline (the fastest result or the file size, 1.0). It measures how long the engine takes to process each unit of data, providing a clear picture of speed performance across different AWK variants.

Benchmark #1

The data confirms that for the simple line-storage pattern (x[NR]=$0), memory consumption is a strictly linear function of the input file size across all three variants. As the data scales from 120KB to 1.2GB, the normalized memory overhead (MO) exhibits clear asymptotic behavior; the initial variance caused by interpreter startup costs (which peaked at 6.2x for the smallest file) stabilizes at higher volumes. By the 1.2GB mark, gawk and nawk settle at roughly 2.0x and 2.1x overhead relative to the raw file size, while mawk maintains a leaner 1.8x, proving to be the most memory-efficient engine for large-scale string retention.

Benchmark #1: Peak Memory vs File Size

In terms of runtime performance, mawk consistently dominated as the fastest variant, serving as the baseline (1.0) for all normalized runtime (RT) measurements above the smallest file size. While gawk showed improving efficiency as the workload increased—dropping from 1.9x to 1.2x the runtime of mawk and nawk struggled significantly with this storage pattern, ending with a runtime 3.3x slower than mawk at the 10 million row limit. These results highlight that for pure data population tasks where preserving line integrity is key, mawk offers the best balance of speed and a minimized memory footprint.

Benchmark #1: Normalized Results

Benchmark #2

The data for the 2D matrix population x[NR, i] = $i shows a massive increase in resource requirements compared to simple line storage, though the peak memory remains a strictly linear function of the file size. As the dataset scales toward 1.2GB, the normalized memory overhead reaches an asymptotic state where the initial interpreter costs become negligible. In this scenario, mawk proves to be the most memory-efficient by far, stabilizing at a memory overhead of 13.4. In contrast, gawk is exceptionally heavy for this storage pattern, requiring 38 times the raw file size in RAM, which is nearly triple the footprint of mawk.

Benchmark #2: Peak Memory vs File Size

The runtime performance results reveal a significant shift in execution efficiency as the number of array elements grows. While mawk is the fastest for small files, its performance degrades significantly at scale, eventually becoming the slowest variant with a normalized runtime of 1.8. Conversely, nawk emerges as the performance leader for large-scale matrix population, maintaining the baseline speed of 1.0 at high volumes. These results illustrate a clear trade-off: mawk is the optimal choice for minimizing the memory footprint in massive stateful operations, but nawk offers superior throughput when processing tens of millions of discrete fields.

Benchmark #2: Normalized Results

Benchmark #3

In Benchmark #3, using 14 independent 1D arrays proves significantly more memory-efficient than the 2D composite key approach across all variants. The peak memory usage remains a linear function of file size, with normalized memory overhead (MO) reaching a steady state quickly. mawk again demonstrates superior memory management, stabilizing at an MO of 10.4, which is about 40% more efficient than gawk’s 17.3 and nawk’s 15.6. Interestingly, gawk's native array-of-arrays feature (gawk*) proved to be the most resource-intensive strategy in this test, with a stabilized MO of 25.8. This suggests that the internal overhead of managing nested objects in gawk is substantially higher than managing multiple flat hash tables.

Benchmark #3: Peak Memory vs File Size

Runtime-wise, mawk maintained its lead as the fastest variant, serving as the 1.0 baseline for all file sizes. gawk and nawk performed similarly at scale, with gawk finishing about 1.4 times slower than mawk, while nawk lagged at 1.9 times slower. Despite the structural elegance of gawk's nested arrays, the gawk* results showed no performance benefit over the 1D array method, consistently running about 2.0 times slower than mawk. For users requiring field-level access at scale, the strategy of multiple 1D arrays in mawk provides the best optimization of both execution speed and memory footprint.

Benchmark #3: Normalized Results

Benchmark #4

Benchmark #4 reveals a dramatic divergence in performance, highlighting how different engines handle repeated string concatenation (x = x $0). In this scenario, gawk performs exceptionally, maintaining near-linear time complexity as the file size increases. This efficiency is due to gawk's optimized string management, which applies a smarter reallocation strategy than its counterparts. As the dataset scales to 60MB, gawk completes the task in just 0.1 seconds, whereas mawk and nawk experience an exponential performance collapse, taking approximately 24 minutes and 64 minutes respectively. Due to these extreme runtime requirements, mawk and nawk were not tested for file sizes larger than 60MB. For any workflow involving large-scale string building, gawk is the only viable option among the three.

Benchmark #4: Peak Memory vs File Size

The memory overhead data also shows an interesting reversal of the previous benchmarks' trends. While mawk and nawk struggle with time, they initially show higher memory overhead relative to the file size during the transition phases. However, gawk’s memory usage remains extremely tight, approaching a 1.0 overhead ratio at the 60MB mark and beyond, effectively matching the raw file size. The massive RT (normalized runtime) values for mawk and nawk, reaching over 13,000x and 35,000x the duration of gawk, underscore a fundamental architectural difference: gawk is specifically optimized for efficient string appending, while the others suffer from costly repeated memory copying and reallocations.

Benchmark #4: Normalized Results

Conclusion

This table summarizes the Memory Overhead (MO, the ratio of peak memory usage relative to the raw file size) of the four benchmarks. These values represent the stable multiplier of peak memory relative to file size once the dataset is large enough to make interpreter startup costs negligible.

Memory Overhead (MO) Summary Table

Benchmark Scenario gawk mawk nawk Best Efficiency
#1: Store entire lines 2.1 1.8 2.0 mawk
#2: Populate 2D matrix 38.0 13.4 15.8 mawk
#3: 1D array per field 17.3 10.4 15.6 mawk
#4: String concatenation* 1.0 2.5 3.0 gawk

*Note: Benchmark #4 values are taken from the 60MB file due to the runtime constraints of mawk and nawk.

Key Findings for the Article

  • The array efficiency gap: For stateful data population, mawk was consistently the most memory-efficient. In the 2D matrix test, it used nearly 3x less memory than gawk, highlighting its leaner internal representation of hash tables and strings.

  • Structure penalty: Breaking a CSV line into 14 discrete fields (Benchmark #3) increases memory overhead by approximately 5x to 8x compared to storing the line as a single string (Benchmark #1).

  • gawk’s specialization: While gawk is the heaviest variant for array-based storage, it is uniquely optimized for string management. It was the only variant where memory overhead effectively equaled the file size (1.0) during massive string concatenation, coupled with extremely fast execution.

  • The cost of "Array of Arrays": Though not in the summary table, the results for gawk (25.8 MO) show that native nested structures are significantly more expensive than multiple 1D arrays (17.3 MO), likely due to the overhead of managing multiple internal hash table objects.

In conclusion, these results demonstrate that using AWK in stateful mode requires careful consideration. While these benchmarks were conducted by populating the entire dataset to test engine limits, significant memory can be saved in practice by populating only the specific fields or records needed for the task. If RAM matters, mawk is the clear leader for population methods involving arrays or matrix simulations. However, for methods requiring large-scale string building, gawk remains the only viable alternative. Ultimately, selecting the right population method and the appropriate AWK variant is essential for maintaining stability and performance when processing large datasets.

P

Thanks for sharing this.

G

Glad it helped

M

this is some seriously thorough benchmarking work. the way you isolated each storage pattern and measured the overhead so precisely is impressive. benchmark #4 results are wild. mawk and nawk just completely falling apart on string concatenation while gawk barely breaks a sweat.

the "structure penalty" finding is the kind of thing you only learn from actually measuring it. 5x to 8x more memory just for splitting fields vs storing raw lines. easy to overlook until it blows up in production.

good stuff.

G

Thanks, glad you find it useful.

G

Thanks