Text processing — grep, sed, awk, cut, sort, uniq

The Unix data-wrangling toolkit. Compose these with pipes and you solve 90% of log/CSV/text tasks without writing a real program.

Easy Technical

1 min read

Regex flavors in grep: default grep uses BRE (Basic Regular Expressions — \(, \+ need escaping). grep -E uses ERE ((, + unescaped, like most modern regex). grep -P uses PCRE (Perl-compatible, lookaheads, \d shorthands — but not available everywhere; macOS grep lacks -P). Prefer grep -E for portable, modern syntax.

sed beyond substitution: sed is a full stream editor with commands: p (print), d (delete), s (substitute), N (append next line), D (delete first line of pattern space), b (branch). Addresses: line numbers (5), ranges (5,10), regex (/pattern/). sed -n '/start/,/end/p' prints the block between two markers. Beware: -i (in-place) requires '' argument on macOS (sed -i '' 's/a/b/' file) vs GNU (sed -i 's/a/b/' file) — use sed -i.bak for a portable middle ground.

awk language: programs are pattern { action } pairs. Built-in vars: NR (line number), NF (field count), $0 (whole line), $1.. (fields), FS/OFS (input/output separator). Associative arrays are first-class: awk '{ count[$1]++ } END { for (k in count) print k, count[k] }'. No external dependencies; installed everywhere. For multi-line data, CSV with quoting, or complex state, reach for gawk (GNU) or a real language.

Performance ordering: for simple substring filtering, grep > sed > awk > perl > python. grep -F (fixed string, no regex) is fastest. LC_ALL=C grep skips Unicode collation — often 3–10× faster on ASCII data. Piping through multiple tools is still fast because of streaming — each tool processes a line and emits it without buffering the whole file.

sort flags worth knowing: -n (numeric), -k2 (sort by field 2), -t',' -k3 (CSV by col 3), -u (unique — drop dupes), -r (reverse), -h (human sizes like 1K, 2M). Combined: sort -t',' -k3 -n data.csv numeric sort by 3rd CSV column.

uniq gotcha: only collapses ADJACENT duplicates. sort first, then uniq. uniq -c prefixes count; uniq -d shows only duplicates; uniq -u only uniques. For counting without sorting (when order matters): awk '!seen[$0]++' deduplicates while preserving order.

Grounded on https://www.gnu.org/software/gawk/manual/gawk.html

Next up

Scripts, shebang & safe defaults

A script is a file of commands. The shebang picks the interpreter; `set -euo pipefail` turns silent bugs into loud failures.