Text processing — grep, sed, awk, cut, sort, uniq

The Unix data-wrangling toolkit. Compose these with pipes and you solve 90% of log/CSV/text tasks without writing a real program.

Easy Technical

1 min read

These tools treat files as streams of lines. Each has a narrow job; pipes compose them. Learn them once, save hours forever.

grep — filter lines matching a pattern. grep 'ERROR' app.log, grep -v (invert), grep -i (case-insensitive), grep -c (count), grep -r pattern dir/ (recursive), grep -E (extended regex), grep -n (line numbers), grep -o (only matched part).

cut — extract columns. cut -d',' -f1,3 data.csv takes columns 1 and 3 of a CSV. cut -c1-10 takes the first 10 characters of each line. Fast and simple; breaks on quoted commas (use a real CSV parser for messy data).

sort + uniq — sort orders lines; uniq removes adjacent duplicates (hence always sort first). sort access.log | uniq -c | sort -rn is the classic 'top N' idiom: count occurrences, then sort by frequency descending.

sed — stream editor. Most common form: sed 's/old/new/g' file replaces all old with new. sed -n '10,20p' file prints lines 10–20. sed -i edits in place (but beware: -i syntax differs between GNU and BSD/macOS sed).

awk — mini programming language for tabular data. awk '{ print $2 }' prints column 2 (whitespace-split). awk -F',' '{ print $1 }' for CSV. awk '$3 > 100 { sum += $3 } END { print sum }' filters and sums. For one-off table work, awk beats writing a Python script.

Grounded on https://www.gnu.org/software/gawk/manual/gawk.html

Next up

Scripts, shebang & safe defaults

A script is a file of commands. The shebang picks the interpreter; `set -euo pipefail` turns silent bugs into loud failures.