awk

awk is scripting language named after its developers (Aho, Weinberger, and Kernighan) (usage)

Things to know about awk:

awk refers to columns as field, such as in the variables for number of fields (NF), input field separator (FS), and output field separator (OFS).
awk refers to rows as records, such as in the variables for record number (NR), input record separator (RS), and output record separator (ORS).
By default, awk recognizes a space or tab as a field separator. If your input file has field separators other than a space or a tab, you need to specify it using the -F flag.
awk has several built-in variables that can be used when writing code:
- $1 = field 1 ($2 = field 2, $3 = field 3, …)
- $0 = entire record
- NF = number of fields
- NR = number of records
- FS = input field separator; default is white space (i.e. space and tab)
- OFS = output field separator; default is single space
- RS = input record separator; default is new line
- ORS = output record separator; default is new line
- [0-9] = any number

Below are examples of how the awk command can be used to achieve lots of desired outcomes when processing data files.

cat data.txt # View data.txt

Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

Printing all of the fields (synonymous with awk '{print $0}' data.txt)

awk '{print}' data.txt

Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

Print a particular field (e.g. field 1)

awk '{print $1}' data.txt

Genes
GeneX
GeneY
GeneZ

If input file uses a comma (,) as a field separator instead of space or tab, set input field separator as (,) (synonymous with awk '{ FS = "," } ; {print $1}' data.csv).

Try leaving out the -F, and see what happens.

cat data.csv # View comma separated file (.csv)

Genes,Sample1,Sample2,Sample3
GeneX,3210,5678,689
GeneY,2354,6700,987
GeneZ,2315,7890,123

awk -F, '{print $1}' data.csv

Genes
GeneX
GeneY
GeneZ

Print multiple fields (e.g. field 1 and 3)

awk '{print $1,$3}' data.txt

Genes Sample2
GeneX 5678
GeneY 6700
GeneZ 7890

Print the last field

awk '{print $NF}' data.txt

Sample3
689
987
123

Print all records after the first record (synonymous with awk 'NR!=1 {print}' /path/to/file)

awk 'NR>1 {print}' data.txt

GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

Print a particular record (e.g. record 3)

awk 'NR==3 {print}' data.txt

GeneY   2354    6700    987

Print all records except for a particular record (e.g. not record 3)

awk 'NR!=3 {print}' data.txt

Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneZ   2315    7890    123

Print a range of records (e.g. records 2 to 3)

awk 'NR==2, NR==3 {print}' data.txt

GeneX   3210    5678    689
GeneY   2354    6700    987

Print records with fewer than a certain number of fields (e.g. fewer than 4 fields)

cat data2.txt # View data2.txt

Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    
GeneZ   2315    7890    123

awk 'NF<4 {print}' data2.txt

GeneY   2354    6700

Print records containing a certain string anywhere in record (e.g. abc)

cat data3.txt

abcd    dcba    efgh    aabc
bcde    dabc    cbad    abdc
cdef    defg    efgh    fghi

awk '/abc/ {print}' data3.txt

abcd    dcba    efgh    aabc
bcde    dabc    cbad    abdc

Print records starting with a certain string (e.g. abc)

awk '/^abc/ {print}' data3.txt

abcd    dcba    efgh    aabc

Print records ending with a certain string (e.g. abc) >One caveat between macOS and Windows (even when using wsl) is that the line ending character in macOS (i.e. unix) is \n while the line ending character in Windows is \r\n. This means that a text file made on a Mac may have a different line ending character than Windows recognizes (and vice versa). To avoid this problem…

awk '/abc$/ {print}' data3.txt

abcd    dcba    efgh    aabc

Print records that don’t contain a certain string anywhere in record (e.g. abc)

awk '!/abc/ {print}' data3.txt

cdef    defg    efgh    fghi

Print records that don’t start with a certain string (e.g. abc)

awk '!/^abc/ {print}' data3.txt

bcde    dabc    cbad    abdc
cdef    defg    efgh    fghi

Print records that don’t end with a certain string (e.g. abc)

awk '!/abc$/ {print}' data3.txt

bcde    dabc    cbad    abdc
cdef    defg    efgh    fghi

Print records where a particular field contains a string (e.g. abc in field 1)

awk '$1 ~ /abc/ {print}' data3.txt

abcd    dcba    efgh    aabc

Print records where a particular field starts with a string (e.g. abc in field 1)

awk '$1 ~ /^abc/ {print}' data3.txt

abcd    dcba    efgh    aabc

Print records where a particular field ends with a string (e.g. abc in field 4)

awk '$4 ~ /abc$/ {print}' data3.txt

abcd    dcba    efgh    aabc

Print records where a particular field starts with any number (e.g. field 1)

cat data4.txt

1ABC    D1CB    EF1G    AAB1
b2cd    da2b    cba2    2abc
CD3E    DEF3    3EFG    F2GH

awk '$1 ~ /^[0-9]/ {print}' data4.txt

1ABC    D1CB    EF1G    AAB1

Print records where a particular field ends with any number (e.g. field 1)

awk '$1 ~ /[0-9]$/ {print}' data4.txt

Ignore case when looking for records containing a string (e.g. abc)

awk 'tolower($0) ~ /abc/ {print}' data4.txt

1ABC    D1CB    EF1G    AAB1
b2cd    da2b    cba2    2abc

Print records that contain a certain value in a particular field (e.g. the number 3210 in field 2)

cat data.txt

Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

awk '$2==3210 {print}' data.txt

GeneX   3210    5678    689

Print records that do not contain a certain value in a particular field (e.g. not the number 10 in field 2)

awk '$2!=3210 {print}' data.txt

Genes   Sample1 Sample2 Sample3
GeneY   2354    6700    987
GeneZ   2315    7890    123

Print records that contain a value greater than a certain value in a particular field (e.g. >2354 in field 2)

awk '$2>2354 {print}' data.txt

Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689

Print records that contain a value less than a certain value in a particular field (e.g. <2354 in field 2)

awk '$2<2354 {print}' data.txt

GeneZ   2315    7890    123

Print records that contain a value less than or equal to a certain value in a particular field (e.g. <2354 in field 2)

awk '$2<=2354 {print}' data.txt

GeneY   2354    6700    987
GeneZ   2315    7890    123

Sum values in a field (e.g. field 2)

awk '{sum+=$2;} END{print sum;}' data.txt

Remember to add NR>1 if your file has a header in case the headers are numeric

awk 'NR>1 {sum+=$2;} END{print sum;}' data.txt

Remove blank lines

cat data5.txt

Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987

GeneZ   2315    7890    123

awk 'NF' data5.txt

Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

Print the record number at beginning of record

awk '{print NR,$0}' data.txt

1 Genes Sample1 Sample2 Sample3
2 GeneX 3210    5678    689
3 GeneY 2354    6700    987
4 GeneZ 2315    7890    123