awk

awk is scripting language named after its developers (Aho, Weinberger, and Kernighan) (usage)


Things to know about awk:


Below are examples of how the awk command can be used to achieve lots of desired outcomes when processing data files.

cat data.txt # View data.txt
Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

Printing all of the fields (synonymous with awk '{print $0}' data.txt)

awk '{print}' data.txt
Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

Print a particular field (e.g. field 1)

awk '{print $1}' data.txt
Genes
GeneX
GeneY
GeneZ

If input file uses a comma (,) as a field separator instead of space or tab, set input field separator as (,) (synonymous with awk '{ FS = "," } ; {print $1}' data.csv).

Try leaving out the -F, and see what happens.

cat data.csv # View comma separated file (.csv)
Genes,Sample1,Sample2,Sample3
GeneX,3210,5678,689
GeneY,2354,6700,987
GeneZ,2315,7890,123
awk -F, '{print $1}' data.csv
Genes
GeneX
GeneY
GeneZ

Print multiple fields (e.g. field 1 and 3)

awk '{print $1,$3}' data.txt
Genes Sample2
GeneX 5678
GeneY 6700
GeneZ 7890

Print the last field

awk '{print $NF}' data.txt
Sample3
689
987
123

Print all records after the first record (synonymous with awk 'NR!=1 {print}' /path/to/file)

awk 'NR>1 {print}' data.txt
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

Print a particular record (e.g. record 3)

awk 'NR==3 {print}' data.txt
GeneY   2354    6700    987

Print all records except for a particular record (e.g. not record 3)

awk 'NR!=3 {print}' data.txt
Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneZ   2315    7890    123

Print a range of records (e.g. records 2 to 3)

awk 'NR==2, NR==3 {print}' data.txt
GeneX   3210    5678    689
GeneY   2354    6700    987

Print records with fewer than a certain number of fields (e.g. fewer than 4 fields)

cat data2.txt # View data2.txt
Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    
GeneZ   2315    7890    123
awk 'NF<4 {print}' data2.txt
GeneY   2354    6700    

Print records containing a certain string anywhere in record (e.g. abc)

cat data3.txt
abcd    dcba    efgh    aabc
bcde    dabc    cbad    abdc
cdef    defg    efgh    fghi
awk '/abc/ {print}' data3.txt
abcd    dcba    efgh    aabc
bcde    dabc    cbad    abdc

Print records starting with a certain string (e.g. abc)

awk '/^abc/ {print}' data3.txt
abcd    dcba    efgh    aabc

Print records ending with a certain string (e.g. abc) >One caveat between macOS and Windows (even when using wsl) is that the line ending character in macOS (i.e. unix) is \n while the line ending character in Windows is \r\n. This means that a text file made on a Mac may have a different line ending character than Windows recognizes (and vice versa). To avoid this problem…

awk '/abc$/ {print}' data3.txt
abcd    dcba    efgh    aabc

Print records that don’t contain a certain string anywhere in record (e.g. abc)

awk '!/abc/ {print}' data3.txt
cdef    defg    efgh    fghi

Print records that don’t start with a certain string (e.g. abc)

awk '!/^abc/ {print}' data3.txt
bcde    dabc    cbad    abdc
cdef    defg    efgh    fghi

Print records that don’t end with a certain string (e.g. abc)

awk '!/abc$/ {print}' data3.txt
bcde    dabc    cbad    abdc
cdef    defg    efgh    fghi

Print records where a particular field contains a string (e.g. abc in field 1)

awk '$1 ~ /abc/ {print}' data3.txt
abcd    dcba    efgh    aabc

Print records where a particular field starts with a string (e.g. abc in field 1)

awk '$1 ~ /^abc/ {print}' data3.txt
abcd    dcba    efgh    aabc

Print records where a particular field ends with a string (e.g. abc in field 4)

awk '$4 ~ /abc$/ {print}' data3.txt
abcd    dcba    efgh    aabc

Print records where a particular field starts with any number (e.g. field 1)

cat data4.txt
1ABC    D1CB    EF1G    AAB1
b2cd    da2b    cba2    2abc
CD3E    DEF3    3EFG    F2GH
awk '$1 ~ /^[0-9]/ {print}' data4.txt
1ABC    D1CB    EF1G    AAB1

Print records where a particular field ends with any number (e.g. field 1)

awk '$1 ~ /[0-9]$/ {print}' data4.txt

Ignore case when looking for records containing a string (e.g. abc)

awk 'tolower($0) ~ /abc/ {print}' data4.txt
1ABC    D1CB    EF1G    AAB1
b2cd    da2b    cba2    2abc

Print records that contain a certain value in a particular field (e.g. the number 3210 in field 2)

cat data.txt
Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123
awk '$2==3210 {print}' data.txt
GeneX   3210    5678    689

Print records that do not contain a certain value in a particular field (e.g. not the number 10 in field 2)

awk '$2!=3210 {print}' data.txt
Genes   Sample1 Sample2 Sample3
GeneY   2354    6700    987
GeneZ   2315    7890    123

Print records that contain a value greater than a certain value in a particular field (e.g. >2354 in field 2)

awk '$2>2354 {print}' data.txt
Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689

Print records that contain a value less than a certain value in a particular field (e.g. <2354 in field 2)

awk '$2<2354 {print}' data.txt
GeneZ   2315    7890    123

Print records that contain a value less than or equal to a certain value in a particular field (e.g. <2354 in field 2)

awk '$2<=2354 {print}' data.txt
GeneY   2354    6700    987
GeneZ   2315    7890    123

Sum values in a field (e.g. field 2)

awk '{sum+=$2;} END{print sum;}' data.txt
7879

Remember to add NR>1 if your file has a header in case the headers are numeric

awk 'NR>1 {sum+=$2;} END{print sum;}' data.txt
7879

Remove blank lines

cat data5.txt
Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987

GeneZ   2315    7890    123
awk 'NF' data5.txt
Genes   Sample1 Sample2 Sample3
GeneX   3210    5678    689
GeneY   2354    6700    987
GeneZ   2315    7890    123

Print the record number at beginning of record

awk '{print NR,$0}' data.txt
1 Genes Sample1 Sample2 Sample3
2 GeneX 3210    5678    689
3 GeneY 2354    6700    987
4 GeneZ 2315    7890    123