cat data.txt # View data.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
awk is scripting language named after its developers (Aho, Weinberger, and Kernighan) (usage)
Things to know about awk:
awk refers to columns as field, such as in the variables for number of fields (NF), input field separator (FS), and output field separator (OFS).
awk refers to rows as records, such as in the variables for record number (NR), input record separator (RS), and output record separator (ORS).
By default, awk recognizes a space or tab as a field separator. If your input file has field separators other than a space or a tab, you need to specify it using the -F flag.
awk has several built-in variables that can be used when writing code:
$1 = field 1 ($2 = field 2, $3 = field 3, …)
$0 = entire record
NF = number of fields
NR = number of records
FS = input field separator; default is white space (i.e. space and tab)
OFS = output field separator; default is single space
RS = input record separator; default is new line
ORS = output record separator; default is new line
[0-9] = any number
Below are examples of how the awk command can be used to achieve lots of desired outcomes when processing data files.
cat data.txt # View data.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
Printing all of the fields (synonymous with awk '{print $0}' data.txt)
awk '{print}' data.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
Print a particular field (e.g. field 1)
awk '{print $1}' data.txtGenes
GeneX
GeneY
GeneZ
If input file uses a comma (,) as a field separator instead of space or tab, set input field separator as (,) (synonymous with awk '{ FS = "," } ; {print $1}' data.csv).
Try leaving out the
-F,and see what happens.
cat data.csv # View comma separated file (.csv)Genes,Sample1,Sample2,Sample3
GeneX,3210,5678,689
GeneY,2354,6700,987
GeneZ,2315,7890,123
awk -F, '{print $1}' data.csvGenes
GeneX
GeneY
GeneZ
Print multiple fields (e.g. field 1 and 3)
awk '{print $1,$3}' data.txtGenes Sample2
GeneX 5678
GeneY 6700
GeneZ 7890
Print the last field
awk '{print $NF}' data.txtSample3
689
987
123
Print all records after the first record (synonymous with awk 'NR!=1 {print}' /path/to/file)
awk 'NR>1 {print}' data.txtGeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
Print a particular record (e.g. record 3)
awk 'NR==3 {print}' data.txtGeneY 2354 6700 987
Print all records except for a particular record (e.g. not record 3)
awk 'NR!=3 {print}' data.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneZ 2315 7890 123
Print a range of records (e.g. records 2 to 3)
awk 'NR==2, NR==3 {print}' data.txtGeneX 3210 5678 689
GeneY 2354 6700 987
Print records with fewer than a certain number of fields (e.g. fewer than 4 fields)
cat data2.txt # View data2.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700
GeneZ 2315 7890 123
awk 'NF<4 {print}' data2.txtGeneY 2354 6700
Print records containing a certain string anywhere in record (e.g. abc)
cat data3.txtabcd dcba efgh aabc
bcde dabc cbad abdc
cdef defg efgh fghi
awk '/abc/ {print}' data3.txtabcd dcba efgh aabc
bcde dabc cbad abdc
Print records starting with a certain string (e.g. abc)
awk '/^abc/ {print}' data3.txtabcd dcba efgh aabc
Print records ending with a certain string (e.g. abc) >One caveat between macOS and Windows (even when using wsl) is that the line ending character in macOS (i.e. unix) is \n while the line ending character in Windows is \r\n. This means that a text file made on a Mac may have a different line ending character than Windows recognizes (and vice versa). To avoid this problem…
awk '/abc$/ {print}' data3.txtabcd dcba efgh aabc
Print records that don’t contain a certain string anywhere in record (e.g. abc)
awk '!/abc/ {print}' data3.txtcdef defg efgh fghi
Print records that don’t start with a certain string (e.g. abc)
awk '!/^abc/ {print}' data3.txtbcde dabc cbad abdc
cdef defg efgh fghi
Print records that don’t end with a certain string (e.g. abc)
awk '!/abc$/ {print}' data3.txtbcde dabc cbad abdc
cdef defg efgh fghi
Print records where a particular field contains a string (e.g. abc in field 1)
awk '$1 ~ /abc/ {print}' data3.txtabcd dcba efgh aabc
Print records where a particular field starts with a string (e.g. abc in field 1)
awk '$1 ~ /^abc/ {print}' data3.txtabcd dcba efgh aabc
Print records where a particular field ends with a string (e.g. abc in field 4)
awk '$4 ~ /abc$/ {print}' data3.txtabcd dcba efgh aabc
Print records where a particular field starts with any number (e.g. field 1)
cat data4.txt1ABC D1CB EF1G AAB1
b2cd da2b cba2 2abc
CD3E DEF3 3EFG F2GH
awk '$1 ~ /^[0-9]/ {print}' data4.txt1ABC D1CB EF1G AAB1
Print records where a particular field ends with any number (e.g. field 1)
awk '$1 ~ /[0-9]$/ {print}' data4.txtIgnore case when looking for records containing a string (e.g. abc)
awk 'tolower($0) ~ /abc/ {print}' data4.txt1ABC D1CB EF1G AAB1
b2cd da2b cba2 2abc
Print records that contain a certain value in a particular field (e.g. the number 3210 in field 2)
cat data.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
awk '$2==3210 {print}' data.txtGeneX 3210 5678 689
Print records that do not contain a certain value in a particular field (e.g. not the number 10 in field 2)
awk '$2!=3210 {print}' data.txtGenes Sample1 Sample2 Sample3
GeneY 2354 6700 987
GeneZ 2315 7890 123
Print records that contain a value greater than a certain value in a particular field (e.g. >2354 in field 2)
awk '$2>2354 {print}' data.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
Print records that contain a value less than a certain value in a particular field (e.g. <2354 in field 2)
awk '$2<2354 {print}' data.txtGeneZ 2315 7890 123
Print records that contain a value less than or equal to a certain value in a particular field (e.g. <2354 in field 2)
awk '$2<=2354 {print}' data.txtGeneY 2354 6700 987
GeneZ 2315 7890 123
Sum values in a field (e.g. field 2)
awk '{sum+=$2;} END{print sum;}' data.txt7879
Remember to add NR>1 if your file has a header in case the headers are numeric
awk 'NR>1 {sum+=$2;} END{print sum;}' data.txt7879
Remove blank lines
cat data5.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
awk 'NF' data5.txtGenes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
Print the record number at beginning of record
awk '{print NR,$0}' data.txt1 Genes Sample1 Sample2 Sample3
2 GeneX 3210 5678 689
3 GeneY 2354 6700 987
4 GeneZ 2315 7890 123