cat data.txt # View data.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
awk is scripting language named after its developers (Aho, Weinberger, and Kernighan) (usage)
Things to know about awk:
awk refers to columns as field, such as in the variables for number of fields (NF
), input field separator (FS
), and output field separator (OFS
).
awk refers to rows as records, such as in the variables for record number (NR
), input record separator (RS
), and output record separator (ORS
).
By default, awk recognizes a space or tab as a field separator. If your input file has field separators other than a space or a tab, you need to specify it using the -F
flag.
awk has several built-in variables that can be used when writing code:
$1
= field 1 ($2 = field 2, $3 = field 3, …)
$0
= entire record
NF
= number of fields
NR
= number of records
FS
= input field separator; default is white space (i.e. space and tab)
OFS
= output field separator; default is single space
RS
= input record separator; default is new line
ORS
= output record separator; default is new line
[0-9]
= any number
Below are examples of how the awk command can be used to achieve lots of desired outcomes when processing data files.
cat data.txt # View data.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
Printing all of the fields (synonymous with awk '{print $0}' data.txt
)
awk '{print}' data.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
Print a particular field (e.g. field 1)
awk '{print $1}' data.txt
Genes
GeneX
GeneY
GeneZ
If input file uses a comma (,
) as a field separator instead of space or tab, set input field separator as (,
) (synonymous with awk '{ FS = "," } ; {print $1}' data.csv
).
Try leaving out the
-F,
and see what happens.
cat data.csv # View comma separated file (.csv)
Genes,Sample1,Sample2,Sample3
GeneX,3210,5678,689
GeneY,2354,6700,987
GeneZ,2315,7890,123
awk -F, '{print $1}' data.csv
Genes
GeneX
GeneY
GeneZ
Print multiple fields (e.g. field 1 and 3)
awk '{print $1,$3}' data.txt
Genes Sample2
GeneX 5678
GeneY 6700
GeneZ 7890
Print the last field
awk '{print $NF}' data.txt
Sample3
689
987
123
Print all records after the first record (synonymous with awk 'NR!=1 {print}' /path/to/file
)
awk 'NR>1 {print}' data.txt
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
Print a particular record (e.g. record 3)
awk 'NR==3 {print}' data.txt
GeneY 2354 6700 987
Print all records except for a particular record (e.g. not record 3)
awk 'NR!=3 {print}' data.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneZ 2315 7890 123
Print a range of records (e.g. records 2 to 3)
awk 'NR==2, NR==3 {print}' data.txt
GeneX 3210 5678 689
GeneY 2354 6700 987
Print records with fewer than a certain number of fields (e.g. fewer than 4 fields)
cat data2.txt # View data2.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700
GeneZ 2315 7890 123
awk 'NF<4 {print}' data2.txt
GeneY 2354 6700
Print records containing a certain string anywhere in record (e.g. abc
)
cat data3.txt
abcd dcba efgh aabc
bcde dabc cbad abdc
cdef defg efgh fghi
awk '/abc/ {print}' data3.txt
abcd dcba efgh aabc
bcde dabc cbad abdc
Print records starting with a certain string (e.g. abc
)
awk '/^abc/ {print}' data3.txt
abcd dcba efgh aabc
Print records ending with a certain string (e.g. abc
) >One caveat between macOS and Windows (even when using wsl) is that the line ending character in macOS (i.e. unix) is \n
while the line ending character in Windows is \r\n
. This means that a text file made on a Mac may have a different line ending character than Windows recognizes (and vice versa). To avoid this problem…
awk '/abc$/ {print}' data3.txt
abcd dcba efgh aabc
Print records that don’t contain a certain string anywhere in record (e.g. abc
)
awk '!/abc/ {print}' data3.txt
cdef defg efgh fghi
Print records that don’t start with a certain string (e.g. abc
)
awk '!/^abc/ {print}' data3.txt
bcde dabc cbad abdc
cdef defg efgh fghi
Print records that don’t end with a certain string (e.g. abc
)
awk '!/abc$/ {print}' data3.txt
bcde dabc cbad abdc
cdef defg efgh fghi
Print records where a particular field contains a string (e.g. abc
in field 1)
awk '$1 ~ /abc/ {print}' data3.txt
abcd dcba efgh aabc
Print records where a particular field starts with a string (e.g. abc
in field 1)
awk '$1 ~ /^abc/ {print}' data3.txt
abcd dcba efgh aabc
Print records where a particular field ends with a string (e.g. abc
in field 4)
awk '$4 ~ /abc$/ {print}' data3.txt
abcd dcba efgh aabc
Print records where a particular field starts with any number (e.g. field 1)
cat data4.txt
1ABC D1CB EF1G AAB1
b2cd da2b cba2 2abc
CD3E DEF3 3EFG F2GH
awk '$1 ~ /^[0-9]/ {print}' data4.txt
1ABC D1CB EF1G AAB1
Print records where a particular field ends with any number (e.g. field 1)
awk '$1 ~ /[0-9]$/ {print}' data4.txt
Ignore case when looking for records containing a string (e.g. abc
)
awk 'tolower($0) ~ /abc/ {print}' data4.txt
1ABC D1CB EF1G AAB1
b2cd da2b cba2 2abc
Print records that contain a certain value in a particular field (e.g. the number 3210 in field 2)
cat data.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
awk '$2==3210 {print}' data.txt
GeneX 3210 5678 689
Print records that do not contain a certain value in a particular field (e.g. not the number 10 in field 2)
awk '$2!=3210 {print}' data.txt
Genes Sample1 Sample2 Sample3
GeneY 2354 6700 987
GeneZ 2315 7890 123
Print records that contain a value greater than a certain value in a particular field (e.g. >2354 in field 2)
awk '$2>2354 {print}' data.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
Print records that contain a value less than a certain value in a particular field (e.g. <2354 in field 2)
awk '$2<2354 {print}' data.txt
GeneZ 2315 7890 123
Print records that contain a value less than or equal to a certain value in a particular field (e.g. <2354 in field 2)
awk '$2<=2354 {print}' data.txt
GeneY 2354 6700 987
GeneZ 2315 7890 123
Sum values in a field (e.g. field 2)
awk '{sum+=$2;} END{print sum;}' data.txt
7879
Remember to add NR>1
if your file has a header in case the headers are numeric
awk 'NR>1 {sum+=$2;} END{print sum;}' data.txt
7879
Remove blank lines
cat data5.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
awk 'NF' data5.txt
Genes Sample1 Sample2 Sample3
GeneX 3210 5678 689
GeneY 2354 6700 987
GeneZ 2315 7890 123
Print the record number at beginning of record
awk '{print NR,$0}' data.txt
1 Genes Sample1 Sample2 Sample3
2 GeneX 3210 5678 689
3 GeneY 2354 6700 987
4 GeneZ 2315 7890 123