FAME Metagenomics workshop 2022

Preparation of Kraken output for Data visualation part 2

You don’t have to do these steps, they’re just a handy reference for messing with Kraken output in bash.

Files I used:

Concatonate and convert using taxonkit.

cat *output.txt > kraken.all.tsv
head kraken.all.tsv
C       SRR1237780.2151832.1    Comamonas piscis (taxid 1562974)        101     1562974:1 0:54 80840:1 0:11
C       SRR1237780.4926005.1    Bifidobacterium adolescentis (taxid 1680)       101     1680:67
C       SRR1237780.501486.1     Bacteria (taxid 2)      101     2:2 1:65
U       SRR1237780.1331889.1    unclassified (taxid 0)  89      0:55
U       SRR1237780.3407698.1    unclassified (taxid 0)  101     0:67
U       SRR1237780.973930.1     unclassified (taxid 0)  101     0:67
U       SRR1237780.3542719.1    unclassified (taxid 0)  101     0:67
U       SRR1237780.4639012.1    unclassified (taxid 0)  101     0:67
C       SRR1237780.1088785.1    Arcobacter cryaerophilus ATCC 43158 (taxid 1032070)     101     0:15 1032070:3 0:37 1032070:2 0:10
U       SRR1237780.1066259.1    unclassified (taxid 0)  101     0:67

This is one line per read, we just want the counts for each taxid per sample.

cut -f2,3 kraken.all.tsv \
  | sed 's/\.\S\+//' \
  | sed 's/\t.\+taxid \([0-9]\+\))/\t\1/' \
  | awk '{if ($2==0){$2=1};print $1"\t"$2}' \
  | sort \
  | uniq -c \
  | sed 's/^\s\+//' \
  | sed 's/ /\t/' \
  > kraken.clean.tsv

head kraken.clean.tsv

This leaves us with a 3-column file: count sampleID taxid.

14382   SRR1237780      0
353     SRR1237780      1
1       SRR1237780      1005058
1       SRR1237780      1028416
1       SRR1237780      1028801
1       SRR1237780      1028989
1       SRR1237780      1029756
14      SRR1237780      1032070
2       SRR1237780      1032071
2       SRR1237780      1032239

We convert taxid to a taxon path with taxonkit.

taxonkit lineage -i 3 kraken.clean.tsv \
  | taxonkit reformat -i 4 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" -F --fill-miss-rank \
  | cut --complement -f3,4 \
  > kraken.taxon.tsv

head kraken.taxon.tsv
14735   SRR1237780      unclassified root superkingdom  unclassified root phylum        unclassified root class unclassified root order unclassified root family        unclassified root genus unclassified root species
1       SRR1237780      Bacteria        Proteobacteria  Gammaproteobacteria     Pasteurellales  Pasteurellaceae Gallibacterium  Gallibacterium anatis
1       SRR1237780      Bacteria        Proteobacteria  Gammaproteobacteria     Pseudomonadales Moraxellaceae   Psychrobacter   Psychrobacter sp. DAB_AL43B
1       SRR1237780      Bacteria        Proteobacteria  Alphaproteobacteria     Hyphomicrobiales        Rhizobiaceae    Neorhizobium    Neorhizobium galegae
1       SRR1237780      Bacteria        Proteobacteria  Gammaproteobacteria     Pseudomonadales Pseudomonadaceae        Pseudomonas     Pseudomonas sp. StFLB209
1       SRR1237780      Bacteria        Proteobacteria  Alphaproteobacteria     Hyphomicrobiales        Hyphomicrobiaceae       Hyphomicrobium  Hyphomicrobium nitrativorans
14      SRR1237780      Bacteria        Proteobacteria  Epsilonproteobacteria   Campylobacterales       Campylobacteraceae      Aliarcobacter   Aliarcobacter cryaerophilus
2       SRR1237780      Bacteria        Proteobacteria  Epsilonproteobacteria   Campylobacterales       Campylobacteraceae      Aliarcobacter   Aliarcobacter cryaerophilus
2       SRR1237780      Bacteria        Proteobacteria  Epsilonproteobacteria   Campylobacterales       Campylobacteraceae      Aliarcobacter   Aliarcobacter skirrowii
1       SRR1237780      Bacteria        Proteobacteria  Epsilonproteobacteria   Campylobacterales       Campylobacteraceae      Aliarcobacter   Aliarcobacter thereius

Let’s add a header.

sed  -i '1i Count\tSampleID\tKingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies' kraken.taxon.tsv

head kraken.taxon.tsv
Count   SampleID        Kingdom Phylum  Class   Order   Family  Genus   Species
14735   SRR1237780      unclassified root superkingdom  unclassified root phylum        unclassified root class unclassified root order unclassified root family        unclassified root genus unclassified root species
1       SRR1237780      Bacteria        Proteobacteria  Gammaproteobacteria     Pasteurellales  Pasteurellaceae Gallibacterium  Gallibacterium anatis
1       SRR1237780      Bacteria        Proteobacteria  Gammaproteobacteria     Pseudomonadales Moraxellaceae   Psychrobacter   Psychrobacter sp. DAB_AL43B
1       SRR1237780      Bacteria        Proteobacteria  Alphaproteobacteria     Hyphomicrobiales        Rhizobiaceae    Neorhizobium    Neorhizobium galegae
1       SRR1237780      Bacteria        Proteobacteria  Gammaproteobacteria     Pseudomonadales Pseudomonadaceae        Pseudomonas     Pseudomonas sp. StFLB209
1       SRR1237780      Bacteria        Proteobacteria  Alphaproteobacteria     Hyphomicrobiales        Hyphomicrobiaceae       Hyphomicrobium  Hyphomicrobium nitrativorans
14      SRR1237780      Bacteria        Proteobacteria  Epsilonproteobacteria   Campylobacterales       Campylobacteraceae      Aliarcobacter   Aliarcobacter cryaerophilus
2       SRR1237780      Bacteria        Proteobacteria  Epsilonproteobacteria   Campylobacterales       Campylobacteraceae      Aliarcobacter   Aliarcobacter cryaerophilus
2       SRR1237780      Bacteria        Proteobacteria  Epsilonproteobacteria   Campylobacterales       Campylobacteraceae      Aliarcobacter   Aliarcobacter skirrowii