Xander Results Explained
The run_xander_skel.sh script first builds the bloom filter, finds starting kmers, and assembles contigs into target gene sequences. Once the contigs have been assembled, they are corrected for insertions and deletions by FrameBot, clustered at a specified aa distance, and the longest contig from each cluster taken as the representative sequence for the cluster. These are filtered for chimeras using UCHIME to give the final sequence files ending with _final_nucl.fasta, _final_prot.fasta, and _final_prot_aligned.fasta. (For example test_nifH_45_final_prot_aligned.fasta.) The closest match to each sequence is found with FrameBot and a coverage file is generated. The coverage file is used to adjust the sequence counts for coverage, and a taxon abundance file (ending with taxonabundance.txt) based on these corrected counts is generated. This file is the final Xander result for a single sample and gives the percentage of corrected counts for each phylum or class for which sequences were found.
The inputs and outputs for each step in this process are explained below. Input files may be from Xander's gene_resource directory. In many cases only the endings of the output file names are given; they may be prepended with some combination of any of the sample short name, the kmer length used, and the gene name. The output files from the build step are found in the output directory under a sub-directory named for the kmer length (e.g. k45). The other files are found in a sub-directory of this named for the gene or in a sub-directory named cluster under the gene directory.
Build
Build the de Bruijn graph. Do this only once for each data set for a given kmer length.
Input: read files (fasta, fastq or gz format)
Output
de Bruijn graph (
k45.bloom
)bloom file stats (
k45_bloom_stat.txt
). Check the "Predicted false positive rate" in this file to make sure that it is less than 1% ( i.e. < 0.01, see Choosing Xander Parameters.)
Find
Identify the starting kmers. Multiple genes should be run together to save time; there is a multi-thread option.
Input
ref_aligned.faa
files fromgene_resource
directoryread files (fasta, fastq or gz format)
Output: starting nucleotide kmers (
starts.txt
for each gene)
Search: Initial search
Assemble the contigs. Each gene can be done in parallel. Length cutoff or HMM score cutoff filters are used. Caution: no outputs from this step are quality filtered!
Input
forward and reverse HMMs (
for_enone.hmm
andrev_enone.hmm
)de Bruijn graph (
k45.bloom
)starting kmers (
gene_starts.txt
)
Output
unique merged protein contigs (
prot_merged_rmdup.fasta
)merged nucleotide contigs (
nucl_merged.fasta
)unmerged nucleotide and protein contigs (
gene_starts.txt_nucl.fasta
andgene_starts.txt_prot.fasta
)
Search: Post assembly processing
Post assembly steps including clustering, chimera removal, closest-match assignment, and abundance calculation.
Cluster
RDP's mcClust is used to cluster the sequences based on aa identity. For each of the clusters the longest contig is chosen as the representative contig. Caution: no outputs from this step are quality filtered! All outputs for this step are located in the cluster
directory.
Input:
prot_merged_rmdup.fasta
from initial searchOutput
representative contigs at 99% aa identity (
nucl_rep_seqs.fasta
andprot_rep_seqs.fasta
).aligned protein contigs (
aligned.fasta
)complete linkage cluster output (
complete.clust
). Shows how many contigs you have with different distance cutoffs.
Remove Chimeric Contigs
UCHIME in reference mode is used to remove chimeric contigs. All outputs for this step are located in the cluster
directory.
Input
representative nucleotide contigs (
nucl_rep_seqs.fasta
from cluster step)gene nucleotide reference set (
nucl.fa
from thegene_reference/GENE/originaldata
directory)
Output
UCHIME output (
result_uchimealn.txt
,results.uchime.txt
)quality_filtered nucleotide representative contigs (
final_nucl.fasta
)quality_filtered protein representative contigs and their raw abundance (not coverage adjusted; number of contigs)(
final_prot.fasta
)
Note: the quality-filtered contigs in final_nucl.fasta
, final_prot.fasta
and final_prot_aligned.fasta
should be used as the final set of contigs assembled by Xander.
Find Closest Matches
The nearest reference sequence match to each contig is found using RDP's FrameBot tool. RDP's Protein Seqmatch tool could also be used for this step. All outputs for this step are located in the cluster
directory.
Input
quality_filtered nucleotide representative contigs (
final_nucl.fasta
)gene protein reference set (
framebot.fa
from thegene_reference/GENE/originaldata
directory)
Output: the nearest reference seq and % aa identity (
framebot.txt
)
Coverage & Kmer Abundance
Read mapping, contig coverage, and kmer abundance are determined with RDP's KmerFilter tool. There is a multi-thread option for these steps.
Input 1
quality_filtered nucleotide representative contigs (
final_nucl.fasta
)read files (fasta, fastq or gz format)
Output
contig coverage (
coverage.txt
) This file can be used to estimate gene abundance and adjust sequence abundance.kmer abundance and corresponding frequency (
abundance.txt
)
Taxonomic Abundance
This is the final output for a single sample. You can think of it as a summary for OTUs (clusters) found in the sample. For each, it gives the closest reference match and the abundance and fractional abundance.
Input
contig coverage (
coverage.txt
)the nearest reference seq (
framebot.txt
)gene protein reference set (
framebot.fa
from thegene_reference/GENE/originaldata
directory)
Output: taxonomic abundance adjusted by coverage, grouped by lineage (phylum and in some cases class) (
taxonabund.txt
).If taxonomy was added to
framebot.fa
ahead of time, taxonomic abundance is calculated by phylum and lineage.If taxonomy was not added to
framebot.fa
ahead of time, taxonomic abundance of lineage is shown twice.
Multiple Samples
To see how results for multiple samples can be combined into an OTU table for community analyses, see the section Xander Results for Multiple Samples.
Last updated