Xander is run on one sample at a time. To perform any type of community analysis with your assembled contigs, results from all samples must be gathered together and reduced to an OTU table. The script get_OTUabundance.sh in RDPTools/Xander_assembler/bin/ is provided to create coverage-adjusted OTU abundance data matrices from contigs of the same gene from multiple samples. Inputs and outputs for this script are:
Inputs
final_prot_aligned.fasta files for all samples of interest
A file of all the sample coverage files concatenated together
Outputs
rformat_dist_0.##.txt: data matrix files with OTU abundances for each sample at given distances (0 to 0.5 by 0.01 steps by default). The data matrices can then imported into R for more extensive analysis and visualization functions. Currently values in the OTU matrix are rounded to include whole number OTU abundances (counts).
The modified script introduced below, xander_cluster_samples.sh, goes several steps further, providing files that may be used to populate an experiment-level phyloseq object with, in additon to the OTU table, representative sequences for each OTU, a tree of the representative sequences, and a taxonomy table giving the genus, species, and strain of the closest match to each OTU. A corresponding sample data table is usually created separately in a spreadhseet program and then added to the phyloseq object.
Planning ahead
It is easier to collect the necessary files together if you plan ahead. And indeed, xander_cluster_samples.sh "assumes" that you do the following:
Before running Xander, create a directory for your experiment. Within it, create a Xander output directory for each sample in your experiment.
When you run Xander, in the xander_setenv.sh file for each sample give a SAMPLE_SHORTNAME that identifies the sample and point the output to the appropriate sample directory. If you are using MSU's HPCC, you can submit multiple jobs (up to 250, but good luck with that) at the same time.
xander_cluster_samples.sh
The script xander_cluster_samples.sh is listed here for your reference. The script is written to run on MSU's HPCC. To run elsewhere you probably need to comment out the line module load FastTree and set FastTree to the full path and program name appropriate to your installaton. You would also need to set RDPToolsDir to the full path for your installation of RDPTools. The script is run by entering a command of the form:
work_dir is a pre-existing directory to which results will be written
expt_dir is the experiment directory contianing all Xander sample directories for the experiment
distance is the one distance for which the OTU table will be created, e.g. 0.05
gene is the gene being analyzed
The script:
#!/bin/bash
# xander_cluster_samples.sh
# John Quensen
# March 2018
## Configuration section
## Set path to RDPTools
RDPToolsDir=/mnt/research/rdp/public/RDPTools
## Load FastTree. If on MSU's HPCC, then
module load FastTree
## Else, comment out the line above and set FastTree to the full path plus program name:
# FastTree=full_path/fasttree
## Get working directory, experiment directory, distance, gene from command line.
## working directory is where results will be written.
## Experiment directory is directory containing Xander or MegaGTA results for all samples.
## Distance is clustering distance.
## Gene is gene being analyzed, e.g. rplB, nifH, etc.
while getopts "w:e:d:g:" option; do
case ${option} in
w) work_dir=${OPTARG};; # path to working (summary) directory
e) expt_dir=${OPTARG};; # path the experiment directory
d) distance=${OPTARG};; # cluster distance for otu table and corresponding representative sequences
g) gene=${OPTARG};; # gene
esac
done
if [[ -z $work_dir || -z $expt_dir || -z $distance || -z $gene ]]; then
echo "Usage: -w <work_dir> -e <expt_dir> -d <distance> -g <gene>"
exit 1
fi
## Get files.
cd $work_dir
find $expt_dir/ -name "*"$gene"*coverage.txt" -exec cp {} $work_dir/ \;
find $expt_dir/ -name "*"$gene"*final_prot_aligned.fasta" -exec cp {} $work_dir/ \;
mkdir taxa
find $expt_dir/ -name "*"$gene"*framebot.txt" -exec cp {} $work_dir/taxa/ \;
rm taxa/*failed*.txt
## Catenate coverage files.
cat *coverage.txt > sam_coverage.txt
## Dereplilcate
java -Xmx2g -jar $RDPToolsDir/Clustering.jar derep -o derep.fa -m '#=GC_RF' ids samples *.fasta
## Calculate distance matrix
java -Xmx2g -jar $RDPToolsDir/Clustering.jar dmatrix -c 0.5 -I derep.fa -i ids -l 50 -o dmatrix.bin
## Cluster. By default, the resulting clust file includes results for all
## distances from 0 to 0.5 by 0.01.
java -Xmx2g -jar $RDPToolsDir/Clustering.jar cluster -d dmatrix.bin -i ids -s samples -o complete.clust
## Get representative sequences for specified distance.
mkdir alignment
mv *aligned.fasta alignment
java -Xmx2g -jar $RDPToolsDir/AlignmentTools.jar alignment-merger alignment merged_aligned.fasta
java -Xmx2g -jar $RDPToolsDir/Clustering.jar rep-seqs -c -l -m "#=GC_RF" -I ids -s complete.clust $distance merged_aligned.fasta
java -Xmx2g -jar $RDPToolsDir/Clustering.jar to-unaligned-fasta complete.clust_rep_seqs.fasta | cut -f1 -d ' ' > unaligned_rep_seqs.fasta
sed -i 's/cluster/OTU/' unaligned_rep_seqs.fasta
## Get coverage adjusted OTU matrix file for specified distance
java -Xmx2g -jar $RDPToolsDir/Clustering.jar cluster_to_Rformat complete.clust . $distance $distance sam_coverage.txt
## Match FrameBot matches to machine names of representative sequences.
cat taxa/*.txt > matches.txt
grep "STATS" matches.txt | cut -f2,3,4,5,6 | cut -d "_" -f 2- > match_taxa_machine_names.txt
## Match cluster numbers to machine names of representative sequences.
grep ">" complete.clust_rep_seqs.fasta | cut -f1 -d "," | sed 's/>//' | sed 's/seq_id=//' | sed 's/ /\t/' > match_cluster_machine_name.txt
## Make tree file
java -Xmx16g -jar /mnt/research/rdp/public/RDPTools/Clustering.jar derep -f -o rep_seqs_model_only.fasta ids samples complete.clust_rep_seqs.fasta
FastTree rep_seqs_model_only.fasta > my_tree.nwk
Assemble a phyloseq Object
The example below describes how an experiment-level phyloseq object for the gene rplB was created from the results for the 21 metagenome samples referenced in the original Xander paper (Wang et.al., 2015). The R script depends on phyloseq and RDPutils version 1.4.1 or above. Instructions for installing phyloseq are given at https://bioconductor.org/packages/release/bioc/html/phyloseq.html. Instructions for installing RDPutils from GitHub are given at https://john-quensen.com/github/.
Initial steps
Create an R working directory.
Put the following files, created by the script xander_cluster_samples.sh, in the R working directory:
xander_rplbB_rformat_dist_0.05.txt
xander_rplB_unaligned_rep_seqs.fasta
match_taxa_machine_names.txt
match_cluster_machine_name.txt
my_tree.nwk
R Script
Open R in a terminal or in RStudio and set the path to the working directory. Then run the R commands below. They are given in sections so that each step may be explained. R ouput lines begin with ##.
Load packages and functions.
Loading RDPutils will automatically load its dependencies including phyloseq.
The sample names in the R-formatted table are too long. They are the names of the aligned aa sequence files. The first two characters of these file names are the sample names. Use the function shorten_sample_names loaded above to shorten the row names (sample names) to just the first two characters. You may have to edit shorten_sample_names to properly extract your sample names.
There are a large number of empty OTUs because some coverage adjusted counts were less than 0.5. They will be removed later.
Make a sample data table
For this example a simple sample data table including only crop type will be created from the first letter of each sample name. In most cases, a more comprehensive sample data table with environmental data and treatment factors would be created in a spreadsheet program and imported.
Notice that these taxa names begin with "OTU_" but are not padded to the same length. We will have to make adjustments so that taxa names for the OTU table and reference sequences match.
Make a taxonomy table
This will consist of the closest matching reference sequences found by FrameBot.
Notice that my_taxa is already a phyloseq object. The taxa names are of the biom file format, not the R-formatter format. The ranks are Genus, Species, and Strain. Strain is the name of the closest match found by FrameBot, and the genus and species are parsed from the strain name. The percent identity to the reference sequence is appended to the strain name. This should always be taken into account when interpreting results. The closest match can be quite distant from the reference sequence.
Read in the tree file
If you made a tree of the aligned representative sequences, you can add it, too. The tip labels will need to be changed to the same format as taxa names in the other components - i.e. begin with "OTU_" and be unpadded.
Before assembling an experiment-level phyloseq object, we need to make the taxa names consistent. I will do that here by making the taxa names begin with "OTU_" and "un-padding" them. The OTU table and sample data table (crop) also need to be converted to phyloseq objects.
Once our data are packaged into an experiment-level phyloseq object, they may be analyzed by any method available to commmunity ecologists. Two methods are demonstrated below.
Ordination
The ordination presented the original Xander paper was a PCA calculated from the square root of Wisconsin standardized counts that had been adjusted for coverage. This ordination is replicated here, but with ggplot graphics. Instructions for installing QsRutils and ggordiplots are given at https://john-quensen.com/github/.
suppressPackageStartupMessages(library("QsRutils"))
otu <- veganotu(expt)
crop <- vegansam(expt)
otu.std <- sqrt(wisconsin(otu))
pca <- rda(otu.std)
axis.labels <- ord_labels(pca)[1:2]
plt <- gg_ordiplot(pca, groups = crop[ , 1], plot = FALSE)
plt$plot +
xlab(axis.labels[1]) +
ylab(axis.labels[2]) +
guides(color=guide_legend("Crop")) +
ggtitle("PCA for rplB in KBS Soil", subtitle = " Square Root of Wisconsin Transformed Counts")
Trees
Plot a tree for the ten most abundant OTUs. Label the tips with the strain of the closest match.
Two of these OTUs are found only in corn, and one is found almost exclusively in corn. Corn is well separated from the other crops in the ordination. Notice that the strain names (tip labels) end with the percent identity to the closest match in the FrameBot reference file.