Xander Results for Multiple Samples
Xander Results for Multiple Samples
Xander is run on one sample at a time. To perform any type of community analysis with your assembled contigs, results from all samples must be gathered together and reduced to an OTU table. The script get_OTUabundance.sh in RDPTools/Xander_assembler/bin/ is provided to create coverage-adjusted OTU abundance data matrices from contigs of the same gene from multiple samples. Inputs and outputs for this script are:
Inputs
final_prot_aligned.fastafiles for all samples of interestA file of all the sample coverage files concatenated together
Outputs
rformat_dist_0.##.txt: data matrix files with OTU abundances for each sample at given distances (0 to 0.5 by 0.01 steps by default). The data matrices can then imported into R for more extensive analysis and visualization functions. Currently values in the OTU matrix are rounded to include whole number OTU abundances (counts).
The modified script introduced below, xander_cluster_samples.sh, goes several steps further, providing files that may be used to populate an experiment-level phyloseq object with, in additon to the OTU table, representative sequences for each OTU, a tree of the representative sequences, and a taxonomy table giving the genus, species, and strain of the closest match to each OTU. A corresponding sample data table is usually created separately in a spreadhseet program and then added to the phyloseq object.
Planning ahead
It is easier to collect the necessary files together if you plan ahead. And indeed, xander_cluster_samples.sh "assumes" that you do the following:
Before running Xander, create a directory for your experiment. Within it, create a Xander output directory for each sample in your experiment.
When you run Xander, in the
xander_setenv.shfile for each sample give aSAMPLE_SHORTNAMEthat identifies the sample and point the output to the appropriate sample directory. If you are using MSU's HPCC, you can submit multiple jobs (up to 250, but good luck with that) at the same time.
xander_cluster_samples.sh
The script xander_cluster_samples.sh is listed here for your reference. The script is written to run on MSU's HPCC. To run elsewhere you probably need to comment out the line module load FastTree and set FastTree to the full path and program name appropriate to your installaton. You would also need to set RDPToolsDir to the full path for your installation of RDPTools. The script is run by entering a command of the form:
/path_to_script/xander_cluster_samples.sh -w work_dir -e expt_dir -d distance -g genewhere:
work_dir is a pre-existing directory to which results will be written
expt_dir is the experiment directory contianing all Xander sample directories for the experiment
distance is the one distance for which the OTU table will be created, e.g. 0.05
gene is the gene being analyzed
The script:
Assemble a phyloseq Object
The example below describes how an experiment-level phyloseq object for the gene rplB was created from the results for the 21 metagenome samples referenced in the original Xander paper (Wang et.al., 2015). The R script depends on phyloseq and RDPutils version 1.4.1 or above. Instructions for installing phyloseq are given at https://bioconductor.org/packages/release/bioc/html/phyloseq.html. Instructions for installing RDPutils from GitHub are given at https://john-quensen.com/github/.
Initial steps
Create an R working directory.
Put the following files, created by the script
xander_cluster_samples.sh, in the R working directory:xander_rplbB_rformat_dist_0.05.txt
xander_rplB_unaligned_rep_seqs.fasta
match_taxa_machine_names.txt
match_cluster_machine_name.txt
my_tree.nwk
R Script
Open R in a terminal or in RStudio and set the path to the working directory. Then run the R commands below. They are given in sections so that each step may be explained. R ouput lines begin with ##.
Load packages and functions.
Loading RDPutils will automatically load its dependencies including phyloseq.
Make OTU Table
The first step in creating the OTU table is to read in the R-formatted text file.
The sample names in the R-formatted table are too long. They are the names of the aligned aa sequence files. The first two characters of these file names are the sample names. Use the function shorten_sample_names loaded above to shorten the row names (sample names) to just the first two characters. You may have to edit shorten_sample_names to properly extract your sample names.
Now the sample names are the way that we want them. The taxa names (OTUs) begin with OTU_ and are padded with zeroes to be all the same length.
Inspect the OTU table.
There are a large number of empty OTUs because some coverage adjusted counts were less than 0.5. They will be removed later.
Make a sample data table
For this example a simple sample data table including only crop type will be created from the first letter of each sample name. In most cases, a more comprehensive sample data table with environmental data and treatment factors would be created in a spreadsheet program and imported.
Read in representative sequences
Notice that these taxa names begin with "OTU_" but are not padded to the same length. We will have to make adjustments so that taxa names for the OTU table and reference sequences match.
Make a taxonomy table
This will consist of the closest matching reference sequences found by FrameBot.
Notice that my_taxa is already a phyloseq object. The taxa names are of the biom file format, not the R-formatter format. The ranks are Genus, Species, and Strain. Strain is the name of the closest match found by FrameBot, and the genus and species are parsed from the strain name. The percent identity to the reference sequence is appended to the strain name. This should always be taken into account when interpreting results. The closest match can be quite distant from the reference sequence.
Read in the tree file
If you made a tree of the aligned representative sequences, you can add it, too. The tip labels will need to be changed to the same format as taxa names in the other components - i.e. begin with "OTU_" and be unpadded.
Make taxa names consistent
Before assembling an experiment-level phyloseq object, we need to make the taxa names consistent. I will do that here by making the taxa names begin with "OTU_" and "un-padding" them. The OTU table and sample data table (crop) also need to be converted to phyloseq objects.
Assemble experiment-level phyloseq object
Finally, assemble the components into an experiment-level phyloseq object.
We noted above that some OTUs are empty. We can easily remove the empty taxa with phyloseq's prune_taxa function:
Example Analyses
Once our data are packaged into an experiment-level phyloseq object, they may be analyzed by any method available to commmunity ecologists. Two methods are demonstrated below.
Ordination
The ordination presented the original Xander paper was a PCA calculated from the square root of Wisconsin standardized counts that had been adjusted for coverage. This ordination is replicated here, but with ggplot graphics. Instructions for installing QsRutils and ggordiplots are given at https://john-quensen.com/github/.

Trees
Plot a tree for the ten most abundant OTUs. Label the tips with the strain of the closest match.

Two of these OTUs are found only in corn, and one is found almost exclusively in corn. Corn is well separated from the other crops in the ordination. Notice that the strain names (tip labels) end with the percent identity to the closest match in the FrameBot reference file.
Last updated
Was this helpful?