De novo assembly of the full trimmed data set
Select the list of paired trimmed reads list generated in Exercise 1, called SRR513053 subset (Trimmed), and in the Toolbar click the Align/Assemble button and choose De novo Assemble. This will open the De novo Assembler Settings window.
The settings window is divided into 5 sections, Data, Method, Trim, Results and, via the More Options button, an Advanced Settings section. For an overview of the various settings see section 10.3 of the online manual.
The Geneious de novo assembler parses your input data and will select the appropriate Sensitivity: setting to use. In most cases you will not need to adjust the Sensitivity setting. The assembler also estimates and reports the amount of memory expected to be required to perform the assembly. The assembler will warn you if it believes you will not have enough RAM to assemble your selected data set.
The Sensivity setting adjusts various Advanced options. If you wish to see the how these Advanced settings change depending on sensitivity, click on the More Options button in the bottom left corner of the Settings window, change the Sensitivity setting and observe how the Advanced options change. Hover over each Advanced setting to see a Tooltip describing the setting. If you wish to modify specific Advanced settings then set Sensitivity to Custom Sensitivity.
Running de novo assembly
In this exercise we will use the Geneious de novo assembly algorithm using the default settings. To ensure you are using default settings, click on the cog in the bottom left corner of the Window and choose Reset to defaults.
In the Results section of the window, check the options to Save an Assembly report, Save contigs and Save consensus sequences.
The Trim section of settings Window is primarily for Sanger reads. Ensure the setting is set to Do not trim.
Click OK to start the assembly, noting the time it takes for assembly to complete. We will compare this time with that taken after normalization.
Viewing the assembly results
Upon completion of assembly three new files will be written, an assembly
report , an assembly file
, and a consensus sequence
generated
from the assembly.
Select the Assembly report to view it. In
this exercise all reads should assemble as a single contig so the report will be
simple. For more complex assemblies resulting in multiple contigs various
statistics including N50
will be reported.
The Assembly report provides a Show options link that opens the Assembly settings window to show you the Assembler options used during assembly.
Select the contig file to view it. By default
assembled paired reads will be colored according to their assembled paired
distance. The color will depend on how the paired distance differs from the
Expected distance that was set when you paired the reads. Select the Home
tab in the side panel, click on the Options link to see the color scheme.
Use the zoom controls
to zoom in and see the sequences at the nucleotide level.
Click on the Statistics tab in the viewer panel to see the Mean Coverage for the assembly file.
Click on the Graphs tab in the viewer panel to see the settings that can be adjusted for the Blue coverage graph. These settings allow you to identify areas of high/low or single stranded coverage.
Click on the Insert Sizes tab at the top of the viewer panel to see a histogram showing the distribution of calculated insert sizes based on the assembly of the paired reads. In this example we can see that the mean paired distance is very close to the predicted expected insert size of 350 bp.
De novo assembly of large data sets requires significant RAM and computational time. We will now "normalize" our trimmed paired read list and assemble again.
Select the SRR513053 subset (trimmed) list and go menu Sequence → Error Correct and Normalize reads. Use the default settings, but uncheck the option for Error correction. Click OK to run, the tool when completed will output a new normalized list called SRR513053 subset (trimmed) (normalized).
Select the normalized list, it should contain 3776 reads, a close to 50% reduction compared to the original list of 7800 reads.
Now select the normalized list and repeat de novo assembly, again noting the time it takes for assembly to complete. You should find assembly of the normalized data set takes around 1/3 the time for the full 7800 read data set. If you select the "normalized" contig and view the Statistics tab you will see this contig has an average coverage of 52.3.
Select the two consensus files generated for each assembly (named SRR513053 subset (trimmed) Assembly 2 and SRR513053 subset (trimmed) (normalized) Assembly 2 and also the reference sequence, E. coli K12 MG1655 (NC_000913) ref extraction, provided with this tutorial.
Click the Align/Assemble button in the Toolbar, select Multiple Align and choose to align with the MAFFT alignment tool. Check the option to Automatically determine sequences' direction. Click OK to align.
Select the output Nucleotide alignment to view. You should see that the Identity graph at the top of the alignment is a solid green bar, indicating that all three sequences are identical at all positions in the alignment.
This last exercise shows that normalisation can significantly reduce the time required for assembly with no loss in accuracy of the assembly. For more information on using normalization in combination with de novo assembly see the Summary section of this tutorial.
This exercise has used the Geneious de novo assembly algorithm. A number of other algorithms are available in Geneious. Several are bundled with Geneious others are available as plugins. See the following Knowledge Base article for a brief summary of the various assemblers.
This concludes the exercises for this tutorial. Go to the final Summary section of this tutorial for further advice on de novo assembly and normalization.
Go to:
Introduction: Introduction
Overview: Overview: Best Practice for preprocessing of NGS reads
Exercise 1: NGS read Preprocessing
Summary: Other preprocessing tools and general advice for de novo assembly