A02

Q01¶

Points: 7

You run FastQC on a set of sequencing reads and observe the following "Per base sequence quality" plot.

a) Describe what this plot is showing.

b) What does this pattern typically indicate about the sequencing run?

c) How might this affect downstream analyses if left uncorrected?

Solution

a

The "Per base sequence quality" plot from FastQC visually represents how the sequencing quality varies across each base position in the reads. On the horizontal axis, it displays base positions ranging from 1 to 239 bp, corresponding to the sequencing reads' length. The vertical axis shows the Phred quality scores from 0 to 40, indicating the confidence level in identifying each base; higher scores reflect greater confidence and lower error probabilities.

A box plot summarizes the distribution of quality scores from all reads at each base position along the X-axis. The central line within each box represents the median quality score, while the top and bottom edges of the box indicate the 25^th and 75^th percentiles, encompassing the middle 50% of the data. The whiskers extend to show the range of scores outside this interquartile range, highlighting the variability in quality at each position.

The plot's background is color-coded to facilitate quick assessment of quality levels: green indicates high-quality scores above 28, orange signifies reasonable quality scores between 20 and 28, and red denotes poor quality scores below 20. In this plot, the quality scores are consistently high and within the green zone from positions 1 to 150 bp, indicating strong confidence in base calling in this region. Beyond position 150 bp, there is a noticeable decline in quality. From positions 150 to 200 bp, the scores decrease but generally remain within the green zone. However, between positions 200 and 239 bp, there is a sharp decline, with quality scores dropping into the orange and even red zones. Additionally, the spread of the box plots widens towards the end of the reads, reflecting increased variability and uncertainty in the base calls in these later positions.

b

The observed pattern—where high-quality scores at the beginning of the reads progressively decline towards the end—is characteristic of data generated by high-throughput sequencing platforms like Illumina. This pattern indicates that while the initial portion of the sequencing run produced high-quality data, the latter part is compromised by decreased accuracy and increased variability, a common phenomenon but significant for data interpretation. This trend suggests that several factors contribute to a reduction in base-calling accuracy as the sequencing reaction proceeds. One key factor is signal decay; over successive cycles, the fluorescence signals used to identify nucleotides become weaker and less distinguishable, making accurate base identification more challenging. Issues with phasing and pre-phasing can also arise, where some DNA strands fall out of sync during the sequencing process, leading to cumulative errors in base calling. Instrument calibration or maintenance issues also impact the sequencing efficiency and accuracy over time.

c

If the decline in sequencing quality towards the end of the reads is not addressed, it can affect downstream analyses in several ways. In alignment tasks, low-quality bases at the ends of reads may lead to incorrect mapping to the reference genome, resulting in mismatches or gaps that reduce overall alignment accuracy and confidence. This misalignment can propagate errors into variant calling, where sequencing errors might be misinterpreted as genetic variants, leading to false positives, or genuine variants might be overlooked due to the masking effect of poor-quality data, resulting in false negatives.

In de novo assembly projects, low-quality regions can fragment contigs or cause misassemblies, as the assembly algorithms struggle to accurately piece together sequences with high error rates. For quantitative analyses like gene expression, including erroneous bases can lead to inaccurate conclusions about gene expression levels or community composition.

Moreover, retaining low-quality data increases computational demands and storage requirements without contributing valuable information, potentially slowing down analysis pipelines and complicating data management. To mitigate these issues, it is crucial to perform quality control measures such as trimming low-quality bases from the ends of reads using tools like Trimmomatic or Cutadapt. By removing or correcting poor-quality data, researchers can improve the reliability and accuracy of downstream analyses, ensuring that conclusions drawn from the sequencing data are based on high-quality, trustworthy information.

Q02¶

Points: 7

In a FastQC report, you notice a high percentage of overrepresented sequences that match known Illumina adapters.

a) What step in data preprocessing should you perform to address this issue?

b) How might the presence of adapter sequences affect genome assembly if not removed?

Solution

a

When you notice a high percentage of overrepresented sequences in a FastQC report that match known Illumina adapters, the appropriate step in data preprocessing is to perform adapter trimming. This involves using specialized software tools like Trimmomatic, Cutadapt, or Trim Galore to scan your sequencing reads and remove any adapter sequences present at the ends. By trimming these adapters, you ensure that downstream analyses—such as genome assembly or alignment—are based solely on genuine genomic data, enhancing the accuracy and reliability of your results.

b

If adapter sequences are not removed before genome assembly, their presence can adversely affect the assembly process in several ways. Adapter sequences can create erroneous overlaps between reads, leading the assembler to incorrectly join unrelated genomic fragments. This can result in misassembled contigs or scaffolds and the formation of chimeric contigs, where non-adjacent genomic regions are incorrectly assembled together due to shared adapter contamination.

Additionally, the presence of adapters can reduce the overall quality of the assembly by inflating error rates and decreasing assembly metrics like N50 values, leading to a fragmented and less accurate genome assembly.

Adapter contamination can also skew read depth and coverage analyses, affecting downstream interpretations such as gene expression levels or variant calling.

Furthermore, including adapter sequences increases the complexity of the dataset, leading to longer processing times and higher computational resource usage during assembly.

Q03¶

Points: 7

You observe in your "Per sequence GC content" plot in your FastQC report a bimodal distribution.

a) What does a bimodal distribution in this plot typically suggest?

b) List a possible reason for this observation.

c) How would you further investigate the cause of this pattern?

Solution

a

A bimodal distribution in the "Per sequence GC content" plot typically suggests that your sequencing data contains two distinct groups of sequences with different GC content percentages. This often indicates that there is more than one source of DNA in your sample. In other words, instead of all your reads having a similar GC content (which would result in a unimodal, bell-shaped distribution), the presence of two peaks suggests heterogeneity in the GC content of the sequences.

b) Possible Reason for the Observation

A possible reason for observing a bimodal distribution in the GC content plot is contamination with DNA from another organism that has a significantly different GC content than your target organism. This contamination could occur during sample collection, preparation, or sequencing.

Alternatively, the presence of technical sequences such as adapters, primers, or other synthetic constructs with distinct GC content can also contribute to the formation of a second peak in the distribution.

c) Investigating the Cause of the Bimodal Pattern

To further investigate the cause of this bimodal pattern, you can take the following steps:

Sequence Alignment and BLAST Search:
- Extract Reads: Separate a subset of reads from each peak of the GC content distribution.
- BLAST Analysis: Perform BLAST searches of these reads against nucleotide databases to identify their potential origins.
- Interpret Results: Determine if the reads from the unexpected peak align to genomes of other organisms, which would indicate contamination.
Mapping to Reference Genome:
- Alignment: Map all reads to your target organism's reference genome using alignment tools like BWA or Bowtie2.
- Assess Mapping Rates: Check if the reads from the second GC peak fail to map to the reference genome, suggesting they originate from a different organism.
Taxonomic Classification:
- Use Metagenomic Tools: Employ tools like Kraken, Centrifuge, or Kaiju to classify the taxonomic origin of the reads.
- Quantify Contamination: Determine the proportion of reads assigned to organisms other than your target, providing evidence of contamination.
Check for Technical Sequences:
- Screen for Adapters and Primers: Use software like FastQC or Cutadapt to identify and quantify any remaining adapter or primer sequences.
- Trim if Necessary: Remove any technical sequences and reassess the GC content distribution to see if the bimodal pattern persists.
Review Laboratory Procedures:
- Sample Handling: Re-examine the sample preparation steps to identify any potential sources of contamination.
- Reagents and Equipment: Ensure that all reagents and equipment were free from contaminants and that proper aseptic techniques were followed.
Consult Metadata and Experimental Design:
- Mixed Samples: Confirm whether the sample was supposed to contain multiple organisms (e.g., in metagenomic studies).
- Experimental Controls: Review any controls included in the experiment to rule out technical artifacts.

Q04¶

Points: 7

Explain the concept of \(k\)-mers in the context of genome assembly. How does the choice of \(k\)-mer size affect the assembly process? Provide examples of potential issues with very small and very large \(k\)-mer sizes.

Solution

In genome assembly, \(k\)-mers are all the possible substrings of length \(k\) that can be extracted from sequencing reads. For example, if you have a read sequence AGCTG, the 3-mers (where \(k\) = 3) would be AGC, GCT, and CTG. \(K\)-mers are fundamental in assembly algorithms, particularly those using de Bruijn graphs, where nodes represent \(k\)-mers, and edges represent overlaps between \(k\)-mers. By analyzing how these \(k\)-mers overlap, assemblers can reconstruct the original genome sequence.

The size of \(k\) in \(k\)-mers significantly impacts the assembly process:

Smaller \(k\)-mers are more likely to find overlaps between reads because they require fewer matching bases. This increases sensitivity, helping to connect reads in regions with low coverage or sequencing errors. Larger \(k\)-mers are more specific, reducing the chance of erroneous overlaps but requiring higher-quality data.
Genomes often contain repetitive sequences. Smaller \(k\)-mers may not distinguish between these repeats because the \(k\)-mers are too short to capture unique sequences, leading to ambiguities in the assembly. Larger \(k\)-mers can span unique regions within repeats, aiding in correctly assembling these areas.
In de Bruijn graph assemblies, smaller \(k\)-mers result in graphs with more connections (edges), which can make the graph more complex and computationally intensive to resolve. Larger \(k\)-mers simplify the graph by reducing the number of connections but may fragment the graph if coverage is insufficient.

Potential Issues with Very Small and Very Large \(k\)-mer Sizes

Very Small \(k\)-mer Sizes (e.g., k ≤ 15):
- High Ambiguity in Repeats: Short \(k\)-mers are common across the genome, especially in repetitive regions. This leads to multiple possible paths in the de Bruijn graph, making it challenging to determine the correct assembly path and potentially causing misassemblies.
- Error Propagation: Sequencing errors can introduce incorrect \(k\)-mers that overlap with many reads, creating false connections in the graph and complicating error correction.
- Computational Load: The increased number of connections can make the graph more complex and harder to resolve, increasing computational time and memory usage.
Very Large \(k\)-mer Sizes (e.g., k ≥ 127):
- Fragmented Assemblies: Longer \(k\)-mers require contiguous sequences without errors to match perfectly. In regions with sequencing errors or low coverage, the necessary \(k\)-mers may be missing, leading to breaks in the assembly and resulting in fragmented contigs.
- Sensitivity to Sequencing Errors: A single nucleotide error can render a large \(k\)-mer unusable because the entire \(k\)-mer must match exactly. This sensitivity can significantly reduce the number of usable \(k\)-mers, especially in datasets with higher error rates.
- Reduced Overlap Detection: With longer \(k\)-mers, the likelihood of finding overlaps between reads decreases, which can hinder the assembly of regions where reads do not overlap sufficiently.
- Increased Memory Usage: Storing and handling a vast number of large \(k\)-mers can require more computational resources, potentially exceeding the capacity of standard computing environments.

Some modern assembly algorithms employ multiple \(k\)-mer sizes or adaptive \(k\)-mer strategies to leverage the advantages of different \(k\)-mer lengths. By carefully selecting and possibly combining different \(k\)-mer sizes, assemblers aim to produce the most accurate and contiguous genome assemblies possible given the data quality and computational resources available.

Q05¶

Points: 6

Describe the greedy algorithm approach to genome assembly:

a) Outline the basic steps of the algorithm.

b) What are the advantages and disadvantages of this approach?

c) Give an example scenario where the greedy algorithm might fail to produce the correct assembly.

Solution

a

The greedy algorithm is one of the earliest methods used for genome assembly and operates on a simple principle: at each step, it makes the locally optimal choice with the hope of finding a global optimum. In the context of genome assembly, the algorithm builds the genome sequence by iteratively merging reads based on the best available overlaps. Here are the basic steps:

Compute All Pairwise Overlaps
- Calculate the overlaps between every possible pair of sequencing reads.
- Assign scores to overlaps based on their length and quality (longer overlaps with fewer errors receive higher scores).
Select the Best Overlapping Pair
- Find the pair of reads with the highest-scoring overlap.
- This pair represents the most significant immediate opportunity for extension.
Merge the Selected Pair
- Align the two reads at the overlapping region.
- If there are mismatches, resolve them (often by selecting the most common nucleotide or using quality scores).
- Combine the two reads into a single sequence (contig).
Update the Read Set
- Exclude the original reads from the dataset.
- Treat the newly formed contig as a read for subsequent iterations.
Repeat the Process
- Go back to step 2 and repeat the selection and merging process with the updated set of reads and contigs.
- Continue until no further overlaps exceed a predefined threshold, or all reads have been assembled into contigs.
Output the Assembled Genome
- The remaining contigs represent the assembled genome sequences.
- Additional steps like scaffolding or gap filling may follow to improve the assembly.

b

Advantages:

Simplicity and Ease of Implementation The algorithm is straightforward and does not require complex data structures. It is easy to understand and implement, making it accessible for educational purposes or small projects.
Speed on Small Datasets For datasets with a small number of reads, the greedy algorithm can be relatively fast since it requires fewer computations.
Immediate Results Since the algorithm makes the best local choice at each step, it quickly extends contigs, providing rapid preliminary assemblies.
Minimal Computational Resources for Small Data Does not demand significant memory or processing power when dealing with limited datasets.

Disadvantages:

Suboptimal Global Assembly The algorithm's local optimization does not guarantee a globally optimal solution. Early decisions may prevent better assemblies later on, leading to fragmented or incorrect genomes.
Poor Handling of Repetitive Sequences Repeats in the genome can cause the algorithm to misassemble reads because it cannot distinguish between identical sequences from different locations. This leads to collapsed repeats or chimeric contigs.
Sensitivity to Sequencing Errors Sequencing errors can mislead the algorithm into making incorrect merges or missing true overlaps. The algorithm lacks robust error correction mechanisms.
Computational Inefficiency with Large Datasets Calculating all pairwise overlaps becomes computationally intensive as the number of reads increases. Does not scale well with high-throughput sequencing data common in modern genomics.
Lack of Quality Consideration Often, the algorithm does not adequately consider the quality scores of bases, which can result in assemblies that include erroneous sequences.
No Mechanism for Resolving Conflicts When multiple overlaps of similar length are found, the algorithm may arbitrarily choose one, potentially leading to incorrect assemblies.

c

Imagine you are assembling the genome of a plant species known to have a high content of repetitive DNA sequences, such as transposable elements, which can be several kilobases long and occur thousands of times throughout the genome.

Reads originating from different copies of the repeat are nearly identical.
The greedy algorithm, prioritizing the longest overlaps, cannot distinguish between overlaps from unique regions and those from repeats.
The algorithm merges reads from different repeat copies, collapsing them into a single contig.
This results in a reduced representation of the genome size and loss of repeat copy number information.
Merging reads from different genomic locations creates contigs that do not exist in the actual genome.
The flanking unique sequences of different repeats get erroneously joined.
The algorithm may encounter conflicting overlaps of similar lengths and arbitrarily choose one, potentially breaking contigs or preventing the correct assembly of unique regions adjacent to repeats.

Q06¶

Points: 6

You are given the following set of reads from a DNA sequencing experiment: {ATGGCTA, GGCTAAC, CTAACGT, AACGTAG, CGTAGCT, TAGCTAA, GCTAACG, TAACGTA, ACGTAGT}.

a) Using the greedy algorithm, show the steps to assemble these reads into contig(s). Assume a minimum overlap of 3 bases. For each step, clearly state which reads you are merging and the resulting sequence. If there are multiple possibilities with the same overlap length, explain your choice.

Clarification

By saying "Assume a minimum overlap of 3", this means you cannot make a merge if the overlap is less than 3. So if you get to a point where you have two sequences and they only have an overlap of 1, you cannot merge them.

If I didn't specify a minimum overlap, you still take the highest overlap each time but you don't have a minimum overlap requirement to merge. If the highest overlap was 2, you can still make that merge. If there is no overlap, you cannot concatenate.

Since this was confusing, I will accept any valid greedy assembly.

b) What is the final assembled sequence?

c) Is this assembly unique? Why or why not?

Clarification

"Unique assembly" in the context of DNA sequence assembly refers to a situation where there is only one possible way to combine the given reads (DNA fragments) to reconstruct the original sequence. In other words, if there are multiple valid ways to combine these reads then the assembly is not unique.

d) Identify a potential problem with this assembly that might not reflect the true original sequence. Explain your reasoning.

Solution

When assembling the given set of reads using the greedy algorithm with a minimum overlap of three bases, the process involves iteratively finding the pair of sequences with the maximum overlap and merging them until no further merges are possible. The reads provided are:

ATGGCTA
GGCTAAC
CTAACGT
AACGTAG
CGTAGCT
TAGCTAA
GCTAACG
TAACGTA
ACGTAGT

a

Step 1: Identify All Possible Overlaps

First, we compute all pairwise overlaps of at least three bases between the reads. We look for overlaps where the suffix of one read matches the prefix of another read. Here are the significant overlaps identified:

Read 1 (ATGGCTA) overlaps with Read 2 (GGCTAAC) by five bases (GGCTA).
Read 2 (GGCTAAC) overlaps with Read 7 (GCTAACG) by six bases (GCTAAC).
Read 7 (GCTAACG) overlaps with Read 3 (CTAACGT) by six bases (CTAACG).
Read 3 (CTAACGT) overlaps with Read 8 (TAACGTA) by six bases (TAACGT).
Read 8 (TAACGTA) overlaps with Read 4 (AACGTAG) by six bases (AACGTA).
Read 4 (AACGTAG) overlaps with Read 9 (ACGTAGT) by six bases (ACGTAG).
Read 5 (CGTAGCT) overlaps with Read 6 (TAGCTAA) by five bases (TAGCT).

Step 2: Merge Reads Based on Maximum Overlaps

We begin merging reads starting with the pairs that have the maximum overlap length.

Merge 1: Read 2 and Read 7

Overlap: Six bases (GCTAAC).
Merged Sequence: GGCTAACG.
Reads Remaining: Reads 1, Merged Read (GGCTAACG), Reads 3–6, 8, 9.

Merge 2: Merged Read and Read 3

Overlap: Six bases (CTAACG).
Merged Sequence: GGCTAACGT.
Reads Remaining: Reads 1, Merged Read (GGCTAACGT), Reads 4–6, 8, 9.

Merge 3: Merged Read and Read 8

Overlap: Six bases (TAACGT).
Merged Sequence: GGCTAACGTA.
Reads Remaining: Reads 1, Merged Read (GGCTAACGTA), Reads 4–6, 9.

Merge 4: Merged Read and Read 4

Overlap: Six bases (AACGTA).
Merged Sequence: GGCTAACGTAG.
Reads Remaining: Reads 1, Merged Read (GGCTAACGTAG), Reads 5, 6, 9.

Merge 5: Merged Read and Read 9

Overlap: Six bases (ACGTAG).
Merged Sequence: GGCTAACGTAGT.
Reads Remaining: Reads 1, Merged Read (GGCTAACGTAGT), Reads 5, 6.

Merge 6: Read 5 and Read 6

Overlap: Five bases (TAGCT).
Merged Sequence: CGTAGCTAA.
Reads Remaining: Reads 1, Merged Read (GGCTAACGTAGT), Merged Read (CGTAGCTAA).

Merge 7: Read 1 and Merged Read (GGCTAACGTAGT)

Overlap: Five bases (GGCTA).
Merged Sequence: ATGGCTAACGTAGT.
Reads Remaining: Merged Read (ATGGCTAACGTAGT), Merged Read (CGTAGCTAA).

b

After these merges, no further significant overlaps are found between the remaining sequences. The final assembly consists of two contigs:

Contig 1: ATGGCTAACGTAGT
Contig 2: CGTAGCTAA

c

This assembly is not unique. The reason is that during the assembly process, multiple overlaps of the same length were available, and choices made at each step could lead to different assembly paths. For example, when merging reads with an overlap of six bases, there were several options, and selecting a different pair could have resulted in a different contig sequence. The greedy algorithm does not consider all possible combinations or the global optimal assembly, so the final result depends on the order in which overlaps are considered and merged.

d

A potential issue with this assembly is that it results in two separate contigs instead of a single contiguous sequence, which may not reflect the true original genome sequence. The inability to merge the two contigs suggests that there might be missing reads or insufficient overlap information to bridge the gap between them. Additionally, the greedy algorithm's nature means it makes local optimal choices without considering the overall assembly, potentially leading to misassemblies or fragmented genomes.

Specifically, the reads CGTAGCT and TAGCTAA were merged to form Contig 2, but this contig could not be connected to Contig 1. It's possible that there was an overlap between Contig 2 and an internal region of Contig 1 that was overlooked due to the algorithm's limitations or because the required overlap length was not met. This fragmentation could result in missing genetic information or misrepresentation of genomic structures, which is critical when accurate genome assembly is necessary for downstream analyses.

Q07¶

Points: 6

You are given the following set of reads from a DNA sequencing experiment:

@read1
ATGCGTAC
+
IIIIIIII
@read2
CGTACGTA
+
IIIIIFFF
@read3
CGTACATA
+
FFIIIIFF
@read4
TACGTAGT
+
FFHHIIII

a) Construct a de Bruijn graph for these reads using \(k\) = 5, where \(k\) is the edge length. Draw the graph, clearly labeling nodes and edges.

b) Identify and explain any features in your graph that might complicate genome assembly, such as bubbles/bulge, tips, or cycles.

c) Propose a possible original sequence that could have generated these reads. If multiple possibilities exist, explain why.

d) How would increasing \(k\)_edge to 6 change the structure of the graph? Discuss both potential benefits and drawbacks of this change.

Note

In your de Bruijn graph, represent the frequency of each edge by labeling it with a number.

Solution

a

To construct a de Bruijn graph using the provided reads and a \(k\)-mer size of 5, we begin by extracting all possible 5-mers from the reads. In this context, the nodes of the graph will be all possible (\(k\)-1)-mers, which are 4-mers, and the edges will represent the 5-mers connecting these nodes.

For each read, we extract the 5-mers and determine the prefix and suffix 4-mers (nodes). Now we identify all unique nodes and edges:

ATGC, TGCG, GCGT, CGTA, GTAC, TACG, ACGT, TACA, ACAT, CATA, GTAG, TAGT.

flowchart LR
    A((ATGC)) -->|1| B((TGCG))
    B -->|1| C((GCGT))
    C -->|1| D((CGTA))

    D -->|1| E((GTAG))
    E -->|1| F((TAGT))

    D -->|3| G((GTAC))
    G -->|1| H((TACG))
    H -->|2| I((ACGT))
    I -->|2| D

    G -->|1| J((TACA))
    J -->|1| K((ACAT))
    K -->|1| L((CATA))

b

The de Bruijn graph constructed exhibits several features that can complicate genome assembly:

Branches (Forks):

At Node CGTA: There are two outgoing edges leading to GTAC and GTAG, creating ambiguity in the assembly path.
At Node GTAC: There are two outgoing edges to TACG and TACA, further increasing complexity.

Cycles:

Cycle Involving CGTA and ACGT: The path CGTA → GTAC → TACG → ACGT → CGTA forms a loop. Cycles can cause assemblers to get stuck or repeat sections erroneously.

Bubbles/Bulges:

The presence of multiple paths between the same nodes can create bubbles in the graph, representing sequencing errors or genuine variations. These features introduce uncertainty in determining the correct path through the graph, making it challenging for assembly algorithms to reconstruct the original sequence accurately.

c

In order to determine possible contigs, we try various paths from the ATGC tip to either TAGT or CATA.

ATGCGTAGT
ATGCGTACATA
ATGCGTACGTAGT
ATGCGTACGTACATA
ATGCGTACGTACGTAGT
ATGCGTACGTACGTACATA

These contigs represent the main paths through the graph that don't involve repeating significant portions of the sequence. The circular path is included because it represents a potential repetitive element in the genome. It's worth noting that there could be variations of these contigs depending on how many times you traverse the circular path. However, without additional information about the expected genome structure or read coverage, it's difficult to determine how many times this repeat should be included in the final assembly.

d

Since we are already using a \(k\)-mer size of 5, increasing the \(k\)-mer size would mean moving to \(k\) = 6. With longer \(k\)-mers, there are fewer possible \(k\)-mers extracted from the reads, leading to a simpler graph.

The use of longer \(k\)-mers in genome assembly offers several potential benefits. Firstly, it can lead to improved resolution of repeats, as longer \(k\)-mers are better able to span repetitive regions that shorter \(k\)-mers might struggle to distinguish. This can reduce ambiguity in the assembly graph and potentially resolve branching caused by short repeats, resulting in a more linear graph structure. Additionally, longer \(k\)-mers can simplify the overall graph structure by reducing the number of nodes and edges, making the graph easier to navigate and assemble. This simplification often results in fewer cycles and bubbles caused by errors or repetitive \(k\)-mers. Furthermore, longer \(k\)-mers enhance specificity, as they are less likely to occur by chance, which increases confidence in the overlaps between reads.

However, the use of longer \(k\)-mers also comes with potential drawbacks. One significant issue is the increased fragmentation due to sequencing errors. Since longer \(k\)-mers are more sensitive to errors, a single base error can invalidate an entire \(k\)-mer, potentially breaking the path in the graph and leading to disconnected components and gaps in the assembly. Another drawback is the reduced coverage of \(k\)-mers. Longer \(k\)-mers naturally occur less frequently in the dataset, which can result in insufficient coverage to form connections between nodes. This may lead to the loss of low-abundance sequences and the potential exclusion of rare variants. Lastly, longer \(k\)-mers can make it more difficult to assemble low-complexity regions. Areas with low complexity or high GC content might not produce enough valid longer \(k\)-mers, potentially resulting in assembly gaps.

Q08¶

Points: 6

You are working on assembling a bacterial genome. After initial quality control and assembly, you notice that your assembly is highly fragmented with a low N50 value.

a) List three possible reasons for this poor assembly.

b) For each reason, suggest a strategy to improve the assembly.

c) How would you validate the quality of your improved assembly?

Solution

a

When assembling a bacterial genome, encountering a highly fragmented assembly with a low N50 value can be indicative of several underlying issues. One possible reason for this poor assembly is insufficient sequencing coverage. Low sequencing depth means there aren't enough overlapping reads to reconstruct the genome contiguously, leading to gaps and fragmented contigs. Without adequate coverage, the assembler cannot confidently piece together the genome, resulting in a low N50 value, which reflects the fragmentation of the assembly.

Another potential cause is the presence of high repetitive content in the genome. Bacterial genomes often contain repetitive elements such as transposons, insertion sequences, or ribosomal RNA operons. These repetitive sequences can confuse assembly algorithms because reads from different parts of the genome may appear identical. The assembler may incorrectly merge these reads or fail to assemble them altogether, leading to fragmented assemblies and decreased N50 values.

A third reason could be poor-quality sequencing data. High error rates, uneven quality across reads, or residual adapter sequences can hinder the correct alignment and overlap of reads during assembly. Sequencing errors introduce incorrect bases, making it difficult for the assembler to find true overlaps between reads. Adapter contamination and low-quality bases at the ends of reads can also prevent proper merging, resulting in a fragmented assembly.

b

To address these issues and improve the assembly, you can employ several strategies. If insufficient sequencing coverage is the problem, increasing the sequencing depth is essential. This can be achieved by performing additional sequencing runs to generate more data. Before doing so, it's helpful to assess your current coverage using computational tools that estimate coverage levels and determine how much additional data is needed. Ensuring that your library preparation methods minimize biases can also help achieve more even coverage across the genome, which is crucial for a successful assembly.

In the case of high repetitive content, utilizing long-read sequencing technologies and advanced assembly techniques can make a significant difference. Long-read platforms like PacBio or Oxford Nanopore Technologies produce reads that are long enough to span repetitive regions, helping to resolve ambiguities that short reads cannot. Implementing a hybrid assembly approach, where you combine short-read data (which is typically more accurate) with long-read data (which provides greater contiguity), can leverage the strengths of both technologies. Additionally, creating mate-pair libraries with larger insert sizes can help bridge gaps caused by repeats, connecting contigs that short-insert libraries cannot.

If poor-quality sequencing data is contributing to the fragmented assembly, improving data quality through stringent preprocessing is crucial. Start by assessing the quality of your reads using tools like FastQC, which can identify issues such as low-quality scores or adapter contamination. Trimming low-quality bases and removing adapter sequences using software like Trimmomatic or Cutadapt can enhance the overall quality of your dataset. Applying error correction algorithms can also rectify sequencing errors before assembly, increasing the likelihood of correct read overlaps. If data quality issues persist despite these measures, it may be necessary to resequence your samples using optimized protocols to obtain higher-quality reads.

c

Once you've implemented strategies to improve the assembly, validating the quality of the new assembly is an essential next step. Begin by evaluating assembly metrics such as N50 and L50 values. An increase in N50 and a decrease in the number of contigs indicate improved contiguity of the assembly. Comparing the total length of your assembly to the expected genome size for your bacterial species can also provide insight into its completeness.

Mapping the original sequencing reads back to the improved assembly is another critical validation step. Using alignment tools like BWA or Bowtie2, you can assess how well the reads align to the assembly. Uniform coverage across the genome suggests that the assembly accurately represents the sequencing data. Analyzing mismatch and insertion-deletion (indel) rates during alignment can help identify any regions with potential errors.

Employing tools like BUSCO (Benchmarking Universal Single-Copy Orthologs) can evaluate the completeness of your assembly by checking for the presence of conserved genes expected in your bacterial genome. A high percentage of complete BUSCO genes indicates that your assembly captures most of the essential genomic content. If a closely related reference genome is available, aligning your assembly to it using software like MUMmer can reveal structural variations or misassemblies, providing further validation.

Annotating the genome using tools such as Prokka can help identify genes, ribosomal RNA, transfer RNA, and other genomic features. Ensuring that essential metabolic pathways and housekeeping genes are present supports the functional completeness of your assembly. It's also important to check for contamination by using taxonomic classification tools like Kraken or Kaiju. These tools can detect and quantify any sequences that may have originated from other organisms, allowing you to exclude contaminant contigs from your assembly.

Analyzing repetitive regions with specialized software can assess how well these elements have been assembled. Verifying that repetitive sequences are accurately represented without fragmentation or misassembly is crucial, especially if your bacterial genome contains a high amount of repeats. Laboratory validation methods, such as PCR amplification of specific genomic regions followed by Sanger sequencing, can provide nucleotide-level confirmation of your assembly's accuracy.

Finally, conducting a comprehensive assessment using tools like QUAST (Quality Assessment Tool for Genome Assemblies) can generate detailed reports on various assembly metrics. Evaluating statistics such as GC content distribution, gene density, and \(k\)-mer frequencies can offer additional insights into the quality of your assembly.

Q09¶

Points: 8

Design a step-by-step workflow for preprocessing raw Illumina sequencing data and performing a de novo genome assembly. For each step, briefly explain its purpose and mention one commonly used bioinformatics tool that could be used to perform that step.

Solution

To preprocess raw Illumina sequencing data and perform a de novo genome assembly, you can follow a structured workflow that ensures each step enhances the quality and accuracy of the final assembly. The process begins with the quality assessment of the raw sequencing data, which is essential for identifying any potential issues that could affect downstream analyses. Tools like FastQC are commonly used for this purpose. FastQC generates detailed reports on various quality metrics, including per-base sequence quality, GC content, sequence length distribution, and the presence of overrepresented sequences or adapter contamination. By thoroughly examining these reports, you can detect problems such as low-quality reads or technical artifacts early in the process.

Following the quality assessment, the next crucial step is quality trimming and adapter removal. This step aims to eliminate low-quality bases from the ends of the reads and remove any residual adapter sequences that might have been introduced during the library preparation. Removing these unwanted sequences improves the overall quality of the data and prevents them from interfering with the assembly process. Trimmomatic is a widely used tool for this purpose. It can perform both quality trimming based on Phred quality scores and adapter removal using provided adapter sequence files. By specifying parameters tailored to your dataset, Trimmomatic efficiently cleans the reads, resulting in a dataset composed of high-quality sequences ready for assembly.

After trimming, it's beneficial to perform error correction on the cleaned reads to reduce the impact of sequencing errors, which can hinder the assembly process by creating false overlaps or introducing incorrect sequences. Error correction algorithms analyze the reads to identify and correct probable errors, enhancing the overall accuracy of the data. BayesHammer, part of the SPAdes assembler suite, is an effective tool for error correction of Illumina sequencing data. It employs a Bayesian approach to distinguish between sequencing errors and true polymorphisms, correcting errors while preserving genuine genetic variation.

With high-quality, error-corrected reads, the next step involves removing potential contaminant sequences. Contamination can originate from various sources, such as microbial DNA in the environment or other samples processed concurrently. Identifying and eliminating these contaminant reads ensures that the assembly reflects only the genome of the target organism. Kraken is a taxonomic classification tool that can efficiently assign reads to known organisms by matching \(k\)-mers to a database of reference sequences. By running Kraken on your dataset, you can detect and remove reads that map to non-target organisms, resulting in a purified dataset for assembly.

Now, you are ready to perform the de novo genome assembly. This process reconstructs the genome without the guidance of a reference sequence by assembling the reads based on their overlaps. SPAdes is a commonly used assembler for Illumina data, particularly well-suited for small genomes like those of bacteria. SPAdes uses a de Bruijn graph approach to assemble genomes from short reads, handling various read lengths and paired-end data effectively. By inputting your cleaned and corrected reads into SPAdes, the assembler constructs contigs and scaffolds that represent the genome sequence.

Once the assembly is complete, it's important to assess the quality of the assembled genome to ensure it meets the necessary standards for your research. Evaluating assembly metrics provides insights into the contiguity, completeness, and accuracy of the assembly. QUAST (Quality Assessment Tool for Genome Assemblies) is a tool designed for this purpose. QUAST generates comprehensive reports that include metrics like N50 (a statistic that indicates the contig length at which 50% of the genome is contained in contigs of that length or longer), total assembly length, number of contigs, GC content, and potential misassemblies. If a reference genome is available, QUAST can also compare your assembly against it to identify structural variations or errors.

Finally, to add functional context to your assembly, you should perform genome annotation. This step involves identifying genes, coding sequences, rRNAs, tRNAs, and other genomic features within the assembled genome, providing valuable insights into the organism's biology. Prokka is an automated annotation tool that streamlines this process for bacterial genomes. It integrates various databases and prediction tools to annotate genes quickly and accurately, producing files compatible with genome browsers and further analyses.

Programming+¶

These problems are not required and will not impact your BIOSC 1540 grade. The instructor will assess these separately to validate correctness without an assigned grade. Thus, you may work on these problems individually or in a team-based setting and "due" by the end of the semester. Happy coding!

Acceptable languages: Python v3.10+, Rust v1.80+, Mojo v24.4+

Files: FASTQ

Rewards

Engaging with these optional programming problems offers several valuable academic and professional growth opportunities.

Consistent engagement with these Programming+ problems will allow me to write more detailed, compelling recommendation letters highlighting your computational skills. These personalized letters can significantly boost your applications for future academic programs, internships, or job opportunities.
If there is enough interest, optional Friday recitations will be provided. This will give you individualized attention to accelerate learning and actionable feedback on your code and problem-solving approaches.
Exceptional solutions may be featured on our course website with the students' permission. This is a way for us to recognize and appreciate your hard work and dedication to these problems.

Note

These problems would be similar to ones given in a major-only version of the class. Although, there would be more relevant instructions during class and would be given more than a week to complete.

P1¶

Write a script that does the following:

Reads the FASTQ file.
Calculates and saves the mean quality score for each read as a CSV. Your score should subtract the 33 from each ASCII encoding.
Makes the "Per base sequence quality" as produced by FastQC.

Example output format:

CSV

read,length,mean_quality
310547a8-d2e3-47a6-a9ff-ec4f7eeaf9de,199,9.7

P2¶

Implement a function that takes a DNA sequence and a \(k\)-mer size as input, and returns a dictionary of unique \(k\)-mer counts. Then use this function to:

Count all 3-mers in the sequences from any FASTQ file.
Print the top 5 most frequent 3-mers and their counts.
Explain how this information might be useful in genome assembly.

P3¶

Implement a simple de Bruijn graph construction algorithm:

Write a function that takes a list of reads and a \(k\)-mer size (for edges) as input.
The function should return a dictionary representing the de Bruijn graph, where keys are (\(k-1\))-mers and values are lists of possible next bases.
Use your function to construct a de Bruijn graph for the reads in a FASTQ file with k=4.
Print the node with the highest out-degree (i.e., the (\(k-1\))-mer with the most possible next bases).
Identify any bubbles in the graph (nodes with multiple outgoing edges that later reconverge).

If you can implement a function to traverse this graph and generate a possible assembly, explain your approach and show the result.