Activity
Read trimming
Using potentially erroneous data could cause bias in downstream analyses; thus, sequences must be cleaned to reduce bias in downstream analysis. We refer to this cleaning process as "trimming", as the problematic base calls are normally at the end of the reads. In general, quality treatments include:
- Trimming/cutting/masking sequences
- from low quality score regions
- beginning/end of sequence
- removing adapters
- Filtering of sequences
- with low mean quality score
- too short
- with too many ambiguous (N) bases
There are several tools used to trim adapters, but there is not one that significantly outperforms the others. They mainly differ by different default parameters, advanced features, and report options.
For this activity, we will use fastp
because of its automatic adapter detection and comprehensive analysis report.
In the Galaxy Tools panel on the left, click FASTA/FASTQ
and select the "fastp fast all-in-one preprocessing for FASTQ files" tool.
Figure
Data selection¶
Our reads (SRR14933407
) are Paired-ended, so we can select either "Paired" or "Paired collection".
The only difference is that "Paired-end" makes you select both forward and reverse reads, whereas the collection will allow you to select the collection we downloaded—results will be the same.
Select the "Paired Collection" option and then "Paired-end data (fastq-dump)".
Adapters¶
Adapters are unique to the DNA preparation protocol and technology employed. Our data source specifies that the Nextera XT protocol and Illumina HiSeq 2000 were used. FastQC identified the following adapter sequences in our reads:
- Forward:
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTAT
This is a TruSeq Adapter, Index 1 containing the prefixGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
, the D703 i7 indexCGCTCATT
, and part of the suffixATCTCGTAT
. - Reverse:
GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTACGTCCTGGTGTAGATCT
(Illumina Single End PCR Primer 1)
Illumina provides a list of adapters used in their products. Since TruSeq adapters were identified, we can go to the TruSeq DNA indexes and note the specified adapter sequences for trimming:
- Read 1:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
- Read 2:
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
If we were using a tool like cutadapt
, we would have to manually specify these.
Since these are widely used, fastp
and trimmomatic
can automatically detect and remove these.
For fastp
, go to "Adapter Trimming Options" and enable "Adapter sequence auto-detection for paired-end".
Quality¶
Processing data over and over again can consume a lot of resources; thus, tools often combine similar features into one run.
Instead of trimming adapters in one run and quality in another, we can simultaneously remove base calls with low accuracy.
A phred score of 20 and lower is considered poor, so we will bump up the fastp
cutoff to 20
instead of 15
.
Under "Filter options", change the "Qualified quality phred" to 20
.
Length¶
Some reads gets smaller and smaller as adapters and low-quality base calls get removed. Once shorter than, say around 20, they can cause undue strain on downstream processing. By going to "Length filtering options" and setting "Length required" to 20, we can remove these short reads.
Now, we can run our tool.