FASTQ
The FASTQ format is similar to FASTA, but also associates each nucleotide with an error probability. This is relevant because it enables us to do things such as trimming, filtering and estimating sample quality. Each record takes up four lines:
- Sequence id with an optional description. Must start with the
@
character. - Actual sequence.
+
.- Quality (ASCII encoded phred scores).
In addition, the sequence and quality lines must contain the same number of characters. One simple example of a FASTQ record is:
@sequence_1
ATCG
+
????
which tells us that there is a record called sequence_1
with the nucleotide sequence ATCG
and associated qualities ????
. How do we know that these are nucleotides and not aminoacids? Strictly, we don't. However, the FASTQ format is almost exclusively used for nucleotides.
Quality, Phred Scores and Error Probabilities
There are three concepts we need to understand before proceeding:
-
Quality
- The ASCII character associated with a particular nucleotide. This is what is found in the actual FASTQ file. Can be converted to aphred score
by taking the ASCII value and subtracting the phred offset (which is usually33
). For example,?
in ASCII corresponds to the value63
. If the phred offset is33
(depends on the sequencing machine), then our phred score is63
-33
=30
. -
Phred Score
- A logarithmically encoded error probability. These are integers that can be converted to errror probabilities. E.g., a phred score of30
corresponds to an error probability of0.001
. -
Error probabilities
- The probability that a particular nucleotide was called incorrectly by the sequencing machine. For example, a nucleotideA
with error probability0.001
means there is a 0.1% likelihood that thisA
is actually something else, like aC
,G
orT
.
Since ASCII, phred scores and error probabilities are related, we can convert between them.
\[ ASCII \iff phred \iff error \]
Before we proceed with this, why are error probabilities not used directly in the FASTQ file? The answer is convenience, efficiency and discretization. We don't want the FASTQ file to contain lots of (potentially very small) floating point numbers. That would be messy and unfeasible.
We won't go through the math regarding phred scores and error probabilities, but the equality looks something like this:
\[ error = 10^{-(ASCII - \text{phred_offset})/10} \]
We can now test this formula. Assume we'd want to convert ?
to an error probability using phred_offset = 33
. This would equate to:
\[ error = 10^{-(63 - 33)/10} = 10^{-3} = 0.001 \]
To make things a bit more concrete, we'll go through a code example of how to do this in Rust.
fn phred_to_err(phred: u8) -> f64 { 10_f64.powf(-1.0 * ((phred - 33) as f64) / 10.0) } fn main() { assert_eq!(phred_to_err(b'?'), 0.001); assert_eq!(phred_to_err(b'5'), 0.01); assert_eq!(phred_to_err(b'+'), 0.1); assert_eq!(phred_to_err(b'!'), 1.0); }
Note that the conversion from ASCII character to ASCII value is implicit by using b'.'
.
Phred Offset
Why are phred offsets even used? The reason is related to ASCII characters. More precisely, the first ASCII characters with values 0-31 are non printable characters so it does not make sense to use them. In addition, ASCII 32 corresponds to a space which also does not make sense to use.
The first usable character is therefore ASCII 33, which is !
. To adjust for the fact that our zero
starts at 33, we simply subtract 33.