FASTA vs FASTQ: Bioinformatics File Formats
Format Structure
FASTA: In the FASTA format, each sequence entry begins with a single-line description, followed by lines of sequence data. The description line typically starts with a greater-than symbol ">" followed by an identifier and optionally a description or metadata. The sequence data can span multiple lines.
FASTQ: In the FASTQ format, each sequence entry consists of four lines:
- Header line starting with "@" followed by an identifier and optionally additional information.
- Sequence data represented by letters (A, C, G, T/U) indicating nucleotide bases.
- A separator line usually represented by a plus sign "+".
- Quality scores represented by ASCII characters, which reflect the confidence or probability of each base call in the sequence data.
Information Content
FASTA: FASTA files primarily contain sequence data and minimal metadata in the form of the description line.
FASTQ: FASTQ files contain not only sequence data but also quality scores corresponding to each base in the sequence. These quality scores are crucial for assessing the reliability of base calls generated during sequencing.
Quality Scores
FASTA: Since FASTA files do not include quality scores, there's no inherent information about the reliability or confidence of each base in the sequence.
FASTQ: Quality scores in FASTQ files provide information about the confidence level associated with each base call. These scores are typically represented using ASCII characters and can be used to assess the accuracy of sequencing data and to filter out low-quality reads.
Applications
FASTA: FASTA format is commonly used for representing sequence databases, sequence alignments, and other sequence-related data where quality information is not required.
FASTQ: FASTQ format is specifically designed for storing data generated by sequencing platforms such as Illumina, which produce both sequence data and corresponding quality scores. It's widely used in various bioinformatics applications including read mapping, variant calling, and de novo assembly where base call accuracy is crucial.
Examples
FASTA Example:
>Sequence1 ATCGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA >Sequence2 CTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
FASTQ Example:
@SEQ_ID1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @SEQ_ID2 TTGGCAGGCCAAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCA + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Phred Scores
Phred scores are a commonly used way to represent the quality of base calls in DNA sequencing data. They are given as logarithmic probabilities, and higher scores correspond to greater levels of confidence in the base call's accuracy.
The formula to convert a Phred score (Q) to a probability (P) is:
P = 10^-Q/10
Conversely, the formula to convert a probability (P) to a Phred score (Q) is:
Q = -10 x log10(P)
In practice, Phred scores typically range from 0 to 40, although higher scores are possible. Here's what these scores represent:
- A Phred score of 10 corresponds to a 1 in 10 chance (or 10%) of the base call being incorrect.
- A Phred score of 20 corresponds to a 1 in 100 chance (or 1%) of the base call being incorrect.
- A Phred score of 30 corresponds to a 1 in 1,000 chance (or 0.1%) of the base call being incorrect.
- And so on.
References
- https://www.ncbi.nlm.nih.gov/genbank/fastaformat/
- https://emea.illumina.com/informatics/sequencing-data-analysis/sequence-file-formats.html
- https://doi.org/10.1016/B978-0-323-89775-4.00016-X.
No comments:
Post a Comment