FASTA versus FASTQ

FASTA vs FASTQ: Bioinformatics File Formats

FASTA vs FASTQ: Bioinformatics File Formats

Format Structure

FASTA: In the FASTA format, each sequence entry begins with a single-line description, followed by lines of sequence data. The description line typically starts with a greater-than symbol ">" followed by an identifier and optionally a description or metadata. The sequence data can span multiple lines.

FASTQ: In the FASTQ format, each sequence entry consists of four lines:

  • Header line starting with "@" followed by an identifier and optionally additional information.
  • Sequence data represented by letters (A, C, G, T/U) indicating nucleotide bases.
  • A separator line usually represented by a plus sign "+".
  • Quality scores represented by ASCII characters, which reflect the confidence or probability of each base call in the sequence data.

Information Content

FASTA: FASTA files primarily contain sequence data and minimal metadata in the form of the description line.

FASTQ: FASTQ files contain not only sequence data but also quality scores corresponding to each base in the sequence. These quality scores are crucial for assessing the reliability of base calls generated during sequencing.

Quality Scores

FASTA: Since FASTA files do not include quality scores, there's no inherent information about the reliability or confidence of each base in the sequence.

FASTQ: Quality scores in FASTQ files provide information about the confidence level associated with each base call. These scores are typically represented using ASCII characters and can be used to assess the accuracy of sequencing data and to filter out low-quality reads.

Applications

FASTA: FASTA format is commonly used for representing sequence databases, sequence alignments, and other sequence-related data where quality information is not required.

FASTQ: FASTQ format is specifically designed for storing data generated by sequencing platforms such as Illumina, which produce both sequence data and corresponding quality scores. It's widely used in various bioinformatics applications including read mapping, variant calling, and de novo assembly where base call accuracy is crucial.

Examples

FASTA Example:

>Sequence1
ATCGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>Sequence2
CTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC

FASTQ Example:
@SEQ_ID1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID2
TTGGCAGGCCAAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCA
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Phred Scores

Phred scores are a commonly used way to represent the quality of base calls in DNA sequencing data. They are given as logarithmic probabilities, and higher scores correspond to greater levels of confidence in the base call's accuracy.

The formula to convert a Phred score (Q) to a probability (P) is:

P = 10^-Q/10

Conversely, the formula to convert a probability (P) to a Phred score (Q) is:
Q = -10 x log10(P)

In practice, Phred scores typically range from 0 to 40, although higher scores are possible. Here's what these scores represent:
  • A Phred score of 10 corresponds to a 1 in 10 chance (or 10%) of the base call being incorrect.
  • A Phred score of 20 corresponds to a 1 in 100 chance (or 1%) of the base call being incorrect.
  • A Phred score of 30 corresponds to a 1 in 1,000 chance (or 0.1%) of the base call being incorrect.
  • And so on.
These scores are widely used in bioinformatics for quality assessment and quality control of sequencing data. They are crucial for filtering out low-quality reads and improving the accuracy of downstream analyses such as variant calling and genome assembly.

References

  1. https://www.ncbi.nlm.nih.gov/genbank/fastaformat/
  2. https://emea.illumina.com/informatics/sequencing-data-analysis/sequence-file-formats.html
  3. https://doi.org/10.1016/B978-0-323-89775-4.00016-X.

No comments: