FASTA versus FASTQ

FASTA and FASTQ are two common file formats used in bioinformatics to store biological sequence data, such as DNA, RNA, or protein sequences [1,2]. While both formats serve the purpose of representing sequence data, they have distinct differences in their structure and the type of information they convey. Here's a breakdown of the main differences between FASTA and FASTQ formats:

  1. Format Structure:

    • FASTA: In the FASTA format, each sequence entry begins with a single-line description, followed by lines of sequence data. The description line typically starts with a greater-than symbol ">" followed by an identifier and optionally a description or metadata. The sequence data can span multiple lines.
    • FASTQ: In the FASTQ format, each sequence entry consists of four lines:
      1. Header line starting with "@" followed by an identifier and optionally additional information.
      2. Sequence data represented by letters (A, C, G, T/U) indicating nucleotide bases.
      3. A separator line usually represented by a plus sign "+".
      4. Quality scores represented by ASCII characters, which reflect the confidence or probability of each base call in the sequence data.
  2. Information Content:

    • FASTA: FASTA files primarily contain sequence data and minimal metadata in the form of the description line.
    • FASTQ: FASTQ files contain not only sequence data but also quality scores corresponding to each base in the sequence. These quality scores are crucial for assessing the reliability of base calls generated during sequencing.
  3. Quality Scores:

    • FASTA: Since FASTA files do not include quality scores, there's no inherent information about the reliability or confidence of each base in the sequence.
    • FASTQ: Quality scores in FASTQ files provide information about the confidence level associated with each base call. These scores are typically represented using ASCII characters and can be used to assess the accuracy of sequencing data and to filter out low-quality reads.
  4. Applications:

    • FASTA: FASTA format is commonly used for representing sequence databases, sequence alignments, and other sequence-related data where quality information is not required.
    • FASTQ: FASTQ format is specifically designed for storing data generated by sequencing platforms such as Illumina, which produce both sequence data and corresponding quality scores. It's widely used in various bioinformatics applications including read mapping, variant calling, and de novo assembly where base call accuracy is crucial.
Here's an example of a FASTA file containing two DNA sequences:
>Sequence1 ATCGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA >Sequence2 CTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
In this example: Sequence1 and Sequence2 are identifiers for the two sequences. Each sequence is represented by a string of nucleotide bases (A, T, C, G) on one or multiple lines. The lines starting with ">" indicate the beginning of a sequence entry and include the sequence identifier.

Here's an example of a FASTQ file containing two sequence entries, each with its corresponding quality scores:
@SEQ_ID1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @SEQ_ID2 TTGGCAGGCCAAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCAGGCA + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC


In this example: @SEQ_ID1 and @SEQ_ID2 are identifiers for the two sequences. The second line represents the nucleotide sequence for each sequence entry. The third line, starting with "+", serves as a separator line between the sequence and quality score information. The fourth line contains quality scores represented by ASCII characters. Each character corresponds to the quality score for the respective base in the sequence. The quality scores are usually Phred scores, which indicate the confidence level of each base call. Higher scores denote higher confidence.

Phred scores are a commonly used way to represent the quality of base calls in DNA sequencing data [3]. They are given as logarithmic probabilities, and higher scores correspond to greater levels of confidence in the base call's accuracy.
The formula to convert a Phred score (Q) to a probability (P) is:

Conversely, the formula to convert a probability (P) to a Phred score (Q) is:

=10log10()

In practice, Phred scores typically range from 0 to 40, although higher scores are possible. Here's what these scores represent:

  • A Phred score of 10 corresponds to a 1 in 10 chance (or 10%) of the base call being incorrect.
  • A Phred score of 20 corresponds to a 1 in 100 chance (or 1%) of the base call being incorrect.
  • A Phred score of 30 corresponds to a 1 in 1,000 chance (or 0.1%) of the base call being incorrect.
  • And so on.

These scores are widely used in bioinformatics for quality assessment and quality control of sequencing data. They are crucial for filtering out low-quality reads and improving the accuracy of downstream analyses such as variant calling and genome assembly.

In summary, while both FASTA and FASTQ formats are used for storing biological sequence data, FASTQ includes quality scores for each base call, providing valuable information for assessing data reliability in sequencing applications.

References:

1. https://www.ncbi.nlm.nih.gov/genbank/fastaformat/

2. https://emea.illumina.com/informatics/sequencing-data-analysis/sequence-file-formats.html

3. https://doi.org/10.1016/B978-0-323-89775-4.00016-X.

No comments:

Post a Comment