Identification of compounds by nucleotide sequence analysis obtained from 16s rRNA sequence
A nucleotide sequence is provided to students to BLAST with NCBI database (or other) and identify. The matching percentage and origin of the sequence is discussed based on statistics provided from the BLAST.
Figure 14b.1: NCBI BLAST alignment score for unknown nucleotide sequence 18
There are no gaps. The E score is 0. Sequence is a 100% match to 16S rRNA of Haemophilus influenzae.
An unknown nucleotide sequence is provided to students to identify. The sequence was analyzed using NCBI BLAST and was a 100% match to 16S rRNA of Haemophilus influenzae. 16S is the small ribosomal subunit of the 30S ribosome subunit. This section of the ribosome is highly conserved among prokaryotes and can be used to identify bacteria species by amount of mutations. As bacteria diverged, more mutations in the 16S occurred and created differences between bacteria that can be amplified and sequenced.
Because the identification is based on mutations as species diverged, more recently divergent bacteria cannot be confidently identified. A high percentage identity is an indicator of a good match. In this case, there is a 100% match to Haemophilus influenzae.
Identity is related to gap percentage. A gap in an alignment can be from a no-call (N) in the sequence. If a sequence is returned with an N, it can be manually changed based on the chromatogram reading and BLASTed. Another reason there may be a gap is from a different nucleotide. A different nucleotide may come from an error, a SNP, or it may be a true difference between the actual sequence and the sequence the database is comparing it to. The more gaps, the less identity the actual sequence has to the database comparison. The unknown sequence has no gaps when compared to Haemophilus influenzae.
The E value is the expectation of multiple matches to the unknown sequence. If the gap percentage is high, there is a possibility the match of the sequence to the database comparison is incorrect. If multiple matches are expected, the E value increases. If there is a low gap percentage and or a high identity percentage, the E value will be low because there is no expectation to have multiple matches. One exception to this is if a sequence has a high identity percentage but matches only a partial sequence, there could be multiple partial sequences it matches to in the database. For this reason, it is better to BLAST longer sequences to get better matches. As mentioned earlier, a relatively recent divergent species may also have few differences from the species from which it diverged. A partial sequence may match both species with a high identity percentage and the E value will be higher. For the unknown sequence analyzed, the returned BLAST result was a complete sequence with a low E value (0.0). This is an indicator of an acceptable sequence identity.