Sequence Vetting#
The sequence vetting functions provide tools for analyzing the complexity of protein sequences. These functions are particularly useful for quality control.
Outlier Detection#
The outlier detection function (Peptide.detect_outlier) analyzes a peptide
sequence using composition-based vetting metrics and compares them against
established SwissProt protein distributions to identify potential outliers,
artifacts, or unusual sequences. This function provides an automated way to
flag sequences that may require further investigation.
How it works#
The function evaluates the following metrics against 5th and 95th percentile thresholds derived from SwissProt protein analysis:
Entropy: Shannon entropy of amino acid composition
Max Frequency: Maximum frequency of any single amino acid
Longest Run: Length of longest consecutive identical amino acids
Use Cases#
Quality Control: Automatically flag problematic sequences in large datasets.
Sequence Validation: Ensure sequences meet expected biological parameters.
Artifact Detection: Identify potential sequencing errors or contaminants.
Database Filtering: Remove or flag unusual sequences before analysis.
Research Validation: Verify sequence quality in experimental workflows.
Best Practices#
Use as part of automated quality control pipelines.
Review flagged sequences manually to confirm issues.
Consider context when interpreting results (some unusual sequences may be biologically valid).
Combine with other validation methods for comprehensive quality assessment.
Result#
- class peptides.OutlierResult(typing.NamedTuple)#
The result of outlier detection with
Peptide.detect_outlier.The outlier detection function analyzes a peptide sequence using composition-based vetting metrics and compares them against established SwissProt protein distributions to identify potential outliers, artifacts, or unusual sequences. This approach provides an automated way to flag sequences that may require further investigation.
Hint
The following metrics are used:
entropy:This metric is a direct measure of the information content of a peptide sequence.
- Range:
0.0 to log₂(26) ≈ 4.70 bits.
- Interpretation:
Most values fall between 3.71 and 4.18 bits for SwissProt data.
max_frequency:This metric is useful for identifying dominant amino acids and assessing sequence diversity.
- Range:
1/sequence_length to 1.0
- Interpretation:
Most values fall between 0.085 and 0.172 for SwissProt data.
longest_run:This metric is useful for detecting repetitive regions, low complexity sequences, and potential sequencing artifacts.
- Range:
1 to sequence_length
- Interpretation:
Most values fall between 2 and 5 for SwissProt data.
Example
For a real peptide, the large ribosomal subunit protein bL32 of Escherichia coli (UniProt:P0A7N4):
>>> peptide = Peptide( ... "MAVQQNKPTRSKRGMRRSHDALTAVTSLSVDKT" ... "SGEKHLRHHITADGYYRGRKVIAK" ... ) >>> result = peptide.detect_outlier() >>> result.is_outlier False
For a problematic sequence:
>>> peptide = Peptide("AAAA") >>> result = peptide.detect_outlier() >>> result.is_outlier True >>> result.issues[0] 'Entropy (0.000) below 5th percentile (3.714)'
Added in version 0.5.0.