Sequence Vetting#

The sequence vetting functions provide tools for analyzing the complexity of protein sequences. These functions are particularly useful for quality control.

Outlier Detection#

The outlier detection function (Peptide.detect_outlier) analyzes a peptide sequence using composition-based vetting metrics and compares them against established SwissProt protein distributions to identify potential outliers, artifacts, or unusual sequences. This function provides an automated way to flag sequences that may require further investigation.

How it works#

The function evaluates the following metrics against 5th and 95th percentile thresholds derived from SwissProt protein analysis:

  1. Entropy: Shannon entropy of amino acid composition

  2. Max Frequency: Maximum frequency of any single amino acid

  3. Longest Run: Length of longest consecutive identical amino acids

Use Cases#

  1. Quality Control: Automatically flag problematic sequences in large datasets.

  2. Sequence Validation: Ensure sequences meet expected biological parameters.

  3. Artifact Detection: Identify potential sequencing errors or contaminants.

  4. Database Filtering: Remove or flag unusual sequences before analysis.

  5. Research Validation: Verify sequence quality in experimental workflows.

Best Practices#

  • Use as part of automated quality control pipelines.

  • Review flagged sequences manually to confirm issues.

  • Consider context when interpreting results (some unusual sequences may be biologically valid).

  • Combine with other validation methods for comprehensive quality assessment.

Result#

class peptides.OutlierResult(typing.NamedTuple)#

The result of outlier detection with Peptide.detect_outlier.

The outlier detection function analyzes a peptide sequence using composition-based vetting metrics and compares them against established SwissProt protein distributions to identify potential outliers, artifacts, or unusual sequences. This approach provides an automated way to flag sequences that may require further investigation.

Hint

The following metrics are used:

entropy:

This metric is a direct measure of the information content of a peptide sequence.

Range:

0.0 to log₂(26) ≈ 4.70 bits.

Interpretation:

Most values fall between 3.71 and 4.18 bits for SwissProt data.

max_frequency:

This metric is useful for identifying dominant amino acids and assessing sequence diversity.

Range:

1/sequence_length to 1.0

Interpretation:

Most values fall between 0.085 and 0.172 for SwissProt data.

longest_run:

This metric is useful for detecting repetitive regions, low complexity sequences, and potential sequencing artifacts.

Range:

1 to sequence_length

Interpretation:

Most values fall between 2 and 5 for SwissProt data.

Example

For a real peptide, the large ribosomal subunit protein bL32 of Escherichia coli (UniProt:P0A7N4):

>>> peptide = Peptide(
...     "MAVQQNKPTRSKRGMRRSHDALTAVTSLSVDKT"
...     "SGEKHLRHHITADGYYRGRKVIAK"
... )
>>> result = peptide.detect_outlier()
>>> result.is_outlier
False

For a problematic sequence:

>>> peptide = Peptide("AAAA")
>>> result = peptide.detect_outlier()
>>> result.is_outlier
True
>>> result.issues[0]
'Entropy (0.000) below 5th percentile (3.714)'
is_outlier#

A flag indicating whether the peptide is an outlier.

Type:

bool

issues#

A list of specific issues found during sequence vetting.

Type:

list of str

metrics#

A dictionary of calculated vetting metrics.

Type:

dict

Added in version 0.5.0.