Official Verified

protein-qc-strict

Strictest protein sequence analysis quality control workflow (3365→456 sequences). Includes literature validation, CD-HIT redundancy removal, complexity check, motif verification, MSA quality assessment, and conservation/coevolution analysis. Based on real research experience with IRED enzyme family.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/billwanttobetop/protein-qc-strict

Download Source Code (.zip)

Protein Sequence Analysis Quality Control Skill

Version: 4.0.0
Created: 2026-04-25
Updated: 2026-04-25 22:16
Purpose: Strictest protein sequence analysis quality control - complete workflow (3365 → 456 sequences)

Quick Start

This skill provides a battle-tested quality control workflow for protein sequence analysis, developed through real research on IRED enzyme family (3,365 → 456 sequences).

Key Features:

Literature validation
CD-HIT redundancy removal (90%)
Complexity check (Shannon entropy)
Motif verification (Rossmann fold)
MSA quality assessment
Conservation & coevolution analysis

Use this skill when:

Analyzing protein families
Preparing sequences for phylogenetic analysis
Ensuring publication-quality data
Need to meet strict literature standards

🎯 核心原则（血的教训）

1. 十分严谨，不能猜想 ⭐⭐⭐⭐⭐

用户原话: "我们一定要严谨，十分的严谨，做科研的每一步都不能猜想，每一步都应该做好质检"

2. 必须使用原版工具 ⭐⭐⭐⭐⭐

✅ MAFFT, trimAl, IQ-TREE, CD-HIT, MEME Suite, WebLogo
❌ 不能用 Python 简化实现

3. 每一步都要质检 ⭐⭐⭐⭐⭐

数据准备质检
比对质量质检
分析结果质检
位点位置质检

4. Gap 会严重误导分析 ⭐⭐⭐⭐⭐

必须过滤 gap > 50% 的位点
必须检查每个结果的 gap 比例

📊 完整质检流程（3365 → 456）

阶段 1: 文献追溯（3365 → 840）

目的: 确保所有序列都有实验验证

方法:

检查每条序列的文献来源
确认是否有实验验证
移除无实验验证的序列

标准:

✅ 必须有实验验证
✅ 必须有文献支持

结果:

输入: 3,365 条
输出: 840 条
移除: 2,525 条（75.0%）

质检: ✅ 所有序列都有实验验证

阶段 2: 长度过滤（840 → 793）

目的: 移除异常长度的序列

标准:

最小长度: 200 aa
最大长度: 500 aa
原因: 太短可能是片段，太长可能是融合蛋白

代码:

from Bio import SeqIO

sequences = list(SeqIO.parse("input.fasta", "fasta"))

# 长度过滤
filtered = []
for seq in sequences:
    length = len(seq.seq)
    if 200 <= length <= 500:
        # 检查非标准字符
        seq_str = str(seq.seq)
        bad_chars = set(seq_str) - set('ACDEFGHIKLMNPQRSTVWY')
        if not bad_chars:
            filtered.append(seq)

SeqIO.write(filtered, "filtered.fasta", "fasta")

结果:

输入: 840 条
输出: 793 条
移除: 47 条（5.6%）
- < 200 aa: 43 条
- 非标准字符: 4 条

质检: ✅ 所有序列 200-500 aa，无非标准字符

阶段 3: CD-HIT 去冗余（793 → 456）⭐⭐⭐⭐⭐

目的: 移除高度相似的序列，避免偏倚

工具: CD-HIT v4.8.1（原版，必须！）

阈值: 90%（文献推荐）

命令:

cd-hit -i input.fasta \
       -o output.fasta \
       -c 0.90 \
       -n 5 \
       -M 0 \
       -T 0

参数说明:

-c 0.90: 90% 序列同一性阈值
-n 5: word length（5 for thresholds 0.7-1.0）
-M 0: 无内存限制
-T 0: 使用所有 CPU 线程

结果:

输入: 793 条
输出: 456 条
聚类: 337 个冗余序列（42.5%）

质检:

# 检查聚类文件
grep "^>" output.fasta.clstr | wc -l  # 应该等于输出序列数

# 检查聚类大小分布
grep "^>" output.fasta.clstr -A 1 | grep "at" | \
  awk '{print $NF}' | sort | uniq -c

质检标准:

✅ 去冗余率 30-50% 合理
✅ 大部分聚类大小 1-3

质检结果: ✅ 通过

阶段 4: 复杂度检查（456 → 456）

目的: 移除低复杂度序列（如重复序列）

方法: Shannon 熵

阈值: >= 2.0

公式:

H = -Σ(p_i * log2(p_i))

代码:

from Bio import SeqIO
from collections import Counter
import numpy as np

Read Full Documentation on GitHub

Metadata

Author@billwanttobetop

Stars4473

Updated2026-05-01

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-billwanttobetop-protein-qc-strict": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

auto-proteomics

Public OpenClaw skill for low-token routing and downstream analysis of processed DDA LFQ proteomics inputs. Use when the user already has protein-level quantification tables such as MaxQuant-style `proteinGroups.txt` and needs a clear two-group downstream workflow.

billwanttobetop 4473

Automd Viz

Skill by billwanttobetop

billwanttobetop 4473

Phylo Tree

Skill by billwanttobetop

billwanttobetop 4473

protein-phylogeny

Comprehensive protein family phylogenetic analysis workflow with quality control, conservation analysis, coevolution network analysis, and publication-ready visualization. Use when: (1) analyzing protein family evolution, (2) building phylogenetic trees from sequences, (3) identifying conserved/coevolved residues, (4) generating publication-quality figures and reports, (5) quality-controlling sequence datasets, or (6) performing systematic evolutionary analysis of enzyme families, protein superfamilies, or any homologous protein groups.

billwanttobetop 4473

automd-gromacs

AutoMD-GROMACS: Automated molecular dynamics simulation workflow - 13 Skills covering system setup, equilibration, production, analysis, free energy, ligand binding, membrane proteins, umbrella sampling, PCA, and workflows. Built-in auto-repair, 84.7% token savings. Part of the AutoMD series.

billwanttobetop 4473