protein-qc-strict
Strictest protein sequence analysis quality control workflow (3365→456 sequences). Includes literature validation, CD-HIT redundancy removal, complexity check, motif verification, MSA quality assessment, and conservation/coevolution analysis. Based on real research experience with IRED enzyme family.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/billwanttobetop/protein-qc-strictProtein Sequence Analysis Quality Control Skill
Version: 4.0.0
Created: 2026-04-25
Updated: 2026-04-25 22:16
Purpose: Strictest protein sequence analysis quality control - complete workflow (3365 → 456 sequences)
Quick Start
This skill provides a battle-tested quality control workflow for protein sequence analysis, developed through real research on IRED enzyme family (3,365 → 456 sequences).
Key Features:
- Literature validation
- CD-HIT redundancy removal (90%)
- Complexity check (Shannon entropy)
- Motif verification (Rossmann fold)
- MSA quality assessment
- Conservation & coevolution analysis
Use this skill when:
- Analyzing protein families
- Preparing sequences for phylogenetic analysis
- Ensuring publication-quality data
- Need to meet strict literature standards
🎯 核心原则(血的教训)
1. 十分严谨,不能猜想 ⭐⭐⭐⭐⭐
用户原话: "我们一定要严谨,十分的严谨,做科研的每一步都不能猜想,每一步都应该做好质检"
2. 必须使用原版工具 ⭐⭐⭐⭐⭐
- ✅ MAFFT, trimAl, IQ-TREE, CD-HIT, MEME Suite, WebLogo
- ❌ 不能用 Python 简化实现
3. 每一步都要质检 ⭐⭐⭐⭐⭐
- 数据准备质检
- 比对质量质检
- 分析结果质检
- 位点位置质检
4. Gap 会严重误导分析 ⭐⭐⭐⭐⭐
- 必须过滤 gap > 50% 的位点
- 必须检查每个结果的 gap 比例
📊 完整质检流程(3365 → 456)
阶段 1: 文献追溯(3365 → 840)
目的: 确保所有序列都有实验验证
方法:
- 检查每条序列的文献来源
- 确认是否有实验验证
- 移除无实验验证的序列
标准:
- ✅ 必须有实验验证
- ✅ 必须有文献支持
结果:
- 输入: 3,365 条
- 输出: 840 条
- 移除: 2,525 条(75.0%)
质检: ✅ 所有序列都有实验验证
阶段 2: 长度过滤(840 → 793)
目的: 移除异常长度的序列
标准:
- 最小长度: 200 aa
- 最大长度: 500 aa
- 原因: 太短可能是片段,太长可能是融合蛋白
代码:
from Bio import SeqIO
sequences = list(SeqIO.parse("input.fasta", "fasta"))
# 长度过滤
filtered = []
for seq in sequences:
length = len(seq.seq)
if 200 <= length <= 500:
# 检查非标准字符
seq_str = str(seq.seq)
bad_chars = set(seq_str) - set('ACDEFGHIKLMNPQRSTVWY')
if not bad_chars:
filtered.append(seq)
SeqIO.write(filtered, "filtered.fasta", "fasta")
结果:
- 输入: 840 条
- 输出: 793 条
- 移除: 47 条(5.6%)
- < 200 aa: 43 条
- 非标准字符: 4 条
质检: ✅ 所有序列 200-500 aa,无非标准字符
阶段 3: CD-HIT 去冗余(793 → 456)⭐⭐⭐⭐⭐
目的: 移除高度相似的序列,避免偏倚
工具: CD-HIT v4.8.1(原版,必须!)
阈值: 90%(文献推荐)
命令:
cd-hit -i input.fasta \
-o output.fasta \
-c 0.90 \
-n 5 \
-M 0 \
-T 0
参数说明:
-c 0.90: 90% 序列同一性阈值-n 5: word length(5 for thresholds 0.7-1.0)-M 0: 无内存限制-T 0: 使用所有 CPU 线程
结果:
- 输入: 793 条
- 输出: 456 条
- 聚类: 337 个冗余序列(42.5%)
质检:
# 检查聚类文件
grep "^>" output.fasta.clstr | wc -l # 应该等于输出序列数
# 检查聚类大小分布
grep "^>" output.fasta.clstr -A 1 | grep "at" | \
awk '{print $NF}' | sort | uniq -c
质检标准:
- ✅ 去冗余率 30-50% 合理
- ✅ 大部分聚类大小 1-3
质检结果: ✅ 通过
阶段 4: 复杂度检查(456 → 456)
目的: 移除低复杂度序列(如重复序列)
方法: Shannon 熵
阈值: >= 2.0
公式:
H = -Σ(p_i * log2(p_i))
代码:
from Bio import SeqIO
from collections import Counter
import numpy as np
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-billwanttobetop-protein-qc-strict": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
auto-proteomics
Public OpenClaw skill for low-token routing and downstream analysis of processed DDA LFQ proteomics inputs. Use when the user already has protein-level quantification tables such as MaxQuant-style `proteinGroups.txt` and needs a clear two-group downstream workflow.
Automd Viz
Skill by billwanttobetop
Phylo Tree
Skill by billwanttobetop
protein-phylogeny
Comprehensive protein family phylogenetic analysis workflow with quality control, conservation analysis, coevolution network analysis, and publication-ready visualization. Use when: (1) analyzing protein family evolution, (2) building phylogenetic trees from sequences, (3) identifying conserved/coevolved residues, (4) generating publication-quality figures and reports, (5) quality-controlling sequence datasets, or (6) performing systematic evolutionary analysis of enzyme families, protein superfamilies, or any homologous protein groups.
automd-gromacs
AutoMD-GROMACS: Automated molecular dynamics simulation workflow - 13 Skills covering system setup, equilibration, production, analysis, free energy, ligand binding, membrane proteins, umbrella sampling, PCA, and workflows. Built-in auto-repair, 84.7% token savings. Part of the AutoMD series.