ColabDesign本地安装及针对靶标蛋白结合多肽设计

ColabDesign利用了一些先进的深度学习模型来进行蛋白质结构的预测和设计。这些模型主要包括:

  1. AlphaFold:这是一个由DeepMind开发的大规模机器学习系统,可以在没有任何实验数据的情况下预测蛋白质结构。AlphaFold在CAS(Critical assessment of Structure Prediction)比赛中表现卓越,是目前最先进的蛋白质折叠预测模型。

  2. RoseTTAFold:这是由David Baker实验室开发的基于深度学习的蛋白质三级结构预测工具,它结合了深度卷积网络和transformer网络为基础的模块,可以高效准确地预测蛋白质结构。

 在使用ColabDesign的过程中,主要利用这些模型的渐变下降法(Gradient Descent, GD)在序列似然值上进行蛋白质设计。该方法主要有三种形式,分别是自由幻觉(Free Hallucination),逆向蛋白质结构预测法(Inverted Protein Structure Prediction),以及限制性幻觉法(Constrained Hallucination)。

Hallucination设计方法最早来自CV里对图像生成的“幻觉”,在trRosetta上成功将其运用于蛋白设计后这个过程就称为了逆向结构模型+优化设计范式结构。基于RosettaFold和AlphaFold这2个结构预测模型的Hallucination分别称为“RFdesign”和“AFdesign”。

这些方法无一例外地都依赖于在一个预测的蛋白质结构和一个目标结构之间计算梯度,并利用这些梯度来更新蛋白质的氨基酸序列。这样一来,就可以设计出具有一定结构属性的蛋白质,这对于研发特定功能的蛋白质药物等应用具有后续的意义。

 ColabDesign适用于多种蛋白质设计的应用领域。以下是一些主要的应用领域:

  1. 药物设计:利用深度学习技术对蛋白质结构进行预测、改造和设计,能帮助科研人员更快地发现具有药物效应的小分子和蛋白质,对于新药的研发具有重要的意义。

  2. 生物制造:通过设计新的蛋白质,科研人员可以创建出具有特定功能的生物系统,比如生产某种特定化学物质的工程菌。

  3. 原子级的机器学习:可以通过设计新的蛋白质,进行新颖的材料研究、能源研究和纳米工程等。

  4. 疾病治疗:通过设计能够与特定蛋白质结合的蛋白质,比如设计能够与Mdm2结合的蛋白质以治疗癌症,Mdm2是P53的负向调控因子,而P53通常被称为“癌症抑制基因”。

 这仅是ColabDesign在蛋白质设计应用方面的一些主要用途,还有许多其他的用途。蛋白质设计一个非常广阔的领域,包含了许多潜在的研究和应用方向。

 ColabDesign是一个为蛋白质设计提供便利的工具,本地安装步骤:

 首先,你需要使用git克隆已经修改的仓库(从原来的仓库fork过来的):

# 克隆仓库

git clone git@github.com::riveSunder/LocalColabDesign.git

cd LocalColabDesign

virtualenv ./local_design --python=python3.10

source ./local_design/bin/activate

# 安装ColabDesign

pip install -e .

# 下载Alphafold2参数

curl -fsSL https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar | tar x -C params

如果你使用GPU运行ColabDesign,你可能需要安装正确版本的JAX,用于搭配GPU对应的CUDA版本。 如果你的CUDA版本为11,你可以使用以下命令安装:

pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

如果你的CUDA版本为12,可以使用下面的命令:

pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

使用下面的代码设计针对PDB ID:4N5T的结合多肽。跑了两轮,设计得到的序列分别为SEAEFMEQFYRAYE和SEEKFWEYWEEIVN

import os

from colabdesign import mk_afdesign_model, clear_mem

from colabdesign.shared.utils import copy_dict

from colabdesign.af.alphafold.common import residue_constants

import numpy as np

import re

from scipy.special import softmax

def get_pdb(pdb_code=""):

  if pdb_code is None or pdb_code == "":

    upload_dict = files.upload()

    pdb_string = upload_dict[list(upload_dict.keys())[0]]

    with open("tmp.pdb","wb") as out: out.write(pdb_string)

    return "tmp.pdb"

  elif os.path.isfile(pdb_code):

    return pdb_code

  elif len(pdb_code) == 4:

    os.system(f"wget -qnc https://files.rcsb.org/view/{pdb_code}.pdb")

    return f"{pdb_code}.pdb"

  else:

    os.system(f"wget -qnc https://alphafold.ebi.ac.uk/files/AF-{pdb_code}-F1-model_v3.pdb")

    return f"AF-{pdb_code}-F1-model_v3.pdb"

# **target info**

pdb = "4N5T" #@param {type:"string"}

# - enter PDB code or UniProt code (to fetch AlphaFoldDB model) or leave blink to upload your own

target_chain = "A" #@param {type:"string"}

target_hotspot = "" #@param {type:"string"}

if target_hotspot == "": target_hotspot = None

# - restrict loss to predefined positions on target (eg. "1-10,12,15")

target_flexible = False #@param {type:"boolean"}

# - allow backbone of target structure to be flexible

# ---

# **binder info**

binder_len = 14 #@param {type:"integer"}

# - length of binder to hallucination

binder_seq = "" #@param {type:"string"} starting sequence

binder_seq = re.sub("[^A-Z]", "", binder_seq.upper())

if len(binder_seq) > 0:

  binder_len = len(binder_seq)

else:

  binder_seq = None

# - if defined, will initialize design with this sequence

binder_chain = "" #@param {type:"string"}

if binder_chain == "": binder_chain = None

# - if defined, supervised loss is used (binder_len is ignored)

# ---

# **model config**

use_multimer = False #@param {type:"boolean"}

# - use alphafold-multimer for design

num_recycles = 0 #@param ["0", "1", "3", "6"] {type:"raw"}

num_models = "2" #@param ["1", "2", "3", "4", "5", "all"]

num_models = 5 if num_models == "all" else int(num_models)

# - number of trained models to use during optimization

x = {"pdb_filename":pdb,

     "chain":target_chain,

     "binder_len":binder_len,

     "binder_chain":binder_chain,

     "hotspot":target_hotspot,

     "use_multimer":use_multimer,

     "rm_target_seq":target_flexible}

    

x["pdb_filename"] = get_pdb(x["pdb_filename"])    

if "x_prev" not in dir() or x != x_prev:

  clear_mem()

  model = mk_afdesign_model(protocol="binder",

                            use_multimer=x["use_multimer"],

                            num_recycles=num_recycles,

                            recycle_mode="sample")

  model.prep_inputs(**x,

                    ignore_missing=False)

  x_prev = copy_dict(x)

  print("target length:", model._target_len)

  print("binder length:", model._binder_len)

  binder_len = model._binder_len

optimizer = "pssm_semigreedy" #@param ["pssm_semigreedy", "3stage", "semigreedy", "pssm", "logits", "soft", "hard"]

# - `pssm_semigreedy` - uses the designed PSSM to bias semigreedy opt. (Recommended)

# - `3stage` - gradient based optimization (GD) (logits → soft → hard)

# - `pssm` - GD optimize (logits → soft) to get a sequence profile (PSSM).

# - `semigreedy` - tries X random mutations, accepts those that decrease loss

# - `logits` - GD optimize logits inputs (continious)

# - `soft` - GD optimize softmax(logits) inputs (probabilities)

# - `hard` - GD optimize one_hot(logits) inputs (discrete)

# WARNING: The output sequence from `pssm`,`logits`,`soft` is not one_hot. To get a valid sequence use the other optimizers, or redesign the output backbone with another protocol like ProteinMPNN.

# ----

# #### advanced GD settings

GD_method = "sgd" #@param ["adabelief", "adafactor", "adagrad", "adam", "adamw", "fromage", "lamb", "lars", "noisy_sgd", "dpsgd", "radam", "rmsprop", "sgd", "sm3", "yogi"]

learning_rate = 0.1 #@param {type:"raw"}

norm_seq_grad = True #@param {type:"boolean"}

dropout = True #@param {type:"boolean"}

model.restart(seq=binder_seq)

model.set_optimizer(optimizer=GD_method,

                    learning_rate=learning_rate,

                    norm_seq_grad=norm_seq_grad)

models = model._model_names[:num_models]

flags = {"num_recycles":num_recycles,

         "models":models,

         "dropout":dropout}

if optimizer == "3stage":

  model.design_3stage(120, 60, 10, **flags)

  pssm = softmax(model._tmp["seq_logits"],-1)

if optimizer == "pssm_semigreedy":

  model.design_pssm_semigreedy(120, 32, **flags)

  pssm = softmax(model._tmp["seq_logits"],1)

if optimizer == "semigreedy":

  model.design_pssm_semigreedy(0, 32, **flags)

  pssm = None

if optimizer == "pssm":

  model.design_logits(120, e_soft=1.0, num_models=1, ramp_recycles=True, **flags)

  model.design_soft(32, num_models=1, **flags)

  flags.update({"dropout":False,"save_best":True})

  model.design_soft(10, num_models=num_models, **flags)

  pssm = softmax(model.aux["seq"]["logits"],-1)

O = {"logits":model.design_logits,

     "soft":model.design_soft,

     "hard":model.design_hard}

if optimizer in O:

  O[optimizer](120, num_models=1, ramp_recycles=True, **flags)

  flags.update({"dropout":False,"save_best":True})

  O[optimizer](10, num_models=num_models, **flags)

  pssm = softmax(model.aux["seq"]["logits"],-1)

model.save_pdb(f"{model.protocol}.pdb")


Souce: 纽普生物    2024-03-05