PC_ali

Multiple sequence and structure aligner PC_ali

Author Ugo Bastolla Centro de Biologia Molecular Severo Ochoa

(It includes a modified version of the needlemanwunsch aligner programmed by Dr. Andrew C. R. Martin in the Profit suite of programs, (c) SciTech Software 1993-2007)

PC_ali performs hybrid multiple structure and sequence alignments based on the structure+sequence similarity score PC_sim. Besides the MSA, it outputs pairwise similarity scores and divergence scores and neighbor-joining phylogenetic tree obtained with the hybrid evolutionary divergence measure based on PC_sim, and it outputs a pdb file with multiply superimposed structures. Optionally, it computes violations of the molecular clock for each pair of proteins.

PC_ali takes as input:

either a list of PDB files (option -pdblist, format: column1: pdb code or file name col2: chain col3: domain, ex. 1-222 col4: domain name col5: directory 1 or 2 where the file is stored), or
not aligned sequences (option -seq), or
an MSA (option -ali).

In cases 2 and 3 PDB file names must be specified as sequence names. In case 1, the PDB files may be stored in two different folders. Folder 1 is input as -pdbdir, folder 2 is input as -pdbdir2, the folder of each PDB file (1 or 2) is specified in the 3rd column of the pdblist file (optional), default is folder 1. It is not allowed to input both a list of PDB files and an MSA.

Chains can be input either as characters (e.g. "1opd A", default) or as number if the option -chain_num is specified (e.g. "1opd 1", in this case the program reads the 1st chain in the PDB file). If the chain is not specified either as charcter or as number, the program reads the first one.

Usage:

PC_ali

-pdblist <List of PDB files> Format: 1 file_name 2 chain 3 domain 4 dom_name 5 dir (only file_name is mandatory)
-seq <sequences in FASTA format, with names of PDB files>
-ali <MSA file in FASTA format, with names of PDB files>
-chain_num ! Interpret chain as number instead of character
# The pdb code is optionally followed by the chain index, e.g. >1opd.pdb A or >1opd 1 or >1opdA or >1opd_A or >1opd_1

-pbdir <folder of pdb files>  (default: current directory)
-pbdir2 <2nd folder of pdb files>  (default: current directory)
-pdbext <extension of pdb files>  (default: .pdb)

Optional parameters:

 -id Print statistics of conservation and changes (.id)
 -out <Name of output files> (default: alignment file)
 -sim_thr    <identity above which proteins are joined>
 -print_pdb    ! Print structure superimposition in PDB format
 -print_sim    ! Print similarity measures
 -print_div    ! Print divergence measures
 -print_cv     ! Print clock violations
 -func <file with function similarity for pairs of proteins>
 -clique     ! Initial alignment is based on cliques (may be slow)
 -ali_tm     ! Perform pairwise alignments that target TM score
 -ali_co     ! Perform pairwise alignments that target Contact Overlap
 -ali_ss     ! Perform alignments that target sec.structure
 -ss_mult    ! alignments that target sec.structure are MSA
 -shift_max <Maximum shift for targeting sec.str.>

Computed similarity measures:

Aligned fraction ali,
Sequence identity SI,
Contact overlap CO,
TM-score TM (Zhang & Skolnick Proteins 2004 57:702)
PC_sim, based on the main Principal Component of the four above similarity scores

They are printed in <>.prot.sim for all pairs of protein sequences, and also for multiple conformations of the same sequence (if present) if required with -print_sim

Computed divergence measures:

Tajima-Nei divergence TN=-log((SI-S0)/(1-S0)) with S0=0.06 (Tajima F & Nei 1984, Mol Biol Evol 1:269),
Contact_divergence CD=-log((q-q0(L))/(1-q0(L)) (Pascual-Garcia et al Proteins 2010 78:181-96),
TM_divergence=-log((TM-TM0)/(1-TM0)), TM0=0.167.
PC_divergence=-log((PC-PC0)/(1-PC0)), PC0 linear combination of S0, TM0, CO(L) and nali0=0.5.

They are printed in <>.prot.div for all pairs of protein sequences, and also for multiple conformations of the same sequence (if present) if required with -print_div

Flux of the program:

In the modality -ali, the program starts from the pairwise alignments obtained from the input MSA. In the modality -seq the starting pairwise alignments are built internally.
The program then modifies the pairwise alignments by targeting PC_sim. The similarity matrix is constructed recursively, using the input pairwise alignment for computing the shared contacts and the distance after optimal superimposition (maximizing the TM score) for all pairs of residues and obtaining a new alignments. Two iterations are usually enough for getting good results. Optionally, for the sake of comparison, the program can target the TM score (-ali_tm), the Contact Overlap (-ali_co) and the secondary structure superposition (-ali_ss).
Proteins with sequence identity > sim_thr are joined together, to accelerate computations and reduce the output size. They represent different conformations of the same protein. The structural similarity (divergence) between two proteins is computed as the maximum (minimum) across all the examined conformations. The clustering can be avoided by setting -sim_thr 1
Then, the program builds progressive multiple alignments. If the option -clique is set, the starting MSA is based on the maximal cliques of the pairwise alignments, which does not require neither a guide tree nor gap penalty parameters, however it can become slow for large data sets.
Finally, the program runs iteratively progressive multiple alignments using as guide tree the average linkage tree obtained with the PC_Div divergence measure of the previous step and using as starting alignment the previous multiple alignment. The best MSA is selected as the one with the maximum value of the average PC similarity score.
The program prints the optimal MSA and the Neighbor Joining tree obtained from the corresponding PC_Div divergence measure.
If the options -print_sim or -print_div are set, the program prints in files <>.prot.sim and <>.prot.div similarity and divergence scores for the input MSA (if present) and for the final MSA.
If -print_pdb is set, the program prints the multiple superimposition obtained by maximizing the TM score
Furthermore, if -print_cv is set, the program computes and prints for all four divergence measures the violations of the molecular clock averaged over all possible outgroups identified with the Neighbor-Joining criterion, and the corresponding significance score.

COMPILE:

unzip PC_ali.zip make cp PC_ali ~/bin/ (or whatever path directory you like)

RUN:

PC_ali -seq -pdbdir

EXAMPLES:

List of PDB files, chain, domain, dom_name, folder (only necessary PDB file name) PC_ali -pdblist 1.10.287.110.SI60.pdblist -pdbdir (all PDB files listed in 1.10.287.110.SI60.pdblist must be downloaded in current folder or in PDBPATH)
Not aligned sequences in FASTA format, PDB file name specified in seq. name PC_ali -seq 50044_Mammoth.aln -pdbdir
Aligned sequences in FASTA format, PDB file name specified in seq. name PC_ali -seq 50044_Mammoth.aln -pdbdir (all PDB files named in 50044_Mammoth.aln must be in current folder or in PDBPATH)

OUTPUT:

MSA (PCAli.fas), NJ tree (PCAli.tree), structure similarity (.sim) and structure divergence scores (.div) for each protein pair, correlations between different types of sequence and structure identity (.id), MSA of secondary structure (_ss.msa)

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
1.10.287.110.SI60.pdblist		1.10.287.110.SI60.pdblist
1awj.pdb		1awj.pdb
1bb9.pdb		1bb9.pdb
1bbz.pdb		1bbz.pdb
1cka.pdb		1cka.pdb
1csk.pdb		1csk.pdb
1gbr.pdb		1gbr.pdb
1gl5.pdb		1gl5.pdb
1i07.pdb		1i07.pdb
1i1j.pdb		1i1j.pdb
1jo8.pdb		1jo8.pdb
1jqq.pdb		1jqq.pdb
1oeb.pdb		1oeb.pdb
1pht.pdb		1pht.pdb
1pwt.pdb		1pwt.pdb
1sem.pdb		1sem.pdb
2hsp.pdb		2hsp.pdb
50044_Mammoth.aln		50044_Mammoth.aln
CV_statistics.c		CV_statistics.c
CV_statistics.h		CV_statistics.h
Cluster_sequences.c		Cluster_sequences.c
Contact_divergence_aux.c		Contact_divergence_aux.c
Contact_divergence_aux.h		Contact_divergence_aux.h
D_Cont.c		D_Cont.c
D_Cont.h		D_Cont.h
McLachlan.h		McLachlan.h
McLachlan_float.c		McLachlan_float.c
NeedlemanWunsch.c		NeedlemanWunsch.c
NeedlemanWunsch.h		NeedlemanWunsch.h
PC_ali.c		PC_ali.c
PC_ali.h		PC_ali.h
Print_pairwise.c		Print_pairwise.c
Print_pairwise.h		Print_pairwise.h
Profit_aux.c		Profit_aux.c
Profit_aux.h		Profit_aux.h
README.md		README.md
README_PC_ali		README_PC_ali
Sim_Prot_aux.c		Sim_Prot_aux.c
Sim_Prot_aux.h		Sim_Prot_aux.h
align_mult.c		align_mult.c
align_ss.c		align_ss.c
align_ss.h		align_ss.h
allocate.c		allocate.c
allocate.h		allocate.h
blosum62.h		blosum62.h
clique_msa.c		clique_msa.c
clique_msa.h		clique_msa.h
consensus_msa.h		consensus_msa.h
cont_list.h		cont_list.h
contact_list.c		contact_list.c
coordinates.c		coordinates.c
coordinates.h		coordinates.h
macros.h		macros.h
main_PC_ali.c		main_PC_ali.c
makefile		makefile
mammoth.h		mammoth.h
nj.c		nj.c
nj_align.h		nj_align.h
normalization.h		normalization.h
print_pdb.c		print_pdb.c
protein.h		protein.h
rank.c		rank.c
rank.h		rank.h
read_pdb_mammoth.c		read_pdb_mammoth.c
read_structures.c		read_structures.c
read_structures.h		read_structures.h
secstrsim.h		secstrsim.h
tm_score.c		tm_score.c
tm_score.h		tm_score.h
tree.c		tree.c
tree.h		tree.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PC_ali

Usage:

Optional parameters:

Computed similarity measures:

Computed divergence measures:

Flux of the program:

COMPILE:

RUN:

EXAMPLES:

OUTPUT:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PC_ali

Usage:

Optional parameters:

Computed similarity measures:

Computed divergence measures:

Flux of the program:

COMPILE:

RUN:

EXAMPLES:

OUTPUT:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages