modhmm


Table of Contents

1. Introduction
2. The .hmg file format
2.1. The single alphabet format
2.1.1. Header
2.1.2. Modules
2.1.3. Vertices
2.1.4. Emission distribution groups
2.1.5. Transition tie groups
3. Software
3.1. Download
3.2. Installation
3.2.1. Installation with prebuilt package
3.2.1.1. Installation on Windows with .exe file
3.2.1.2. Installation on Ubuntu and Debian
3.2.1.3. Installation on CentOS and Fedora
3.2.1.4. Installation on Mac OS X
3.2.2. Building from source
3.2.2.1. Building from source on Unix
3.2.2.2. Building install packages
3.2.2.2.1. Building an .exe file for Windows
3.2.2.2.2. Building an rpm
3.2.2.2.3. Building a deb package
3.2.2.2.4. Building install package for MacOS X
3.3. Usage
3.3.1. modhmmc
3.3.1.1. Command line options
3.3.1.2. Modules
3.3.1.2.1. Singlenode module
3.3.1.2.2. Singleloop module
3.3.1.2.3. Forward std module
3.3.1.2.4. Forward alt module
3.3.1.2.5. Cluster module
3.3.1.2.6. Profile9
3.3.1.2.7. Profile7 module
3.3.1.2.8. U-turn module
3.3.1.2.9. Highway module
3.3.1.3. basic modhmmc
3.3.1.4. Examples
3.3.2. modhmmt
3.3.2.1. Command line options
3.3.2.2. Examples
3.3.3. modhmms
3.3.3.1. Command line options
3.3.3.2. Examples
3.3.4. modseqalign
3.3.4.1. Command line options
3.3.4.2. Examples
3.3.5. add_alphabet
3.3.5.1. Command line options
3.3.5.2. Examples
3.4. File formats
3.4.1. File formats for discrete alphabets
3.4.1.1. Modhmm fa format
3.4.1.2. Modhmm s discrete format
3.4.1.3. Modhmm msa discrete format
3.4.1.4. Labels
3.4.2. File formats for continuous alphabets
3.4.2.1. Modhmm s continuous format
3.4.2.2. Modhmm msa continuous format
3.4.3. Modhmm prf format
3.4.4. Modhmm replacement letter format
3.4.5. Modhmm subst mtx format
3.4.6. Modhmm frequency format
3.4.7. Modhmm prior format
3.4.8. Modhmm null model format
3.4.9. Modhmm name file format
3.4.10. Modhmm alphabet format
4. Functionality
4.1. Emission scoring methods
4.1.1. DP
4.1.2. GM
4.1.3. PICASSO
4.1.4. DPPI
4.1.5. SMP
4.1.6. SMDP
4.1.7. SMDPP
References

List of Examples

1. seq.modhmm-fa, an example file in the Modhmm fa format
2. seq.modhmm-s-discrete, an example file in the Modhmm s discrete format
3. seq-multiple-alphabet.modhmm-s-discrete, a multiple-alphabet example file in the Modhmm s discrete format
4. seq.modhmm-msa-discrete, an example file in Modhmm msa discrete format
5. seq-multiple-alphabet.modhmm-msa-discrete, an example file in the Modhmm msa discrete format
6. seq.modhmm-s-continuous, an example file in the Modhmm s continuous format
7. seq-multiple-alphabet.modhmm-s-continuous, an example file in the Modhmm s continuous format
8. seq.modhmm-msa-continuous, an example file in Modhmm msa continuous format
9. seq.modhmm-prf, an example file in Modhmm msa prf format
10. file.modhmm-replacement-letter, an example file in Modhmm replacement letter format
11. file.modhmm-subst-mtx, an example file in Modhmm subst mtx format
12. file.modhmm-frequency, an example file in Modhmm frequency format
13. file.modhmm-prior, an example file in Modhmm prior format
14. file.modhmm-null-model, an example file in Modhmm null model format
15. file.modhmm-name-file, an example file in Modhmm name file format
16. file.modhmm-alphabet, an example file in Modhmm alphabet format

1. Introduction

modhmm is software tool for building, training and scoring hidden Markov models. The software is open source and licensed under the GPL license. Building a hidden Markov model is done using a modular approach which means that there is a set of predefined model subparts (modules) to choose from when putting together a complete hmm. The idea is to simplify the building of large hmms while still allowing for hmms of arbitrary architecture to be built. The model building tool is called modhmmc.

modhmm consists of three main subparts, modhmmc, modhmmt and modhmms for creating training and scoring hmms respectively. Training and scoring can be done using either single sequences, multiple sequence alignments or sequence profiles as training/scoring input.

The current state of modhmm is still somewhat preliminary.

The primary URL for this document is http://modhmm.sourceforge.net.

2. The .hmg file format

There are two different formats for storing modhmm-models. These formats are in most aspects identical. Their differences are associated to the use of multiple alphabets or not. The multiple alphabet format (up to four different parallel alphabets are possible at the time) includes some additional entries for this information which are not present in the single alphabet format. Every model created by modhmmc or modhmmt is saved in the .hmg text format. It is fairly sensitive to minor changes as adding extra blank lines, extra blanks within a line, etc. This type of changes may work, but nothing is guaranteed. Therefore caution is decreed when manually editing a .hmg file

2.1. The single alphabet format

2.1.1. Header

The header of a .hmg file contains 14 lines (+ 2 compulsory blank lines).

***********************Header*****************************
NAME: tutorialHMM
TIME OF CREATION: May 20, 2003 11:10:29 AM
ALPHABET: A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y;
ALPHABET LENGTH: 20
NR OF MODULES: 93
NR OF VERTICES: 135
NR OF TRANSITIONS: 283
NR OF DISTRIBUTION GROUPS: 7
NR OF TRANSITION TIE GROUPS: 42
NR OF EMISSION PRIORFILES: 2
EMISSION PRIORFILES: ./amino_1.pri ./amino_2.pri
NR OF TRANSITION PRIORFILES: 0
TRANSITION PRIORFILES:



Description of headers

NAME

the name of the model

TIME OF CREATION

time for last modification of the model

ALPHABET

the alphabet of the emissions, each letter separated by ';'

ALPHABET LENGTH

the number of letters in the alphabet

NR OF MODULES

the number of hmm modules

NR OF VERTICES

the number of states

NR OF TRANSITIONS

the total number of transitions

NR OF DISTRIBUTION GROUPS

the number of distribution groups. A distribution group is a set of states whose emission probabilities have been tied together so that during training updating, they are regarded as the same state.

NR OF TRANSITION TIE GROUPS

the number of tied transitions. A transition tie is the same thing as a distribution group, but for transitions.

NR OF EMISSION PRIORFILES

number of emission priorfiles

EMISSION PRIORFILES

names (and paths) of emission priorfiles. An emission priorfile is a file with prior information over the emissions used to weight the observed emission frequences against a belief prior to the observation.

NR OF TRANSITION PRIORFILES

number of transition priorfiles

TRANSITION PRIORFILES

names (and paths) of transition priorfiles. Same thing as an emission priorfile, but for transitions.

2.1.2. Modules

Each module section contains 4 rows in the beginning (+ a compulsory blank row) and a set of vertex sections, each separated by a blank row. Each module section is ended by a blank row and a row of '-'.

Module: module1
Type: Singlenode
NrVertices: 1
Emission prior file: ./amino.pri
Transition prior file: null

Vertex 70:
Vertex type: standard
Vertex label: M
Transition prior scaler: 1.0
.
.
.

-------------------------------------------------------

Description of modules variables

Module

the name of the module

Type

the type of the module ( see Section 3.3.1.2, “Modules” )

NrVertices

number of states in this module

Emission prior file

possible emission prior file associated with this module, 'null' means that no file is associated

Transition prior file

possible transition prior file associated with this module, 'null' means that no file is associated

2.1.3. Vertices

Each vertex section consists of 9 initial rows, followed by a sections for transition probabilities, end transition probabilities and emission probabilities respectively.

Vertex 1:
Vertex type: standard
Vertex label: d
Transition prior scaler: 1.0
Emission prior scaler: 1.0
Nr transitions = 1
Nr end transitions = 0
Nr emissions = 20
Transition probabilities
        Vertex 2: 1.0
End transition probabilities
Emission probabilities
        A: 0.05
        C: 0.05
        .
        .
        .
        T: 0.05
        V: 0.05
        W: 0.05
        Y: 0.05
Vertex

the number of the state

Vertex type

the type of the state. 3 types exist: standard, silent and locked. A standard state is an regular emitting state. A silent state is a state that does not emit any symbols. Finally a locked state is an emitting state for which the emission probabilities are fixed, i.e. will not be updated during training.

Transition prior scaler

the transition prior scaler is a factor which describes how much weight to put on the possible prior distribution associated with this vertex

Emission prior scaler

the emission prior scaler is a factor which describes how much weight to put on the possible Dirichlet prior mixture associated with this vertex.

Nr transitions

is the number of transitions from this state (except to end states)

Nr end transitions

is the number of transitions from this state to end states

Nr emissions

is the number of different possible emissions from this state, usually equal to the alphabet size.

The next three are all listings of transition, end transition and emission probabilities respectively.
Transition probabilities

has a row for each of the transitions which states which state the transition is to and the probability of the transition

Transition probabilities
        Vertex 2: 0.4
        Vertex 3: 0.6
End transition probabilities

same as above, but for end transitions, usually there are not more than one of this type of transition, but nothing in the program prohibits this

End transition probabilities
        Vertex 4: 1.0
Emission probabilities

the probabilities for emitting the different letters in the alphabet, one row for each letter in the alphabet. In the case of continuous emissions, the alphabet is interpreted by the program as follows. The letters are divided into groups of 3. Each group describe a mixture component of a mixture of one-dimensional normal distributions. For each group, the first letter represents the mean value, the second the variance and the third the coefficient for the particular mixture component.

Discrete:
Emission probabilities
        A: 0.05
        C: 0.05
        D: 0.05
        E: 0.05
        .
        .
        .
        Y: 0.05
Continuous (2 mixture components):
Emission probabilities
        m1: 22.5
        var1: 0.34
        co1: 0.45
        m2: -22.3
        var2: 3.11
        co2: 0.55

2.1.4. Emission distribution groups

The emission distribution group has a line for each distribution group which simply states the numbers of the vertices that belong to a particular group

Group 1: 1 2
Group 2: 3 4 

2.1.5. Transition tie groups

The transition tie groups entries (one line) describe two or more transitions which are tied together. A transition is specified as from-state arrow to-state.

Tie 1: 1->2 3->4
Tie 2: 5->6 7->10000 8->10001

3. Software

3.1. Download

Download the software from the sourceforge project page. The latest version of modhmm is 1.1.0.

3.2. Installation

3.2.1. Installation with prebuilt package

3.2.1.1. Installation on Windows with .exe file

To install modhmm on Windows, first download the modhmm-1.1.0-win32.exe and then execute the file ( click on it ).

3.2.1.2. Installation on Ubuntu and Debian

To install modhmm on Ubuntu or Debian, first download the modhmm-1.1.0.deb and then log in as root and

# dpkg -i modhmm-1.1.0.deb 

3.2.1.3. Installation on CentOS and Fedora

To install modhmm on Centos or Debian, first download the modhmm-1.1.0.Linux.rpm and then log in as root and

# yum localinstall modhmm-1.1.0.Linux.rpm 

3.2.1.4. Installation on Mac OS X

To install modhmm on a Mac OS X v10.5 ( Leopard ) on a Mac computer with Intel cpu, first download the modhmm-1.1.0-MacOSX10.5.tar.gz and then

$ tar xfz modhmm-1.1.0-MacOSX10.5.tar.gz  

To install modhmm on a Mac OS X v10.4 ( Tiger ) on a Mac computer with Intel cpu, first download the modhmm-1.1.0-MacOSX10.4.tar.gz and then

$ tar xfz modhmm-1.1.0-MacOSX10.4.tar.gz

3.2.2. Building from source

3.2.2.1. Building from source on Unix

To build modhmm on Unix ( e.g. Linux, MacOSX, CygWin ) you need to have this installed

If you have the modhmm source code in the directory /tmp/modhmm and you want to install modhmm into the directory /tmp/install, you First run cmake then make and then make install

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DCMAKE_INSTALL_PREFIX=/tmp/install /tmp/source && make && make install
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/build
Scanning dependencies of target fastdist
[  3%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist.o
[  6%] Building CXX object src/c++/CMakeFiles/fastdist.dir/BitVector.o
[  9%] Building CXX object src/c++/CMakeFiles/fastdist.dir/Exception.o
[ 12%] Building CXX object src/c++/CMakeFiles/fastdist.dir/InitAndPrintOn_utils.o
[ 15%] Building CXX object src/c++/CMakeFiles/fastdist.dir/Object.o
[ 18%] Building CXX object src/c++/CMakeFiles/fastdist.dir/Sequence.o
[ 21%] Building CXX object src/c++/CMakeFiles/fastdist.dir/SequenceTree.o
[ 25%] Building CXX object src/c++/CMakeFiles/fastdist.dir/SequenceTree_MostParsimonious.o
[ 28%] Building CXX object src/c++/CMakeFiles/fastdist.dir/Simulator.o
[ 31%] Building CXX object src/c++/CMakeFiles/fastdist.dir/arg_utils_ext.o
[ 34%] Building CXX object src/c++/CMakeFiles/fastdist.dir/file_utils.o
[ 37%] Building CXX object src/c++/CMakeFiles/fastdist.dir/stl_utils.o
[ 40%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DNA_b128/DNA_b128_String.o
[ 43%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DNA_b128/Sequences2DistanceMatrix.o
[ 46%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/AML_LeafLifting.o
[ 50%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/AML_given_edge_probabilities.o
[ 53%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/AML_local_improve.o
[ 56%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/AML_star.o
[ 59%] Building CXX object src/c++/CMakeFiles/fastdist.dir/aml/Big_AML.o
[ 62%] Building CXX object src/c++/CMakeFiles/fastdist.dir/distance_methods/LeastSquaresFit.o
[ 65%] Building CXX object src/c++/CMakeFiles/fastdist.dir/distance_methods/NeighborJoining.o
[ 68%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/Kimura2parameter.o
[ 71%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/TamuraNei.o
[ 75%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/ambiguity_nucleotide.o
[ 78%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/dna_pairwise_sequence_likelihood.o
[ 81%] Building CXX object src/c++/CMakeFiles/fastdist.dir/sequence_likelihood/string_compare.o
[ 84%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DistanceMatrix.o
[ 87%] Building C object src/c++/CMakeFiles/fastdist.dir/arg_utils.o
cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C
[ 90%] Building C object src/c++/CMakeFiles/fastdist.dir/std_c_utils.o
cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C
[ 93%] Building C object src/c++/CMakeFiles/fastdist.dir/DNA_b128/sse2_wrapper.o
[ 96%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DNA_b128/computeTAMURANEIDistance_DNA_b128_String.o
[100%] Building CXX object src/c++/CMakeFiles/fastdist.dir/DNA_b128/computeDistance_DNA_b128_String.o
Linking CXX executable fastdist
[100%] Built target fastdist
[100%] Built target fastdist
Linking CXX executable CMakeFiles/CMakeRelink.dir/fastdist
Install the project...
-- Install configuration: ""
-- Install configuration: ""
-- Installing /tmp/install/bin/fastdist
-- Install configuration: ""

If you want to build the html documentation ( i.e. this page ) you need to pass the -DBUILD_DOCBOOK=ON option to cmake.

3.2.2.2. Building install packages

This is section is mainly intended for package maintainers

3.2.2.2.1. Building an .exe file for Windows

To build the modhmm nullsoft installer package ( modhmm-1.1.0-win32.exe ) you need to have this installed

on your Windows machine.

Just open up a msys bash shell

$ mkdir tmpbuild
$ cd tmpbuild
$ cmake path/to/the/modhmm/source/code  -DSTATIC=ON -G "MSYS Makefiles" && make win32installer

The source code for gengetopt, libz and libxml will be automatically downloaded and built statically.

3.2.2.2.2. Building an rpm

On a CentOS or Fedora machine, first log in as root and install the dependencies

# yum install xmlto libxml2-devel cmake gcc-c++ binutils gengetopt

Check that cmake is version 2.6 or later

$ cmake --version
cmake version 2.6-patch 0

If it is older you could download a cmake binary directly from www.cmake.org

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DCMAKE_INSTALL_PREFIX=/ -DBUILD_DOCBOOK=ON /tmp/source && make package

3.2.2.2.3. Building a deb package

On a Debian or Ubuntu machine, first log in as root and install the dependencies

# apt-get install libxml2-dev cmake g++ binutils gengetopt

Check that cmake is version 2.6 or later

$ cmake --version
cmake version 2.6-patch 0

If it is older you could download a cmake binary directly from www.cmake.org. Now build the deb package.

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DCMAKE_INSTALL_PREFIX=/ -DBUILD_DOCBOOK=ON /tmp/source && make package

3.2.2.2.4. Building install package for MacOS X

To build the modhmm install package for MacOS X you need to have this installed

on your MacOS X computer.

Check that cmake is version 2.6 or later

$ cmake --version
cmake version 2.6-patch 0

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DSTATIC=ON -DCPACK_GENERATOR="TGZ" /tmp/source && make package

3.3. Usage

3.3.1. modhmmc

modhmmc is the program for designing an HMM. The main modhmmc program has a command line based interactive user interface which lets the user specify alphabet, states, transition and emission probabilities, etc. If an hmm with one alphabet is designed, it will automatically be saved in the one-alphabet format. If an hmm with multiple alphabets is designed, it will automatically be saved in the multiple alphabet format

3.3.1.1. Command line options

Type modhmmc --help to see the command line options

[user@saturn ~]$ 
   couldn't xinclude file


   couldn't xinclude file

3.3.1.2. Modules

The states of an HMM are specified as a collection of state modules with transitions between them. A module is a set of states which are interconnected in a predecided fashion. The idea behind modules is to make the creation of large HMMs easier. HMMs with several hundred states are not uncommon in sequence analysis and specifying each and every state and transition in such a case is impractical. modhmmc currently has 7 module types to choose from: singlenode, singleloop, forward std, forward alt, cluster, profile7 and profile9 ). modhmmc allows for the creation of both regular and silent states.

Transition probabilities are set by default inside the modules to correspond to the intrinsic properties of that module (see descriptions of the modules). Transition probabilities between modules are by default set uniformly, so that the probabilities of going from a module to another are equally distributed among all its neighbours. Emission probabilities for each state are either set manually by the user or according to a chosen distribution. The choices are to set the values uniformly, randomly or to zero for all letters, which creates a silent state. There are also some special distributions that are particular to certain states.

3.3.1.2.1. Singlenode module

The singlenode module is the most basic module of modhmmc. It consists of only one state, emitting by default, but the user may specify it as silent. All other modules may be built using collections of singlenodes. The singlenode has no input parameters.

The singlenode module is the most basic module of modhmmc. It consists of only one state, emitting by default, but the user may specify it as silent. All other modules may be built using collections of singlenodes. The singlenode has no input parameters.

3.3.1.2.2. Singleloop module

The singleloop module consists of one state with a transition to itself. It is necessarily emitting, since no loops of silent states are allowed. The singleloop has one input parameter. The user specifies the expected length of the loop, which sets the loop transition probability to p trans = e ln0.5L where L is the specified loop length. This results in the probability of a loop length longer than or equal to L being equal to the probability of a loop length shorter than L , that is, L is the expected median loop length.

3.3.1.2.3. Forward std module

The forward module is a set of states connected in a straight line, indexed 1 to n. All states are emitting by default. The two input parameters (m,n) specify the shortest and longest possible routes through the module. The number of states in the module is equal to the length of the longest possible route n. For the state with index m-1 there is a transition to all states in the module with a higher index number up to and including the state with index n. The transition probabilities are by default set so that the total probability is equal for all possible paths through the module, but it is also possible to set the length probilities according to a binomial distribution, with the base either in the shortest or the longest path. Any state that connects to this module will connect to the state with index 1, and all outgoing connections from this module go from the state with index n.

3.3.1.2.4. Forward alt module

The forward module is a set of states connected in a straight line, indexed 1 to n. All states are emitting by default. The two input parameters (m,n) specify the shortest and longest possible routes through the module. The number of states in the module is equal to the length of the longest possible route n. For all states with index m-1 to n-2 there is a transition to the last state. The transition probabilities are set so that the total probability is equal for all possible paths through the module, but it is also possible to set the length probilities according to a binomial distribution, with the base either in the shortest or the longest path. Any state that connects to this module will connect to the state with index 1, and all outgoing connections from this module go from the state with index n.

3.3.1.2.5. Cluster module

The cluster module is a fully interconnected set of states. Every state has a transition to every other state. The transition probabilities are evenly distributed by default. All states are emitting. The input parameter is the number of states. Incoming transitions connect to all states, and outgoing connections go from all states.

3.3.1.2.6. Profile9

The profile9 module is equal to the standard profile-HMM architecture. The input parameter specifies the length of the module, i.e. the number of match states. Incoming transitions connect to the an initial silent state. From this state there are transitions to an pre-model insert state and to the first match and delete states of the model. For the purpose of local alignment with respect to the model there are also transitions to the following match states. These may be set to zero when global alignment is prefered. From the last match and delete states there are transitions to a silent state at the end. There are also transintions from the previous match states to this state for the purpose of local alignments. These are set to zero when global alignment is used. This silent end-state has a transition to an insert state for the latter unaligned part of a sequence and to the next module. Transition parameters are set automatically to default values at this stage. For a profile HMM, they may be updated using the opt_prfhmm program.

3.3.1.2.7. Profile7 module

The profile7 module is equal to the profile9 module in every way, except for it not having any delete ⟹ insert or insert ⟹ delete transitions.

3.3.1.2.8. U-turn module

The U-turn module is a set of states that models a symbol sequence of arbitrary length. The "bottom of the U" is represented by a single node with a transition to itself, while the states in the leg leading to the bottom of the U have transitions both to the next state towards the bottom and to two states on the other leg of the U. The states in the leg leading out of the U have a transition to the next state on the path out from the U. At initialization all transition probabilities are set to 0.5. The length parameter defines the number of states in each leg.

3.3.1.2.9. Highway module

A highway module models a symbol sequence of arbitrary but fixed length. The length parameter defines this length.

3.3.1.3. basic modhmmc

Name of HMM?

Specify file name of HMM. modhmmc will add '.hmg'

Nr of alphabets (1-4)?

Specify number of alphabets

Alphabet 1:

The alphabet of an HMM is specified as a set of letters, where each letter is a word of up to 4 characters. etc

Alphabet 2:

If multiple alphabets exist, these must also be specified

Alphabet 3:

...

Alphabet 4:

...

Start node:

Specify name of start state

Module x:

Specify module type, name and label

End node:

Specify name of end state

Specify interconnectivity

Connection from a to:

Specify transitions from state a

Specify emission distribution groups:

Tie the emission probabilities of (all states of) the specified modules to each other. This means that when updating the value of the emission probabilities during training, all states in a distribution group will be treated as one state.

Specify transition distribution groups:

Tie the transition probabilities of the specified modules to each other. Modules must be identical (same type and size) for this to work. Tying two modules means that the transitions in the two modules which corresponds to each other will be updated as the one transition when transition probabilities are updated during training.

Specify initial emission probabilities ...

Initialize the emission probabilities and tie the states to specific prior distribution files (see ??? ). Also specify the weight of the prior against the observations, default is 1.0. This is done for each alphabet.

Specify initial transition probabilities ...

Same as above, but for transitions. (prior files and weights not implemented in training and scoring algorithms yet).

3.3.1.4. Examples

No examples yet

3.3.2. modhmmt

The modhmmt program is is used for parameter optimization. It implements the regular Baum-Welch training algorithm as well as Conditional maximum likelihood (CML) training, both for single sequences, multiple sequence alignments and sequence profiles.

3.3.2.1. Command line options

Type modhmmt --help to see the command line options

[user@saturn ~]$ modhmmt --help
modhmmt 1.1.0

train a hidden markov model

Usage: modhmmt [OPTIONS]...

  -h, --help                    Print help and exit
  -V, --version                 Print version and exit
  -i, --hmminfile=filename      modelfile (in .hmg format)
  -s, --seqnamefile=filename    sequence namefile (for sequences in fasta, 
                                  smod, msamod or prfmod format)
  -f, --seqformat=STRING        format of input sequences (fa=fasta, s=smod, 
                                  msa=msamod, prf=prfmod)
  -o, --outfile=filename        model outfile
  -q, --freqfile=filename       background frequency file
  -x, --smxfile=filename        substitution matrix file
  -r, --replfile=filename       replacement letter file
  -a, --alg=STRING              training algorithm (cml=conditional maximum 
                                  likelihood, bw=baum-welch (default), 
                                  disc=discriminative training)
  -n, --negseqnamefile=filename sequence namefile for negative training 
                                  sequences (for sequences in fasta, smod, 
                                  msamod
                                  or prfmod format)
  -z, --optalpha=STRING         alphabet to optimize (parameters for 
                                  transitions and all other alphabets will not 
                                  be changed
  -M, --msascoring=STRING       scoring method for alignment and profile data 
                                  options = DP/DPPI/GM/GMR/DPPI/PI/PIS 
                                  default=GM
  -c, --usecolumns=STRING       specify which columns to use for alignment 
                                  input data, options = all/nr, where all means 
                                  use all columns
                                  and nr specifies a sequence in the alignment 
                                  and the columns where this sequence have 
                                  non-gap symbls are used
                                  default = all
      --nolabels                do not use labels even though the input 
                                  sequences are labeled  (default=off)
      --noprior                 do not use priors when training even though the 
                                  the model file has prior files specified  
                                  (default=off)
      --tpcounts                use pseudocounts for transition parameter 
                                  updates  (default=off)
      --epcounts                use pseudocounts for emission parameter updates 
                                   (default=off)
      --transonly               only update transition parameters  
                                  (default=off)
      --emissonly               only update emission parameters  (default=off)
  -v, --verbose                 print some information about what is going on  
                                  (default=off)

3.3.2.2. Examples

no examples yet

3.3.3. modhmms

The modhmms program is used for scoring sequences, multiple sequence alignments and sequence profiles against an hmm. The algorithms implemented includ forward, Viterbi and 1-best. Output can be either a log-likelihood/logodds/reverse score, the (approximatley) most probable labeling of a sequence or the most probale state path.

3.3.3.1. Command line options

Type modhmms --help to see the command line options

[user@saturn ~]$ modhmms --help
modhmms 1.1.0

score sequences on hidden markov models

Usage: modhmms [OPTIONS]...

  -h, --help                  Print help and exit
  -V, --version               Print version and exit
  -m, --hmmnamefile=filename  model namefile for models in hmg format
  -s, --seqnamefile=filename  sequence namefile (for seuences in fasta, smod, 
                                msamod or prfmod format)
  -f, --seqformat=STRING      format of input sequences (fa=fasta, s=smod, 
                                msa=msamod, prf=prfmod)
  -o, --outpath=dir           output directory
  -q, --freqfile=filename     background frequency file
  -x, --smxfile=filename      substitution matrix file
  -r, --replfile=filename     replacement letter file
  -p, --priorfile=filename    sequence prior file (for msa input files)
  -n, --nullfile=filename     null model file
  -a, --anchor=STRING         hmm=results are hmm-ancored (default), 
                                seq=results are sequence anchored
  -L, --labeloutput           output will print predicted labeling and 
                                posterior label probabilities  (default=off)
  -A, --alignmentoutput       output will print log likelihood, log odds and 
                                reversi scores  (default=off)
  -M, --msascoring=STRING     scoring method for alignment and profile data 
                                options = DP/DPPI/GM/GMR/DPPI/PI/PIS default=GM
  -c, --usecolumns=STRING     specify which columns to use for alignment input 
                                data, options = all/nr, where all means use all 
                                columns
                                and nr specifies a sequence in the alignment 
                                and the columns where this sequence have 
                                non-gap symbls are used
                                default = all
      --nolabels              do not use labels even though the input sequences 
                                are labeled  (default=off)
  -v, --verbose               print some information about what is going on  
                                (default=off)

 Group: score_algs
      --viterbi               Use viterbi algorithm for alignment and/or label 
                                scoring (default no)
      --nbest                 Use n-best (=1-best) algorithm for label scoring 
                                (default yes)
      --forward               Use forward algorithm for alignment scoring 
                                (default yes)
      --max_d                 Retrain model on each sequence using Baum-Welch 
                                before scoring  (default=off)
      --savehmm               Save retrained HMM to file  (default=off)

options for specific output control:
      --path                  Print most likely statepath  (default=off)
      --nopostout             no posterior probability information for label 
                                scoring  (default=off)
      --nolabelout            no predicted labeling for label scoring  
                                (default=off)
      --nollout               no log likelihood score for alignment scoring  
                                (default=off)
      --nooddsout             no log odds score for alignment scoring  
                                (default=off)
      --norevout              no reversi score for alignment scoring  
                                (default=off)
      --alignpostout          print posterior probability information for 
                                alignment scoring  (default=off)
      --alignlabelout         print predicted labeling for alignment scoring  
                                (default=off)
      --labelllout            print log likelihood score for label scoring  
                                (default=off)
      --labeloddsout          print log odds score for label scoring  
                                (default=off)
      --labelrevout           print reversi score for label scoring  
                                (default=off)

3.3.3.2. Examples

no examples yet

3.3.4. modseqalign

The modseqalign program aligns two sequences/multiple alignments/profiles using their most proable state paths through a given hmm.

3.3.4.1. Command line options

Type modseqalign --help to see the command line options

[user@saturn ~]$ modseqalign --help

modseqalign 1.1.0

align 2 sequences to each other using their most likely state path through a 
given HMM

Usage: modseqalign [OPTIONS]...

  -h, --help                Print help and exit
  -V, --version             Print version and exit
  -m, --hmmfile=filename    model namefile for models in hmg format
  -s, --target=filename     sequence namefile (for seuences in fasta, smod, 
                              msamod or prfmod format)
  -t, --template=filename   sequence namefile (for seuences in fasta, smod, 
                              msamod or prfmod format)
  -f, --seqformat=STRING    format of input sequences (fa=fasta, s=smod, 
                              msa=msamod, prf=prfmod)
  -o, --outfile=filename    model outfile
  -q, --freqfile=filename   background frequency file
  -x, --smxfile=filename    substitution matrix file
  -r, --replfile=filename   replacement letter file
  -p, --priorfile=filename  sequence prior file (for msa input files)
  -M, --msascoring=STRING   scoring method for alignment and profile data 
                              options = DP/DPPI/GM/GMR/DPPI/PI/PIS default=GM
  -c, --usecolumns=STRING   specify which columns to use for alignment input 
                              data, options = all/nr, where all means use all 
                              columns
                              and nr specifies a sequence in the alignment and 
                              the columns where this sequence have non-gap 
                              symbls are used
                              default = all
      --nolabels            do not use labels even though the input sequences 
                              are labeled  (default=off)
  -v, --verbose             print some information about what is going on  
                              (default=off)

3.3.4.2. Examples

no examples yet

3.3.5. add_alphabet

The add_alphabet program is used for scoring sequences, multiple sequence alignments and sequence profiles against an hmm. The algorithms implemented includ forward, Viterbi and 1-best. Output can be either a log-likelihood/logodds/reverse score, the (approximatley) most probable labeling of a sequence or the most probale state path.

3.3.5.1. Command line options

Type add_alphabet --help to see the command line options

[user@saturn ~]$ add_alphabet --help

add_alphabet 1.1.0

add alphabet with predefined emission probabilities to a given model file

Usage: add_alphabet [OPTIONS]...

  -h, --help                Print help and exit
  -V, --version             Print version and exit
  -i, --hmminfile=filename  modelfile (in .hmg format)
  -o, --outfile=filename    model outfile
  -a, --alphafile=filename  alphabet file
  -v, --verbose             print some information about what is going on  
                              (default=off)

3.3.5.2. Examples

no examples yet

3.4. File formats

There are a few different types of input files accociated with the modhmm package, mainly files for sequences and various utility files. There are 4 different sequence file formats (fasta, single, multi and profile). The profile format is compulsory to contain labels, while for the other formats labels are optional. Each sequence file may contain only one sequence, but a sequence may consist of up to 4 different parallel alphabets.

3.4.1. File formats for discrete alphabets

3.4.1.1. Modhmm fa format

The standard fasta sequence file format. Fasta sequences can only consist of one alphabet.

Example 1. seq.modhmm-fa, an example file in the Modhmm fa format

The example file seq.modhmm-fa contains

>seqname1
ACDEFGHIKLMNPQRSTVWY


Labels can also be used in this format ( see Section 3.4.1.4, “Labels” ).

3.4.1.2. Modhmm s discrete format

The Modhmm s discrete format is a single sequence file format specific for modhmm, designed to allow for an alphabet which allows letters with more than one character. '<' is the signal for starting a sequence, '>' is the signal for ending it and ';' is the end-of-letter marker. Note that this marker must also be present after the last letter in a sequences, i.e. right before the '>' sign. A sequence may occupy more than one line and can be of arbitrary length. Anything written outside of a <> marking will be ignored.

Example 2. seq.modhmm-s-discrete, an example file in the Modhmm s discrete format

The example file seq.modhmm-s-discrete contains

<A;B;C;D;E;F;G;H;>


The multiple alphabet version of this format has the same basic structure as the single alphabet version. Each sequence consists of 2-4 subsequences which begin and end with '<' and '>' respectively. The number of '<' characters in the beginning is used to indicate which alphabet a subsequence belongs to. The subsequences must be in numerical order, and they must be of equal lengths.

Example 3. seq-multiple-alphabet.modhmm-s-discrete, a multiple-alphabet example file in the Modhmm s discrete format

The example file seq-multiple-alphabet.modhmm-s-discrete contains

<A;B;C;D;E;F;G;H;>
<<a;b;c;d;e;f;g;h;>


Labels can also be used in this format ( see Section 3.4.1.4, “Labels” ).

3.4.1.3. Modhmm msa discrete format

The Modhmm msa discrete format is the same format as the Modhmm s discrete format ( Section 3.4.1.2, “Modhmm s discrete format” ) but for multiple sequence alignments. This means that one msa sequence file contains a set of sequences aligned to each other. For this purpose, each sequence must be situated in one line only. '-', '_', ' ' and '.' can all be used to signify a gap. All sequences in a msa sequence file are regarded as belonging to the same multiple sequence alignment. Each line must be as long as or shorter than the line containing the template sequence when such a sequence is used. The starting position of an alignment is calculated from the first letter after '<', which means that it is not possible to skip letters in the begining of a sequence. These must be represented as gaps. However, it is possible to do this at the end of a sequence.

Example 4. seq.modhmm-msa-discrete, an example file in Modhmm msa discrete format

The example file seq.modhmm-msa-discrete contains

<A;B;C;D;E;F;G;H;>
<A;-;C;D;E;P;G;H;>
<A;B;-;-;E; ; ; ;>
<A;B;C;D;-;F>

It uses a single alphabet.


This however will not work:

<A;B;C;D;E;F;G;H;>
<A;-;C;D;E;P;G;H;>
<A;B;-;-;E; ; ; ;>
    <C;D;-;F>

'C' in the fourth line will be aligned with the 'A'-column, 'D' with the 'B'-column and so on.

Example 5. seq-multiple-alphabet.modhmm-msa-discrete, an example file in the Modhmm msa discrete format

The example file seq-multiple-alphabet.modhmm-msa-discrete contains

<A;B;C;D;E;F;G;H;>
<A;-;C;D;E;P;G;H;>
<A;B;-;-;E; ; ; ;>
<<a;b;c;d;d;f;g;h;>
<<a;-;c;d;r;f;g;h;>
<<a;b;-;-;e; ; ; ;>

It uses multiple alphabets.


Labels can also be used in this format ( see Section 3.4.1.4, “Labels” ).

3.4.1.4. Labels

A line containing a labeling of the sequence may be added to sequences in fa, s or msa format. The labeling is represented as a line at the end of the file (enclosed in by '/' characters) with one label character for each sequence position. For a fasta sequence:

>seqname1
ACDEFGHIKLMNPQRSTVWY
/..................../

For an s sequence:

>seqname1
<A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y;>
/..................../

For an msa sequence:

>seqname1
<A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y;>
<A;C;D;T;F;T;H;I;K;-;-;N;P;Q;R;S;T;-;W;Y;>
<-;-;-;-;-;-;-;-;-;-;-;N;P;Q;R;S;T;-;Y;Y;>
/..................../

A label must only consist of one character. The character '.' is predefined as representing any label, which means that a sequence letter labeled with '.' will match any state label. For multiple alphabets the principle is the same as for unlabeled alignment sequences, and the labels are placed at the bottom (only once, since the labeling is the same over all alphabets).

3.4.2. File formats for continuous alphabets

3.4.2.1. Modhmm s continuous format

The Modhmm s continuous format is the continuos version of the Modhmm s discrete format. All lines start and end with '#' and '+' respectively to mark that the alphabet is continuous. Otherwise, there is no difference from the discrete std format.

Example 6. seq.modhmm-s-continuous, an example file in the Modhmm s continuous format

The example file seq.modhmm-s-continuous contains

#0.04;-2.34;2.99;104.77;+


The multiple alphabet version of this format has the same basic structure as the single alphabet version. Each sequence consists of 2-4 subsequences which begin and end with '#' and '+' respectively. The number of '#' characters in the beginning is used to indicate which alphabet a subsequence belongs to. The subsequences must be in numerical order, and they must be of equal lengths. It is also possible to mix discrete and continuous alphabets. This is simply done by using the indicators '#'- '+' and '<' and '>' respectively.

Example 7. seq-multiple-alphabet.modhmm-s-continuous, an example file in the Modhmm s continuous format

The example file seq-multiple-alphabet.modhmm-s-continuous contains

<A;W;A;A;>
##0.11;2.99;-1.00;-10.09;+                                                                                                                                                                                                                   


Labeling sequences from a continuous alphabet works in exactly the same way as for sequences from a discrete alphabet.

3.4.2.2. Modhmm msa continuous format

Alignments of sequences from continuous alphabets do not exist. However, when using multiple alphabets, single sequences from continuous alphabets may be used together with alignments of ones from discrete alphabets. Here, 'X' is a wildcard letter which stands for anything (emission probability 1.0).

Example 8. seq.modhmm-msa-continuous, an example file in Modhmm msa continuous format

The example file seq.modhmm-msa-continuous contains

<A;B;C;D;E;F;G;H;>
<A;-;C;D;E;P;G;H;>
<A;B;-;-;E; ; ; ;>
<A;B;C;D;-;F>
<<s;s;h;h;f;f;s;s>
<<h;-;h;h;h;h;h;h;>
<<s;s;-;-;q; ; ; ;>
<<h;h;h;h;-;h>
###0.01;0.05;0.99;0.88;X;0.11;-1.99;-2.99;+    


3.4.3. Modhmm prf format

The Modhmm prf format is slightly different form the other formats as it does not contain letters, but rather frequencies at each position. Labels are compulsory for this sequence format.

Example 9. seq.modhmm-prf, an example file in Modhmm msa prf format

The example file seq.modhmm-prf contains

Sequence: 1c3wA
NR of aligned sequences: 4
Length of query sequence: 5

START 1
ALPHABET:   A       C       D       E       -     <SPACE> <LABEL> <QUERY>
COL    1:   0.00    0.00    0.00    0.00    0.00   30.00   .       T
COL    2:   0.00    0.00    0.00    0.00    0.00   28.00   .       G
COL    3:   0.00    0.00    0.00    0.00    0.00   28.00   .       R
COL    4:   0.00    0.00    0.00    0.00    0.00   27.00   .       P
COL    5:   1.00    0.00    2.00    9.00    0.00   23.00   .       E
END 1

START 2
ALPHABET:   c         M         m       -       <SPACE> <LABEL> <QUERY>
COL   1:  0.00266   0.08577   0.05604   0.00000   0.00000   .      X
COL   2:  0.00271   0.11234   0.13852   0.00000   0.00000   .      X
COL   3:  0.00240   0.20539   0.20777   0.00000   0.00000   .      X
COL   4:  0.00304   0.26166   0.19852   0.00000   0.00000   .      X
COL   5:  0.00419   0.26735   0.23101   0.00000   0.00000   .      X
END 2


3.4.4. Modhmm replacement letter format

The Modhmm replacement letter format is fairly strict. All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart from this, the first line must be the number of alphabets, the second line must be a number giving the size of the first alphabet. The third line must be the characters of the alphabet, separated by and ending with ';'. The fourth line is the number of the replacement letters. Next, depending on the number in the fourth line, there are a number of lines with the specification of the replacement letters for that alphabet. Then the next line contains the number giving the size of the second alphabet if it exists, etc.

The specification lines have the following layout: <alphabet letter> = <substitution> <substitution> <substitution> ... . The substitutions have the form: <replacement letter>:<probabilityshare>.

Example 10. file.modhmm-replacement-letter, an example file in Modhmm replacement letter format

The example file file.modhmm-replacement-letter contains

# Nr of alphabets
2
# Alphabet 1 size
20
# Alphabet 1
A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y;
# Nr replacement letters 1
3
# Replacement letters 1
U = M:1.0
B = D:0.5 N:0.5
Z = Q:0.5 E:0.5

# Alphabet 2 size
3
# Alphabet 2
s;h;e;
# Nr replacement letters 2
0
# Replacement letters


3.4.5. Modhmm subst mtx format

The Modhmm subst mtx format specifies a substitution matrix as defined in Section 4.1, “Emission scoring methods”. All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart from this, the first line must be a number giving the size of the alphabet of the associated hmm. The second line must be the characters of the alphabet, separated by and ending with ';'. The following lines specify the matrix itself. Each line has the following layout: <alphabet letter> = <score alphabet letter 1> <score alphabet letter 2> ... . There is one line for each letter of the alphabet, not in any particular order, and one line at the end that describes how to interpret letters in the sequences not in the alphabet. This line is only used under certain circumstances in the SMDP and SMDPP scoring methods. There is one column for each letter of the alphabet and the columns must be in the same order as the alphabet specification of the hmm.

Example 11. file.modhmm-subst-mtx, an example file in Modhmm subst mtx format

The example file file.modhmm-subst-mtx contains

#Nr of alphabets
2

# Alphabet size
20

# Alphabet
A;C;D;E;F;G;H;I;K;L;M;N;P;Q;R;S;T;V;W;Y;

# Matrix letters
A = A:1.0 C:0.0 D:0.0 E:0.0 F:0.0 ...
C = A:0.0 C:1.0 D:0.0 E:0.0 F:0.0 ...
D = A:0.0 C:0.0 D:1.0 E:1.0 F:0.0 ...
E = A:0.0 C:0.0 D:1.0 E:1.0 F:0.0 ...
F = A:0.0 C:0.0 D:0.0 E:0.0 F:1.0 ...
G = A:0.0 C:0.0 D:0.0 E:0.0 F:0.0 ...
H = A:0.0 C:0.0 D:0.0 E:0.0 F:0.0 ...
.
.
.
X = A:0.05 C:0.05 D:0.05 E:0.05 ...

#Alphabet 2 size
3

#Alphabet 2
s;h:e;

# Matrix letters 2
s = s:1.0 h:0.0 e:0.0
h = s:0.0 h:1.0 e:0.0
e = s:0.0 h:0.0 e:1.0
x = s_0.33 h: 0.33 e :0.34


3.4.6. Modhmm frequency format

The Modhmm frequency format specifies the background frequencies of the alphabet letters in some chosen population as defined in Section 4.1, “Emission scoring methods”. All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart from this, the first line must be a number giving the size of the alphabet of the associated hmm. The following lines specify the frequencies. There is one line for each letter of the alphabet, in the same order as the alphabet is listed in the hmm specification.

Example 12. file.modhmm-frequency, an example file in Modhmm frequency format

The example file file.modhmm-frequency contains

#alphabet size
5 

# frequencies
0.51
0.14
0.03
0.08
0.24


Note that there are no multiple alphabet frequency files. These are specified separately for each alphabet.

3.4.7. Modhmm prior format

All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart from this, the first line should contain the number of mixture components. Then, for each component there is a line containing the component's probability value and a line containing the component values for each alphabet letter. Note that the order of the component values must match the alphabet order in the hmm specification, i.e. the first component value of a line will be associated with the first alphabet letter in the hmm specification, etc.

Example 13. file.modhmm-prior, an example file in Modhmm prior format

The example file file.modhmm-prior contains

9       # 9 components
# Component 1
0.178091
0.270671 0.039848 0.017576 ...

# Component 2
0.056591
0.021465 0.0103 0.011741 ...
.
.
.


Note that there are no multiple alphabet priorfiles. These are specified separately for each alphabet.

3.4.8. Modhmm null model format

All lines beginning with '#' (as the very first character), and all empty lines are disregarded. Apart fom this, the first line should contain the alphabet size (n). The lines 2 to n+1, should contain the emission probabilities for the letters of the alphabet, ordered the same way as in the hmm specification. The last line should contain the loop transition probability. The null model is built as an hmm wit one singleloop module.

Example 14. file.modhmm-null-model, an example file in Modhmm null model format

The example file file.modhmm-null-model contains

2 # Nr of alhabets

20 # alphabet size

0.075520 # A
0.016973 # C
0.053029 # D
0.063204 # E
0.040762 # F
.
.
.
0.012513 # W
0.031985 # Y

3 # alphabet size 2

0.33 # s
0.33 # h
0.34 # e

0.997151 # trans


3.4.9. Modhmm name file format

The files called hmmnamefile and seqnamefile in the options are in the Modhmm name file format. These files are simply files with names of hmms/sequences inluding a relative or full path. Note that this file format is very sensitive to blank lines, blanks added at the end of lines etc. Each line should contain only the name of the hmm/sequence file.

Example 15. file.modhmm-name-file, an example file in Modhmm name file format

The example file file.modhmm-name-file contains

DATASET/FASTA_FILES/A15_HUMAN_MOLLER_C.fa
DATASET/FASTA_FILES/A26161_HALSS_MPTOPO3D.fa
DATASET/FASTA_FILES/A41616_HUMAN_MPTOPO3D.fa
DATASET/FASTA_FILES/A42226_ECOLI_MPTOPO1D.fa
DATASET/FASTA_FILES/ACC8_CRICR_TMPDB1D.fa
DATASET/FASTA_FILES/ADT2_YEAST_MOLLER_C.fa
DATASET/FASTA_FILES/ADTA_RICPR_MOLLER_B.fa
.
.
.


3.4.10. Modhmm alphabet format

Files in the Modhmm alphabet format are used to specify the alphabet and the emission probabilities when adding an alphabet to an already existing HMM. The first row contains the text 'ALPHABET: ' and a string containing the alphabet where the letters are separated by a ';'. The second row conatains the text 'ALPHABET LENGTH: ' and a positive integer giving the number of letters in the alphabet. The remaining rows contain the text 'VERTEX X: ' and a sequence of floating point numbers (separated by a 'SPACE' character). The sequence of floating point numbers give the emission probabilities for the alphabet letters in state X of the model. This means that there should be as many floats in the sequence of each line as there are letters in the alphabet. The order of the probabilities should follow the order of the letters in the alphabet specification from the first row. There should be one row for each state in the HMM, which the alphabet is to be added to. The sum of the numbers from one row should be either 1.0 (for emitter states) or 0.0 (for silent states) for discrete alphabets.

Example 16. file.modhmm-alphabet, an example file in Modhmm alphabet format

The example file file.modhmm-alphabet contains

ALPHABET: A;B;C;
ALPHABET LENGTH: 3
VERTEX 0: 0.0 0.0 0.0
VERTEX 1: 0.1 0.1 0.80
VERTEX 2: 0.23 0.27 0.50
VERTEX 3: 0.0 0.0 0.0


4. Functionality

4.1. Emission scoring methods

There are 7 different ways to calculate the score (probability) for a multiple sequence alignment column to be emitted in a certain state. Let us call this score e msa .

4.1.1. DP

The first scoring method, DP, is the dot product

Equation 0. 

e q msa ( x k msa ) = j e q ( σ j ) * x k msa ( σ j )


Here, σ j represents alphabet symbol j , q is s state and x k msa is sequence column k . This formula is easy to use and only increases the time complexity of the forward and backward algorithms by a factor the size of the alphabet. The probabilistic interpretation of this score is that it represents the probability that the same letter is drawn if drawing independently from both the state- and the profile- distributions.

4.1.2. GM

The second implementation of e msa , GM, calculates the total probability of scoring all sequences of the profile as single sequences but forced to take the same path through the model, normalized on the number of sequences in the alignment

Equation 0. 

e q msa ( x k msa ) = j e q ( σ j ) x k msa ( σ j )


As for the scoring method above, this formula only increases the time complexity of the forward and backward algorithms with a factor the size of the alphabet.

4.1.3. PICASSO

The third scoring method, PICASSO, is a scoring method originally developed for non-HMM-based profile-profile comparison. The probabilistic interpretaion of this score, is that it (as GM) calculates the geometric mean, but an emission probability is not a pure frequency, but a frequency which is weighted according to how common this letter is in a given background distribution. In the symmetric version, the profile columns are weighted in a similar fashion. The respective columns are then normalized to sum to 1. In PICASSO, the emission score is calculated as

Equation 0. 

e q msa ( x k msa ) = j = 1 Z ( e q ( σ j ) freq σ j ) x k msa ( σ j )


for the original version [1] or as

Equation 0. 

e q msa ( x k msa ) = j = 1 Z ( e q ( σ j ) freq σ j ) x k msa ( σ j ) * e q msa ( x k msa ) = j = 1 Z ( x k msa ( σ j ) freq σ j ) e q ( σ j )


for the symmetric version [4]. Here, freq σ j is the background frequency of alphabet letter σ j . The normalizing procedure has been left out in the above formula. The purpose of using pre-determined background frequencies is to get a score which depends on the column frequencies of the alphabet letters relative to a background distribution. Note that the geometric mean interpretation does not hold for the symmetric version. The product j = 1 Z ( x k msa ( σ j ) freq σ j ) e π i ( σ j ) cannot easily be given a probabilistic interpretation, because it contains an inverted view of the sequences and the model, i.e. the emission probabilities are seen as a set of sequence residues, which are emitted by the sequence frequency profile.

4.1.4. DPPI

The fourth scoring method DPPI, is a combination of DP ( Section 4.1.1, “DP” ) and PICASSO ( Section 4.1.3, “PICASSO” ). Here, a sum of products between the two elements of the two vectors are calculated as in standard DP, but each term in the sum is also divided by the bakground frequency of that letter.

Equation 0. 

e q msa ( x k msa ) = j e q ( σ j ) * x k msa ( σ j ) freq σ j


4.1.5. SMP

The fifth implementation of e msa , SMP, is using a variant of the so called log-average score, which originally is

Equation 0. 

e q msa ( x k msa ) = log i=1 | | j=1 | | x k msa ( σ i ) * e q ( σ j ) p rel ( i,j ) p i p j


where p i is the background distribution for letter i and p rel is a measure of "relatedness" for letter pairs. The modhmm implementation differs from this formula only in not taking the log of the double sum. A substitution matrix ( Section 3.4.5, “Modhmm subst mtx format” ) is supplied by the user for the p rel ( i,j ) p i p j part. This scoring method increases the time complexity of the forward and backward algorithms with a factor the size of the alphabet to the power of 2.

4.1.6. SMDP

The sixth implementation of e msa , SMDP, is a faster version of SMP ( Section 4.1.5, “SMP” ), which puts more focus on the query sequence.

Equation 0. 

e q msa ( x k msa ) = j=1 | | x k msa ( σ query ) * e q ( σ j ) p rel ( query,j ) p query p j


where x k msa ( σ query ) represents the alignment share of the query sequence's letter at position/column k . This scoring method is an approximation of the SMP method which has the advantage of being a factor the size of the alphabet faster.

4.1.7. SMDPP

The seventh implementation, SMDPP, is essentially the same as the SMDP method ( Section 4.1.6, “SMDP” ), but for all letters in a certain position for which the substitution matrix value is below a certain threshold, the sequence column shares are evened out, i.e. redistributed uniformly over these letters. The objective is to disregard the significance of a random, "unlikely" mutation.

References

[1] A. Heger and L. Holm “Exhaustive enumeration of protein domain families”. J. Mol. Biol. 328:749-767, 2003.

[2] R. Hughey and A. Krogh “Hidden Markov models for sequence analysis: extension and analysis of the basic method”. Comput. Appl. Biosci.., 12(2):95-107, 1996.

[3] A. Krogh “Hidden Markov models for labeled sequences”. Proceedings of the 12th IAPR International Conference on Pattern Recognition., Los Alamitos, California, pages 140-144, 1994 IEEE Computer Society Press.

[4] D. Mittelman, R. Sadreyev and N. Grishin “Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments”. Bioinformatics., 19:1531-1539, 2003.

[5] K. Sjölander, K. Karplus, R. Hughey, A. Krogh, I.S. Mian and D. Haussler “Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology”. Comput. Appl. Biosci.., 12(4):327-345, 1996.


modhmm is hosted at

SourceForge.net Logo