BioUML plug-in for nonlinear parameter estimation using multiple experimental data
|, , ,|
Development of experimental technologies in molecular biology led to accumulation of huge volumes of data relating to various levels of life organization. However, the data alone cannot be used to reconstruct the full organization of biological systems. Therefore, the interests of bioinformatics are now focused on the problems of data processing, including the problems of integration and systematization of primary experimental data and the problems of knowledge production based on mathematics and modern information technologies. The challenge of systems biology is construction of mathematical models to describe dynamic behavior of biological systems based on experimental data. Such problems involve studying a large volume of data and require software for their processing and interpreting.
The standard tools for working with biological data include access to biological databases, formalized description of biological systems, as well as visualization, simulation, parameter fitting and analysis of ODE models representing these systems. The BioUML software is an integrated environment that was developed to span all of these capabilities. Here we present optimization tools of this software intended for multi-experiment training of the models created using BioUML notation or imported in the SBML format [Hucka et al, 2003]. These tools are available both in the desktop and web editions of BioUML.
Optimization problem in BioUML
The general nonlinear optimization problem [Runarsson and Yao, 2000] can be formulated as follows: find a minimum of the objective function , where lies in the intersection of the -dimensional search space
and the admissible region defined by a set of equality and/or inequality constraints on . Since the equality can be replaced by two inequalities and , the admissible region can be defined without loss of generality as
In order to get solution situated inside , we minimize the penalty function
The problem could be solved by different optimization methods. We implemented the following of them in the BioUML software:
stochastic ranking evolution strategy (SRES) [Runarsson and Yao, 2000];
cellular genetic algorithm MOCell [Nebro et al, 2009];
particle swarm optimization (PSO) [Sierra and Coello, 2005];
deterministic method of global optimization glbSolve [Björkman and Holmström, 1999];
adaptive simulated annealing (ASA) [Ingber, 1996].
Table 1 shows the generic scheme of the optimization process for these methods. SRES, MOCell, PSO and glbSolve run a predefined number of iterations considering a sequence of sets (populations) , , of potential solutions (guesses). In the case of the first three methods, the size of the population is fixed, whereas in glbSolve the initial population consists of one guess, while the size of the population is found during the iteration with the number . The method ASA considers sequentially generated guesses , , and stops if distance between and defined as Euclidean norm becomes less than a predefined accuracy .
All methods, excepting glbSolve, are stochastic and seek global minimum of the function taking into account the admissible region . Thus, a guess is more preferable than a guess at some iteration of methods, if and or . The method glbSolve is suited to solve only the problems with . Values of the function are calculated but do not affect on the generation of potential solutions.
Implementing the optimization scheme in BioUML, we designed the OptimizationProblem interface (fig. 1) comprising the following procedures:
getParameters specifies a list of parameters to fit including identification of initial values and variation intervals (upper and lower bounds);
testGoodnessOfFit defines type of the functions and and evaluates their values for a population of guesses;
getEvaluationsNumber returns the number of passed evaluations during optimization process.
An abstract class OptimizationMethod provides the number of subclasses representing implementation of the foregoing methods. These subclasses involve search of optimal parameters by calling a procedure getSolution depending on the settings of optimization problem.
Application of non-linear optimization to systems biology
We assume that a mathematical model of some biological process consists of a set of chemical species associated with variables representing their concentrations, and a set of biochemical reactions with rates depending on a set of kinetic constants . Reaction rates are modeled by standard laws of chemical kinetics. A Cauchy problem for ordinary differential equations representing a linear combination of reaction rates is used to describe the model behavior over time:
Here is a stoichiometric matrix of by . We say that is a steady state of the system (1) if
Identification of parameters and initial concentrations is based on experimental data represented by a set of points defining dynamics of variables , , at given times , , where is the number of such points for the concentration , . The problem of parameter identification consists in minimization of the function of deviations defined as the normalized sum of squares [Hoops et al, 2006]:
where normalization factors with are used to make all concentration trajectories have similar importance. The weights are calculated by one of the formulas on experimentally measured concentrations: (mean square value), (mean value) and (standard deviation).
If we want to consider additional constrains
holding for concentrations and parameters for some period of time , the penalty function is defined as
This function assumes summation of values in the nodes of grid defined by an ODE solver to find a numerical solution of the system (1).
In the particular case, experimental data could be represented by steady state values of species concentrations. Then functions and have the simpler forms:
where and , , denote experimental and simulated steady state values.
Typically, researchers want to perform evaluation of model parameters using experimental data obtained with different experimental conditions, i.e. different initial concentrations of species. In such case, we will consider the functions
Implementation of the parameter estimation process in BioUML
Initiation of the parameter estimation process requires definition of many details including specification of the search space, the admissible region, settings of numerical methods to solve ODE system and optimization problem, links to the files with experimental data and model description, etc. In order to structure this information, we designed an appropriate hierarchy of classes (fig. 2) taking into account the following rules:
experimental data must be represented by time-courses or steady states of chemical species concentrations; in the first case, these data may be expressed as percentage values obtained, for example, on the basis of the Western-blot technology;
an initial state of the Cauchy problem must be specified for each considered file with experimental data (these states may be different for some experiments);
parameters to fit may be divided into local and global:
The main class Optimization in fig. 2 comprises definition of the optimization method and parameters including the model parameters to fit, parametric constraints holding for given time intervals, and experiments. The following fields keep information about each experiment:
cellLine defines an experimental group to evaluate local parameters of the model;
diagramStateName corresponds to the model initial state;
experimentType defines the time-course or steady state type of experimental data;
tableSupport contains the link to the file with experimental data, and specifies a method for calculation of weights in the formula (2);
parameterConnections is a list of correspondences between variable names in the model and in the experimental file including information whether data in this file are expressed as exact values or as values related to a given time point.
An object of the OptimizationMethod class contains a set of parameters (methodParameters) including links to the file with the processed model (diagramPath) and the directory to save results when the parameter estimation will be completed (resultPath). The estimation process begins immediately after the object control gets command doRun. Goodness of the current fit is defined through the object optimizationProblem providing correct simulation of the model. Firstly, evaluation of functions (2) and (4) is performed separately for all initial states of the model by calling a method testGoodnessOfFit for each object of the class SingleExperimentParameterEstimation associated with the certain experiment. Then the total values of these functions are calculated in the class ParameterEstimationProblem by the formulas (5).
The parameter estimation process is optimized using the following technologies:
Acceleration of simulation of the Cauchy problem for different values of fitted parameter is achieved by automatic generation and compilation of the Java class file at the first iteration of the optimization algorithm. At the subsequent iterations, current values are passed to the object of this compiled class and a solution of the Cauchy problem is found.
Acceleration of the optimization methods considering a population of guesses is achieved by parallelization of calculations. The following task (SimulationTask in fig. 2) is generated for each guess:
When all tasks are generated, they are passed to the executor service, which distributes their performance between the predefined threads.
Fig. 3 shows a graphical user interface of optimization methods in the BioUML workbench (desktop edition). The upper left panel includes a list of methods. Description of the selected method is provided below. The upper right panel defines the search space. Under it you can find the tab panel with settings of optimization problem. In fig. 3 the selected tab contains method parameters and fields displaying intermediate values of objective/penalty functions and the number of passed evaluations. The next tab includes description of experimental data and specifies settings for all experiment fields listed above.
Web edition optimization
The web edition provides a set of optimization tools, like in the workbench edition. The user can set up the boundaries and initial values of optimization parameters, select and adjust an optimization method, manage experimental data, set up constraints. After all parameters are adjusted, optimization process can be launched. Optimization results can be saved and then viewed as a table of fitted parameter values. Graphical representation of the optimization process is also available.
Analysis of the methods convergence
Stochastic methods (SRES, MOCell, PSO and ASA) for global optimization rely on probabilistic approaches and have weak theoretical guarantees of convergence to the global optimum. However, they can locate its vicinity with relative efficiency [Moles et al, 2003]. In contrast, deterministic method glbSolve guarantees global optimality, if the objective function is continuous or at least continuous in the neighborhood of a global optimum [Björkman and Holmström, 1999]. However, it can not solve general global optimization problems with certainty in finite time.
To analyze convergence rate of the implemented methods, we considered a reaction chain (fig. 4, table 2) extracted from the model by L. Neumann et al. [Neumann et al, 2010] and representing activation of caspase-8 triggered by the receptor CD95 (APO-1/Fas).
We performed estimation of parameters using the search space defined as
, , ,
where upper bounds were chosen based on the order of magnitude of parameter values proposed by authors of the original model. Initial values of variables were fixed according to [Neumann et al, 2010]:
, , .
Estimation was based on the experimental data obtained by Neumann et al. for procaspase-8 and its cleaved products p43/p41 and caspase-8 (table 3).
Fig. 5 shows dependence of the objective function mean values on the number of considered guesses for 100 runs of the optimization process. Statistics of the best, mean and worst values of , as well as the best guesses found by the methods after consideration of 107 guesses are listed in the tables 4 and 5 correspondently.
As can be seen from these tables, the best result was obtained by the particle swarm optimization (PSO) and the cellular genetic algorithm (MOCell). Methods SRES, MOCell and PSO found similar solutions. Methods ASA and glbSolve found dissimilar values of parameters and resulting in lower efficiency compared to the first three methods.
For comparison, some test cases considered in the study by Moles et al. [Moles et al, 2003] resulted in superiority of SRES. However, the authors did not explore such methods as genetic algorithms and particle swarm optimization.
Comparison of parameter estimation features of BioUML with other software
We compared optimization tools of BioUML with the following software applications assigned for analysis of biochemical networks and supporting the procedure of model fitting:
COPASI (Complex Pathway Simulator) – a stand-alone program providing an C++ API [Hoops et al, 2006];
AMIGO (Advanced Model Identification using Global Optimization) – a multi-platform (Windows and Linux) MATLAB-based toolbox [Balsa-Canto and Banga, 2011];
SBToolbox 2 (Systems Biology Toolbox 2) – a part of SBPOP Package requiring MATLAB[Schmidt and Jirstrand, 2005];
PET (Parameter Estimation Toolkit) – a graphical user interface intended to run under Windows, Mac OS X, and Unix [Shaffer et al, 2009];
PottersWheel – a framework designed as a MATLAB toolbox [Maiwald and Timmer, 2008].
Details of the comparison are given in table 6.
Further, we compared computation speed of the optimization methods implemented in BioUML and COPASI. For this purpose, we considered a series of test cases. A brief description of the models used in these test cases is provided below. For more details, including specification of experimental data and fitting parameters, see Additional file 1 of the supplementary materials.
In the first test case, we analyzed three models of CD95-induced caspase-8 activation constructed on the basis of the model by Neumann et al. [Neumann et al, 2010] with varying degrees of detail (fig. 6, A-C). The second test case was proposed by Mendes et al. [Mendes et al, 2009] and corresponded to the model of the MAP kinase cascade (fig. 7) developed by Kholodenko et al. [Kholodenko, 2000]. Finally, we tested the model of Bagci et al. [Bagci et al, 2006] representing the mitochondria-depended apoptosis resulting from the cooperative formation of heptameric apoptosome complex and activation of caspase-9 and caspase-3 (fig. 8).
Analyzing the number of the objective function evaluations per second for these test cases, we found that BioUML showed a better result than COPASI (fig. 9).
In this paper, we considered parameter estimation tools of the BioUML software. These tools can be applied to biological systems characterized with a set of ODEs. The fitting process is based on experimental time course or steady state measurements, and assumes minimization of the function of error between these measurements and the corresponding model prediction. We implemented several stochastic and deterministic global optimization methods as new plug-in for BioUML. None of these methods is effective for all cases. Nevertheless, on the basis of our observations, we concluded that adaptive simulated annealing can be used when it is necessary to quickly find the vicinity of the solution. In the case when adequacy of solution is more preferable than the rate of convergence, it is better to use such methods as MOCell, PSO and SRES.
Parameter fitting is an important part of the quantitative biological modeling. However, if the model includes more elements than are necessary to approximate experimental data with the given accuracy, we face the problem of overfitting [Hawkins, 2004]. In this case, there is no overall best solution and it is expedient to find distribution of the parameter values which are compatible with observed experimental dynamics. For this purpose, we should run parameter estimation process many times (it is better with different optimization methods) and evaluate bounds for all parameters. Implementation of this technique in BioUML is a task for the future work.
We successfully applied our optimization plug-in for creation of the combined model of CD95 and NF-κB signaling pathways [Kutumova et al, 2013], where the problem of parameter overfitting was solved using the methodology of model reduction [Gorban et al, 2010].
- There are currently no refbacks.