Biojournal of Science and Technology
Volume 2, ISSN:2410-9754, Article ID: m140007
A multi-objective evolutionary approach to reconstruct gene regulatory network using recurrent neural network model
Date of Acceptance: 2015/06/14
Published in Online: 2015/07/13
Sumon Ahmed, Institute of Information Technology, University of Dhaka, Dhaka – 1000; email: [email protected]
Academic editor: Dr. Md Saiful Islam
To Cite This Article:
Sumon Ahmed, Md. Nurul Ahad Tawhid, Kazi Sakib, Md. Mustafizur Rahman. A multi-objective evolutionary approach to reconstruct gene regulatory network using recurrent neural network model. Biojournal of Science and Technology. Vol. 2 (2015)
In recent years, the availability of large scale gene expression data has significantly increased the study of the relationship among genes. Gene expression data, whether in time-course format or steady state format open the door to the researchers to observe the interaction among thousands of genes simultaneously under various environmental conditions. Given that large volume of gene expression data is available, in principle it is possible to reverse engineer the detailed quantitative model of the biological network that adequately represents the dynamics of the underlying system (Noman et al., 2013).
Several common practical issues that make the reconstruction process of GRN difficult are small sample size compared to the number of genes, the presence of biological and experimental noise, lack of adequate knowledge of the complex dynamics and nonlinear nature of biological systems. In spite of many technical advances, the gene array technologies are still unable to acquire the quality and quantity of data that is required to capture the precise mechanism in common regulatory pathways (Noman and Iba, 2007, Schena M, 2013). Two major challenges faced by all inference methodologies (Differential Equations, Bayesian Network etc.) in terms of representation accuracy and computational feasibility while reconstructing GRN are 1) detecting the sparse topological architecture of biological network and 2) estimating the regulatory parameters from a limited amount of gene expression data corrupted with a significant level of noise. Generally, with the increase of problem dimension due to large number of genes in network, search complexity increases very rapidly and locating the global optimum solution becomes very difficult.
In order to apply a computational approach to reconstruct GRN from experimental time-series data, a mathematical model is necessary that will adequately formalize the process of gene regulation. The analysis of gene expression networks and metabolic pathways has resulted in various types of GRN models which vary in terms of the details of biochemical interaction incorporated, discrete or continuous expression level used, deterministic of stochastic approach applied, and so forth (Noman and Iba, 2007). Among the mathematical models of GRN, Boolean Network (Sahoo et al., 2013), Linear Model (Dhaeseleer et al., 2013), Bayesian Network (Mazur and Kaderali, 2013), Neural Network (Vohradsky J, 2013), Differential Equations (Chen et al., 2013), Linear Time-Variant Model (Kabir et al., 2013), S-system Model (Savageau M A, 2013) and models including stochastic components on the molecular level (McAdams and Arkin, 2013) are well known. Boolean Networks and Linear Models are simplistic approaches that employ pairwise association measures such as conditional mutual information for inferring the interactions between genes (Chowdhury et al., 2013). Having low computational complexity, these methods can easily scale up to very large networks of thousands of genes (Basso et al., 2013). Bayesian networks (BN) are based on the strong foundation of probability and statistics where directed edges and conditional probability distributions are used to represent dependencies between nodes. Differential Equation (DE) is a member of sophisticated and well established class of models that maintains a balance between model complexity and mathematical tractability. Several linear and non-linear types of DE models such as Linear Time-Variant Model, S-system Model, etc. have the ability to depict system dynamics in continuous time (Chowdhury et al., 2013).
In this work, the Recurrent Neural Network (RNN) model (Wahde and Hertz, 2013) along with a natural computational method is used to extract regulatory interactions among genes from gene expression data sets. Among the reconstruction approaches applied to infer GRN, the RNN model is of particular interest because of its capability to adequately discover the nonlinear and dynamic interactions among the genes (Noman et al., 2013, Wahde and Hertz, 2013, Chiang and Chao, 2013). With the network of nonlinear processing elements, the model can reasonably capture various dynamics and mechanisms that could be present in a complex biological system. However, inferring GRN using RNN model demands the estimation of large parameter sets that also increase with the number of genes present in the target network. Thus the method may get stuck on some locally optimum solution and fail to predict the true skeletal structure in case of larger biological networks. To overcome this problem, the proposed methodology incorporated another objective function that is calculated by summing up the number of regulatory inputs of all the genes in the system (Ahmed et al., 2013). As the most biological systems are sparse (Noman et al., 2013, Noman and Iba, 2007), the smaller values of this second objective function ensure the biological reality in inferred gene regulatory networks.
Applying a mathematical model for inferring GRN requires the development of some algorithmic techniques that will estimate the values of model parameters. Some algorithmic techniques such as particle swarm optimizations (Sultana et al., 2013), evolutionary algorithms (Noman et al., 2013, Noman and Iba, 2007), etc. have already been developed in the field of computational intelligence and machine learning that help the biologists to form new hypothesis about the biological systems (Noman et al., 2013) and to design new experiments. In this work, an Evolutionary Algorithm (EA) based inference technique using Recurrent Neural Network (RNN) model has been used with the aim of providing a method that can fulfill the experimental requirements. As the proposed methodology uses more than one fitness function, a natural multi-objective computational approach known as elitist Differential Evolution for Multi-Objective Optimization (DEMO) is used. DEMO, belonging to the group of evolutionary algorithms, is proven to be very effective in solving different conflicting multi-objective optimization problems arising in different domains (Ahmed et al., 2013). Among the family of multi-objective evolutionary algorithms, the proposed methodology has the unique feature of self adaptation. Based on its objective functions, the algorithm converges rapidly without the need of setting any threshold values on the interactions of a particular gene.
The inference capability of the proposed method has been highlighted in different learning experiments using both artificial and real gene network data. Artificial network data with varying noise levels and characteristics were chosen and simulated to obtain synthetic time-series data set and the underlying skeletal network architecture. The reconstruction results depict the suitability of the proposed approach as it correctly identifies all the regulatory interactions among genes even with noisy time-series data. The proposed method was applied in the reconstruction of well-known SOS DNA repair system in Escherichia coli. Among 40 genes of SOS network, 6 genes have been considered in this work which controls the core repair system (Little et al., 2013). The expression values of this gene network are measured in a 50-step time-series, and documented in Uri Alon Lab1. The experimental result represents biological plausibility of the estimated GRN, which has been validated from various aspects, ranging from the activity of functionally coherent gene sets, to previous experimentally verified interactions among genes.
The rest of the paper is organized as follows. The next section explains the RNN model for reconstructing gene regulatory network followed by the description of the fitness functions used in the proposed methodology. Then, elitist DEMO algorithm for inferring RNN model based GRN has been described which is followed by the section presents the experimental results to highlight the effectiveness of the proposed method. The final section concludes the paper with some general discussions.
RECURRENT NEURAL NETWORK (RNN) MODEL FOR GENE REGULATORY NETWORK
The Recurrent Neural Network (RNN) model offers a good compromise between the biological proximity and mathematical flexibility while reconstructing gene regulatory network. The model formulates the interactions among the genes in terms of a tightly coupled system (Noman et al., 2013, Vohradsky J, 2013, Wahde and Hertz, 2013) expressed as,
where N (i,j ≤ N) is the total number of genes in the network, ei represents the total regulatory input for i-th gene, wij represents the type and strength of the regulatory interaction of gene-j on gene-i which is either positive (activation), negative (repression) or zero (no regulation). denotes the basal expression level and represents the decay rate parameter of gene-i. The non-linearity of GRN is introduced by the function g() which is often given by the sigmoid function. The reconstruction of a gene network using RNN model can be described by the set of parameters which can mimic the experimentally observed gene expression data. (1 http://www.weizmann.ac.il/mcb/UriAlon/)
For biological realism, the expression level of gene-i at time t + 1, i.e. ei(t+1) is obtained by normalizing zi using a sigmoid squashing function:
The dynamic interactions among genes of a network are reflected in the change of the magnitude of parameter . In particular, increased values of indicate strengthened interaction between gene i and j, and decreased values indicate weakened interaction. Another important point is that the interactions between the genes have been modeled in this paper based on their expression levels which is a common choice for many existing methods (Noman et al., 2013, Noman and Iba, 2007, Kabir et al., 2013, Ahmed et al., 2013).
MODEL EVALUATION CRITERIA
The large parameter set of recurrent neural network model emerges the needs of some assessment mechanisms for evaluating the alternate gene networks that come across in the course of evolutionary process. The most commonly used model evaluation process known as Mean Squared Error (MSE) is the quantitative difference between the response generated by the candidate model and the experimentally collected response. The smaller the value of MSE, the better the match between observed and calculated expression dynamics, the better the approximation. Like other dynamic systems, the reverse engineering of GRN achieves higher accuracy if multiple time series for the same gene is used. Using M sets of time dynamics, the MSE based fitness function can be given by
Here, represents the experimentally observed expression level of gene-i at time t in the data set. Whereas, is the numerically calculated expression level of gene-i, at sampling time t in the same data set which is acquired by solving Equations (1) and (2). Here M is the number of experimental data sets used, T is the number of sampling time points and N represents the number of genes in the regulatory system.
In a biological system very few genes or proteins interact with a particular gene. Because of the high degree of freedom of the model, there exist many local minima in the search space that can also mimic the time courses very closely. Therefore, if all possible regulations are allowed, the search algorithm may get stuck on some locally optimum solution and fail to obtain the true skeletal network structure. To overcome this problem another fitness function is used similar to that used in (Ahmed et al., 2013) as the second objective in the proposed multi-objective inference algorithm. The value of this fitness function is calculated by summing up the number of regulatory inputs of all the genes in the system. The smaller the value of this fitness function, the sparser the underlying skeletal network structure, closer approximation of the biological reality. Thus for each set of parameters representing regulation networks in recurrent neural network system, the fitness function for obtaining globally optimal gene network structure is given by
Here, Ii is the number of regulatory inputs to gene-i and N is the number of genes in the regulatory system.
PROPOSED INFERENCE METHOD
In this work, an enhanced Multi-Objective Evolutionary Algorithm (MOEA) has been used to estimate the model parameters for the target gene regulatory network. Elitist version of DEMO is used in the core of our algorithm as the optimizer that minimizes both f1, f2 given in equations (3) and (4) respectively. Like most of the MOEAs, DEMO is a population-based search heuristic, where each individual of the population represents a candidate solution of the problem under consideration. A feasible solution, x dominates another feasible solution y, if and only if x is better than y for at least one objective function value. An optimum solution called Pareto optimum is the one which is not dominated by any other solution in the search spaces. In MOEA, it is impossible to improve a Pareto optimum solution with respect to any objective without worsening at least another objective (Storn and Price, 2013, Robic and Filipic, 2013). After random initialization of first generation, each successive new generation is created as follows:
Let, Pt is the current generation and NP represents the population size. A new offspring population Qt of size NP is generated by using crossover and mutation operator of DE (Storn and Price, 2013). Each individual of Qt can be a newly generated offspring or it can come from Pt, based on the following principles (Robic and Filipic, 2013).
- The candidate replaces the parent if it dominates it.
- If the parent dominates the candidate, the candidate is discarded.
- Otherwise (when the candidate and parent are non-dominated with regard to each other), the candidate is added to the population.
A combined population, of size between NP and 2 × NP, is generated and each individual of Rt is evaluated using equations (3) and (4). If the population has been enlarged, it is truncated to prepare for the next step of the algorithm.
The truncation process used in this paper is based on NSGA-II (Deb et al., 2013) and comprised of two steps. First, the fast non-dominated sorting (Deb et al., 2013) is applied and individuals of Rt are sorted into non-dominated fronts F0 …Fl, where the best non-dominated solutions are stored in F0. The members of one front are non-dominated by each other. The next generation, Pt+1 is filled, firstly with the members of F0 and subsequently adding the members of following fronts. However, all members of a front may not be added, because otherwise NP (number of population) would be exceeded. In this case, crowding distance (Deb et al., 2013) is used to identify diverge non-dominated solutions which will be forwarded to next generation.
Because of the high-degree of freedom of the RNN model, the search space contains many local optimums which may trap the search algorithm and global optimum may remain undiscovered (Noman et al., 2013). Thus, if the fitness values (i.e., f1 and f2) of the best compromised individual does not improve for Gm consecutive generations, the mutation operation of DEMO is evoked which mutates all the other individuals in the current generation. The ,β and parameters of an individual are mutated by adding random numbers drawn from Gaussian distribution with mean and standard deviation and , respectively. The parameter is mutated using random numbers drawn from a distribution with mean and standard deviation .
After the random start, the algorithm proceeds in its regular mode- repeating the above process for all genes until the termination criterion is not met. The output generated by any MOEA is the non-dominated set of solutions known as the Pareto-optimal solutions (Robic and Filipic, 2013, Deb et al., 2013). However the decision maker may have imprecise or fuzzy goals for each objective function. Thus, upon having the Pareto-optimal set, a fuzzy based mechanism described in (Abido M, 2013), has been incorporated in the proposed methodology to extract a Pareto-optimal solution as the best compromise solution.
The suitability of the proposed GRN reconstruction methodology using RNN model has been primarily validated using a synthetic network as the actual structure and parameter values are unknown for real networks. The experiments were carried out under the ideal noise-free condition and with simulated noise corrupted gene expression data. Finally, the proposed methodology was applied in the reconstruction of SOS DNA repair system of Echericha coli using real micro array data.
Artificial Network Inference
At first, this paper investigated whether it is possible to infer the regulatory interactions and correct parameter values for a small scale 5 gene synthetic network that is also studied by (Ahmed et al., 2013). The regulatory weight matrix of this five genes network is shown in Table 1. The network contains both positive and negative regulations along with feedback loop. The initial gene expression level was selected randomly. In order to simulate the noise experienced in real gene expression data, expression profiles have been generated by adding 5% and 10% Gaussian noise. The experiments were conducted for each condition using 10 sets of data where search ranges for RNN parameters were set as follows:
In the inference of this small scale synthetic network, is used for all genes. Thus was not been included in the search as it is fixed for the target. The algorithm was implemented in Java and experiments were run in a Intel(R) Core(TM)2 Duo 2.80 GHz, 2GB RAM – personal computer. Each experiment has been repeated 10 times to confirm the reliability of the proposed GRN reconstruction methodology. This approach ensures that even if the significant solutions of one run miss a true regulation, the subsequent runs may find that. That is, the outputs from all of these run are taken into consideration for ensuring the validity of the algorithm.
Table 1. Weight Matrix for target synthetic network
Table 2. Inferred Weight Matrix for target synthetic network using 5% noisy time-series data
Table 3. Inferred Weight Matrix for target synthetic network using 10% noisy time-series data
Table 4. Average SN, SP of the target network for noise-free, 5% and 10% noisy time-series data
In almost every optimization run with noise-free expression data, fitness score for models reach to zero or very close to zero (<10-6 ) and the estimated parameters are exactly the same as the target. The performance of the reconstruction algorithm is also analyzed using noisy time-series data with the same experimental conditions. Table 2 and 3 shows the estimated network structure and parameter values achieved in a sample run for 5% and 10% noisy data respectively. From Table 2 and 3, it is evident that even in the presence of high level of noise the proposed method has successfully predicted all the regulatory interactions among the genes. Some false positive regulations are also predicted by the search algorithm while working with noisy data. However, the magnitudes of these false positives were pretty small compared to the real regulations. The summary of prediction in terms of sensitivity (SN) and specificity (SP) has been presented in Table 4 using their standard definition based on positive/negative value of wij. This result shows that the prediction contains a full 1.00 sensitivity and the specificity greater than 0.50 even for corrupted GRN data. In the case of 10% noisy data the specificity value 0.55 means prediction of 45% false positive regulations. In an overall, the proposed approach performs a correct prediction of the network structure and a good approximation of the model parameters.
Analysis of Real Microarray Data
The proposed methodology has been analyzed in the reconstruction of well-known SOS DNA repair network in Escherichia coli. It is the longest, most complex and best understood DNA damage-inducible network to be characterized to date. In this work, the experiment was carried out by the gene expression data set collected in Uri Alon Lab. The data set contains expression levels of 8 genes namely uvrD, lexA, umuD, recA, uvrA, uvrY, ruvA, polB. Four experiments were done using various light intensities, in each of which 50 samples were collected at 6 minutes interval for the above 8 genes (Perrin et al., 2013). For reconstructing GRN, this paper used the data sets from experiment 3 and 4. To meet biological reality, data corresponding to each gene was normalized within the range (0, 1] against their maximum value and very small value (~ 10-4) was used to replace all the zero expression levels in these two data sets.
In this work, 6 key regulators namely uvrD, lexA, umuD, recA, uvrA and polB have been considered in the reconstruction process. This sub network is well studied one and the interactions among different genes are known. Being actual microarray data, there is unknown amount of noise inherently present in these data. These noises in the data may have had an influence on the inference method. So, the generated results have been much dispersed. The results have been generated based on the different runs of the algorithm.
The regulations of each gene have been identified using the following search ranges of RNN parameter:
The known regulations and the predicted regulations for all the 6 genes in the SOS repair network identified by the proposed algorithm have been summarized in Table 5.
Table 5. Estimated regulations for SOS DNA repair system
|uvrD||–||–||(Shuhei et al., 2013, Cho et al., 2013, Shuhei et al., 2013)|
|lexA||+||+||+||–||(Shuhei et al., 2013, Cho et al., 2013, Shuhei et al., 2013)|
|umuD||–||+||–||(Noman et al., 2013, Shuhei et al., 2013, Bansal et al., 2013, Gardner et al., 2013)|
|recA||–||+||+||+||(Shuhei et al., 2013, Bansal et al., 2013)|
|uvrA||+||–||–||(Kabir et al., 2013, Shuhei et al., 2013, Perrin et al., 2013)|
|polB||–||+||(Noman et al., 2013, Kabir et al., 2013, Shuhei et al., 2013)|
In each run, the reconstruction process achieves a very small fitness function value which indicates that the inferred network model could match the target time course data pretty well. The comparison between the target dynamics and the estimated model generated dynamics for some selected genes has been shown in Figure 1. From Figure 1, it is evident that the proposed method has the ability to mimic the system dynamics adequately.
The estimated regulations and parameter values were different from run to run in the conducted experiments. However, examining all of the extracted interactions with regard to known roles of selected genes; it is evident that, in most cases, the predictions confirm the prior knowledge, which indicates the suitability of the proposed method. The algorithm also predicts a number of false positives which are either unknown regulations or the side effect of noise presented in micro array data.
Figure 1. Target and estimated dynamics for the SOS DNA repair system
DISCUSSION AND CONCLUSION
Gene regulatory networks are abstract mapping of the more complicated biochemical systems and inherently nonlinear in nature. The inference of the large scale GRN is always impeded by the computational requirements imposed by the underlying model. In this work, recurrent neural network model is used to infer the target gene expression profiles and found very effective in terms of biological actuality and computational feasibility. However, the RNN model contains a large number of parameters and because of the high-degree of freedom of the model; the search becomes very complicated and the global optimum solution may remain undiscovered. To overcome this problem, a second objective function based on skeletal network architecture, has been incorporated in the proposed method which ensures the inference of sparser biological networks. A natural multi-objective computational approach, known as DEMO is used to infer the true structure of underlying biological system. Among the EA based multi-objective search heuristics, elitist version of DEMO is used in the proposed methodology because of its reputation of fast convergence in complex and conflicting search spaces.
Some experimental analysis of the proposed method has been performed to investigate the different components of the algorithm which are necessary for accurate estimation of the regulatory parameters. All of the results are based on experimenting with an artificial gene network and analyzing a real micro array gene expression profile. The performance of a reverse-engineering algorithm always affected by the noise levels presented in the experimental data and the proposed methodology is no exception. Thus, the synthetic gene expression data corrupted with varying noise levels have been used to highlight practicability of the proposed optimization algorithm in estimating robust parameter values. From the experimental results, it is very evident that the proposed method is very efficient in the estimation of true network structure even in the presence of high levels of noise. Moreover, the two performance measures, i.e. SN and SP, showed the resistance of the proposed approach in the case of identifying the false regulations among genes. In the analysis of SOS DNA repair network of E. coli, because of the insufficient amount of gene expression data with high noise, it was very difficult for the proposed method to get any consistent result for the target network in the different experimental runs. Nonetheless, most of the pathways in the reconstructed network were consistent with the results reported in the literature. Although, the proposed reverse-engineering algorithm may not be able to capture the complete network architecture in a single run, because of insufficient data availability corrupted with excessive noise; still, this type of indication can be very useful for the biologists to design additional experiments that may in turn help to identify new interactions among the genes.
With the speedy growth of biological samples categorization and characterization, and enhanced data collection techniques, it is expected that high-dimensional and feature-rich data will be collected which will represent complex dynamics of biological systems. Thus, the development of decoupled version of the proposed method will be a timely contribution to narrow the gap between the imminent methodological needs and the available biological data. Moreover, such decoupling of the original method not only offers a deeper understanding of the mechanisms and processes underlying biological networks, but also eases the immediate parallelization or distributed implementation of the proposed GRN reconstruction algorithm.