Decision Tree Main

ARTICLE IN PRESS
Engineering Applications of Articial Intelligence 23 (2010) 102109
Contents lists available at ScienceDirect
Engineering Applications of Articial Intelligence

journal homepage: www.elsevier.com/locate/engappai
Application of an enhanced decision tree learning approach for prediction of

petroleum production
Xiongmin Li, Christine W. Chan
Faculty of Engineering, University of Regina, Regina, SK, Canada S4S0A2
a r t i c l e in f o
a b s t r a c t
Article history:
Received 5 January 2008
Received in revised form
19 January 2009
Accepted 4 June 2009
Available online 19 September 2009
Economic evaluation of a new oil well is important for decision-making in the petroleum industry, and
this evaluation is based on a good prediction on a wells production. However, it is difcult to accurately
predict a wells production due to the complex subsurface conditions of reservoirs. The industrial
standard approach is to use either curve-tting methods or complex and time-consuming reservoir
simulations. In this paper, an enhanced decision tree learning approach called neural-based decision
tree (NDT) model is applied in an attempt to investigate its performance in predicting petroleum
production. The primary strength of this model is that it can capture dependencies among attributes,
and therefore, it is likely to provide an improved or more accurate prediction (Lee and Yen, 2002).
This paper presents an application of the NDT model for petroleum prediction. Our models were
developed based on the ve most signicant parameters that affect oil production: permeability,
porosity, rst shut-in pressure, residual oil and saturation of water. The ve parameters were used as
input variables, and oil production is the output variable for modeling. Four different models were
generated in the modeling process, and each involves a different combination of parameters. First, an
overall oil production model is developed using the three geoscience parameters of permeability,
porosity and rst shut-in pressure. Secondly, two different models, with different input parameters,
were developed to predict production in the post-water ooding stage only. The results of the above
models indicate that data-driven models may not be effective for classifying the data set. Hence, a trend
model was developed in an attempt to improve the effectiveness and accuracy of the predictive model.
The result shows that the trend model can provide an improved performance, and its performance is
comparable to that of the articial neural network.
& 2009 Elsevier Ltd. All rights reserved.
Keywords:
Decision tree
Neural network
Attribute dependency
Data mining
Petroleum production prediction
1. Introduction
Prediction of oil well production is important in estimating
economic benet of a well. However, this prediction task is
difcult because of the complex subsurface conditions of wells.
Even two wells, located side by side in the same reservoir, may not
have the same production (Mattar and Anderson, 2003). In
addition, core analysis data obtained from oil elds are limited
and tend to be biased. It is also difcult to adequately model the
core analysis variables, which are signicant factors in oil
production. These variables have some dependencies or correlation among each other, however, there is no equation to describe
inter-relationships. The traditional approach of modeling the
variables is to use the curve-tting approach, which is complex
and time consuming. The current amount of petroleum data
collected in databases today has far exceeded our ability to reduce
and analyze data without the use of automated analysis
Corresponding author. Tel.: + 1 306 5855225; fax: + 1 306 5854855.
E-mail addresses: li258@uregina.ca (X. Li), Christine.Chan@uregina.ca

(C.W. Chan).
0952-1976/$ - see front matter & 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.engappai.2009.06.003
techniques. In this paper, we apply a data modeling approach

called the neural-decision tree (NDT) model that is distinctive in
that it considers the dependencies or correlations among the
attributes or variables.
In recent years, articial intelligence, in its many integrated
avors, has made solid steps towards becoming more accepted in
the mainstream of the oil and gas industry (Mohaghegh et al.,
2005). Articial neural network (ANN) models have become
a popular choice among the nonlinear petroleum prediction
methods (Nguyen and Chan, 2005). As an accurate predictive
model, the ANN technique has, however, suffers from
the weakness that the ANN models generated are black boxes,
which do not help users understand the precise relationships
among the variables modeled. The arbitrary nature of the internal
representation means that there may be dramatic variations
between networks of identical architecture trained on the
same data.
The objective of this paper is to present the NDT model that
can provide explicit information on the processing involved in
generating estimates of petroleum production. The model is
derived from an integrated approach that combines articial
neural network and decision tree learning algorithms, where the
ARTICLE IN PRESS
X. Li, C.W. Chan / Engineering Applications of Articial Intelligence 23 (2010) 102109
articial neural network is used to discover the attributes

dependence and the decision tree learning algorithm is applied
to predict the petroleum production.
2. Background literature
2.1. Articial neural network in petroleum prediction
Articial neural network (ANN) was rst introduced by the
psychologist Frank Rosenblatt in 1958 (Minsky, 1969). After that,
the approach has been utilized for predicting oil production based
on geological parameters such as porosity, permeability and uid
saturation from conventional well log and core analysis (Mohaghegh et al., 2005). Since the inter-relationships among the
parameters are complicated, and the prediction process using the
conventional methods can be time consuming and expensive in
terms of labor and computational resources, demand for effective
models to predict production increases rapidly. On the other hand,
compared to the conventional methods, the ANN approach has
been shown to generate more accurate and repeatable results
(Mohaghegh et al., 1995; Chan and Nguyen, 2003). Some recent
efforts at applying ANN methods in petroleum engineering
include the following. Nguyen et al. (2004) used a multiple neural
network model to make both short- and long-term time series
predictions of petroleum production. Mohaghegh et al. (1994)
applied ANN to predict the permeability of the formations using
the data provided by geophysical well logs with good accuracy.
Wong et al. (1998) used ANN to predict permeability with an
application to the Ravva oil and gas eld, offshore India. Nikravesh
et al. (1996) used ANN for the prediction of formation damage
during uid injection into fractured, low permeability reservoirs.
103
learning models. Since there is often no prior knowledge of

whether the data samples under study exhibit attribute dependencies, or what type of dependencies exist between which
attributes, a mechanism for exploring these factors must be
incorporated into the learning process. These issues are considered in the NDT model we adopted.
3. The neural-decision tree model

The neural-based decision tree learning (NDT) model combines
neural networks and decision tree algorithms, details of NDT
model can be found in (Li and Chan, 2007). The primary strength
of this model is that it can capture dependencies among
attributes, and therefore, it is likely to provide an improved or
more accurate prediction (Lee and Yen, 2002). The architecture of
the NDT model for a mixed-type data set, with squared boxes
showing the main processing components of the neural network
and decision tree is shown in Fig. 1. The NDT algorithm models a
single pass process initialized by the acceptance of training data
as input to the neural network model. The generated output is
then passed on for decision tree construction, and the resulting
rules will be the net output of the NDT model. The arrows show
the direction of data ow from the input to the output, with the
left-hand-side downward arrows show the processing stages of
nominal attributes, and the right-hand-side arrows indicating the
corresponding process for numeric-type data. As illustrated in Fig.
1, the rule extraction steps in this NDT model include the
following:
Mixed Type
2.2. Decision tree learning for petroleum engineering
A decision tree is an idea generation tool that generally refers
to a graph or model of decisions and their possible consequences,
including chance event outcomes, resource costs, and utility. In
data mining, a decision tree is a predictive model; it is a mapping
of observations about an item to conclusions about the items
target value (JiaWei and Micheline, 2001).
One of the advantages of decision trees is that they are easy to
be interpreted and used by petroleum engineer. For example, the
best potential oil production among the wells can be easily
identied by the set of terminal nodes in the tree that have the
highest percentage of production, and then the user can focus on
the specic wells described by those nodes. In comparison with
other methods, a decision tree can be constructed relatively easily
and quickly.
Some research works on application of decision tree learning
for petroleum engineering are discussed as follows. Perez et al.
(2005) used decision trees to classify the data for permeability
predictions based on the well logs. Jensen (1998) applied decision
tree analysis to estimate the range of uncertainty in the reservoir
production prognosis. Agbon et al. (2003) compared the decision
tree model with the fuzzy model for the ranking of reservoirs
based on the amount of natural gas production in Venezuela. The
research emphasis on decision tree learning in the existing
literature used implementations that support univariate attribute
testing at each node, which is an adequately expressive representation if the training data are assumed to exhibit attribute
independence. But in petroleum engineering, the available data
sets typically contain attributes that are interdependent. Hence, it
is necessary to take the issue of possible dependencies among
attributes into consideration when designing decision tree
Nominal Type
Numeric Type
Encode
Normalization
Neural Network
Feed-forward Back-propagation Model
Collect hidden weights
Adjust data values
Combine/Merge Attributes
Recombine to mixed type
Decision Tree Algorithm

With Post Pruning
Decision Rules
Fig. 1. Design of the NDT Model.
ARTICLE IN PRESS
104
1. Divide the numerical-categorical-mixed dataset into the two

parts of numerical subset and nominal subset. For a pure-type
data set, no division of the data set is necessary.
2. For the numerical subset, train a feed-forward back-propagation neural network and collect weights between the input
layer and the rst hidden layer. Then, change each attribute
value according to the weights generated by the neural
network model.
3. For the categorical subset, train a back-propagation neural
network and classify categorical attributes according to the
weights generated by the neural network model.
4. Combine the new numerical subset and new categorical subset
into a new numerical-categorical-mixed dataset.
5. The new dataset is fed as input to the C4.5 system to generate
the decision tree and rules.
6. Some evaluation of the results is conducted to obtain a reliable
estimate of the accuracy of the classier or predictor model.
an assumption that the production data exhibit a harmonic trend

in the particular area of study. Therefore, instead of directly
estimating the wells production, the last model predicted the
behaviors or the harmonic trends of the wells production data
based on permeability and porosity. The model predicted the
production trend of oil in the pre-water ooding stage. In the
model, the geoscience data were selected and grouped into
specic rock formations by a petroleum engineer. Details of the
variations in these models will be discussed in the following
subsections.
4.3. Data preprocessing
The objective is to use the NDT model for identifying the

relationships among the core variables responsible for oil
production prediction. The performance of the NDT model is
assessed using data sets obtained from pools of wells in
Saskatchewan, Canada.
The original data set used for modeling suffered from the
commonplace inadequacies of being incomplete (lacking attribute
values or certain attributes of interest, or containing only
aggregate data), noisy (containing errors, or outlier values that
deviate from the expected), and inconsistent (containing discrepancies in the tag names used to label attributes). In other
words, there are reported errors, unusual values, and inconsistencies in the data recorded for some transactions. Due to these
inadequacies in the dataset as well as the fact that a large amount
of redundant data can slow down the analysis process, data
preprocessing was conducted so as to improve the efciency of
the analysis process. Since the geoscience values of permeability,
porosity, water saturation, and residual oil are measured in
different depth ranges at each well in the core analysis, we
calculated the weighted averaged value and fed them as the
corresponding input values of each geoscience parameter for each
well.
4.2. Study area
4.4. Experiments and results
The Weyburn Field is one of the largest reservoirs located in

the north of the Williston Basin in southeastern Saskatchewan,
Canada. Petroleum production of this reservoir is from Mississippian age carbonates at a depth of more than 1300 m. The well was
discovered in 1954 and oil production occurred until February
1963 when water ooding was implemented. The maximum
production appeared in 1965 at 46,000 barrels/day.
The Saskatchewan Industry and Resources supplied the dataset
on the Weyburn Field. The dataset consists of data records on
production rate, core analysis and pressure data from the
Weyburn Midale eld in Saskatchewan, Canada spanning 50
years, from the 1960s to 2006. The core and Practical Drillstem
Test (DST) analysis data provide information about the geoscience
parameters, such as permeability, porosity and rst shut-in
pressure.
Since the modeling process requires accurate identication of
input parameters, all the relevant parameters related to petroleum
prediction should be considered. However, due to limitations in
the data, the ve most signicant parameters of permeability,
porosity, rst shut-in pressure, residual oil and saturation of water
are selected for modeling. In our investigation on modeling
production data, four different models were developed based on
different combinations of parameters and data. The rst model
predicted total production using the three geosciences parameters
of permeability, porosity and rst shut-in pressure as input
variables. Then, two different models with different input
parameters were developed to predict production in the postwater ooding stage only: (1) one model includes only the three
geosciences parameters of permeability, porosity and saturation
water, and (2) another model extends the rst model by using the
four geosciences parameters of permeability, porosity, water
saturation and residual oil. The fourth model was built based on
This section will present the experimental results and analysis

of the four models built by the NDT structures. These four models
can be classied into three different types: overall oil production
model, post-water ooding model and trend predicting model.
Details of each model are presented below.
4. Application problem domain

4.1. Prediction of petroleum production
4.4.1. Overall oil production model

First, the entire oil production data set was used, so the model
was called an overall oil production model. The model includes
the production rates and the three geoscience parameters of
permeability, porosity and rst shut-in pressure as input variables
for the prediction task. In this study, we attempt to build the
classication model by using the data discretization technique to
divide the numeric production data into ranges of values. The
entropy-based discretization technique (JiaWei and Micheline,
2001) was used here to discretize the numeric class attribute into
different specic sub-ranges. This approach is a commonly used
splitting technique that explores distribution information in its
determination of split-points, which are data values for partitioning an attribute range. For example, to discretize the numeric class
attribute, A, the method selects the value of A that has the
minimum entropy as a split-point, and recursively partitions the
resulting intervals to arrive at a hierarchical discretization. The
reason we adopted this approach was that the entropy-based
discretization technique uses class information to reduce size of
the data set, which makes it more likely that the interval
boundaries, or split-points are dened to occur in places that
may help improve classication accuracy. This method generated
six interval ranges from the data set, which are listed below from
low to high:
Class A: [2,00097,767], Class B: [97,767199,633], Class C:
[199,633269,082],
ARTICLE IN PRESS
105
Class D: [269,082373,839], Class E: [373,839469,633], Class

F: [469,633620,000].
Fig. 2 shows the data distribution of each class. As seen from
the graph, most of the data fall into Classes A and B. This is
because in reality, most of the oil wells in fact have low
production rates, and only a few of them have high production
capacities.
And the training result of the model is shown in Fig. 3. From
the result, we obtained the link weights between the input layer
and the hidden layer. Then, the original numeric data set was
trained using Eqs. (1)(3) shown below into a new dataset, which
has equivalent values as the inputs to the hidden layer of the
trained neural network, and the NDT model was applied to the
new dataset to generate decision rules.
H1 0:37 4:24Permeability 4:2Porosity 2:63Pressure
H2 0:76 1:36Permeability1:24Porosity 2:07Pressure
H3 0:82 1:6Permeability 1:31Porosity 2:12Pressure
The unpruned and pruned tree versions of the NDT model

generated 24 rules and 17 rules, respectively; the entire trees are
not shown due to their large sizes. The topmost splitting attribute
is permeability, which means that the variable of permeability has
the maximum correlation with the predicted production. The
attribute at the lower level is porosity, which appears in the subbranches frequently. The rst shut-in pressure is the less
important attribute and appears at or near the leaf nodes in the
trees.
As shown in Table 1, it can be seen that in comparing the
results generated by the C4.5 decision tree learning algorithm
with the NDT model, the NDT model reduces the tree size and
number of rules by half with only 5% decrease in classication
accuracy. A decrease in the number of rules is an improvement
because domain experts such as petroleum engineers can more
easily validate them. Hence, the NDT model applied in this study
provides better explanation capability since it generates a
comparative smaller rule set with an acceptable level of
classication accuracy. In other words, with the sacrice of a
little classication accuracy, the NDT model is able to provide
some explicit heuristics for classication that support predicting
oil production from a new well. Some sample rules generated by
the original dataset using the C4.5 decision tree model and the
rules generated with the new dataset using the NDT model are
shown in Figs. 4 and 5, respectively. And Figs. 6 and 7 show the
sample tree for these two models.
4.4.2. Zooming-in: post-water ooding model
Water ooding is often used to enhance oil recovery. After
water is injected into the well, the mechanism of the well is
changed and oil well production is signicantly increased. In order
300
Data set distribution

A
D
200
100
0
Fig. 2. Distribution of data in each class.
B
E
C
F
Fig. 3. Training result of the original dataset.
to better model the oil production of the post-water ooding

stage, two models were built: one with the three geoscience
parameters of permeability, porosity and saturation water only,
and the other contained the four geoscience parameters of
permeability, porosity, saturation water and residual oil. Only
121 wells in total are available, of which 81 wells were used for
training and 40 for testing. The two post-water ooding models
are discussed as follows:
4.4.2.1. Post-water ooding model with three parameters. This
model was set up for the stage of post-water ooding prediction
with the three most signicant geoscience parameters of permeability, porosity and saturation water. The saturation water instead of rst shut-in pressure was adopted as an input geoscience
parameter because rst shut-in pressure was measured during the
stage of pre-water ooding and hence not applicable during the
stage of post-water ooding.
Interval ranges of values derived from the data set using by the
method described above are
Class A: [06707.9], Class B: [6707.918001.85], Class C:
[18001.8536400.2],
Class D: [36400.255736.95], Class E: [55736.9565484.95],
Class F: [65434.9583956.2].
An articial neural network (ANN) model with the same input
and output variables was built as well. The model adopts the
popular three-layer feed-forward ANN topology, the sigmoid
activation function, and the classical back-propagation training
method of ANN. The Weka [Weka tool kit, 2007] software tool kit
was used for implementation and the performance of the model is
summarized in Table 2.
The overall prediction results generated by the NDT model are
similar to those of the ANN model; both models have almost the
same misclassication rate, the NDT models RMSE is only 0.056%
higher than that of the ANN, the NDT models mean absolute error
is 0.0145 lower than that of the ANN, the relative absolute error is
6.1% lower and the root relative squared error is 16.45% higher.
However, both NDT and ANN have a high misclassication rate of
over 71% during the stage of post-water ooding production. This
suggests that the model is inadequate; hence, we attempted to
process the input data more efciently and some new attributes
were added to improve the prediction accuracy of the models.
ARTICLE IN PRESS
106
Table 1
Comparison of NDT model and C4.5 results.
Measures
C4.5 without pruning
C4.5 with pruning
NDT without pruning
NDT with pruning
Tree size (# of nodes)

Number of rules (# of leaves)
Test set size
Correct classication rate (%)
Misclassication rate (%)
Mean absolute error
Mean squared error
Computation time (milliseconds)
131
52
320
86.88
13.12
0.13
5.51
15692
117
45
320
85.31
14.69
0.15
6.90
14070
67
24
320
80.94
19.06
0.19
11.63
28211
53
17
320
79.69
20.31
0.20
13.20
26468
Fig. 4. Sample rule generated by C4.5 decision tree model.
Fig. 5. Sample rule generated by NDT model.
Fig. 7. Sample tree generated by NDT model.
Table 2
Comparison of NDT model and ANN results.
Fig. 6. Sample tree generated by C4.5 decision tree model.
4.4.2.2. Post-water ooding model with four parameters. In order to

improve the effectiveness of the predictive model, it was extended
by adding another geoscience parameters: residual oil. The same
conguration of the model was built and it generated the results
as shown in Table 3.
As shown in Table 3, both the NDT and ANN models have even
higher error compared with the previous one. This means the
parameter of residual oil has little inuence on oil production and
more attributes cannot lead to a better performance. Interestingly,
in the nodes of the NDT model corresponding to higher
production, instead of permeability, porosity begins to appear at
the top layer of the generated NDT model tree. This means that
this attribute of porosity became the most important one for
predicting the post-water ooding production. The physical
explanation of this fact is that the oil ow in the rock rapidly
Measures
NDT cross-valid
ANN cross-valid
Number of rules or hidden nodes in ANN

Number of samples
Root mean square error(RMSE)
Mean absolute error
Relative absolute error (%)
Root relative squared error (%)
35
122
71.0744
0.4033
0.2132
89.4032
116.8512
3
122
71.1
0.3465
0.2277
95.5131
100.4011
increases after water ooding, and with more void space within a
rock, the oil can be more easily pushed onto the surface.
4.4.3. Trend predicting model
The error analysis of the post-water ooding model reported
above led us to the conclusion that the mechanisms used so far
were somewhat supercial, and these congurations may not
allow the data-driven models to classify oil production into
physically interpretable classes.
In order to improve the effectiveness of the classication
model, we attempt to predict the trend of the oil well instead of
directly estimating its production. We worked with petroleum
ARTICLE IN PRESS
Table 3
Comparison of NDT model and ANN results.
Table 4
Experimental results of NDT.
Measures
NDT cross-valid
ANN cross-valid
Number of rules or hidden nodes in ANN

Number of samples
Mean absolute error
30
122
76.8595
0.4233
0.2215
92.884
122.6586
3
122
72.7273
0.3488
0.223
93.5436
101.0489
Measures
Number of samples
Mean absolute error
Correlation coefcient
engineers who preprocessed the geoscience data by ltering the

data based on their expertise. After preprocessing the data, 31 oil
wells, where the production values have a harmonic relationship
over time, were chosen and used to build the model. The empirical
Arps decline equation, which represents the relationship between
production rate and time for oil wells during the pseudosteadystate period, is shown as follows (Arps, 1945):
qi
qt
1 Di t
107
where q(t) is the oil production rate at time t, Di is the initial

nominal decline rate and qi is the initial oil production rate. In
addition, the values for the parameters of permeability and
porosity will not be averaged from all depths, instead, the average
values from one particular formation only will be taken. The
production value will be calculated as the summation of
productions from different formations of one well.
To build this model, we rst calculated the qi and Di of each
well. The Sigma Plot software package [Sigma Plot, 2007] was
used to calculate the values of qi and Di based on the production
value of each well. Instead of directly predicting the production,
we focused on predicting the qi and Di values in the harmonic
equation which would give us a better understanding of the trend
and behavior of an oil well. In this way, we can also capture
characteristics of the reservoir and dataow organization. In the
model, we only used permeability as the input variable to predict
the value of Di, and used porosity to predict the value of qi, 28
wells were used for training and 3 for testing. The value range of
Di is [0, 0.38] and the value range of qi is [23.798, 161.699].
The experimental results of NDT and ANN model are shown in
Tables 4 and 5, respectively.
It can be observed from the above table, the trend model
demonstrated an improved performance in prediction of qi and Di,
by both NDT and ANN models. In the experiment for Di, prediction,
the Mean Absolute Error of the NDT model is 0.0086, while the
ANN model is 0.0395. Thus, the NDT model is better than ANN
model in Di, prediction. In comparison to the value range of Di, (0
0.38), the Mean Absolute Error in the NDT model is about 10%,
which is small. And in qi prediction, the Mean Absolute Error of
the NDT model is 38.32, while the ANN model is 32.11. Since the
value range of the qi value is from 23.798 to 161.699, the Mean
Absolute Error is also about 10%, which is still acceptable when
comparing with the range of the qi. However, the performance of
ANN model is better than NDT model in qi prediction.
The relative absolute error and the root relative squared error
are just the total absolute error, with the same kind of normalization. In these two error measures, the errors are normalized by
the error of the predictor that predicts average values (Witten and
Frank., 2005). If values of these relative errors are high, it means
the predicted values are scattered away from the average
predicted value. If their values are low, then the predicted value
tend to cluster close to the predicted average. Hence, the relative
absolute error and the root relative squared error only reect the
basic predictability or unpredictability of the output variable, not
NDT qi prediction
NDT Di prediction
Training
Testing
Training
Testing
28
17.17
13.48
62.18
61.39
0.79
3
42.31
38.32
124
127.86
0.50
28
0.0653
0.0315
92.3
94.6
0.52
3
0.0094
0.0086
62.8
51.3
0.95
Table 5
Experimental results of ANN.
Measures
Number of samples
Mean absolute error
Correlation coefcient
ANN qi prediction
ANN Di prediction
Training
Testing
Training
Testing
28
32.43
25.37
117.03
115.96
0.337
3
38.46
32.11
104
116.23
0.91
28
0.0607
0.039
115.36
87.82
0.484
3
0.0546
0.0395
287.83
298.03
0.94
the accuracy of the model. From Tables 4 and 5, it shows that for
testing, these relative errors of Di in ANN model are higher than
NDT model, which means the output value of the NDT model
tends to lie fairly close to its average value, and therefore easier to
predict compared with the ANN model. However, in qi prediction,
these relative errors of both NDT and ANN are high, which
demonstrates that predictability of the model is low.
In summary, the performance of NDT and ANN in predicting qi
and Di, is inconclusive in that: the NDT is better in Di, prediction
according to the Mean Absolute Error, and the reverse is true for qi
prediction.
5. Conclusions
This paper reports on an ongoing research program that has
the objective of predicting petroleum production using different
machine intelligence techniques. The initial research goal is to
perform prediction modeling for petroleum production, and the
approach we used is the neural-decision tree (NDT) model. It is
shown that the NDT model, being analogous to decision tree
algorithms, have some advantages compared to articial neural
network. From the experimentation presented here involving
different strategies and different parameters combinations, the
following conclusions can be made:
Firstly, an overall oil production model was developed using
the three geoscience parameters of permeability, porosity and rst
shut-in pressure, the model has an average classication accuracy
of 80.31%. Although this classication accuracy shows a 5%
decrease compared with the regular C4.5 model, the NDT model
reduces the tree size and number of rules by half, which makes it
easier for petroleum engineers to analyze results from the rules it
generated. In addition, permeability is shown to be the most
important variable in predicting petroleum production.
Secondly, post-water ooding models were developed using
three geoscience parameters and four geoscience parameters. In
spite of low classication accuracy in these models, the models
demonstrate porosity is the most signicant factor in prediction of
petroleum production instead of permeability in the post-water
ARTICLE IN PRESS
108
ooding stage. This nding is consistent with the fact that the oil
well had different characteristics after water ooding was applied.
Also, the reason of the low classication accuracy using just three
attributes is likely because the input data have not been processed
more efciently, and lack of some domain knowledge of the postwater ooding conditions of the well.
Thirdly, a trend model was developed in order to improve
effectiveness of the prediction model. In this model, the oil wells
with harmonic relationships between the variables of production
rate and time were selected for prediction of the parameters of qi
and Dt using the empirical Arps decline equation. The model
demonstrated an improved performance in prediction, with the
Mean Absolute Error of 38.32 in prediting qi, and 0.0086 in
prediting Dt respectively. By comparing these results with the
value range of qi [00.38] and Di [23.798, 161.699], we found that
the Mean Absolute Error of them is small.
In addition, we also found that the performance of NDT model
is comparable to the articial neural network. The advantages of
the NDT model when compared to ANN model are
The generated tree-like structure are easy to understand for

decision makers. It makes it possible for a petroleum engineers

to have a good overview of the relationship between the
geoscience attributes if we are able to validate the goodness of
the rules.
The training process of the NDT model is much faster than ANN
and always converges.
The knowledge inside a NDT tree may also help in parameters
selection and assessing the interdependent relationships
among attributes.
Lastly, we observed that although the general prediction

performances of the NDT model are good, the prediction
inaccuracies of some models are high. The reasons for this high
inaccuracy may be due to data-related problems, which include
The limited availability of data in the model. For example, only

31 oil wells, with a harmonic relationship between production

and time, can be identied for training the trend model after
preprocessing the geoscience data.
The oil wells may contain different rock formations having
different physical properties. However, this difference is
ignored.
The weighted average data may be too rough to capture the oil
well property. Although the pressure data use weighted
average data of different formations instead of the entire
weighted data for all formations, however, this is not a
signicant attribute in predicting production rate.
In order to validate the model or improve the prediction

accuracy, the following recommendation could be given:
Instead of focus on generating perfect rules that are guaranteed

to give the correct classication on all instances in the training
set, we would rather generate good rules that avoid
overtting the training set and thereby stand a better chance
of performing well on new test instances. For example, we can
split the training data into two sets: a growing set and a pruning
set. The growing set is used to form a rule using the unpruned
NDT model. Then, a conditional is deleted from the rule, and
the effect is evaluated by trying out the truncated rule on the
pruning set and seeing whether its coverage would increase
over the original rule. This pruning process repeats until the
rule cannot be improved by deleting any further more
conditionals in the rule. The whole procedure is repeated for
each class, obtaining one best rule for each class of the decision
variable. It is important to separate the growing and pruning
sets because it would be misleading to evaluate a rule on the
data used to form it, and would lead to serious errors if rules
that overt the data were preferred (Witten and Frank, 2005).
Since relation between the geoscience parameters are highly
nonlinear, the prediction would likely be improved by using
more geoscience parameters as input attributes. In this way,
more dependencies among attributes are considered and
identied in the NDT model.
In the situation when data-driven models are used (NDT or
ANN, etc.), some domain knowledge of petroleum engineering
is needed to analyze and validate the rules and results that
generated by these models.
6. Future work
It can be observed that the problem of predicting oil
production is difcult. Although we have utilized different
approaches on different aspects of the problem, such as data
preprocessing by NDT model, different mechanisms of input
parameters used, and different AI techniques, the prediction
results are still not as accurate as desired. It seems that oil
production depends on many factors, some of which have not
been taken into account. Moreover, each factor such as permeability cannot be measured accurately because the values vary at
different locations and in different rock formations.
The NDT model has been applied for petroleum prediction,
which is a domain that involved primarily numeric data. For
future work, the project can be extended to include more data,
that belong to the categorical and mixed-types. Further research is
needed to dene formal processes of integrating attribute grouping into the construction of a multivariate decision tree for
categorical data modeling. Without a dened process, the multivariate tree model generated is entirely dependent on the users
interpretation of link weights, which are used to prune errors. A
heuristic search based on link weights should also be considered
for constructing a multivariate tree that favours low class entropy
and supports meaningful attribute groupings. It is our belief that,
if all the dependencies among input attributes can be captured,
more improvement in data classication accuracy can be realized.
Some models with high reported errors need to be further
investigated in different ways. It would also be of interest to apply
the NDT model for prediction of gas and water in the oil wells,
integrating other geoscience parameters from DST, well log, and
core analysis data sources. It is necessary to collaborate with
petroleum engineers who can help classify geoscience data into
different rock formation groups and develop one model for each
group. In that approach, permeability and porosity will not be
averaged from all depths but instead will be averaged from values
derived from one formation only. Also, the production will be
calculated as the summation of productions from different
formations of one well.
Furthermore, the NDT model implemented in this project can
not deal with the problem of missing attribute values. This issue
needs to be investigated so that an effective process for dealing
with missing values can be dened. One possible way is to use the
attribute mean to ll in the missing value or to use the attribute
mean for all samples belonging to the same class as the given
tuple. However, this method biases the data, and the lled-in
value may not be correct. Another popular solution to this
problem is to use the most probable value to ll in the missing
value (JiaWei and Micheline, 2001), which uses all the information
ARTICLE IN PRESS
from the present data set to predict the missing value. The
judgment on the most probable value is made by the user.
Acknowledgements
The generous support of a grant from the Canada Research
Chairs Program for the rst author is gratefully acknowledged. The
authors would also like to thank Hahn H. Nguyen and Jon Hromek
for their contributions to this work.
References
Agbon, I.S., Aldana, G.J., Araque, J.C., 2003. Fuzzy ranking of gas exploitation
opportunities in mature oil elds in Eastern Venezula. SPE paper 84337
presented at SPE Annual Technical Conference and Exhibition, Denver, CO, USA,
58 October 2003.
Arps, J.J., 1945. Analysis of decline curves. Trans. AIME 160, 228247.
Chan, C.W., Nguyen, H.H., 2003. An analysis of articial neural networks versus
curve estimation techniques in predicting petroleum production. Paper EIA03038, vol. 1, International Society for Environmental Information Sciences, pp.
375385.
JiaWei, Han, Micheline, Kamber, 2001. Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers.
Jensen, T.B., 1998. Estimation of production forecast uncertainty for a mature
production license, SPE Annual Technical Conference and Exhibitions, New
Orleans, USA, SPE 49091, September 1998.
Lee, Y.-S., Yen, S.-J., 2002. Neural-based approaches for improving the accuracy of
decision trees. In: Proceedings of the Data Warehousing and Knowledge
Discovery Conference, pp. 114123.
Li, X., Chan, C.W., 2007. Towards a neural-network-based decision tree learning
algorithm for petroleum production prediction. Proceedings at the IEEE CCECE
2007, Vancouver, BC, Canada, April 2226, 2007.
109
Mattar, L., Anderson, D.M., 2003. A systematic and comprehensive methodology

for advanced analysis of production data. SPE Paper 84472 presented at the SPE
Annual Technical Conference and Exhibition, Denver, CO, USA, 58 October.
Minsky, 1969. Marvin and Seymour Papert. Perceptrons: An Introduction to
Computational Geometry, MIT Press.
Mohaghegh, S., Are, R., Ameri, S., Rose, D., 1994. Design and development of an
articial neural network for estimation of formation permeability. SPE Paper
28237 presented at the SPE Petroleum Computer Conference, Dallas, USA, 31
July3 August 1994.
Mohaghegh, S., Balan, B., Ameri, S., 1995. State-of-the-art in permeability
determination from well log data: Part 2: Veriable, accurate permeability
prediction, the touch-stone of all models. In: SPE Page 30979, Proceedings at
the SPE Eastern Regional Conference and Exhibition, Morgantown, WV, USA,
1721 September 1995.
Mohaghegh, S. D., Gaskari, R., Jalali, J., 2005. New method for production data
analysis to identify new opportunities in mature elds: methodology and
application. SPE Paper 98010 presented at the 2005 SPE Eastern Regional
Meeting, Morgantown, WV, USA, 1416 September 2005.
Nikravesh, M., Kovscek,j A. R., Patzek, T. W., 1996. Prediction of formation damage
during uid injection in to fractured low permeability reservoirs via neural
networks. In: Paper SPE 31103, Proceedings of the SPE Formation Damage
Symposium, Lafayette, LA, USA, February 1415, 1996.
Nguyen, H.H., Christine, W., Chan, 2004. Malcolm Wilson: prediction of oil well
production: a multiple-neural-network approach. Intell. Data Anal. 8 (2), 183
196.
Nguyen, H.H., Chan, C.W., 2005. Applications of data analysis techniques for oil
production prediction. Eng. Appl. Artif. Intell. 18 (5), 549558 Aug.
Perez, H., Datta-Gupta, A., Misra, S., 2005. The role of electrofacies, lithofacies and
hydraulic ow units in permeability predictions from well logs: a comparative
analysis using classication trees. SPE Reservoir Eng. Eval. 8 (2) April.
Sigma Plot, 2007. /http://www.systat.com/products/sigmaplot/S.
Weka tool kit, 2007. /http://www.cs.waikato.ac.nz/ml/weka/index_home.htmlS.
Witten, I.H., Frank, E., 2005. Data Mining: Practical Machine Learning Tools and
Techniques with JavaImplementations. Morgan Kaufmann, 2nd ed.
Wong, P.M., Henderson, D.H., Brooks, L.J., 1998. Permeability determination using
neural networks in the Ravva eld, offshore India. SPE Reservoir Eval. Eng. 1,
99104.

Decision Tree Main

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Decision Tree Main

Încărcat de

Drepturi de autor:

Formate disponibile

ARTICLE IN PRESS

Engineering Applications of Articial Intelligence 23 (2010) 102109

Contents lists available at ScienceDirect

Engineering Applications of Articial Intelligence

Application of an enhanced decision tree learning approach for prediction of

Corresponding author. Tel.: + 1 306 5855225; fax: + 1 306 5854855.

E-mail addresses: li258@uregina.ca (X. Li), Christine.Chan@uregina.ca

techniques. In this paper, we apply a data modeling approach

articial neural network is used to discover the attributes

learning models. Since there is often no prior knowledge of

3. The neural-decision tree model

Adjust data values

Recombine to mixed type

Decision Tree Algorithm

X. Li, C.W. Chan / Engineering Applications of Articial Intelligence 23 (2010) 102109

1. Divide the numerical-categorical-mixed dataset into the two

an assumption that the production data exhibit a harmonic trend

The objective is to use the NDT model for identifying the

4.2. Study area

4.4. Experiments and results

The Weyburn Field is one of the largest reservoirs located in

This section will present the experimental results and analysis

4. Application problem domain

4.4.1. Overall oil production model

Class D: [269,082373,839], Class E: [373,839469,633], Class

H2 0:76 1:36Permeability1:24Porosity 2:07Pressure

H3 0:82 1:6Permeability 1:31Porosity 2:12Pressure

The unpruned and pruned tree versions of the NDT model

Data set distribution

Fig. 3. Training result of the original dataset.

to better model the oil production of the post-water ooding

X. Li, C.W. Chan / Engineering Applications of Articial Intelligence 23 (2010) 102109

C4.5 without pruning

C4.5 with pruning

NDT without pruning

NDT with pruning

Tree size (# of nodes)

Fig. 4. Sample rule generated by C4.5 decision tree model.

Fig. 5. Sample rule generated by NDT model.

Fig. 7. Sample tree generated by NDT model.

Fig. 6. Sample tree generated by C4.5 decision tree model.

4.4.2.2. Post-water ooding model with four parameters. In order to

Number of rules or hidden nodes in ANN

Number of rules or hidden nodes in ANN

engineers who preprocessed the geoscience data by ltering the

where q(t) is the oil production rate at time t, Di is the initial

X. Li, C.W. Chan / Engineering Applications of Articial Intelligence 23 (2010) 102109

The generated tree-like structure are easy to understand for

decision makers. It makes it possible for a petroleum engineers

Lastly, we observed that although the general prediction

The limited availability of data in the model. For example, only

31 oil wells, with a harmonic relationship between production

In order to validate the model or improve the prediction

Instead of focus on generating perfect rules that are guaranteed

Mattar, L., Anderson, D.M., 2003. A systematic and comprehensive methodology

S-ar putea să vă placă și