Sunteți pe pagina 1din 11

SPE-174900-MS

Using Data Analytics to Analyze Reservoir Databases


Brian B. Lee, and Larry W. Lake, The University of Texas at Austin

Copyright 2015, Society of Petroleum Engineers

This paper was prepared for presentation at the SPE Annual Technical Conference and Exhibition held in Houston, Texas, USA, 28 –30 September 2015.

This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents
of the paper have not been reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect
any position of the Society of Petroleum Engineers, its officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written
consent of the Society of Petroleum Engineers is prohibited. Permission to reproduce in print is restricted to an abstract of not more than 300 words; illustrations may
not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.

Abstract
There are many public and private databases of oil field properties the analysis of which could lead to
insights in several areas. Recent trend of Big Data has given rise to novel analytic methods to effectively
handle multidimensional data, and to visualize them to discover new patterns. Using a commercial
reservoir properties database, we created and tested four data analytic models to predict ultimate oil and
gas recovery efficiencies, using the following methods borrowed from data analytics: linear regression,
linear regression with feature selection, Bayesian network, and artificial neural network. The accuracy of
result and robustness of each approach is presented and discussed.

Introduction
Big data loosely refers to data sets that are too large and complex to be contained and processed with
traditional data analysis methods. It also pertains to the new technologies developed to handle the
inadequacies of traditional means of data analysis. Beyond these observations, the core idea of Big Data
is to identify and derive value in the data itself. This is in contrast with the approach of the larger part of
petroleum engineering industry, where data is mainly a descriptor of some other valuable asset. Big Data
seeks to extract knowledge out of data by means of inference (National Research Council, 2013).
Recently, however, there are growing endeavor within the petroleum industry to reflect the vision of
Big Data. For instance, Lake et al. (2014) and Zerafat et al. (2011) used reservoir databases in conjunction
with Bayes’ theorem and Bayesian network respectively for EOR screening. Sharma et al. (2010) used
linear regression with cluster analysis and compared the results with other empirical correlations for
recovery efficiencies. Finally, Rodriguez et al. (2013) have used various data pre-processing methods,
principal component analysis and clustering methods to create an expert system that, when provided a
reservoir, identifies analogue reservoirs. These are examples of attempts to tap into the inherent value of
data by inference and learning.
As a part of this greater body of effort, we have created different prediction models from a commercial
reservoir database to predict ultimate oil and gas recovery efficiencies. The four models employed are:
linear regression, linear regression with feature selection, Bayesian network and artificial neural network.
2 SPE-174900-MS

Data Pre-Processing and Experiment Methods

To analyze recovery efficiencies, we used a commercial reservoir database that contains approximately
1300 reservoir entries, each of which contains variables such as porosity, permeability, well spacing,
recovery efficiency, well location, depth, lithology, water saturation, structural setting, API gravity, and
cumulative production. Some of the values—such as porosity, permeability, well spacing, and water
saturation—are averages that characterize the entire reservoir. The total number of variables available for
each reservoir entry is 261.
Similar to other large data sets, this data set is not completely populated. To address the missing data,
we used an imputation method called multivariate imputation by chained equations (Van Buuren et al.,
2009). The imputation process works as described below:
1. Discard all observations for which everything is missing.
2. For all missing observations fill in the missing data with random draws from the observed values.
3. Move through the columns of variables and perform single-variable imputation using some
method (linear regression).
4. Replace the original (random) replacements with the fitted replacements. Repeat previous step
until a certain number of cycles have completed or the imputed values converge to a threshold.
5. Repeat above to create multiple cases of imputed data sets.
Once we have followed through the above procedure to obtain a complete data set, the data were
randomly separated in two groups to ensure generalizability of the predictive models. Training set and
testing set accounted for 70% and 30% of the data, respectively. Every time we trained and tested a model,
new training and testing sets were generated by random assignment. To test the models, independent
(input) variables were provided to the model, and the output predicted recovery efficiency was compared
with the actual recovery efficiency value from the data set.

Multilinear Regression
Multilinear regression is one of the most common tools for data analysis and modeling. A standard case
of multilinear regression here will serve to illustrate the shortcoming of the traditional approach, and as
a benchmark for other analytic techniques.
Six variables frequently used in reservoir engineering were chosen to create a linear regression model
that predicts the ultimate recovery efficiency. The variables are: initial reservoir pressure, porosity, depth,
oil and gas well spacing, and water saturation. The well spacing variables were log-transformed because
they had broad ranges and their univariate distributions were approximately lognormal.
Using a training set, standard linear regression was conducted to find the coefficients that yield the
most accurate estimate of oil and gas recovery efficiencies. Testing results for predicting oil and gas
recovery efficiencies are provided in Figure 1.
SPE-174900-MS 3

Figure 1—Oil and gas recovery efficiency prediction performance using standard multilinear regression using six variables.

From the scatter plots in Figure 1 the multilinear regression’s performance is poor; the predicted values
of recovery efficiencies given in the y-axes show little correlation with the actual data set recovery
efficiency values in the x-axes. For gas recovery efficiencies, the estimated values on the horizontal axis
show wider scatter than the actual target values from the data set, shown on the vertical axis. The
inaccurate estimation may be because none of the individual input variables displayed bivariate correla-
tion with oil or gas recovery efficiencies.
To examine whether any subset of input variables adopted above yield better performance, we
conducted another experiment. We created regression models using all possible subset combinations of
the six chosen variables and tested the models. Then we selected the best performing model according to
whichever had the smallest mean squared error. Because each training and testing of model drew from
different randomized training and testing data sets, the process was repeated 100 times. Figure 2 shows
that variables 1-4 were selected as the highest performing linear regression model for the oil recovery
prediction and that the same variables plus variable 6 were selected for gas recovery. In order, variables
1-6 represent: initial reservoir pressure, average porosity, depth, oil well spacing, gas well spacing, and
average water saturation.

Figure 2—Frequency of variables selected as the best performing multilinear regression model.

For oil recovery efficiency model, the first four variables were implemented. These variables were: 1.
Initial reservoir pressure, 2. Average porosity, 3. Subsurface depth, and 4. Oil well spacing. For the gas
4 SPE-174900-MS

recovery efficiency model, the selected variables were identical to the oil model, with the addition of 5.
Average water saturation.
Multilinear regression models created with the selected variables were tested, and the results are
provided in the Figure 3.

Figure 3—Oil and gas recovery efficiency prediction performance of multilinear regression models with brute-force method to select
variables

The figures reveal that the performance show little improvement from the initial case; the estimations
are as scattered as in Figure 2.
Initially, the scope of analysis was limited to six input variables, which were determined by what we
believed to have a direct influence on recovery efficiencies based on engineering principles. However, it
may be that correlations with recovery efficiencies exist outside the span of the proposed six variables,
or not at all. Because the database has many features (250⫹) for each reservoir entry, creating linear
regression models with every possible combination of variables is costly in time and computation.
Sequential feature selection, proposed in the next section, is an elegant alternative to this brute-force
approach.
Multilinear Regression with Sequential Feature Selection
Also known as feature subset selection, sequential feature selection selects only the variables that are
relevant to the analysis, effectively reducing the dimensions considered for a goal. This method is different
from the previous section in that, whereas the previous method required the modeler to select the
variables, the algorithm selects the variables according to the intervariate relationships present in the data
set. Sequential feature selection is effective when creating models with high dimensional data, which is
common in machine learning (Fisher et al., 2007). Of different variations of the selection methods, we
have employed the sequential forward selection method as follows:
1. Start with a blank slate: an empty model that includes no variables.
2. Test individual candidate variables one at a time: if there are m variables in the data set, create m
different linear regression models, each of which contains a single variable.
3. Identify one candidate variable that generates the most accurate model. This variable is added to
the model.
4. Identify the next most important variable. Begin with the model that includes the selected
variable(s). Test the remaining candidate variables by adding them one at a time, until one
identifies the candidate variable whose inclusion improves the accuracy of the previous model the
SPE-174900-MS 5

most.
5. Test for statistical significance. If the new model is significantly more accurate than the previous,
add the candidate variable to the model.
6. Repeat steps 4-5 until the statistical significance test fails. If at any point the new model is not
significantly more accurate than the last model generated, remove the last statistically insignificant
candidate variable from the model, and stop.
Table 1 lists the variables selected after the above process is completed.

Table 1—Variables selected after sequential feature selection


Selected Variables Oil RF Selected Variables Gas RF

API GRAVITY API GRAVITY


AVERAGE ANNUAL SURFACE T DEPTH
FRACTURE PRESSURE ELEVATION
GAS SPECIFIC GRAVITY FLUID CONTACT
MID RESERVOIR DEPTH FRACTURE PRESSURE
PORE VOLUME COMPRESSIBILITY FORMATION VOLUME FACTOR
POROSITY PORE VOLUME COMPRESSIBILITY
PRODUCTION POROSITY
STRATIGRAPHIC COMPARTMENT COUNT PRESSURE
TEMPERATURE DEPTH PRESSURE DEPTH
TEMPERATURE GRADIENT PRODUCTION
TOTAL ORGANIC CONTENT STRATIGRAPHIC COMPARTMENT COUNT
VISCOSITY TEMPERATURE STRUCTURAL COMPARTMENT COUNT
WATER SATURATION TEMPERATURE
WELL COUNT TEMPERATURE DEPTH
WELL SPACING TEMPERATURE GRADIENT
THICKNESS
TOTAL ORGANIC CONTENT
WATER SALINITY
WATER SATURATION
WELL COUNT
WELL SPACING

Figure 4 shows the testing result: the estimated recovery efficiency is in the y-axis, whereas the actual
data set values are given in the x-axis. The estimation accuracies significantly improved from Figures 2
and 3 though there are still issues. While the scatter of the data is reduced the clustering about the 1:1 line
has worsened.
6 SPE-174900-MS

Figure 4 —Oil and gas recovery efficiency prediction performance of linear regression models with sequentially selected variables

Because of the stochastic nature of separating training and testing data sets, resulting combinations of
selected features may differ. To account for this effect, we have run 1000 cases of sequential feature
selection, each with different training and testing data sets randomly drawn from the same original data
set. The plots below show the number of times each variable was selected for the final model after
sequential variable selection. For concise representation, variable names in the x-axes were replaced by
corresponding variable IDs unique to each variable.

Figure 5—Frequency of variables selected after 1000 runs of sequential feature selection.

The frequency of selection varies greatly for each variable. To create and test the final model, we have
adopted variables that were selected more than 90% of the time. These variables are in Table 2.
SPE-174900-MS 7

Table 2—Variables selected after 1000 runs of sequential variable selection


RF Oil RF Gas

POROSITY WATER DEPTH


PORE VOLUME COMPRESSIBILITY DEPTH
WATER SATURATION FLUID CONTACT
TOTAL ORGANIC CONTENT STRATIGRAPHIC COMPARTMENT COUNT
WELL COUNT THICKNESS
API GRAVITY POROSITY
GAS SPECIFIC GRAVITY PORE VOLUME COMPRESSIBILITY
TEMPERATURE DEPTH WATER SATURATION
TEMPERATURE GRADIENT WELL COUNT
FRACTURE PRESSURE API GRAVITY
WELL SPACING TEMPERATURE
TEMPERATURE GRADIENT
FRACTURE PRESSURE
PRESSURE
PRESSURE DEPTH
WATER SALINITY

Figure 6 —Oil and gas recovery efficiency prediction performance for variables selected after 1000 runs of sequential variable selection

The horizontal axes (“Target”) represent the values from the original data set, while the vertical axes
(“Output”) represent values estimated by the model. The predictive accuracies are not as large as for the
single sequential variable selection case; however, it is still an improvement from the original multilinear
regression model in Figures 1 and 2.

Bayesian Network
Another model implemented is the Bayesian network model. Bayesian network is a probabilistic graphical
model that represents a set of random variables and their conditional dependencies. In the network, a
random variable is represented as a node, and a conditional dependency is represented as an edge. These
components are used together to represent a successive and/or simultaneous application of Bayes’
theorem, which is used to systematically update prior probability distributions when new observations are
made (Pearl, 2009). The network is able to take in multiple variable observations, and accordingly update
8 SPE-174900-MS

the estimated probabilities in related nodes. This approach draws from principles from graph theory,
probability theory, machine learning (computer science), and statistics (Bishop, 2007).
The Bayesian network has a few distinct advantages in achieving the goal of predicting recovery
efficiency. 1) It is able to incorporate both the data and the relationships between reservoir properties. 2)
It can easily visualize data interdependencies, which can be difficult in other multidimensional models. 3)
It can handle both continuous and categorical variables, which were neglected in the previous stages.
To predict oil and gas recovery efficiencies, we created and tested a naïve (meaning no relationship
between inputs) Bayesian network. Figure 7 is a capture the network.

Figure 7—Bayesian network to estimate oil and gas recovery factors

To allow numeric computation of conditional probabilities, continuous variables were discretized into
three or four segments. Again, oil and gas well spacing variables were log-transformed. Because each
node shows the probability distribution of the variable rather than a specific value, the numeric estimate
of oil and gas recovery efficiencies were estimated by calculating the expected value of the discretized
probability distribution. The testing results are in Figure 8.
SPE-174900-MS 9

Figure 8 —Oil and gas recovery factor prediction performance for Bayesian network

The agreement with the actual values is similar to that of simple linear regression. A difference is in
that the predicted values of recovery efficiencies show slight striations. These striations are artifacts of
discretization of continuous variables, which compartmentalized the multidimensional space and placed
the data in its segments. Increasing the number of discretization for each variable will make the prediction
striations smoother; however, the likelihood of a data point’s presence in each discretized segment quickly
decreases. With the scale of our current database, three or four discrete segments per variable are
appropriate.

Artificial Neural Network


The final data model we have employed is the artificial neural network (ANN). The selected model is
feed-forward neural network, also known as multilayer perceptron. It is a simple type of ANN that consists
of three layers: input, hidden, and output. Because the activation functions present in each node of the
diagram are sigmoidal, a neural network is able to model complex nonlinear behavior that may be difficult
with other methods such as linear regression. An example diagram is in Figure 9.
10 SPE-174900-MS

Figure 9 —Schematic feed-forward artificial neural network

For our modeling, we have implemented a feed-forward network with 15 input nodes, 30 hidden nodes
and 1 output node. The model was trained using a back-propagation algorithm. The performance of this
model is presented in Figure 10.

Figure 10 —Test results for oil and gas recovery effiencies using ANN

The results of the artificial neural network prediction show slightly improved accuracy than standard
linear regression model.
Conclusions
Oil and gas recovery efficiencies were predicted using multilinear regression, multilinear regression with
sequential feature selection, Bayesian network, and artificial neural network. The comparison of the first
two models has shown that multilinear regression with sequentially selected variables performs better than
multilinear regression with variables selected based on engineering principles. The Bayesian network
SPE-174900-MS 11

model did not perform well, primarily because of the necessity of discretizing numeric variables with large
increment bins. The artificial neural network results showed little scatter but are not accurate. Of the
results presented in this paper, none seem to be able to predict recovery efficiency entirely satisfactorily.
There two major reasons for the models’ inadequacies: the original data set and inaccuracies in the
model design.
The data set does not include some critical variables—such as time since initial production, or reservoir
maturity—whose inclusion may have given statistically significant results. Also, none of the variables in
the data set are indicators of reservoir heterogeneity, which may have a strong correlation with recovery
efficiencies. The analysis could improve further if the data set included primary, secondary, and ultimate
recovery efficiencies separately. Finally, if the data set were more complete, then we could have relied less
on data imputation techniques that may have introduced more noise to the data than necessary.
The experiments with model designs, especially with the Bayesian network and artificial neural
network, are by no means conclusive. For both models there are many possible permutations. For
example, Bayesian network could have a more complex structure than a naïve Bayesian network used in
this paper, or it could include different variables. For artificial neural network, we could investigate how
different numbers of intput, hidden, and output nodes affect the outcome. Also, there exist different types
of artificial neural network models that could be explored.
The discussion above calls for a more elaborate study.

Acknowledgements
This work was supported by the Center for Petroleum Asset Risk Managemement. Larry W. Lake holds
the Sharon and Shahid Ullah Chair at the University of Texas.

References
Bishop, C. M. 2007. Pattern Recognition and Machine Learning. Springer.
Fisher, D., Lenz, H. 2007. Learning from Data: Artificial Intelligence and Statistics V. Springer.
Lake, L. W., Yang, A., Pan, Z. 2014. Listening to the Data: An Analysis of the Oil and Gas Journal
Database. Presented at the SPE Improved Oil Recovery Symposium held in Tulsa, Oklahoma, USA, 12-16
April 2014. SPE-169056-MS.
National Research Council (U.S.). 2013. Frontiers in Massive Data Analysis.
Pearl, J. 2009. Causality, second edition. Cambridge University Press.
Rodriguez, H. M., Escobar, E., Embid, S., Rodriguez, N., Hegazy, M., and Lake, L. W. 2013. New
Approach to Identify Analogue Reservoirs. Presented at the SPE Annual Technical Conference and
Exhibition held in New Orleans, Louisiana, USA, 30 September-2 October 2013. SPE 166449.
Sharma, A., Srinivasan, S., Lake, L. W. 2010. Classification of Oil and Gas Reservoirs Based on
Recovery Factor: A Data-Mining Approach
Van Buuren, S., and Groothius-Oudshoorn, K. 2009. mice: Multiple Imputation by Chained Equations
in R. Journal of Statistical Software Vol 45 issue 3.
Zerafat, M. M., Ayatollahi, Sh., Mehranbod, N., and Barzegari, D. 2011. Bayesian Network Analysis
as a Tool for Efficient EOR Screening. Presented at the SPE Enhanced Oil Recovery Conference held in
Kuala Lumpur, Malaysia, 19-21 July 2011. SPE 143282.

S-ar putea să vă placă și