Sunteți pe pagina 1din 12

SPE-176792-MS

Using Trees, Bagging, and Random Forests to Predict Rate of Penetration


During Drilling
Chiranth Hegde, Scott Wallace, and Ken Gray, University of Texas at Austin

Copyright 2015, Society of Petroleum Engineers

This paper was prepared for presentation at the SPE Middle East Intelligent Oil & Gas Conference & Exhibition held in Abu Dhabi, UAE, 15–16 September 2015.

This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents
of the paper have not been reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect
any position of the Society of Petroleum Engineers, its officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written
consent of the Society of Petroleum Engineers is prohibited. Permission to reproduce in print is restricted to an abstract of not more than 300 words; illustrations may
not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.

Abstract
Predicting rate of penetration (ROP) has always been of fundamental interest to the drilling industry. Early
predictions can assist the engineer in changing parameters to reduce non-productive time (NPT) and
achieve optimum ROP. This paper illustrates methods to predict the ROP in a computationally efficient
manner using only data available at the surface. These methods can then be incorporated into real time
drilling operations, first through a passive diagnostic tool, and then an integrated real-time control loop.
In this work, statistical learning techniques such as trees, bagged trees, and random forests (RF) are
used to predict ROP. Trees provide easy interpretability and hence are favored over other non-linear
techniques. However, accuracy is imperative in this procedure. Accuracy can be increased by using
bootstrap aggregating (bagging) or Random Forests. These techniques are applied, using the statistical
software computing package R and its numerous libraries. Statistical learning techniques have been
applied to a data set which had nine predictors.
Applying trees to a data set yields great visualization of the data, but the lack of accuracy and can result
in substantial overfitting. This shortcoming was rectified using bagging or RF methods to substantially
increase accuracy. The results were promising in all cases and acceptable for real time predictions.
Scalability is another concern for real time operations. Computational efficiency of the methods were
evaluated, and the best method was based on a combination of computational efficiency and accuracy.
Potential time savings which would result from applying the model in real-time optimizations and
demonstration of the power of machine learning techniques are included in this paper. Future improve-
ments will be incorporated in real-time prediction during drilling.
State of the art statistical learning and machine learning techniques are applied to prediction of ROP,
whereas previous prediction methods have not been based on real-time drilling data. The result is a
computationally efficient model which can determine the right features for prediction at each step, while
also incorporating engineering judgement and maintaining integrity of the statistical principles being
employed.
These methods can easily be extended to other drilling parameters such as MSE or Torque and Drag.
2 SPE-176792-MS

Introduction
Drilling operations are more cost efficient when rate of penetration (ROP) and other drilling parameters
are optimized. Several ROP models have been used over the years (Bingham, 1965; Bourgoyne and
Young, 1974), which take different parameters into account. Models were added to incorporate the type
of drill bit being used (Hareland et.al, 2000; Mothhari et. al, 2010). For changes in formation type or
lithology, various co-efficients are used, determined empirically so as to fit a specific drilling situation.
This is a significant disadvantage of traditional ROP models, the requirement to iterate over some drilling
interval for determining empirical coefficient in each hole section being drilled. Lithological changes over
short intervals exacerbate the problem. Moreover, traditional models are not predictive.
A predictive model can be used in optimizing drilling parameters such as weight on bit (WOB) and
rotary speed (RPM) to achieve maximum ROP. This paper introduces such a predictive model, Wider
Windows Statistical Learning Model (WWSLM) which also can be incorporated into real-time drilling
operations.
Principles of statistical learning and machine learning have been applied to create a model which
accurately predicts ROP, as well as providing other drilling-based inferences (Hegde, Wallace and Gray,
2015). Trees, bagging, boosting, and random forests are used to construct the statistical model. The
optimum model can be developed to best suit needs of the user, combining accuracy and computational
efficiency. In real-time applications, computational efficiency of the model may be more limiting than
model accuracy. Neural networks have been used for ROP prediction (Jahanbakshi and Keshavarzi, 2012).
However, the methods presented herein takes a different approach which produces higher computational
efficiency and accuracy. Moreover, the use of trees and random forest techniques provide mechanisms for
evaluating the relative importance of input predictor variables, a very powerful modeling feature and a
highly useful capability for regular field operations.
The WWSLM uses a number of predictors, such as WOB, RPM, depth, etc. as input variables, which
are then utilized to ‘train’ an ROP predictor. Input parameters are user-selected, and resulting accuracy
will depend on the specific input parameters used. In the examples shown in this paper, only surface
measurements were used for ROP prediction. Other variables such as mud properties, drill string and
bottom-hole assemblies were not included, but they could be. Basic requirements for WWSLM are
minimal, user-friendly, and rig adaptable.

Partition of Data Set


The data set used was partitioned into three sets, so as to evaluate errors that would be significant during
real time operations. Over fitting of data is avoided by splitting the data into training sets, validation sets,
and test sets. A training set is used by the algorithm to develop the model. Then the validation set is used
for cross validation and to restrict the number of parameters in the model, while indicating which
parameters are the most significant. Such fine tuning of the algorithm ensures the least error. Finally, the
test set is used to “test” accuracy. The WWSLM is blind to the test set, i.e. it never has access to the test
set; it never has an opportunity to learn from it. This aids assessing model accuracy, as error assessment
from the training set can lead to over fitting. In the example illustrated herein, the training and validation
sets consist of a sub section of the data (say first 500 ft.). The test set is the values of ROP that are to be
predicted (2,000 ft. yet to be drilled).

Assessment of Model Accuracy


The advantages of an SLM model can be evaluated based on computational efficiency and error rate.
Measurement of error helps determine the accuracy of the model and gauge its applicability in real-time
applications. Assessment of model accuracy utilizes root mean squared error (RMSE) of the model on the
test data. The mathematical definition of RMSE is shown below (Equation 1).
SPE-176792-MS 3

(Equation 1)

Modeling Techniques

Trees
Tree methods can be used for classification and regression purposes. Here trees are employed for
regression and prediction of ROP. A tree includes a flowchart like structure, in which an input variable
is evaluated at each node. Regression trees are used to predict a variable at a given point. In simple linear
regression, one can determine the outcome using a list of predictors, as shown in Equation 2.
(Equation 2)

Multiple linear regression is a good method of prediction when data are linearly related. However, ROP
data are non-linear in nature, with complex interactions between predictors. For such situations, linear
regression has limited utility.
To simplify such complex data relationships, one approach is to partition the data into smaller, more
manageable, sections, as illustrated in Figure 1. The sub divisions can be partitioned again, which
constitutes recursive partitioning, until the sub divisions can be fit with simple linear models. Trees use
a pictorial tree representation to create these partitions, and also aid in interpretation of the data at hand.
Trees are fast in making predictions, a useful feature for large data sets common to drilling operations.
They are easy to understand, as shown in Figure 2, where the split at each variable is shown for the Tyler
formation. An analysis of trees and their superiority over regression is discussed in later sections.
4 SPE-176792-MS

Figure 1—Data partition for a simple tree involving prediction of ROP using RPM as a predictor
SPE-176792-MS 5

Figure 2—Simple tree diagram for Tyler sandstone formation using surface measurements as input parameters

Shortcomings of trees include their accuracy. However, this can be dealt with by ‘pruning’ a tree where
the tree is grown to an optimum length. Pruning is determined using cross validation such that the tree
with the lowest RMSE is chosen as the model for that given formation. Now the validation set comes into
play to fine tune the algorithms used to predict ROP to select those of highest accuracy. A good method
to reduce RMSE in trees is to alternatively grow and prune them. First grow the tree and then prune it.
This is repeated until pruning no longer decreases RMSE.
Bagging
The bootstrap (Efron and Tibshirani, 1996) is an extremely powerful idea, which can improve SLM
models such as trees. A disadvantage of trees is that they exhibit high variance, which can be crucial in
drilling, especially for high lithology variation. Averaging models helps reduce variance. This translates
to using many training sets to build different prediction models using trees across various training sets.
Averaging these yields a single low variance SLM. But the training sets are limited in this case for a single
well. However this problem can be circumvented by using bootstrap methods to repeatedly sample from
the data, yielding multiple training sets. This method of using the bootstrap and trees jointly is called
bagging. Bagging provides an additional advantage in which calculation of out of bag error estimates
(OOB) obviate the need for a validation set. The OOB uses the remainder of the bootstrap training set to
6 SPE-176792-MS

calculate the error. It has been shown by James et al. (2013) that this is equivalent to computing the test
error.

Random Forests
Decision trees capture non-linearity in the data but suffer from high error, high variance, and over fitting.
These problems can be avoided by using random forests, which is an extension of bagging. Random
forests implement bootstrapping to create a large number of samples. At each node / split a random sample
of predictors are considered to construct the decision tree. The Figure 3 below is a pictorial depiction of
the aforementioned method. ‘B’ number of trees is created from the training set sampled using boot-
strapping. These are then averaged to obtain a low variance, highly accurate SLM. The main difference
between bagging and random forests is that in random forests, random subsets of predictors are used to
build trees. This helps in de-correlating trees individually which reduces the overall variance of the SLM
when averaged.

Figure 3—Random Forest Visualization (ICCV Tutorial, 2009)

Scalability and Computational Efficiency


It is desirable that an SLM should be capable of real-time operations at the rig. Thus the computational
time taken to run a particular SLM should be analyzed. The time value can be used to determine whether
the algorithm is scalable and can be used in closed-loop automation. Measurement of the time for
execution of the different algorithms was implemented, keeping all other factors such as processing
power, length of data set, software, etc. the same, such that the algorithms could be evaluated for
computational efficiency.

Methodology
Data Preprocessing
Surface measured parameters have been utilized as input parameters to the different algorithms in this
paper. The data consists of surface measurements during drilling through different formations in shale,
sandstone, and limestone. Different sandstone formations were utilized in the WWSLM to predict ROP.
Some exploratory plots were used to inspect the data. Figure 4 shows ROP plotted against depth for the
sandstone formations in the data set.
SPE-176792-MS 7

Figure 4 —Plot of ROP against depth for sandstone formations in the dataset colored by formation

Splitting the data set by individual formation allows a closer look in a specific drilling interval.
Because ROP changes with lithology or formation, an SLM should be re-trained for each formation or
periodically for highly variable rock drilling response. Application of the WWSLM to real-time, closed
loop operations can proceed with ROP predictions a few feet ahead of drilling, with the model being
retrained as often as warranted by desired accuracy. The drilling engineer’s judgement and experience
would be utilized to determine what data would be excluded from analysis. However, all measured data
will have been taken into consideration.
Field Examples
Scalability and Application in Real Time Drilling
The aforementioned algorithms can be used to develop an SLM for real-time, closed-loop drilling
operations, highly dependent on computational efficiency of the model. There is always a tradeoff
8 SPE-176792-MS

between accuracy and computation time requirements. While real-time operations need an SLM with low
computational requirements that exclude non-scalable algorithms, the latter may be quite useful in
planning beforehand. Furthermore, the drilling engineer may, depending on the application, utilize data
taken at different sample rates, as for example every second or every foot drilled.
Certainly, efficient data management systems increase the computational efficiency of an SLM. In the
WWSLM used herein, trees took about 20 milliseconds for execution over 800 rows of the Tyler data set.
Boosting took 10 milliseconds and the random forests algorithm took about 20 milliseconds. Pruning the
tree took longer due to cross validation, at about 50 milliseconds. Even for the laptop employed here (16
GB RAM and 2.2 GHz of Processing power), the algorithms are computationally efficient and can be
applied to real time situations.

Using Trees, Bagging and Random Forests for Prediction


Based on the aforementioned algorithms, these methods were applied to the field data to check accuracy
and predictive ability in real-time situations. In the Tyler formation, trees produced an RMSE of 35. This
is too high. Trees can be improved in terms of accuracy by using pruning. Cross validation was used to
determine the precise length of the tree, which was then grown to the “optimum” length. The RMSE in
this case was around 35. The boosting was performed, where 5,000 trees were grown to a length of 10
branches and averaged. This reduced the RMSE by 1 to a value of 34. While these methods are powerful,
better methods are needed for real time operations. Random forests on the same formation produced an
RMSE of 7.4, much lower error. This was achieved by splitting the data, training it on the first 80%, and
evaluating the error on the remainder of the set.
The same procedure was used for all of the sandstone formations to determine whether random forests
would predict accurately across all formations. The RMSE of the results are summarized in Figure 5,
random forests produces the least RMSE. This is true for all formations that have been evaluated in the
data set. The plot has been restricted to show RMSE less than 50. Formations with low RMSE can be
implemented in real-time automation systems as described by Wallace, Hegde and Gray (2015). Figure
6 is a box plot depicting the RMSE for all methods applied in this paper. It reaffirms the use of random
forests for prediction of ROP, with mean RMSE three times lower than for other methods such as tree,
bagging, and pruned trees.
SPE-176792-MS 9

Figure 5—Figure comparing RMSE for different prediction methods across different formations
10 SPE-176792-MS

Figure 6 —Box plot of RMSE using different methods for ROP prediction in sandstone formations

Bagging and random forests can be used for an additional purpose. Since both these methods employ
bootstrapping, the remainder 37% of the unused samples serve as a test set to evaluate OOBs as explained
previously. The models can also be used to evaluate the relative importance of input variables to the
outcome of prediction. In this case, since only surface parameters have been used, the input variables can
be tested for relative contribution to the model. Figure 7 below illustrates one such plot which explains
the importance of different input predictor variables. The plot shows that ‘X7’ and ‘X3’are the two most
important variables. Here the ‘X’s represent input parameters to the model, all of which were surface
measurements taken during drilling. Obviously, the most influential changes in ROP, then would be the
targeted parameters. Thus the WWSLM can select which important variables to select, from those under
control of the driller or engineer on site.
SPE-176792-MS 11

Figure 7—Relative Importance of Input Variables using Boosting

Prediction of ROP and Drilling Optimization


Future Wider Windows statistical learning models will include Mechanical Specific Energy (MSE) and
torque and drag predictions applicable in closed-loop, automated systems. Prediction of ROP, MSE, and
torque and drag simultaneously would be of considerable utility. Set points, thresholds, etc. would provide
for ‘sweet spot’ determinations. Modeling the inverse problem can provide valuable insight into factors
influential in drilling efficiency. Simultaneous prediction of multiple drilling parameters will be useful in
either forward or inverse modeling because the several parameters reflect different physical phenomena.
Conclusions
From results of this field study, prediction of ROP during real-time operations is possible using statistical
learning methods, especially random forests. Statistical learning methods provide predictive capabilities,
whereas traditional drilling models do not. The WWSLM is adaptive to changes in formation, drill string
and BHA design, operational changes, whereas traditional models are not. Such flexibility is applicable
to conventional wells or unconventional wells, horizontal, vertical, inclined, any trajectory. In many
situations, excellent accuracy can be achieved with only surface measurements.
Random Forests provide the best accuracy for data used in this paper, of consequence in real-time,
closed-loop applications. While only limited, specific surface measurements were used here, increasing
12 SPE-176792-MS

the number and data volume of input parameters would improve prediction accuracy and inferential
capability. Random forests and bagging can also be utilized to determine the relative importance of input
parameters, providing both strategic and tactical information for drilling parameter predictions and
on-the-fly rig floor changes towards optimization.

Acknowledgements
The authors would like to thank sponsors of the Wider Windows Industrial Affiliate Program British
Petroleum, Chevron, ConocoPhillips, Marathon, National Oilwell Varco, Occidental Oil and Gas, and
Shell. Special thanks are due to Marathon Oil Company for providing field data.

Nomenclature
BHA ⫽ Borehole Assembly
NPT ⫽ Non Productive Time
MSE ⫽ Mechanical Specific Energy
OOB ⫽ Out of Bag Error
RMSE ⫽ Root Mean Squared Error
RF ⫽ Random Forests
ROP ⫽ Rate of Penetration, fthr⫺
RPM ⫽ Rotations per Minute
SLM ⫽ Statistical Learning Method
WOB ⫽ Weight on Bit, klbf

References
Bingham, M. G. 1965. A New Approach to Interpreting Rock Drillability. Oil & Gas IOllr.
Efron, B. and Tibshirani.R. 1994.An Introduction to the Bootstrap. New York: Chapman & Hall
Bourgoyne Jr, A. T., and F. S. Young Jr 1974. A multiple regression approach to optimal drilling and
abnormal pressure detection.Society of Petroleum Engineers Journal 14(04): 371–384.
Dunlop, J., Isangulov, R., Aldred, W., Arismendi Sanchez, H., Sanchez Flores, J. L., Alarcon
Herdoiza, J., Belaskie, J., Luppens, J. C. 2011. Increased Rate of Penetration Through Automation.
SPE/IADC 139897.
Hareland, G., and P. R. Rampersad 1994. Drag-bit model including wear. SPE Latin America/
Caribbean Petroleum Engineering Conference. Society of Petroleum Engineers.
Hegde,C. M., Wallace S. P., and Gray, K. E. (2015). Use of Regression and Bootstrapping in Drilling
Inference and Prediction. Presented at SPE Middle East Intelligent Oil & Gas Conference &
Exhibition, Abu Dhabi, United Arab Emirates, 15-16 September. SPE-176791.
Jahanbakhshi, R., R. Keshavarzi, and A. Jafarnezhad 2012. Real-time prediction of rate of penetration
during drilling operation in oil and gas wells.46th US Rock Mechanics/Geomechanics Sympo-
sium. American Rock Mechanics Association.
James, G., Witten, D., Hastie, T., & Tibshirani, R. 2013._An introduction to statistical learning. New
York: springer.
Motahari H R, Hareland G, James J A. 2010. Improved drilling efficiency technique using integrated
PDM and PDC bit paremeters. Journal of Canadian Petroleum Technology 49(10):45–52
Wallace, Hegde and Gray (2015). System for Real Time Drilling Performance Optimization and
Automation Based on Statistical Leanring Methods. Presented at SPE Middle East Intelligent Oil
& Gas Conference & Exhibition, Abu Dhabi, United Arab Emirates, 15–16 September. SPE
176804.

S-ar putea să vă placă și