Lost Circulation Prediction Using Decision Tree, Random Forest, and Extra Trees Algorithms for an Iraqi Oil Field

Abstract


Introduction
drilling mud is one of the most important elements of any drilling operation where it aims to build a mud cake in the areas of contact with the drilled formation that filters the drilling fluid and prevents the penetration of drilling fluid into the formation, cool and lubricate the bit, remove the drilled cuttings from the hole and plays an important role in controlling formation pressure and preventing kick or blow out (Rabia, 2002).The volume of the drilling fluid returning to the surface through the annulus is not equal to the volume that is injected into the well in most cases, as this difference between the two volumes is called loss of circulation and ranges from seepage, partial and total depending on losses rate (Alsaba et al., 2017).Drilling fluid losses usually occur in high permeable, naturally fractured, induced fractured and cavernous formations (Datwani, 2012).This problem is one of the costliest problems in the oil industry, as it costs 2$ billion annually to treat it also, represents approximately 12 percent of non-productive time globally and 46 percent of non-productive time in the Rumaila oil field (Arshad et al., 2015).
The first step in dealing with drilling fluid losses is to determine the type of losses if we have seepage or partial losses Loss Circulation Mateirals (LCM) slurry is a good choice and in case of severe or total losses cement squeeze is a suitable choice (Messenger and McNiel, 1952).The cost of treating drilling fluid losses is due to the cost of the materials used in addition to the cost of non-productive time such as cement, diesel oil bentonite, diesel oil bentonite cement plug need 18, 10 and 12 hours and cost about 27, 15 and 18 thousand USD respectively (Al-Hameedi et al., 2018).
Many factors are causing the loss of drilling fluid, including (a) the types of formations (b) hole conditions such as homogeneous impermeable walls, well irregularities, intrinsic fractures, permeable zones, and closed hydraulics (c) excessive pressure, which is affected by drilling mud weight, flow properties, filtrate rate, high circulating rate, hole enlargements, and surging of pumps (Howard and Scott, 1951).Controlling these factors to prevent the loss of drilling fluid is a very difficult task, so it is necessary to have a smart model to predict the occurrence of losses or not, as well as to predict the type of those losses depending on these factors, and therefore some of these factors that can be controlled to prevent or reduce the loss of drilling fluid (Toreifi et al., 2014).
Machine learning is one of the most important techniques used in solving complex problems by revealing patterns and complex relationships between the causes of the problem and the outcome (Abbas et al., 2018).Many smart models have been used to predict various problems in the drilling operation of oil wells, foremost of which is the problem of loss of drilling fluid most of these models were built based on properties of the drilling fluid and the parameters of the drilling operation and these models showed the ability to predict cases of drilling fluid loss.Most of these models were neural networks models (Moazzeni et al., 2012;Abbas et al. 2018;and hybrid models (Toreifi et al., 2014), which was distinguished by complex construction and high ability to predict drilling fluid losses.Tree-based models did not receive the researchers' interest in using them to predict the loss of drilling fluid.In this study, three tree-based models were developed to predict lost circulation events in the Rumaila oil field.
The Rumaila field is considered one of the largest fields in the world and is located in the southern part of Iraq.Fig. 1 shows the stratigraphic composition and geological formations of the Rumaila oil field.Wells drilled in this field are exposed to the risk of losing drilling fluid significantly, especially when the wells pass through Dibdibba, Dammam, Umm Er Radhuma, Tayarat and Hartha formations.

Gathering the Data
This step is the most important in building the smart model, as many models failed to work due to insufficient data used.In the field of the oil industry, it is difficult to obtain data, and if any, it is scarce due to the policy of oil companies and property rights.Before starting to build the model, it is necessary to ensure the availability of sufficient and necessary data to complete the work.The data used in this paper gathered from 75 wells drilled in the Rumaila oilfield in the southern part of Iraq suffered from lost circulation problems included flow rate, rate of penetration, equivalent circulating density , yield point, plastic viscosity, and losses rate.Table 1 depicts the maximum and minimum values of these parameters that are used in this paper.

Data Processing
After the data gathering and integration process is done, an important next step is to clean the data.Many models fail and do not work well due to spending a lot of time processing and filtering the data before applying machine learning algorithms.The following are the steps to clean up the data:

Data visualization
The first step in cleaning the data is viewing the data.Bad or wrong data makes the model easy to break and gives unexpected and wrong results.There are many library packages for Python programming languages, such as Matplotlib, Seaborn, and Plotly.The purpose of data visualization to understand how these data are distributed.There are some commonly used graphics such as scatter plots (Fig. 2), and distribution plots (Fig. 3).

Outlier or noise detection (smoothen the data)
The purpose of displaying the data is to find the outlier points.Also, by displaying the data we can know the pattern or range of the data distribution, and thus the ease of extracting unfamiliar points that may be the result of a mistake in recording them or other reasons, and thus easy to get rid of these points before Progress in building an intelligence model.Box plot is one way to extract unfamiliar points (Fig. 4).Outlier points was found in (ECD, YP and ROP) after extract these points as shown in Fig. 5, the data now ready to the next step.Table 2 shows the summary of the data after extracting unfamiliar points.

Collinearity
One of the important things is to test collinearity between the features high collinearity may cause the model to confuse because these inputs that have high collinearity will give the model the same information about the output.Collinearity test with Pearson correlation coefficient heat map (Fig. 6).Pearson correlation coefficient ranged from -1 to +1 the -value means inverse proparation and -value direct proparation.A high positive value represents high collinearity between the features.

Features Ranking and Selection
The next important step is the feature ranking then feature selection, and here the role of applied experience and knowledge plays an important role in the success of the model.In the absence of experience, it is difficult to understand and explain the relationship between variables themself and the relationship between inputs and outputs.To know the effect of the features on the model, or to know which features to choose or leave, a feature ranking must be done to find out the effect of these features on the desired output, like decision tree, random forest, and extra trees (Fig. 7).

Feature Scaling
To ensure that the algorithm does not take sides to the large value of the data (whether inputs or outputs), a scale must be made for it.In general, the work of the scaling increases the speed of the model's work and reduces the use of resources.
Here, it is necessary to specify the models that should be scaled for their data or not, which are as follows: • Distance-based models: such as Artificial neural network, K-nearest neighbors, Support vector machine, and K-means clusters use distances between data points to determine their similarity which means they affect by the magnitude of the data.So, scaling must be done.• Tree-based models: such as Decision tree, Random forest, Gradient boosting split a node on a feature that increases the homogeneity of the node, this split on a feature is not influenced by other features.So, it is not necessary to scale the data.

Cross-Validation (Evaluating Estimator Performance)
Using the same data for training and testing the model is a methodological error.This model will test on the same data that it has previously trained on, and this leads to a high-score model with training and testing, but it will fail to predict when using new data and this is called overfitting (Fig. 8).
To avoid this, one set of data must be isolated to train the model and another to test the model, and this is done by following different types of cross-validations such as (The hold out method, K fold method, etc.).In this paper, the hold-out method with different training testing ratios was used to build the model and get the best accuracy.

Decision Tree (DT)
Decision tree learning is one of the most widely used and practical supervised ML algorithms in many fields from learning to diagnose medical cases to learn to assess the credit risk of loan applicants (Mitchell, 1997).DT is used for classification and regression.DT divides the data into sub-trees, which in turn are divided into other sub-trees.The ones in the top represent the most popular and the most influential on the results which are called "root node", the intermediate trees that contain one or more spaces are called "decision nodes", and, "terminal nodes" refer to as a leaf node is the lowest node and does not split any more (Fig. 9).

Fig. 9. Decision tree illustration (Belyadi and Haghighat, 2021)
There are many decision tree algorithms like iterative dichotomiser 3 represents one of the most widely used decision tree algorithms which was developed in 1986 (Quinlan, 1986).

Attribute selection technique
The process of choosing the tree that is at the top and which is at the bottom represents a difficult choice and a real challenge, the following steps explain how to build a regression decision tree (scikitlearn, 2022) : • Calculate threshold points which represent the average between every two points of the input data (Fig. 10a).• Calculate the average of the target value of the points located on the left and right of the threshold point (Fig. 10b).• Calculate the Sum of Square Residual (SSR) which represents the difference between the actual taget point and the average target for all points.• Repeat step (a) to (c) and choose the lowest value of the SSR threshold point to be the "root node" (Fig. 10c).• Repeat step (a) to step (d) for the point located on the left or right on the "root node" separately.
By this step, we will locate "decision nodes" and "terminal nodes" (Fig. 10d).(Josh, 2019) 2.6.2.Random forest supervised machine learning algorithm used for classification and regression problems.Random Forest is several trees combined into a single tree model.It is more accurate than a decision tree, especially in predicting, because of the increase in knowledge resulting from the increase in predictions.

Fig. 10. Regression Tree technique
In regression problems, it takes the average decision trees for the final predictor.(Fig. 11) shows Decision tree versus random forest.(Belyadi and Haghighat, 2021) The steps of building random decisions are as follows: • Bootstrap aggregation: The first appearance of this technique was in 1994 by Leo Breiman (Breiman, 1994).the decision trees are trained based on randomly sampling subsets of data and sampling is done with replacement.The idea behind bootstrapping is the inputs used more than once in a single decision tree (Efron, 1979).• Build a tree considering a subset of variables at each step.
• Repeat bootstrap aggregation and step b to build a wide variety of decision trees.This variety makes the random forest more effective than a single decision tree.• In regression problems, it takes the average decision trees for the final predictor.

Extra trees (extremely randomized trees)
Extra method of supervised machine learning algorithm like a Random Forest and used for solving classification and regression problems.The only difference from the random forest is that the Extra Tree uses all the original data, meaning that it does not use bootstrap aggregation.Fig. 12 shows visual representation of Extra Trees model.

Model Evaluation
All models must be evaluated to find out whether you have trained well or not before going to the application of this model and obtaining unexpected results.Evaluation of regression problems consists of two parts: • Accuracy score: This represents the accuracy of the model and must take training, testing and implementation.The most common regression accuracy metric is square linear correlation coefficient value (R 2 ) as shown in Eq. ( 1). (1) • Error score: This represents the difference between the model output and the actual output.The most common regression error metric is Mean Square Error (MSE) as shown in Eq. ( 2). ( Where f(xi) represents the actual losses, yi represents the predicted value of losses, and y_bar represents the average value of losses.

Hyperparameter selection (fine-tuning and optimization)
A supervised ML model has many parameters that determine how the model works during training.Determining the best parameter leads to making the model in the best shape and thus obtaining an accurate model.The process of choosing the best parameter is a real and difficult challenge to complete this task, we will rely on the Grid Search, which operates a specific number of loops through which the parameter values are replaced according to the values imposed by the user, and therefore with checking the accuracy of the model, the best parameter can be reached.

Results and Discussion
The testing of the three models showed that the ET model gave the best result with R2 testing equal to 0.9245 and MSE 0.0059.Figs. 13 and 14 and Table 14 show the accuracy of training, testing, and implementation of each model addition to MSE.All models showed good accuracy in training and testing, but what matters is the implementation or use of the model to predict lost circulation events by using new data that it has not seen before.Extra tree model showed high ability to predict the new lost circulation events with high accuracy.

Implementing the Trained Model
After the training process is completed it is now time for the real test, applying new data that the model has not seen before, and knowing its accuracy in predicting drilling fluid losses.For this purpose, the data of 9 wells that suffered from the problem of loss of drilling fluids of various kinds, from simple losses to total losses, were isolated.Fig. 15 represents the real values of losses and the values that were predicted by using these models, The black points represent the real losses of the 9 wells, the red points represent the predicted losses using decision tree model, the green points represent the predicted losses using random forest model, and the blue points represent the predicted losses using extra tree model.In general, the results of all models were good.

Feature Importance Ranking
The most important factor affecting the loss of drilling fluid using the three models was ECD (Fig. 16).This factor is related to the amount of drilling fluid pressure or the so-called hydrostatic pressure.If this pressure increases and exceeds the pressure of the formation, it will lead to the break of the relatively weak formation and the rush of the drilling fluid into this layer.

Conclusions
After implementing the three models to predict lost circulation events in the Rumaila oilfield, the extra tree model produced the best prediction result, with R2 equal to 0.9681.The results demonstrated the three models' capacity to anticipate various sorts of losses, ranging from seepage to severe losses, with high accuracy due to a shortage of training data, R2 of testing is lower than R2 of training.In order to address this issue, the model needs to be trained on more data.The majority of the validation data are reasonably similar to the training data, which makes it simpler for the model to identify, and this is another factor contributing to the high R2 of validation.
The most important parameter influencing the drilling fluid loss process which was equivalent circulating density (ECD) that means the possibility of preventing or reducing the possibility of this problem occurring by controlling the value of the most important factors causing this problem.

Fig. 2 .Fig. 3 .
Fig. 2. Scatterplots show the relationship between the input features and the losses rate before extracting unfamiliar points

Fig. 5 .
Fig. 5. Scatterplots show the relationship between the input features and the losses rate after extracting unfamiliar points

Fig. 13 .
Fig. 13.Accuracy score of all three models

Fig. 15 .
Fig. 15.Comparison of the estimated losses of each model and the real losses

Fig. 16 .
Fig. 16.Feature importance score using the three models

Table 1 .
Summary of the data that used in modeling before extracting unfamiliar points

Table 2 .
Summary of the data after extracting unfamiliar points

Table 4 .
Summary performance of all the three models