Sand Dunes Spectral Index Determination Using Machine Learning Model: Case Study of Baiji Sand Dunes Field Northern Iraq

Abstract


Introduction
Creating a new spectral index based on Artificial Intelligence (AI) technologies has been the subject of current research (Shi et al., 2020).AI is the study of algorithms and statistical models used by computer systems to do tasks automatically.Machine learning methods are utilized in many everyday applications (Mahesh, 2020).Two subtypes of machine-learning approaches have been identified: supervised and unsupervised methods (Halder et al., 2011;Wu et al., 2019).Multiple classifiers derived from statistical techniques, for example, cluster analysis (Haby et al., 2010), decision tree (Reis and Taşdemir, 2011), maximum likelihood classifier (Ke et al., 2010), Artificial Neural Networks (ANN) (Gomez et al., 2010), Support Vector Machine (SVM) (Ferreira et al., 2016), Random Forest (RF) (Nogueira et al., 2017), and expert rules-based approach (Dalponte et al., 2012), have been used in remote sensing applications to classify images effectively and efficiently (Bayatvarkeshi et al., 2021;Maxwell et al., 2018).Machine learning algorithms were used to solve different data classification problems (Khorshid and Abdulazeez, 2021;Sulaiman, 2020;Zantalis et al., 2019;Zebari et al., 2020).In recent years, machine learning technology has multiplied and has been extensively used in many scientific fields (Abdullah and Ahmed, 2021;Al-jaboriy et al., 2019;Wei et al., 2018).
Machine learning techniques have become an effective way to model and extract patterns from remote sensing data (Penghui et al., 2020) due to their high computational efficiency, few required variables, and reliable results (Ali et al., 2015;An et al., 2020;Schauberger et al., 2020;Zhang et al., 2018).Several machine learning (ML) models have been developed for modeling heavy metals over the past two decades, with outstanding progress (Yaseen, 2021).During the past two decades, the (SVM) and (RF) classifiers have brought image classification to the forefront of remote sensing applications (Novillo et al., 2018).The relatively high classification accuracy of Support Vector Machine SVM and Random Forest RF places them among the most popular machine learning classifiers in the remote sensing domain.It shows competitive results with Convolutional Neural Networks CNNs (Sheykhmousa et al., 2020).Currently, non-parametric algorithms such as ANN, SVM, and RF are most used to classify land cover (Heinl et al., 2009;Leinenkugel et al., 2019).They are good at handling multi-dimensional data and provide satisfactory classification results (Mellor and Boukir, 2017).Many studies have been conducted to determine the most suitable and accurate algorithm among the currently available machine-learning classifiers for Land Use Land Cover (LULC) mapping (Camargo et al., 2019;Carranza-García et al., 2019;Jamali, 2019;Li et al., 2016;Rogan et al., 2008).At the same time, SVM and RF are superior to other machine-learning methods for LULC classification (Ma et al., 2017;Mountrakis et al., 2011).However, sensor features and image data-related elements, such as spatial and temporal resolution, processing software and hardware, and so on, all have a role in the accuracy of LULC classification (Deng et al., 2008;Maroufpoor et al., 2019).
Dunes in Iraq take on various forms, depending on the area's geography and geology, but most notably on the prevailing wind direction and the presence of vegetation.Barchans dune is the primary type of dune in Iraq in addition to sand sheets and Nebkha.The landform of Barchans is characterized by a crescent shape or a collection of endless dunes with irregular, dome-shaped, and elongated dunes., which shows the Baiji study area (Al-Ani, 1979;Al-Taie et al., 2013).Sand can have a wide variation in its spectral reflectance pattern.Its absorption and reflectance features are essentially identical to gypsum's source material.Sand derived from other sources with different mineral compositions would have a spectral reflectance curve indicative of its source material.Other factors affecting the spectral response from sand include the presence or absence of water and organic matter (Wright, 1980).
Different studies have been conducted for extracting and mapping the sand and drifting sand (aeolian) from Landsat imagery.Al-Quraishi (2013) suggested the Normalized Difference Sand Index (NDSI-1).This index is used to distinguish the dunes and drifting sandy areas from the rest of the land.The NDSI-1 is defined as the ratio of observed intensities in the short-wave infrared (SWIR2) and red (R) spectral bands of the TM images.Pan et al. (2018) developed the Normalized Difference Sand Index (NDSI-2).NDSI-2 implies that the reflectance signals can be derived by subtracting the high Band 4 value from the low Band 1 value Landsat 8 OLI.This distinction is quite helpful in distinguishing sand from other ground characteristics.Sahar et al. (2021) proposed Normalized Differential Sandy Land Index (NDSLI) is used to map and extract sandy areas.The SWIR-1 moisture sensitivity and the red band capacity to distinguish between various land cover categories, such as soil, vegetation, and Builtup features, are used by the NDSLI to compute a normalization difference between the SWIR-1 and red bands.Karnieli (1997) developed Crust Index CI by normalization the difference between the spectral values in RED and BLUE.
The objectives of the current study are as the followings: 1. Calculate spectral index to detect sand dunes and drifting sand using Support Vector Machine (SVM) techniques and Landsat TM/OLI and Sentinel 2 imagery to investigate the importance of using artificial intelligence methods in remote sensing applications.2. Study of spatial and temporal variations of sand dunes and drifting sand by proposed spectral index using different sensors imageries.

Study Area
The case study area is in Baji sand dunes field which is bounded by lat.33° 31' 22" -35° 40' 33" N and long.42° 39' 05" -45° 13" 00' E and located about 230 km north of Baghdad and a few kilometers southwest of the Baiji city (Fig. 1).There are two types of sand dunes in Iraq: sand dunes which contain a high percentage of silt and clay and are primarily found in the middle and south of the country; the second is sand dunes, which contain a high percentage of sand and are found in Baiji area.The most common sand dunes in the study area are barchans (Al-Saadi, 1971;Al-Taie et al., 2013).

Materials and Methods
The diagram in Fig. 2 illustrates the steps of the methodology used in this paper.This diagram shows the detailed research materials, download satellite images, pre-processing, processing data with SVM using R software, validation, change detection, and the results to solve the research problem.

Materials
Landsat 5, Landsat 8, and Sentinel-2 satellite data are used in this research.These satellites are considered similar or very close in bandwidth wavelength.Images from Landsat-5, Landsat-8, and Sentinel-2 satellites were used in this research to investigate the feature of dunes in the Baiji area.Baiji was chosen for its importance and impact on the neighboring wheat and barley farms and residentials.
The images were obtained for the United States Geological Survey (USGS) website: https://earthexplorer.usgs.gov.Table 1 shows the images' details used in this study.Landsat-8 is the main satellite data that is used in training and testing.Six bands are chosen, including three Visible bands, Near-infrared NIR band, and two Short-wave infrared SWIR bands.Data for modeling was selected and specified according to LULC by using polygons.Some of these polygons represent a feature of sand dunes, while others represent other features in the image (Bare land, Builtup, Vegetation, and Water).They were producing the total number of pixels displayed in Table 2. Fifteen normalization differences (NDs) have been calculated from those bands as input features in a linear SVM.The total amount of pixels used in the study is (125599), which are divided into two groups: ( 83733) training pixels and (41866) testing pixels.The training data analyzed research were derived from the overall context of the study area.The number of pixels was obtained by calculating the selected pixels from each LULC, including the dune feature.It then classified the dune feature as +1 and other phenomena as -1.
ArcGIS 10.8 and ENVI 5.3 are used to prepare the data for this study.Data preparation for the training process through several stages is essential for converting raw data from digital numbers (DN) to reflectance.A surface's reflectance is defined as the ratio of input to reflected radiation.Because the reflectance band of some materials is used to identify them, it is the first typical correction of an image to reflectance before attempting to detect or identify components in an image.Convert Landsat-8 OLI from Digital Numbers to Bottom of Atmosphere Reflectance (BoA) using the ENVI 5.3 software.

Methods
In this study, the Linear SVM was used by implementation in (LibLINEAR) package using R software.In its simplest form, SVM is a linear binary classifier.It refers to a single dividing line that separates two groups.The linear SVM assumes that multi-dimensional data in the input space are linearly separable.Linear SVM takes a subset of the training data known as support vectors that are closest in feature space to the ideal decision border to increase the separation or margin.In particular, identify the optimal hyperplane (in the simplest case, a line) for classifying the training data into a specified number of classes.Fig. 3 shows an example about Linear SVM (Bazi and Melgani, 2006;Foody and Mathur, 2004;Kuo et al., 2013;Schölkopf et al., 2002).
Mathematically and geometrically, the ideal hyperplane, or maximum margin, can be defined.It refers to a decision boundary designed to reduce the number of misclassification mistakes that occur during the training step (Mountrakis et al., 2011;Sheykhmousa et al., 2019).The most significant classification difficulty directly affects the optimal location of the decision boundary (Kuo et al., 2013.Many hyperplanes are generated with no sample in between.The optimum hyperplane is selected by determining the greatest spacing between them (Huang et al., 2018).The learning process is iterative to develop a classifier with an accurate decision boundary (Wang, 2005).

Optimal separating hyperplane in linear SVM
Extraction of Drifting Sand Index (DSI) from Optimal Separating Hyperplane (OSH), the DSI (Drifting Sand Index) has the following general Equation: Where (i) represents the number of NDs in the index, (a) represents the coefficients, and (b) the bias value for reducing constant terms to be zero, meaning that DSI > 0 indicates the dunes pixel and DSI < 0 indicates the non-sand dunes pixel.This section establishes a connection between DSI and linear SVM by the idea of optimal separating hyperplane (OSH), which is used to construct DSI from the trained SVM model (Yaseen et al., 2018).
SVM's binary-oriented nature makes it particularly well suited to dunes classification, dividing land cover into dunes (+1) and non-sand dunes (-1).Compared to other neural network equations and MLAs, linear SVM allows for the classifier's training using acceptable mathematical representations, which is why it was chosen for this research.In summary, SVM is essential to obtain the best separating hyperplane for separating two groups, which can be mathematically defined as follows: The origin of the topological space (Zhou et al., 2014), i=1….n, xi ∈ R^d, Yi ∈ {+1, -1}, represents the pair consisting of the vector of attributes x and the binary label y.X indicates the reflectance values spectral bands in DSI dunes classification.In contrast, y = +1 denotes a dunes pixel, y = -1 denotes a non-sand dunes pixel, and w and b denote the hyperplane's coefficients and constants, respectively.Multiple hyperplanes may satisfy the constraints in Equation ( 2).The most excellent hyperplane separation can be characterized by utilizing linear SVM to maximize the margins between two support hyperplanes.The support hyperplanes represent the border of two classes of points in Equations (3).As a result, locating the OSH is the same as finding the support hyperplanes with the most significant margin distance.The following are the two support hyperplanes: (Chang and Lin, 2011;Tao et al., 2021).

Connection DSI to OSH
Sand dunes and non-sand dunes pixels are commonly indistinguishable in a given topological space, suggesting that no hyperplane can identify all pixels optimally.In classification, this is a softmargin issue (or an inseparable case).A penalty term is added to support hyperplanes: Depending on the penalty term, the cost parameter C is used to determine the penalty weights equal to the misclassification tolerance, Equation (5).
The word "OSH" refers to the pair of support hyperplanes that are the farthest distance from one another (margin).Additionally, the distance d between two support hyperplanes is proportional to ║W ║ in the inverse direction.Thus, a value of d equals the minimizing of ║W ║.As a result, when the cost parameter is included, the following quadratic programming algorithm is used to complete the task: The norm vector w of OSH is determined by separate points on the hyperplanes, which are referred to in SVM as support vectors (Fan et al., 2008).
where () denotes the values of the support vectors, (ρ) denotes the bias of the model, and for the best optimal separability, (k) is the kernel function that converts origin points from origin topological space to feature space.The following activities are designed to connect DSI with OSH.
• Training the linear SVM algorithm using labeled sand and non-sand pixels.
• Using the learned linear SVM model's parameters to build OSH from Equation (1).
• Equating the DSI coefficients to the norm vector w, can connect DSI to OSH.As a result, the suggested DSI formula is dependent on Equation (1): In the end, it will be based on equation 8, which represents the final form of the proposed index.Configure this Equation by extracting the coefficients after completing the training process for the Baiji area and calculating the classification accuracy, error rate, and Kappa coefficients for the model.All these processes have been done using R software.

Accuracy assessment
Three steps to determine the classification accuracy and error for all classifiers using ArcGIS software: • Using six sets of random points with an interval of 100 points from 500 tp 1000.
• Extract the class type for random points from the Reference Image, linear SVM result image, and Sand indices from previous studies Table 3. • Computing confusion matrix.1-(R -B) / (R + B) (0 -0.17) (Karnieli, 1997) All accuracy assessment parameters were calculated: the confusion matrix, the overall accuracy (OA), Cohen's kappa coefficient (K).The commission and omission are errors according to seccussfully classified pixels as in Table 4. True positives (TP) and true negatives (TN) are accurate classifications.A false positive (FP) happens when a result is incorrectly predicted as +1 (or positive) while the actual value is -1.(negative).When a result is erroneously forecasted as negative when it is positive, this is referred to as a false negative (FN) (Jiang et al., 2019;Keerthi et al., 2008;Lin et al., 2008;Witten et al., 2005;Aggarwal, 2015).The Cohen's Kappa formula is expressed in the conventional two-dimensional confusion matrix used in machine learning and statistics to assess binary classifications (Chicco et al., 2021).The OA and Kappa are calculated using the following formulae.The F-score was used to determine the accuracy of classification results.The F-score is defined as the harmonic mean of UA and PA.The PA, UA, Om.E, Co. E, and F-scores can be calculated using formulas (Ao et al., 2017).

Results
Initially, the index was created based on all the features entered for training, so the proposed Equation consists of 15 NDs in the first stage.In the second stage, unnecessary features (NDs) that have the least weight are removed, and a short formula is created to be more readily applicable.The following Table 5 shows all versions of the DSI index.

Accuracy Assessments
For mapping the sand dunes, the threshold value needs to be optimized.The threshold is optimized according to bias and visually checked according to the reference image.As illustrated in Fig. 4, the suggested indices' binary classification combines NDSI-1, NDSI-2, CI, and NDSLI with various thresholds to identify sand from non-sand.The reference image aims to provide a baseline against all sand dunes indices results.The classification result is compared to the reference image.The white area represents the feature of sand dunes and drifting sands, whereas the black area represents all other phenomena, such as water, vegetation, built-up, and bare lands.Table 6 compares OA, Kappa, F-score, and errors of DSI-C and DSI-R with NDSI-1, NDSI-2, NDSLI, and CI by five measurements.Three sample locations have been chosen (Site1, Site2, and Site3), represented by red circles.These sites include areas with dunes and other phenomena inside the same circle to enlarge and clarify the differences.In the following section, the accuracy of the results was evaluated statistically by calculating the performance of the classifier using three measures Overall Accuracy (OA), Kappa (K), and F-score.Also, by calculating the classification error using two criteria (Commission and Omission err) and drawing charts to illustrate the differences between these measures for all dune indices.
These measures were calculated based on selecting a set of randomly distributed points in both classes for the Baiji area.Then these points were calculated at six levels.500 points were selected at the first level, the second level was 600, and so on; up to 1000 points were distributed equally randomly in the dunes class and the other class.The performance of the proposed index (DSI-R) shows a clear difference in distinguishing the feature of dunes from other indices after applying the mentioned indices to the Baiji area.Tables 6 and 7 show OA, Kappa, F-score, and errors of all indices.

Validation and Change Detection
Spatial and temporal variations were conducted using a DSI-R to evaluate the suggested index performance.It was applied to a set of images acquired from three satellites.The obtained sand cover map uses the DSI-R applied on Landsat 8 and Sentinal-2 images collected on various dates in 2018 and 2021, while Landsat 5 TM scenes of the study area were taken in 2011.These three satellites were chosen due to the significant similarity of the bands in terms of wavelength.Table 8 shows the characteristics of the components for each satellite.Spatial and temporal changes maps were illustrated for each year of study.Fig. 8 shows the maps after applying the proposed index to Landsat 5,8 and Sentinel 2 satellites.
Fig. 9 compares the area of the sand dune collection area between 2011 and 2021.Applying DSI-R to various images (different sensors) showed an increase in the sand dunes accumulations from 850.80 km2 to 905.47 km2 throughout the three years from 2018 to 2021 in Baiji by sentinel -2 satellite.Then sand dunes accumulated from 1527.0291 km2 to 1648.4679 km2 throughout the ten years from 2011 to 2021 in Baiji by Landsat-5 TM and Landsat-8 satellites.As the highest area observed in 2021, the reason is that land degradation was significant due to global climatic changes, which is reflected in Iraq as drought and expanding of desertification (Awadh et al., 2022;Beg and Al-Sulttani, 2020;Mail et al., 2016).

Discussion
In previous years, a set of indices for the feature of dunes were developed.DSI works to classify and isolate sand dunes from satellite images more accurately than the rest of the reviewed indices in this study.The DSI identifies the sand dunes from other land cover types such as vegetation, water, buildings, and bare land.Different land cover types appear alongside the dunes in some other sand indices.Machine learning techniques constructed the suggested index linear equation when machine learning techniques were combined with remote sensing data; the result was significantly better classification accuracy for satellite images.This study has proposed the DSI to identify and extract sand dunes and drifting sand from Landsat 8 images by normalization differences between three NDs (ND34, ND47, and ND57).Spatial and temporal variations study show that the DSI is working efficiently on similar multispectral sensors: Landsat 5, Sentinel 2, and Landsat 8, which the modeling of DSI was done on its data.(Rasheed and Al-Ramahi, 2021) suggested the dunes index, which uses two bands (GREEN with SWIR) and did not mention SWIR1 or SWIR2 band.Thus, the index was applied to the same study region; the first time, it was used between (SWIR1 and Green), and the results indicated that vegetation had the highest value.The second time, (SWIR2 and Green) were chosen since they performed poorly compared to the four indices evaluated in this study.Therefore, this index was not selected from the indices required to reach our proposed index.The best and closest index to DSI but less accurate than DSI, among the four indices chosen for comparison, is the NDSI-1, which is considered more accurate and efficient than the other indices and the most similar to our proposed index (DSI-R), as shown by the obtained results.
The limitation of DSI is that it may not work effectively on drifting sand and dune areas other than the Baiji area.The accuracy of the DSI may vary when applied to other dune fields because it was tested on samples from the Baiji area only due to the computing capabilities required by the training and testing process.Finally, recommendations for conducting future studies, adopting the same methodology as this study, based on samples from the dune areas of the various dune fields in Iraq.The methodology adopted in this study is the easiest to apply to the least amount of data possible to train on and produce an accurate index.In the future, it is possible to conduct a study on more areas of dunes in the world, using the same methodology adopted in this study, to extract spectral indicators that work with high accuracy for most drifting sand regions in the world.

Conclusions
This research uses Landsat 5,8 and Sentinel-2 imagery to predict sandy areas accurately.The strength of the methodology is mainly due to the using the Linear SVM machine learning technique to develop sand spectral index.The performance of the new index DSI, which was proposed in this work and utilized to identify dune accumulations from several multispectral sensors, was accepted, demonstrating its high capacity for mapping and monitoring dunes in the study area.Based on comparisons with reference sites, it was shown that the DSI-R performs well in identifying and separating sandy land from vegetation, water areas, and several other kinds of soils, with an average overall accuracy of more than 87% and average Kappa of more than 74.9%.DSI works efficiently on different sensors: Landsat 5, Landsat 8, and Sentinel 2. It will help monitor the changes in Baji sand field and manage the risk of sand creeping onto residentials, roads, farms, and other infrastructure.

Table 2 .
Number of pixels

Table 6 .
Evaluation of the accuracy and error of several sand dunes indices in Baiji area, OM.E Omission Error, CO.E = Commission Error, Ov.A. = Overall Accuracy, and Non = No sand

Table 7 .
Summary of the average metrics of all indices