Disclaimer: This dissertation has been written by a student and is not an example of our professional work, which you can see examples of here.

Any opinions, findings, conclusions, or recommendations expressed in this dissertation are those of the authors and do not necessarily reflect the views of UKDiss.com.

Dissertation on Automated Loan Approval by Machine Learning

Info: 6825 words (27 pages) Dissertation
Published: 14th May 2021

Reference this

Tagged: Finance

Implementation of Decision Support System for Automatic Approval of Loan by Analysing Applicants Credit Payment Behaviour.


The paper evaluates the behaviour of credit card holders in Taiwan and estimates the consumer credit worthiness by employing various machine learning techniques including Logistic Regression, Random Trees, Bayesian Network and Neural Network on customer credit dataset. For this research project, dataset is extracted from `UCI Machine Learning Repository’ (Lichman; 2013) and then partitioned into training and testing dataset respectively for analysis and evaluation purpose.

The objective of this project is to implement the decision support system which can help organizations in approving loan automatically by analysing the credit payment history of customer. In order to minimize the credit risk from banking perspective, proposed study concentrates on predicting probability of default customers.

Each employed algorithm chooses important predictor variables to train predictive model. To improve performance of the predictive model, other variables except from customer payment status are also taken into consideration from dataset to forecast the default customers. Performance of the implemented predictive models is evaluated by comparing prediction accuracy rate of each model for both training and testing dataset. Among the four algorithms used to build a predictive model, It has been observed that the Logistic Regression algorithm is having the highest ability in predicting default customers.

Keywords: Credit Card, Loan Approval, Machine Learning, Customer Behaviour.

1 Introduction

According to (Yeh and Lien; 2009), credit card holders in Taiwan su_ered from a major credit card debt crisis in the year of 2006 and the same crisis was expected to increase in third quarter of that year. To raise the market share, banks in Taiwan exceeded their credit limit and o_ered more credit to disquali_ed candidates. Within the same period of time, usage of credit cards for personal requirements became increased regardless of their payment capacity which resulted into accumulation of high credit amount in context to their personal bank account and this situation gave rise to critical economic condition 1 for both banks and credit card users so as to manage a clean cashow (Yeh and Lien; 2009). Well organized _nancial institution focuses more on predicting the _nancial risk factor than managing the economic crisis (Borodzicz; 2005). Financial transactions and customer payment history are the main source of information for analysing the behaviour of consumer credit payment and to forecast the default customers.

Data mining terminology comprises various methods to explore the data and present this data into meaningful knowledge (Jiawei and Kamber; 2003). In the domain of Information technology, data mining plays signi_cant role in identifying the trends from data and unseen relationships between various attributes which are part of that data. Machine learning procedures take revealed pattern as an input data for analysis and can be used for building clusters, classifying data and selection of features (Cios et al.; 1998). According to (Venkatesh and Jacob; 2016), application of data mining methods in banking area is increasing continuously as the machine learning algorithms has greater capability of capturing meaningful perception from the data. Various classi_cation algorithm comes under the machine learning environment and it can be utilized to segregate the data into proposed categories. (Venkatesh and Jacob; 2016) stated that, credit card transactions data is increasing on a daily basis in banking sector. In this situation, computer technology is playing an important role for banks in managing the credit risk and to deal with secure transactions by applying machine learning methods and building prediction model for credit risk evaluation.

(Clover; 2013) presented study on developing an auto loan approval system for banks to minimize the credit risk and gaining pro_t from customer credits. (Agarwal et al.; 2008; Heit_eld and Sabarwal; 2004) claimed that most of the developed auto loan approval systems make decisions on applicants demographic and application information but this data does not identify the consumer credit behaviour which is more important for auto loan approval system. To overcome this problem, banks added consumer credit data into auto loan approval systems for predicting default customers by analysing customers credit payment behaviour (Clover; 2013). In order to manage the credit risk in banking sector, predicting probability of default as well as non-default customer is much important as dividing customers into good and bad categories because every bank sets their own criteria while o_ering credit limit to customers based on number of applicants who applied for loan. According to (Baesens, Setiono, Mues and Vanthienen; 2003), evaluating predicted probability, whether it is real or not being a major challenge because probability is mysterious. To tackle with this challenge, (Clover; 2013) proposed `Sorting Smoothing Method’ to determine the actual probability of default by estimating variance between classi_cation accuracy of various data mining methods involved. This proposed approach enabled researchers for further analysis on credit risk evaluation and predicting probabilty of defaulters.

1.1 Research Question: How can data mining supports in implementing decision system for automatic approval of applicants loan in banking sector by analysing consumer credit payment behaviour?

1.2 Project Purpose: Objective of this project is to implement prediction model using various machine learning algorithms and evaluate probability value for both default and non-default customers and _nally conclude the best algorithm with maximum prediction accuracy rate.

1.3 Paper Structure: The research project paper is divided into six sections. First section gives the brief introduction about the research project. Second section summarizes the literature review related to credit risk evaluation models derived by researchers. Third section describes the methodology part in which strategy which is followed to implement the project is explained. Fourth section gives information about tools and algorithms used for project implementation. Fifth section evaluates and compares the results produced from implemented model. Last section concludes the best prediction model and suggests the future work which needs to be done.

2 Related Work

According to (Krichene and Krichene; 2017), after failure of banks in the Asian continent, investigation on credit risk assessment took a stage ahead. Identifying risk in banking and _nance domain is an important part to acknowledge and reduce the uncertainty in future for small and medium scale organizations. As stated by (Wu et al.; 2014), Business Intelligence is playing an important role in analysing consumer credit data and helping in determining the most inuential parameters of risk. (Kasiyanto; 2016) stated that, credit card transaction data is increasing in signi_cant amount due to rapid expansion of online payment systems. For example, PayPal online payment system has captured market globally and they had around 170 million customers until September 2015 (Perez; 2015). In year 2009, from around 200 countries, 2.5 trillion transactions were made and payment transactions through card across the globe is estimated to be around 10000 every second. (Source: American Bankers Association, March 2009). In the recent years data mining has proven its signi_cant importance in various sectors including consumer behavioural scoring, fraud recognition and risk evaluation. Neural network has played signicant role in analysing trend in data and unrevealing composite association between the parameters (Jiawei and Kamber; 2003).

Financial institution and banking framework is very complicated across the globe and very tough to recognize. This complex framework creates barrier in organizations development. Risk is very uncertain as it directly depends upon the economy. This critical situation enabled researchers and analyst to derive predictive model for risk computation. To overcome such a scenario, (Ni; 2010) implemented model of component selection which classifies risks of similar characteristics. This technique had chosen components from data, based on the likeness between parameters and eliminated unwanted data to refine the prediction result. This method works on the concept of filtering and wrapping process. Algorithm assigns the grade value to each selected group and checks recursively for least error value. Credit risk is a major cause to create other risks in the banking sector. However, it is not feasible to destroy the risk absolutely but it can be reduced to a pleasant point which can give some confidence to banks for making safe transactions.

(Shiri et al.; 2012) designed model for credit risk evaluation and fraud detection but this model did not fit to determine the intensity of risk and this gap enables for further study of credit risk assessment.

According to (Yu et al.; 2010), various machine learning methods were explored for credit risk assessment including Decision Tree, Artificial Neural Network and Support Vector Machines. This explored algorithms were applied and evaluated on German and Australian credit data. In this experiment, precision and recall values were compared of each algorithm to choose best prediction model which has maximum accuracy rate and results shown that, Support Vector Machines did not give satisfied outcomes. In data mining, recall is total number of outcomes predicted and precision is total number outcomes which are predicted correctly. (Ghatasheh; 2014) claimed that, Support Vector Machines does not produce signi_cant results when the training dataset is small. On the other hand, Decision Tree method is easy to understand and it has greater capability of predicting outcomes when training dataset is large as compared to Support Vector Machines method. Support Vector Machine is machine learning technique which performs analysis on data for classification purpose (Cortes and Vapnik; 1995). Decision Tree is also a machine learning technique and it has structure similar to tree, which contains root and leaf nodes. Each root evaluates the input data based on certain conditions and classifies into categories. The aim of decision tree algorithm is to design a system which will forecast value of target variable by analysing input dataset (Rokach and Maimon; 2014). In addition, (Yu et al.; 2010) proposed model for credit risk evaluation by combining Decision Tree and Support Vector Machine Technique.

Nave Bayes is a classification technique and it is derived from the Bayes theorem with assumption that predictor variables are independent (Zhang; 2004). It is more eficient and commonly used method for building classification guidelines. (Freitas; 2014) applied the concept of Nave Bayes algorithm for credit score evaluation and examined the importance of Nave Bayes classification method in credit risk assessment, which further enabled researchers to estimate credit score depending on customers credit payment behaviour.

Nave Bayes method is employed by analyst on Kenyan private bank dataset to improve the effectiveness of the classification model and to evaluate its performance (Wagacha; 2002). In this analysis, use of appropriate attributes shown effective classification results from developed classifier. Author (Malekipirbazari and Aksakalli; 2015) proposed model of Random Forest algorithm for credit risk evaluation but this model did not predicted customers payment behaviour. Though, Random Forest algorithm performs well in classifying data but implemented model classified good customers in bad category and vice versa. Random forest is machine learning algorithm which is use for classification of data. Algorithm builds decision trees from training dataset and divides the target variables into specific category based on the decision rule of each tree (Liaw et al.; 2002a).

(Vallini et al.; 2009) applied Multiple Discriminant Analysis (MDA) and Artificial Neural Network (ANN) method on Italian organizations dataset, to forecast the possibility of risk for small and medium scale organizations. Both MDA and ANN are data classification techniques in data mining. Prediction accuracy rate of generated output from the techniques applied was 65.9% and 68.4% respectively. These results were not significant, in order to deploy MDA and ANN model for risk computation. However, numerous tactics are taken into consideration to forecast credit risk but their complications are not explained by considerable accuracy measurement. As stated by (Migufieis et al.; 2013), regardless of deep analysis on credit risk evaluation, there is no agreement on most suitable classi_cation methods to apply. (Baesens, Van Gestel, Viaene, Stepanova, Suykens and Vanthienen; 2003) discovered that conicts can arise while comparing the results of various methods. This situation, forced researchers to continue with investigation for credit risk assessment. This paper followed the approach of (Clover; 2013) by incorporating various machine learning methods which were examined from literature review to implement the prediction model for customer behavioural analysis.

3 Methodology

3.1 Selection of Implementation Strategy

The research project focused on implementation of model to predict customers behaviour by analysing individual credit transaction history in banking sector. To develop predictive model for proposed study, various methodologies were reviewed. After understanding scope of the project, CRISP-DM model is followed to implement the project. The strategy used to build predictive model is `Cross Industry Standard Process for Data Mining’ and it is generally recognized by its short form CRISP-DM (Shearer; 2000). This prototype is popularly used by data mining professionals to find the various solutions associated.

Survey was undertaken to decide the best model for data mining process implementation and CRISP-DM model majorly voted as best model from survey (Piatetsky-Shapiro; 2014). Below diagram represents, ow of the research project implementation.

Figure 1: CRISP-DM Model. (Image Source: Wikipedia)

As shown in the above diagram, process is divided into six stages. Flow of the model is designed in recursive approach to make necessary changes in any stage whenever required.

3.2 Problem Identification and Data Acquisition

Problem understanding is the basic step of this project implementation and it is considered as an important stage to define the aim of project. Objective of this project is to prevent banks from financial loss. This paper presents study on minimizing financial loss for banks by evaluating customers credit payment history and predicting default customer from analysis. According to project statement, multiple datasets were looked up to decide the most appropriate dataset for our proposed study. Data which was required for research project should contain enough demographic information of customer and it should contain minimum 6 months customers credit payment history. Other extracted datasets were of small size as compared to data which was used for our study purpose.

Among several sources, credit dataset of one financial institution in Taiwan has been finalized for this research project. Data for this project is extracted from the `UCI Machine Learning Repository’ (Lichman; 2013). Extracted dataset holds 30000 records and 25 variables.

3.3 Data Preparation

Preparing accurate dataset is a very important stage in the entire data analysis process because usage of wrong data for analysis can lead to incorrect path and ultimately results into production of erroneous output. Hence, to prepare quality data for analysis is an important task (Pyle; 1999). For this project, considering size and number of attributes of the dataset, applications including RStudio, SPSS Modeler and SPSS Statistics has been used for data pre-processing, analysis and model building purpose. With the help of application RStudio, data has been veri_ed to check missing and duplicate values. SPSS Statistics has been used to encode the variable names. Each variable was encoded to specific value for the ease of use. SPSS Modeler has been used to define the data type of variables and to crosscheck missing values. The data has been assured to be in accurate format before applying any techniques, leading to the implementation of project on a correct path. Below is the graphical output, generated from RStudio to check missing values from data and the graph depicts that data does not contain any missing values.

Figure 2: Graph to check missing values.

3.4 Outlier Detection

After the validation of checking of missing values, outlier detection test is performed on the input data. In data analytics, outlier detection is a test which helps in recognizing data entries which are difierent from general observation values (Maddala and Lahiri; 1992). For example, age value should not be like 200 in age column. Here, scatter plot is used in SPSS Statistics to detect outliers and below is the result for the same. From the output generated, it seems that data does not contain any outlier value and it has been represented in below graph. Hence data is appropriate for further processing.

Figure 3: Outlier Detection

3.5 Prediction Modelling and Evaluation

Data which was prepared in earlier stage is taken as an input for demonstrating the predictive model. Input data is partitioned into training (80%) and testing (20%) data by using partition node in SPSS modeler. Four machine learning techniques were applied on credit dataset which includes Logistic Regression, Bayesian Network, Random trees and Neural Network. All four models were trained using training (80%) dataset and later validation was performed on testing dataset. Demographic statistics and suitable graphs were discovered to show important features from dataset which is discussed more detailed further below, in descriptive part of implementation section. Each model evaluated percentage rate of customers predicted correctly and wrongly. Performance of the implemented prediction model was estimated by comparing prediction accuracy rate of each algorithm. Architecture shown below is the designed model of our project implementation and it is a combination of all the prediction models used in this project. This architecture is developed with the help of SPSS modeler.

Figure 4: Predictive Models Architecture

4 Implementation

The implementation procedure of the project is divided into two parts. The first part presents a descriptive analysis of the dataset and the second part involves a comparative study which evaluates the predictive algorithms in order to come up with the best algorithm that would best predict a default customer.

The data used in this project for analysis is made up of 25 variables. It consists of the customers demographic information, payment history for a period of six months, and, the total amount of credit given for both individual credit and supplementary credit. The dataset consists of a total of 24 explanatory variables, 14 Continuous variables and 10 categorical variables, and 1 dependent dichotomous variable. More information on the dataset is presented in the table below.

Figure 5: Variables used in the study and their De_nition

For instance, Pay 0 to Pay 6 column represents the customers payment status from April to September. The status has been categorized into 10 categories based on the payment status of the credit, -2 stands for no consumption, -1 stands for a loan that was paid in full, and 1 stands for a credit that has been delayed for one month and above.

Data pre-processing was conducted on the data using the data audit node in SPSS Modeler, where the data was found to be 100% complete. The data consists of 30000 cases. The partition node in SPSS Modeller was used to split the data into two partitions.

80% of the data,23929 cases, were used for training and model building, and 20% of the data, 6071 cases were used for testing and validation of the model.

4.1 Descriptive Analysis

Descriptive analysis is a method which explains features from data. This method represents summary of quantitative data and produces graphs for the same, which helps in pattern understanding from the dataset (Mann; 2007). Descriptive analysis for our dataset was performed with the help of SPSS statistics data mining tool. Initially, null values were checked for demographic variables. Further, frequencies for demographic and other variables were calculated. Output generated from the frequency evaluation is explained below.

Table shown below represents that data for all the demographic categorical variables did not have missing values.

Figure 6: Valid cases against missing cases

The table below represents frequencies of the demographic variables, for both the default customers and those who were not.

From the table, we can see that 77.9% of the customers were not defaulters and on the other hand 22.1% were defaulters. It can be deduced from the table that more female goes for credit than male, 60.4% were female and 39.6% were male. We can also see that most of the people who took up credit have their highest level of education is university and it is 46.8% followed by those who hold masters at 35.3% and for High school attender rate is only 16.4%. The table shows that the highest number of people who acquired credit were single at 53.2% followed by married people at 45.5%. Those who are divorced were only 1.1%.

Figure 7: Demographic statistics of the Customers

The table below, depicts that the average amount of credit balance was 167,484 NT Dollars with a standard deviation of 129,747 NT Dollars. The lowest credit limit was 10000 NT dollars and the highest credit limit balance was 1,000,000 NT dollars. On the other hand, the average age of those who took up credit was of 35 years with a standard deviation of 9 years. It can be inferred that most of the people who take up credits are middle aged. The youngest person who took up a credit was of age 21 years and the oldest person was of 79 years.

Figure 8: Demographic Statistics for Continuous Variables

Below table illustrates that most of the customers were using revolving funds and was of 53.2%. On the other hand, there are no customers delaying payment for 9 months and above, the highest number of delayed were of 8 months and that was 0.02%.

Figure 9: A Cross Tabulation of the Customers Payment Status and Default Payment from April to September

From the below graph, we can infer that most of the defaulters were using revolving funds. It can also be seen that as the number of delayed months increase the chances of defaulters also increases. We can also see that more than 50% of the customers who delayed for two months and above would result in default.

Figure 10: Distribution of Customers Across Payment Status

4.2 Implementation of Predictive Models

The following are the four algorithms to be evaluated:

1. Logistic Regression

2. Bayesian Network

3. Random Tree

4. Neural Network

The algorithms performance was measured based on its overall prediction accuracy.

Finally, a conclusion was drawn for the best predictor model. Analysis was conducted by engaging the various nodes in SPSS Modeller.

4.2.1 Logistic Regression

First prediction model which was preferred to build for our project is Logistic Regression. Logistic Regression is a classification method in data mining and it is most popular among all the classification techniques. This method is mostly used in the case when the output of target variable is required to be in binary format such as good/bad, boy/girl etc. (Walker and Duncan; 1967). Total 23929 cases were used for building the model which is 80% of data and 6071 cases used for testing and validation purpose, which is approximately 20% of data. Customers were randomly assigned to the two groups using partition node in SPSS Modeller. Case processing summary indicated that data had 0% missing values. Using the logistic regression node in the SPSS Modeler, an automated forward stepwise procedure was used in order to come up with a model that has the strongest predictor variables. At each step, a variable is tested for its importance to the model using Chi-square. Chi-square is a test, which is used to evaluate relation between variables. This model is designed in a way such that new variables can be added in future, if required. Also, upon inclusion of the new variable to the model, it is compared with the existing predictor variable to check whether the newly entered variable is better in explaining the behaviour of credit default. If the newly entered predictor variable was found to be better in terms of prediction, then the existing predictor variable in the model would be removed. The forward stepwise procedure continues until all the predictor variables are tested and if it satisfies the criteria, the same is included or if not then removed respectively. While performing parameter estimation for the model, below variables are found to be statistically significant.

i. Amount of given credit in NT dollars

ii. Sex

iii. Education level

iv. Marital Status

v. Age in years

vi. Repayment Status in September

vii. Amount of bill statement in August

viii. Amount of previous payment in September

ix. Amount of previous payment in August

x. Amount of previous payment in April

By following the simple logistic regression equation, loan default model is evaluated, From the above equation,Loan Default Model is estimated as below;

Where Bi is the coeficient estimated in variable selection process.

Table shown below gives Pearson and Deviance goodness of fit result in which test evaluates whether the predicted probability is varying from the observation values or not.

Figure 11: Goodness of Fit Logistic Regression

As the significant values are greater than 0.05 for Pearson and Deviance, therefore model fit is appropriate.

4.2.2 Neural Network

The next technique used for model building is Neural Network. It is a machine learning method having a strong capability of identifying and representing relationships between variables. Inspiration behind implementing neural network is to build intelligence system which functions like a human brain. Multilayer Perceptron (MLP) model is followed to develop the predictive model. This procedure trains the system using historical data and binds association between input and output data and attempt to generate outcome when output is mysterious(Yegnanarayana; 2009).

Neural Net node was used in SPSS Modeler to train model. The data used for training in Neural Network method was same as that used for training for logistic model, 80% training data and 20% testing data. Main objective while building a default customer predictive model is to achieve a model that has the highest accuracy. The enhanced model accuracy option was selected to boost the predictive model. Probability was used to enhance model accuracy in determining the most valuable inputs.

Table shown below represents the signi_cant predictor variables which was accomplished through the highest probability wins technique. Column V5 shows the probabilities of categories combination of variables; value nearest to 1 means more valuable is the variable.

Figure 12: Signi_cant Predictor Variables- Neural Network

Model Gain:

The graph shown below depicts the fitness of Neural Network model for prediction.

The red diagonal line represents a random model and the blue line represents our model.

Random model is an arbitrary assumed virtual model. Graph explains that blue line model is better than red line model in perspective of % gain as 60 percentiles would result in 70.2% gain.

Figure 13: Model Gain: Neural Network

4.2.3 Bayesian Network

Bayesian Network is mostly preferred method used in machine learning where the problem is uncertain and in which probability is important factor (Murphy; 1998). Graph represented by Bayesian network holds, nodes and lines. Nodes denotes the random parameters and line denotes the association between parameters. For this project Bayesian Network could visualize association between target variable, default payment and predictor variables.

Figure 14: Markov Bayesian Network.

The Bayes Net node was used in SPSS Modeler to build the Bayesian Network predictive model. Data used for training was the same as the previous models. Markov Blanket was used to structure the Bayesian Network Model where target node is guarded by all children and parental node. `’Markov Blanket” is supervised method used to form Bayesian Network and it assists in predicting behaviour of the target variable (Pearl; 2014). Figure 14 is the model generated for credit input dataset.

Figure shown above represents the Markov Bayesian Network, the box shows the distribution of the respective explanatory variables. The importance of the predictor variable on default customer prediction is represented by the concentration of the blue colour on the bars. The darker the blue colour the more important the variable is. From the figure above the following variables were found to be important

i. Credit repayment status in August

ii. Repayment Status in September

iii. Repayment status in July

iv. Sex

v. Age

vi. April payment Status

vii. April Bill Amount

viii. August bill Amount

ix. May Bill amount

x. July Bill Amount

Model Gain:

Graph shown below depicts the fitness of Bayesian Network model for prediction.

The red diagonal line represents a random model and the blue line represents our model.

Graph explains that blue line model is better than red line model in perspective of % gain as 60 percentiles would result in 69.2% gain.

Figure 15: Model Gain: Bayesian Network

4.2.4 Random Trees

The Random Tree procedure is an enhanced method of classifying target variables in which algorithm use generated trees for predicting the outcome of target variable when observed values are new. This method tries to determine the most significant decision rule which has the high forecasting rate (Liaw et al.; 2002b).

The random tree node in SPSS Modeler was used in growing the random trees and pruning. Data used for training was the same as that of the earlier models. The following predictor variables were found to be important from results.

i. September repayment Status

ii. Amount of Previous Payment in September

iii. Amount of Previous payment in August

iv. Repayment status in June

v. Amount of given credit in NT Dollars

Table shown below gives the details about top decision rules identified by the random tree algorithm. `Interestingness Index’ column in the table represents the probability rate of accurate prediction of default customer, derived from the decision rule. Based on the prediction accuracy probability value, top 5 decision rules are displayed by default.

Figure 16: Top Decision Rules for Default Customer Prediction

Model Gain:

Graph shown below depicts the fitness of Random Tree model for prediction. Using the Graphical Evaluation node in SPSS Modeler, model gain was generated for Random Trees model. Graph explains that a 60 percenttile would lead to 68.65% gain.

Figure 17: Model Gain- Random Tress

5 Evaluation

In this section all models were evaluated to decide the best model for predicting the default customers. Evaluation of models is performed based on the prediction accuracy of all four algorithms which are employed. For each method, total number of customers predcited accurately and wrongly is calculated for both training and testing dataset respectivelly.

The accurate and erroneus predictions are represented by numbers and percentile.

Figure 18: Predictive Models Evaluation: Prediction Accuracy

From the above model prediction analysis of both the training and testing data, Logistic Regression produced the best model with 81.54% correct prediction, and 18.46% misclassifications. It was followed closely by Neural net, with 81.5 correct prediction and 18.5 misclassifications.

6 Conclusion and Future Work

The main objective of this study is to implement the various predictive models for forecasting default as well as non-default customers in banking and other financial organizations and also approving customer loan application automatically by analysing a individual’s credit payment behaviour. After a thorough investigation of the various data mining algorithms, the Logistic Regression technique was found to have the highest level of prediction accuracy. The implemented system evaluates the probability value of a default customer. The probability evaluation can assist banks to set a specific criteria while approving the client’s loan application. In future, performance of the implemented system can be improved by training predictive model using larger dataset than size of existing ones. Future researches should include more explanatory variables in the model .This would go a long way in improving the model prediction accuracy.


I would like to express my sincere and faithful gratitude to my supervisor Prof. Keith Maycock for the continuous support that he had given to me during the completion of my Masters Thesis. His guidance helped me immensely throughout the time of research and writing. I will be thankful to my guide ever.


Agarwal, S., Ambrose, B. W. and Chomsisengphet, S. (2008). Determinants of automobile loan default and prepayment.

Baesens, B., Setiono, R., Mues, C. and Vanthienen, J. (2003). Using neural network rule extraction and decision tables for credit-risk evaluation, Management science 49(3): 312{329.

Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J. and Vanthienen, J. (2003). Benchmarking state-of-the-art classi_cation algorithms for credit scoring, Journal of the operational research society 54(6): 627{635.

Borodzicz, E. (2005). Risk, crisis and security management, Wiley.

Cios, K. J., Pedrycz, W. and Swiniarski, R. W. (1998). Rough sets, Data Mining Methods for Knowledge Discovery, Springer, pp. 27{71.

Clover, M. (2013). Yeh, tsun-siou lee., The role of credit card behavior in auto loan grant decision. An application of survival table. Banks and Bank Systems 8(1): 112.

Cortes, C. and Vapnik, V. (1995). Support-vector networks, Machine learning 20(3): 273{ 


Freitas, A. A. (2014). Comprehensible classi_cation models: a position paper, ACM

SIGKDD explorations newsletter 15(1): 1{10. Ghatasheh, N. (2014). Business analytics using random forest trees for credit risk prediction: A comparison study, International Journal of Advanced Science and Technology 72: 19{30.

Heitfield, E. and Sabarwal, T. (2004). What drives default and prepayment on subprime auto loans?, The Journal of real estate finance and economics 29(4): 457{477.

Jiawei, H. and Kamber, M. (2003). Data mining: Concepts and techniques, (the morgan kaufmann series in data management systems), vol. 2.

Kasiyanto, S. (2016). Security issues of new innovative payments and their regulatory challenges, Bitcoin and Mobile Payments, Springer, pp. 145{179.

Krichene, A. and Krichene, A. (2017). Using a naive bayesian classi_er methodology for loan risk assessment: Evidence from a tunisian commercial bank, Journal of Economics, Finance and Administrative Science 22(42): 3{24.

Liaw, A., Wiener, M. et al. (2002a). Classi_cation and regression by randomforest, R news 2(3): 18{22.

Liaw, A., Wiener, M. et al. (2002b). Classification and regression by randomforest, R news 2(3): 18{22.

Lichman, M. (2013). UCI machine learning repository.

URL: http://archive.ics.uci.edu/ml

Maddala, G. S. and Lahiri, K. (1992). Introduction to econometrics, Vol. 2, Macmillan New York.

Malekipirbazari, M. and Aksakalli, V. (2015). Risk assessment in social lending via random forests, Expert Systems with Applications 42(10): 4621{4631.

Mann, P. S. (2007). Introductory statistics, John Wiley & Sons.

Migufieis, V. L., Benoit, D. F. and Van den Poel, D. (2013). Enhanced decision support in credit scoring using bayesian binary quantile regression, Journal of the Operational Research Society 64(9): 1374{1383.

Murphy, K. (1998). A brief introduction to graphical models and bayesian networks.

Ni, H. (2010). Consumer credit risk evaluation by logistic regression with self-organizing map, Natural Computation (ICNC), 2010 Sixth International Conference on, Vol. 1, IEEE, pp. 205{209.

Pearl, J. (2014). Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann.

Perez, S. (2015). Paypal launches paypal. me, a simpler way to request money using your own personalized url.

Piatetsky-Shapiro, G. (2014). Kdnuggets methodology poll.

Pyle, D. (1999). Data preparation for data mining, Vol. 1, morgan kaufmann.

Rokach, L. and Maimon, O. (2014). Data mining with decision trees: theory and applications, World scientific.

Shearer, C. (2000). The crisp-dm model: the new blueprint for data mining, Journal of data warehousing 5(4): 13{22.

Shiri, M. M., Amini, M. T. and Raftar, M. B. (2012). Data mining techniques and predicting corporate financial distress, Interdisciplinary Journal of Contemporary Research in Business 3(12): 61{68.

Vallini, C., Ciampi, F. and Gordini, N. (2009). Using arti_cial neural networks analysis for small enterprise default prediction modeling: Statistical evidence from italian firms, 2009 Oxford Business & Economics Conference Proceedings, Association for Business and Economics Research (ABER), pp. 1{26.

Venkatesh, A. and Jacob, S. G. (2016). Prediction of credit-card defaulters: A comparative study on performance of classi_ers, International Journal of Computer Applications 145(7).

Wagacha, P. W. (2002). Machine learning notes on: I. classi_er learning and generalization, ii. data preparation, iii. validation methods, Institute of Computer Science, University of Nairobi, http://www. uonbi. ac.

ke/acad depts/ics/course material/machine learning/MLNotes. pdf .

Walker, S. H. and Duncan, D. B. (1967). Estimation of the probability of an event as a function of several independent variables, Biometrika 54(1-2): 167{179.

Wu, D. D., Chen, S.-H. and Olson, D. L. (2014). Business intelligence in risk management:

Some recent progresses, Information Sciences 256: 1{7.

Yegnanarayana, B. (2009). Arti_cial neural networks, PHI Learning Pvt. Ltd.

Yeh, I.-C. and Lien, C.-h. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients, Expert Systems with Applications 36(2): 2473{2480.

Yu, H., Huang, X., Hu, X. and Cai, H. (2010). A comparative study on data mining algorithms for individual credit risk evaluation, Management of e-Commerce and eGovernment (ICMeCG), 2010 Fourth International Conference on, IEEE, pp. 35{38.

Zhang, H. (2004). The optimality of naive bayes, AA 1(2): 3.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

Related Content

All Tags

Content relating to: "Finance"

Finance is a field of study involving matters of the management, and creation, of money and investments including the dynamics of assets and liabilities, under conditions of uncertainty and risk.

Related Articles

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the UKDiss.com website then please: