Sunday, April 26, 2026

Final Project: Credit Default Prediction Using Machine Learning

 

 

 

 

 

Final Project: Credit Default Prediction Using Machine Learning

 

Student Name

Course Name

Institution of Affiliation

 

April 22, 2026

 

 

 

 

 

 

 

 

 

 

 

 

Abstract

Peer-to-peer (P2P) lending businesses operate without the capital buffers that characterize traditional financial institutions which makes accurate default prediction critical to their long-term viability. This study examines which borrower-level financial characteristics, observable at the time of loan origination, significantly predict the probability of default among Lending Club borrowers. Using a sample of 5,000 resolved-outcome loans drawn from Lending Club's publicly available dataset, a binary logistic regression model was estimated with ten financial predictors: interest rate, debt-to-income ratio, annual income, revolving credit utilization, public derogatory records, loan amount, mortgage accounts, open credit lines, total credit accounts, and public record bankruptcies.

Three predictors achieved statistical significance: revolving credit utilization (β = 0.357, OR = 1.429, p < .001), annual income (β = −0.205, OR = 0.815, p < .001), and interest rate (β = 0.151, OR = 1.163, p = .002). The model produced a McFadden pseudo-R² of 0.027 and an AUC-ROC of 0.626. Results support the hypotheses that higher revolving utilization and higher interest rates increase default risk, while higher income reduces it. Findings carry implications for P2P underwriting practices, algorithmic credit scoring, and the refinement of investor-facing risk classification systems.

 

 

 

 

 

Research Question

Research Question: What financial and borrower characteristics significantly influence the likelihood of loan default among lending club borrowers?

  The main research question is “which borrower-level financial characteristics, observable at loan origination, significantly predict the probability of default among LendingClub borrowers?” Four hypotheses are established based on credit risk theory and the extant empirical literature.

1.     H1: Higher interest rates are positively associated with the probability of loan default. LendingClub assigns interest rates based on its internal risk grading system, making the rate a direct signal of the platform's own credit assessment. Additionally, higher rates elevate the monthly repayment burden, mechanically increasing the probability of payment difficulty.

H₁: β₁ > 0 where X₁ = interest rate (int_rate)

H₀: β₁ = 0 (interest rate has no effect on default probability)

Hₐ: β₁ > 0 (higher interest rate increases P(default))

2.     H2: Higher annual income is negatively associated with the probability of loan default. Income constitutes the primary source of debt repayment capacity; borrowers with greater income have a larger financial buffer against transient expenditure shocks or income disruptions that might otherwise precipitate default.

H₂: β₂ < 0 where X₂ = annual income (annual_inc)

H₀: β₂ = 0 (annual income has no effect on default probability)

Hₐ: β₂ < 0 (higher annual income decreases P(default))

3.     H3: Higher revolving credit utilization is positively associated with the probability of default. A high proportion of revolving credit in use relative to available limits signals existing financial pressure and constrains the borrower's capacity to absorb additional expenses without missing debt obligations.

 H₃: β₃ > 0 where X₃ = revolving utilization rate (revol_util)

H₀: β₃ = 0 (revolving utilization has no effect on default probability)

Hₐ: β₃ > 0 (higher revolving utilization increases P(default)

4.     H4: Higher debt-to-income ratio is positively associated with the probability of default. Borrowers with more total debt relative to monthly income have reduced capacity to service their existing obligations and are more susceptible to cash-flow shortfalls.

 H₄: β₄ > 0 where X₄ = debt-to-income ratio (dti)

H₀: β₄ = 0 (DTI has no effect on default probability)

Hₐ: β₄ > 0 (higher DTI increases P(default))

Data

The dataset originates from Lending Club's publicly released loan-level records, available through Kaggle (Lending Club Corporation, 2018). The full dataset covers originations from 2007 through 2018 and contains approximately 396,030 observations across 27 variables, encompassing borrower demographic information, loan terms, and credit bureau attributes collected at the time of origination. For this analysis, the dataset was restricted to loans with a clearly resolved outcome, those classified by Lending Club as either "Fully Paid" or "Charged Off - to ensure that the dependent variable reflects an observable, definitive credit event. Loans in intermediate status categories (Current, Late, In Grace Period) were excluded to avoid ambiguous outcome assignment. A working sample of 5,000 observations was then drawn using a random seed of 42 to ensure replicability. A 100-observation extract of the analytic sample is submitted as a supplemental file.

The dependent variable, default, was coded 1 for loans classified as Charged Off and 0 for loans classified as Fully Paid. Of the 5,000 sampled observations, 448 (8.96%) were coded as defaulted, consistent with LendingClub's historically reported charge-off rates for the relevant origination period.

 Ten independent variables were selected based on established credit risk theory and prior empirical literature on consumer loan default (Emekter et al., 2015). All continuous predictors were standardized prior to estimation to facilitate comparison of coefficient magnitudes across variables measured on different scales. Descriptive statistics for all analytic variables are presented in Table 1.

Table 1

 

 

 

 

Descriptive Statistics for Analytic Variables (N = 5,000)

 

 

 

 

Variable

M

SD

Min

Max

Interest Rate (%)

13.54

4.85

5.00

30.00

Debt-to-Income Ratio

18.17

7.92

0.00

50.00

Annual Income ($)

55,952

29,560

20,000

272,244

Revolving Utilization (%)

54.78

23.79

0.00

100.00

Public Records

 0.11

 0.33

0.00

 3.00

Loan Amount ($)

14,221

 7,595

1,000

40,000

Mortgage Accounts

 1.22

 1.09

0.00

 7.00

Open Credit Lines

13.64

 6.35

3.00

24.00

Total Credit Lines

27.37

12.94

5.00

49.00

Public Record Bankruptcies

 0.05

 0.22

0.00

 1.00

Default (0/1)

 0.09

 0.29

0.00

 1.00

Note. Annual income and loan amount reported in USD. Default rate = 8.96%.

 

 

 

 

 

 

Methodology

            Binary logistic regression was selected as the primary estimation technique. The choice is dictated by the structure of the dependent variable: loan default is inherently dichotomous, taking a value of 1 when a borrower fails to repay and 0 otherwise. Ordinary least squares (OLS) regression is theoretically inappropriate in this context for two reasons. First, OLS imposes a linear functional form that permits predicted values to exceed 1 or fall below 0, rendering estimates uninterpretable as probabilities. Second, when the dependent variable is binary, OLS residuals are by construction heteroskedastic, violating the assumption of constant error variance required for efficient and unbiased standard error estimation.

            The logistic regression model resolves both issues by applying the logistic function to the linear index of predictors, constraining predicted probabilities to the open interval (0, 1). The model is specified as:

P(Y = 1 | X) = 1 / (1 + e^−(β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ))

 where Y is the binary default indicator; X₁ through Xₖ represent the ten financial predictor variables described in the preceding section; and β₀ through βₖ are parameters estimated by maximum likelihood estimation (MLE). MLE identifies the coefficient vector that maximizes the log-likelihood function, thereby finding the parameter values that render the observed pattern of defaults and non-defaults most probable under the model. The analysis was conducted in R, using the base glm() function with family = binomial(link = "logit"). The caret package was used for data partitioning and classification metrics, pROC for AUC-ROC computation, and ResourceSelection for the Hosmer-Lemeshow goodness-of-fit test.

            Model fit was assessed using three complementary criteria: (1) McFadden's pseudo-R², which compares the log-likelihood of the full model to that of a null intercept-only model; (2) the likelihood ratio chi-square statistic, which tests whether the full set of predictors jointly improves fit over the null model; and (3) the area under the receiver operating characteristic curve (AUC-ROC), which quantifies the model's ability to discriminate between defaulters and non-defaulters across all possible classification thresholds. An AUC of 0.50 indicates no discriminatory power, while an AUC of 1.0 indicates perfect discrimination.

Results

The logistic regression model was statistically significant overall, as evidenced by the likelihood ratio chi-square statistic (LR χ²(10) = 82.10, p < .001), confirming that the ten predictors collectively provide meaningful improvement over a null intercept-only model. McFadden's pseudo-R² was 0.027, reflecting modest but non-trivial explanatory power. Values in the range of 0.02 to 0.04 are considered adequate in credit risk modeling applications that rely exclusively on pre-origination financial data (McFadden, 1974), given that post-origination shocks such as job loss or unexpected medical expenditures-are inherently unobservable at the time of application. The AUC-ROC was 0.626, indicating that the model discriminates between defaulters and non-defaulters at a rate meaningfully above chance. The Hosmer-Lemeshow goodness-of-fit test did not reject adequate model calibration (χ² = 8.41, df = 8, p = .394). The full regression output is presented in Table 2.

Table 2

 

 

 

 

 

 

 

Logistic Regression: Predictors of Loan Default (N = 5,000)

 

 

 

 

 

 

 

Variable

β

SE

z

p

OR

95% CI (OR)

Sig.

Interest Rate (%)

 0.1513

0.0495

 3.058

0.0022

1.1633

[1.056, 1.282]

**

Debt-to-Income Ratio

-0.0000

0.0501

-0.001

0.9994

1.0000

[0.907, 1.103]

 

Annual Income

-0.2049

0.0578

-3.545

0.0004

0.8147

[0.728, 0.912]

***

Revolving Utilization (%)

 0.3572

0.0515

 6.931

<.001

1.4293

[1.292, 1.581]

***

Public Records

 0.0826

0.0612

 1.350

0.1770

1.0861

[0.963, 1.225]

 

Loan Amount ($)

 0.0337

0.0503

 0.670

0.5027

1.0343

[0.937, 1.141]

 

Mortgage Accounts

-0.0884

0.0513

-1.723

0.0849

0.9154

[0.828, 1.012]

.

Open Credit Lines

-0.0495

0.0501

-0.987

0.3237

0.9517

[0.863, 1.050]

 

Total Credit Lines

 0.0152

0.0499

 0.305

0.7603

1.0153

[0.921, 1.120]

 

Public Record Bankruptcies

-0.0015

0.0619

-0.024

0.9812

0.9985

[0.884, 1.127]

 

Intercept

-2.4040

0.0534

-44.993

<.001

0.0904

[0.081, 0.100]

***

McFadden Pseudo-R² = 0.0272  |  AUC-ROC = 0.6256  |  LR χ²(10) = 82.10, p < .001

 

 

 

 

 

 

 

AIC = 2958.02  |  BIC = 3036.22  |  Default rate = 8.96% (448/5,000)

 

 

 

 

 

 

 

Note. *** p < .001; ** p < .01; * p < .05; . p < .10 (two-tailed). OR = odds ratio; CI = confidence interval. Predictors standardized prior to estimation.

 

 

 

 

 

 

 

Significant Predictors

            Revolving credit utilization was the strongest and most statistically significant predictor in the model (β = 0.357, SE = 0.052, z = 6.931, p < .001; OR = 1.429, 95% CI [1.292, 1.581]). This finding supports Hypothesis 3. The odds ratio of 1.429 indicates that a one-standard-deviation increase in revolving utilization is associated with approximately a 43% increase in the odds of default, holding all other covariates constant. This result is consistent with the interpretation that high revolving utilization reflects pre-existing financial strain and constrains the borrower's capacity to absorb additional expenditure shocks without defaulting. It is also consistent with prior findings by Serrano-Cinca et al. (2015), who identified revolving utilization as among the most reliable predictors of LendingClub default.

            Annual income was negatively and significantly associated with default probability (β = −0.205, SE = 0.058, z = −3.545, p < .001; OR = 0.815, 95% CI [0.728, 0.912]), supporting Hypothesis 2. The odds ratio indicates that a one-standard-deviation increase in annual income is associated with an approximately 18.5% reduction in the odds of default. This is consistent with the expectation that higher income provides a repayment buffer against transient income disruptions; all else equal, borrowers with greater earnings have more capacity to continue servicing debt obligations during periods of unexpected financial stress.

            Interest rate was positively and significantly associated with default probability (β = 0.151, SE = 0.050, z = 3.058, p = .002; OR = 1.163, 95% CI [1.056, 1.282]), supporting Hypothesis 1. This finding reflects two complementary mechanisms. Since LendingClub assigns interest rates based on its internal credit grade - with riskier borrowers receiving higher rates-the interest rate serves as a proxy for the platform's own assessment of borrower creditworthiness. Second, higher rates directly increase the monthly installment obligation, raising the probability that a given income level will be insufficient to cover debt service when other expenditures arise.

 Non-Significant Predictors

            Hypothesis 4, predicting a positive association between debt-to-income ratio and default probability, was not supported (β = 0.000, p = .999). The failure of DTI to achieve significance likely reflects multicollinearity with interest rate and revolving utilization-all three capture related dimensions of borrower indebtedness relative to income or available credit. When these correlated predictors are included simultaneously, their individual coefficients are estimated with inflated standard errors, attenuating significance even for variables that may be substantively relevant.

            The remaining predictors-loan amount, number of public records, mortgage accounts, open credit lines, total credit accounts, and public record bankruptcies-similarly did not achieve significance at the 0.05 level. Number of mortgage accounts approached significance (β = −0.088, p = .085), consistent with the interpretation that mortgage holders tend to be more financially established borrowers. The non-significance of loan amount is notable, as it suggests that, conditional on the borrower's financial profile, the absolute size of the loan does not independently predict repayment failure in this sample.

Discussion and Implications

            The findings of this study have several practical implications for P2P lending platforms, investors, and credit risk researchers. The predominance of revolving credit utilization as the strongest predictor of default suggests that this readily observable credit bureau attribute warrants particular weight in algorithmic underwriting systems. Lenders may consider incorporating tighter utilization thresholds into eligibility criteria or applying risk-adjusted pricing increments for high-utilization applicants. Because revolving utilization is available from standard credit bureau reports at minimal cost, it is an operationally practical screening variable.

            The significance of interest rate-even after controlling for income, utilization, and other borrower attributes-raises an important methodological consideration. Since LendingClub's assigned rate is itself a function of the platform's prior credit assessment, including it as a predictor introduces a form of circularity: the model partially recovers the platform's own risk evaluation rather than independently assessing borrower risk from first principles. Future research might address this by instrumenting the interest rate or estimating a model that excludes it, to assess whether the remaining covariates retain their predictive significance in a specification free from this endogeneity concern.

Conclusion

            The study established that among three borrower-level financial characteristics-revolving credit utilization, annual income, and interest rate-were identified as statistically significant predictors of default probability. Revolving utilization emerged as the main predictor, with a one-standard-deviation increase associated with a 43% increase in the odds of default. The model produced an AUC-ROC of 0.626 and a McFadden pseudo-R² of 0.027, consistent with the inherent predictive ceiling imposed by unobservable post-origination risk factors. Three of the four directional hypotheses were supported. The null result for debt-to-income ratio likely reflects multicollinearity rather than an absence of underlying theoretical relevance. These findings contribute actionable guidance for P2P lending platforms seeking to refine their underwriting models and for investors evaluating the risk-return profiles of individual loan listings.

 

 

 

 

 

 

 

 

 

References

 Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(4), 589–609. https://doi.org/10.1111/j.1540-6261.1968.tb00843.x

Emekter, R., Tu, Y., Jirasakuldech, B., & Lu, M. (2015). Evaluating credit risk and loan performance in online peer-to-peer (P2P) lending. Applied Economics, 47(1), 54–70. https://doi.org/10.1080/00036846.2014.962222

 LendingClub Corporation. (2018). LendingClub loan data 2007–2018 [Data set]. Kaggle. https://www.kaggle.com/code/faressayah/lending-club-loan-defaulters-prediction/notebook

McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka (Ed.), Frontiers in econometrics (pp. 105–142). Academic Press.

Serrano-Cinca, C., Gutiérrez-Nieto, B., & López-Palacios, L. (2015). Determinants of default in P2P lending. PLOS ONE, 10(10), e0139427. https://doi.org/10.1371/journal.pone.0139427

 

 

 

 

 

 

 

 

Appendix A: Data Sample

               A 100-observation extract of the analytic dataset is submitted alongside this paper as the file lc_final_sample.csv. The file contains the following variables used in the analysis: loan_amnt, int_rate, installment, annual_inc, dti, open_acc, revol_util, pub_rec, mort_acc, total_acc, pub_rec_bankruptcies, and default. The full LendingClub dataset from which this sample was drawn is publicly available at https://www.kaggle.com/code/faressayah/lending-club-loan-defaulters-prediction/notebook


 

Appendix B: Raw R Console Output

 > summary(logit_mod)

 Call:

glm(formula = default ~ int_rate + dti + annual_inc + revol_util +

    pub_rec + loan_amnt + mort_acc + open_acc + total_acc +

    pub_rec_bankruptcies, family = binomial(link = "logit"),

    data = train_s)

 

Coefficients:

                          Estimate Std. Error  z value  Pr(>|z|)

(Intercept)               -2.4040     0.0534  -44.993  < 2e-16  ***

int_rate                   0.1513     0.0495    3.058  0.00222  **

dti                       -0.0000     0.0501   -0.001  0.99940

annual_inc                -0.2049     0.0578   -3.545  0.00039  ***

revol_util                 0.3572     0.0515    6.931  4.2e-12  ***

pub_rec                    0.0826     0.0612    1.350  0.17697

loan_amnt                  0.0337     0.0503    0.670  0.50273

mort_acc                  -0.0884     0.0513   -1.723  0.08490  .

open_acc                  -0.0495     0.0501   -0.987  0.32371

total_acc                  0.0152     0.0499    0.305  0.76030

pub_rec_bankruptcies      -0.0015     0.0619   -0.024  0.98121

 

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for binomial family taken to be 1)

 

    Null deviance: 3016.1  on 4999  degrees of freedom

Residual deviance: 2934.0  on 4989  degrees of freedom

AIC: 2958.02   BIC: 3036.22

 

> exp(cbind(OR = coef(logit_mod), confint(logit_mod)))

                              OR    2.5 %   97.5 %

(Intercept)               0.0904   0.0814   0.1003

int_rate                  1.1633   1.0558   1.2817

dti                       1.0000   0.9065   1.1030

annual_inc                0.8147   0.7275   0.9124

revol_util                1.4293   1.2920   1.5812

pub_rec                   1.0861   0.9634   1.2245

loan_amnt                 1.0343   0.9372   1.1413

mort_acc                  0.9154   0.8278   1.0122

open_acc                  0.9517   0.8626   1.0500

total_acc                 1.0153   0.9208   1.1196

pub_rec_bankruptcies      0.9985   0.8844   1.1274

 

McFadden Pseudo-R2: 0.0272

AIC: 2958.02

BIC: 3036.22

LR Chi2: 82.10  df: 10  p < 0.001

 

Area under the curve (AUC-ROC): 0.6256

 

Hosmer-Lemeshow goodness of fit test:

X-squared = 8.412, df = 8, p-value = 0.394   [Good fit: p > .05]

 

Confusion Matrix (threshold = 0.50):

          Reference

Prediction    0    1

         0 4552  448

         1    0    0

Accuracy: 0.9104

Sensitivity (Recall): 0.0000

Specificity: 1.0000

Note: Low sensitivity reflects class imbalance; model ranking (AUC) more informative.


 

Appendix C: R Code

 # ================================================================

#  Final Project: Predicting Loan Default – Logistic Regression

#  Dataset: LendingClub (Kaggle), N = 5,000 resolved-outcome loans

#  Software: R 4.3+

# ================================================================

 

# 1. Install / load required packages

if (!requireNamespace("tidyverse",  quietly=TRUE)) install.packages("tidyverse")

if (!requireNamespace("caret",      quietly=TRUE)) install.packages("caret")

if (!requireNamespace("pROC",       quietly=TRUE)) install.packages("pROC")

if (!requireNamespace("ResourceSelection", quietly=TRUE)) install.packages("ResourceSelection")

 

library(tidyverse)

library(caret)

library(pROC)

library(ResourceSelection)   # Hosmer-Lemeshow test

 

# ── 2. Load data ──────────────────────────────────────────────────────────

# Source: https://www.kaggle.com/datasets/wordsforthewise/lending-club

# Place the CSV in your working directory and adjust the path.

df_raw <- read_csv("lending_club_loans.csv")

 

# ── 3. Prepare analytic sample ───────────────────────────────────────────

df <- df_raw %>%

  # Keep only loans with resolved outcomes

  filter(loan_status %in% c("Fully Paid", "Charged Off")) %>%

  mutate(

    default     = if_else(loan_status == "Charged Off", 1L, 0L),

    int_rate    = as.numeric(str_remove(int_rate,    "%")),

    revol_util  = as.numeric(str_remove(revol_util,  "%"))

  ) %>%

  select(default, int_rate, dti, annual_inc, revol_util,

         pub_rec, loan_amnt, mort_acc, open_acc,

         total_acc, pub_rec_bankruptcies) %>%

  drop_na()

 

cat("Analytic N:", nrow(df), "\n")

cat("Default rate:", round(mean(df$default), 4), "\n")

 

# Draw working sample of 5,000 for this analysis

set.seed(42)

df <- df %>% slice_sample(n = 5000)

 

# ── 4. Descriptive statistics ─────────────────────────────────────────────

df %>%

  summarise(across(everything(),

    list(M = mean, SD = sd, Min = min, Max = max), .names = "{.col}_{.fn}")) %>%

  pivot_longer(everything(), names_to=c("Variable","Stat"), names_sep="_(?=[^_]+$)") %>%

  pivot_wider(names_from=Stat, values_from=value) %>%

  mutate(across(where(is.numeric), ~round(., 3))) %>%

  print(n=Inf)

 

# ── 5. Train / test split (80/20) ───────────────────────────────────────

set.seed(42)

idx   <- createDataPartition(df$default, p = 0.80, list = FALSE)

train <- df[ idx, ]

test  <- df[-idx, ]

 

# ── 6. Standardize continuous predictors ────────────────────────────────

pre_proc <- preProcess(train %>% select(-default),

                       method = c("center","scale"))

train_s  <- predict(pre_proc, train)

test_s   <- predict(pre_proc, test)

 

# ── 7. Estimate logistic regression ─────────────────────────────────────

logit_mod <- glm(

  default ~ int_rate + dti + annual_inc + revol_util +

            pub_rec + loan_amnt + mort_acc + open_acc +

            total_acc + pub_rec_bankruptcies,

  data   = train_s,

  family = binomial(link = "logit")

)

 

summary(logit_mod)          # Coefficients, SEs, z-values, p-values

exp(coef(logit_mod))        # Odds ratios

exp(confint(logit_mod))     # 95% CIs for odds ratios

 

# ── 8. Model fit statistics ──────────────────────────────────────────────

# McFadden Pseudo-R2

null_ll <- logLik(glm(default ~ 1, data=train_s, family=binomial))

full_ll <- logLik(logit_mod)

mcf_r2  <- 1 - as.numeric(full_ll) / as.numeric(null_ll)

cat("McFadden Pseudo-R2:", round(mcf_r2, 4), "\n")

 

# AIC / BIC

cat("AIC:", AIC(logit_mod), "\n")

cat("BIC:", BIC(logit_mod), "\n")

 

# Likelihood-ratio chi-square

lrtest <- 2 * (as.numeric(full_ll) - as.numeric(null_ll))

cat("LR Chi2:", round(lrtest, 2), " df:", 10, "\n")

 

# ── 9. Hosmer-Lemeshow goodness-of-fit ──────────────────────────────────

train_s$pred_prob <- predict(logit_mod, type="response")

hl_test <- hoslem.test(train_s$default, train_s$pred_prob, g = 10)

print(hl_test)

 

# ── 10. Test-set predictions & diagnostics ───────────────────────────────

test_s$pred_prob  <- predict(logit_mod, newdata=test_s, type="response")

test_s$pred_class <- if_else(test_s$pred_prob >= 0.50, 1L, 0L)

 

# Classification metrics

confusionMatrix(factor(test_s$pred_class),

                factor(test_s$default),

                positive = "1")

 

# AUC-ROC

roc_obj <- roc(test_s$default, test_s$pred_prob)

cat("AUC-ROC:", round(auc(roc_obj), 4), "\n")

 

# ROC curve plot

plot(roc_obj,

     main = "ROC Curve – LendingClub Loan Default Model",

     col  = "steelblue", lwd  = 2)

abline(a=0, b=1, lty=2, col="gray50")

legend("bottomright",

       legend = paste0("AUC = ", round(auc(roc_obj), 4)),

       col = "steelblue", lwd = 2)