Skip to Content

Risky Business: Predicting Cancellations in Imbalanced Multi-Classification Settings

Anand Deshmukh, Meena Kewlani, Yash Ambegaokar, Matthew A. Lanham
Purdue University Krannert School of Management;;;


Identifying new sales opportunities and allocating resources against the best potential-revenue generating accounts is a challenging problem companies face. When a customer reneges on a prior commitment, companies not only bear the loss of potential revenue, but also the sunk costs associated with acquiring the customer’s business.

Graph: Status of projectsOur industry partner is in the business of fixing/improving a product owned by their customer. Acquiring the customers is a very involved and resource-intensive process and once the customer signs the contract, several internal and external resources are employed in the planning and execution of these projects. Hence, such unforeseen cancellations pose a significant risk to our industry partner.

We study the use of machine learning techniques in predicting if a customer might cancel a deal after initially agreeing, and build a model to identify at-risk projects, thereby providing our industry partner the decision-support required to proactively engage with the customer and save their business.

We use classification techniques to classify a project as:

  1. Closed: Project is successfully completed
  2. Declined: Project is declined by management
  3. Cancelled: Customer reneges on the contract

As visible from the graph alongside, there is heavy imbalance between the classes. We treat this using the following techniques:

  1. Resampling
  2. Class weights
  3. Combination of both


Data Sources

  1. Industry Partner: Our industry partner provided attributes of all the projects undertaken in the past 12 months, with over 300,000 observations. It included information regarding the projects, customers, the product being fixed, leads, lead sources, industry partner’s employees, representatives that are involved in the project, and many more. The status of the projects (Closed/Declined/Cancelled) is the response variable.

  2. Publicly available (zip code level) demographic data about income, educational levels, unemployment rates, and population were used to create clusters of zip codes.


Figure of methodology

Graph: Feature Engineering

  1. Discount ratio: Discounted price offered versus market price
  2. Clusters of zip codes using K-means algorithm
  3. Features created to quantify the experience and win-rate of salesforce

Interesting finding from exploratory data analysis: Tougher the customer, greater the discounts (refer to graph).

After feature engineering, EDA, and preprocessing the data, we treat the class imbalance on the train set.

Model Building and Comparison/Selection

We use Random Forest and Gradient Boosting classifiers for this multi-class classification problem.

Treatment of Class Imbalance

1. Resampling using Synthetic Minority Over-Sampling Technique (SMOTE):

           i. Auto: Over-sample all classes to match the majority class 

           ii. Minority: Over-sample the minority class to match the majority class

           iii. Custom over-sampling using dictionary as an argument

2. Class weights

Since the classes are imbalanced, we use the following Evaluation Metrics:

Data table: Metrics with formula

Additionally, we calculate the costs saved by our industry partner post running our model.

Cost Matrix (𝐶𝑖 𝑥 𝑗)Data Table: Actual versus Predicted Status

Assumption: 10% of projects that would have been cancelled can get saved.

Confusion Matrix for each model (𝟑 × 𝟑): 𝐶𝐹𝑖 𝑥 𝑗

Cost Saving Per Project = Σ3𝒊=𝟏 Σ3𝒋=𝟏 ((𝐶𝑖 𝑥 𝑗 × (𝐶𝐹𝑖 𝑥 𝑗) / 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝑷𝒓𝒐𝒋𝒆𝒄𝒕𝒔 (𝑵))

Finally, we pick a model that has the best performing evaluation metrics across the three classes and enables our industry partner to save the highest potential revenue and cost by correctly classifying the projects that would close, get declined by management, or get cancelled by customers.


Data table: combinations of models we ran for the purpose of this study
The table above illustrates the combinations of models we ran for the purpose of this study. For each model, we calculated the cost savings per project based on the formula mentioned earlier. We observe that while SMOTE and class weights increased the cost-savings for the Random Forest Classifier (in isolation as well as in unison), the base model of the Gradient Boosting classifier outperformed all the models and performed better without the treatment of class imbalance.

On the basis of the cost savings, the top 3 models (highlighted above) were:
1. Gradient Boosting (Base model)

Data table: Gradient Boosting (Base model)

2. Gradient Boosting (with Custom SMOTE)

Data table: Gradient Boosting (with Custom SMOTE)

3. Random Forest (with Auto SMOTE and Balanced Class Weights)

Data table: Random Forest (with Auto SMOTE and Balanced Class Weights)

Best Performing Model

Data table: Best Performing Model


  1. We develop a model to predict if a project undertaken by our industry partner would successfully get completed, get declined by the company, or if the customer would renege on the contract and cancel the project
  2. We use a Random Forest classifier and a Gradient Boosting classifier for this multi-class classification problem. The imbalance in classes is treated using SMOTE, setting class weights, and a combination of the two.
  3. Models are evaluated by comparing the potential revenue and costs they save as well as the precision and recall scores of predictions. The precision and recall scores of the highest cost saving model is also the highest amongst the models developed.
  4. By deploying our best performing model, our industry partner can save $10.63 million annually.



We thank our industry partner for sharing their business problem and data with us. The Purdue BIAC partially funded this work.