dc.description.abstract |
The vast accessibility and advancement of the internet have made it an essential
component of modern companies and organizations. Particularly in recent times, with
the emergence of the COVID-19 pandemic, the adoption of online platforms and digital
solutions has become increasingly prevalent among businesses to connect with their
customers. Ecommerce refers to the buying and selling of products or services over the
internet. Online firms interact with clients under non-contractual terms, making it
difficult to track customer retention. One of the major challenges encountered by e-
commerce is churn, which refers to the situation when a customer stop buying a product
or service for a prolonged period. The churn rate in e-commerce is closely linked to a
company's revenue, as retaining customers leads to higher margins compared to
randomly acquiring new customers. It is estimated that the cost of acquiring new
customers is four to five times that of retaining existing customers. The foremost
objective of this research is to determine the most effective approach for identifying
potential customer churn in the e-commerce industry. To carry out the analysis, an
unlabelled dataset obtained from an e-commerce store is used to obtain insights
regarding customer purchasing pattern. The data undergoes various stages of
preprocessing and during this process, new features are derived from the original
dataset. To label the customer data, three distinct churn indicator techniques has been
applied. These techniques include a comparison of the average purchase duration of
customers, the implementation of the RFM (recency, frequency, and monetary) method,
and the application of a K-means unsupervised learning algorithm. Ultimately, a
comparative analysis of several machine learning classification algorithms is performed
to develop an accurate churn prediction model. This study constructed nine models by
employing the Random Forests, Support Vector Machine, and Extreme Gradient
Boosting algorithms in conjunction with three defining criteria. These models were then
evaluated based on a range of performance metrics, including precision, recall, f1-score,
accuracy, and auroc. The models attained their highest accuracy when trained on data
that had been labelled using the RFM method, with accuracies of 86% and 82%,
respectively. Additionally, the memory and time consumption of the models were
assessed, and it was discovered that the support vector machine classifier used the least
amount of memory, while the extreme gradient boosting approach demonstrated the
most time-efficient performance. |
en_US |