Cost-aware Retraining for Machine Learning


Retraining a machine learning (ML) model is essential for maintaining its performance as the data change over time. However, retraining is also costly, as it typically requires re-processing the entire dataset. As a result, a trade-off arises: on the one hand, retraining an ML model too frequently incurs unnecessary computing costs; on the other hand, not retraining frequently enough leads to stale ML models and incurs a cost in loss of accuracy. To resolve this trade-off, we envision ML systems that make automated and cost-optimal decisions about when to retrain an ML model.

In this work, we study the decision problem of whether to retrain or keep an existing ML model based on the data, the model, and the predictive queries answered by the model. Crucially, we consider the costs associated with each decision and aim to optimize the trade-off. Our main contribution is a Cost-Aware Retraining Algorithm, CARA, which optimizes the trade-off over streams of data and queries. To explore the performance of CARA, we first analyze synthetic datasets and demonstrate that CARA can adapt to different data drifts and retraining costs while performing similarly to an optimal retrospective algorithm. Subsequently, we experiment with real-world datasets and demonstrate that CARA has better accuracy than drift detection baselines while making fewer retraining decisions, thus incurring lower total costs.

Knowledge-Based Systems
Ananth Mahadevan
Ananth Mahadevan
Machine Learning PhD Student

My research interests include systems for Machine Learning and network science.