A Machine Learning (ML) pipeline configure the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline’s performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as poor execution time and memory usage, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as an pipeline configuration (PLC) issue.
There is no prior systematic study on the pervasiveness, impact and root causes of PLC issues. A systematic understanding of these issues helps configure effective ML pipelines and identify misconfigured ones. In this paper, we present the first systematic study of PLC issues. We propose Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and compares their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The study results show that 1,092 (32.3%) of the 3,380 pipelines manifest significant performance inconsistencies on at least one variant. We find that 399, 243 and 440 pipelines can achieve better competition scores, execution time and memory usage, respectively, by adopting a different configuration. Based on our empirical findings, we constructed a repository containing 164 defective APIs and 106 API combinations from 418 library versions, which helped us capture PLC issues in 309 real-world ML pipelines.