Data science and machine learning are often associated with mathematics, statistics, algorithms and data wrangling. While these skills are core to the success of implementing machine learning in an organization, there is one function that is gaining importance – DevOps for Data Science.
DevOps involves infrastructure provisioning, configuration management, continuous integration and deployment, testing and monitoring. DevOps teams have been closely working with the development teams to manage the lifecycle of applications effectively.
Data science brings additional responsibilities to DevOps. Data engineering, a niche domain that deals with complex pipelines that transform the data, demands close collaboration of data science teams with DevOps. Operators are expected to provision highly available clusters of Apache Hadoop, Apache Kafka, Apache Spark and Apache Airflow that tackle data extraction and transformation. Data engineers acquire data from a variety of sources before leveraging Big Data clusters and complex pipelines for transforming it.
Data scientists explore transformed data to find insights and correlations. They use a different set of tools including Jupyter Notebooks, Pandas, Tableau and Power BI to visualize data. DevOps teams are expected to support data scientists by creating environments for data exploration and visualization.
Building machine learning models is fundamentally different from traditional application development. The development is not only iterative but also heterogeneous. Data scientists and developers use a variety of languages, libraries, toolkits and development environments to evolve machine learning models. Popular languages for machine learning development such as Python, R and Julia are used within development environments based on Jupyter Notebooks, PyCharm, Visual Studio Code, RStudio and Juno. These environments must be available to data scientists and developers solving ML problems.
Machine learning and deep learning demand massive compute infrastructure running on powerful CPUs and GPUs. Frameworks such as TensorFlow, Caffe, Apache MXNet and Microsoft CNTK exploit the GPUs to perform complex computation involved in training ML models. Provisioning, configuring, scaling and managing these clusters is a typical DevOps function. DevOps teams may have to create scripts to automate the provisioning and configuration of the infrastructure for a variety of environments. They will also need to automate the termination of instances when the training job is done.
Similar to modern application development, machine learning development is iterative. New datasets result in training and evolving new ML models that need to be made available to the users. Some of the best practices of continuous integration and deployment (CI/CD) are applied to ML lifecycle management. Each version of an ML model is packaged as a container image that is tagged differently. DevOps teams bridge the gap between the ML training environment and model deployment environment through sophisticated CI/CD pipelines.
When a fully-trained ML model is available, DevOps teams are expected to host the model in a scalable environment. They may take advantage of orchestration engines such as Apache Mesos or Kubernetes to scale the model deployment.
The rise of containers and container management tools make ML development manageable and efficient. DevOps teams are leveraging containers for provisioning development environments, data processing pipelines, training infrastructure and model deployment environments. Emerging technologies such as Kubeflow and MlFlow focus on enabling DevOps teams to tackle the new challenges involved in dealing with ML infrastructure.
Machine learning brings a new dimension to DevOps. Along with developers, operators will have to collaborate with data scientists and data engineers to support businesses embracing the ML paradigm.