A recent episode of The Feast Podcast featured the co-creator of Kubeflow, David Aronchick, along with hosts Willem Pienaar and Demetrios Brinkmann. David, Willem, and Demetrios talked about the complexities of setting up machine learning (ML) infrastructure today and what’s needed in the future to improve this process. You can read about the highlights from the podcast below or listen to the full episode here.
Creation and philosophy behind Kubeflow
Kubeflow is a project that improves the deployment process of ML workflows on Kubernetes, a system for managing containers. It’s an open-source platform originally based on Google’s internal method to deploy TensorFlow models, and is available for public use. It can deploy systems everywhere that Kubernetes is supported: e.g. on-premise installations, Google Cloud, AWS, and Azure.
For machine learning practitioners, training is usually done in one of two ways. If the data set is small, users typically work in a Jupyter notebook, which allows them to quickly iterate on the necessary parameters without having to do much manual setup. On the other hand, if the data set is very large, distributed training is required with many physical or virtual machines.
Originally, Kubeflow started as a way to connect the two worlds, so one could start with a Jupyter notebook and then move into distributed training with more features, pipelines, and feature stores as the data set grows. By itself, Kubeflow did not provide these additional capabilities, but wanted to partner with a service that did — hence, the beginning of a great collaboration with Feast.
David described how Kubeflow is built on a mix of services: Kubeflow defines the pipeline language, Feast provides the feature store, Argo does work under the hood, Katib provides a hyperparameter sweep, and Seldon provides an inference endpoint. As Kubeflow becomes more mature, the goal is to restructure from a monolithic infrastructure where many services are installed at once to become more clean and specialized, so users only install the services they need. Currently we can see that happening with the graduation of KServe.
Improving the collaboration between data scientists and software engineers
Next, David discussed how data scientists and software engineers work together to build and deploy ML systems. Data scientists fine tune the parameters of the model while engineers work on productionizing the model — that is, making sure it runs smoothly without interruptions. Unfortunately, the production deploy process cannot be fully automated yet.
One of the core problems is that the APIs for ML systems are complicated to use, which is a hindrance to data scientists. A lot of work in ML is closer to science, where hypotheses are made and tested, as opposed to software development, where there is an iterative process and new versions are always being shipped. If you start building a distributed model based on a large data set, it may be hard or impossible to work in an interactive notebook like Jupyter unless a completely new, smaller, model is created.
The general process for ML practitioners is a pipeline, but the individual steps are often not clearly described so it is difficult to map each step to the correct tool for the job. A data scientist’s daily work can often look like downloading a CSV, deleting a column of data, uploading it to a feature store, running a Python script, and then doing training. Willem stated the need for a better solution: “Small groups should be able to independently deploy solutions for specific use cases that solve business problems.” David wants to make this pipeline easier to perform with existing tools: “While there certainly are components of that available in Kubeflow and Kubernetes and others, I’m thinking about what that next abstraction layer looks like and it should be available shortly.”
What’s needed to accelerate the industry
The landscape of ML operations platforms is very complex. There are several infrastructure options out there: Kubernetes was chosen as the backbone of Kubeflow because it’s simple to set up and tweak. Willem talked about the consolidation of ML and ML operations tools: “It’s going to happen eventually because there’s just too many out there, and they’re not all going to make it. Right now, it’s the breeding grounds, and then it’s going to be survival of the fittest.” We can already see this playing out with DataRobot acquiring Algorithmia, Snowflake purchasing Streamlit and a few days ago, Databricks buying Cortex Labs.
For open-source projects like Kubeflow, there should be a working group around core components that establishes standards. It isn’t necessary to have one person who makes all of the decisions in this space. If a new feature is needed, code discussions are 10% of the problem but the majority of the work is around deciding implementation details and making sure that it works. The fastest way to get something done is just to build it yourself and try to get it merged.
David mentioned that to really improve the ecosystem for ML, we “need to develop not just a standard layer for describing the entire platform, but also a system that describes many of the common objects in machine learning: a feature, a feature store, a data set, a training run, an experiment, a serving inference, and so on. It will spur innovation because it defines a set of clear contracts that users can produce and consume.” Currently, this is hard to do programmatically because the variety of systems means that auxiliary tools need to be written to connect data sets.
If expanding the future of ML infrastructure sounds exciting to you, there’s a lot of contributions that are needed! You can learn more about Feast, the feature store connected to Kubeflow, and start using it today. Jump in our slack and say hi!