While searching for infrastructure ideas for a ML-Platform project that I am about to sart in my current company, I have found out there are two competing candidates to orchastrate ML-workflows: MLflow and Kubeflow. It is not appropiate to call MLflow as an orchastrator but still it provides some functionality that is required for proper MLOps pipelines, so I guess that’s why people compare the two side-by-side. Since we have an existing Kubernetes cluster on Azure Cloud maintained by the SW group in our case, it made much more sense to go ahead and deep dive into Kubeflow world and get some experience of Kubernetes along the way. But it was quite a hassle to deploying the Kubeflow app in the Azure Cloud at the first place. There are couple of tutorials that goes through the process smoothly but it didnot worked out for myself. So I ended up deploying by the manifests manually - which was provided for “advanced users” - which I am totaly not member of, but anyway…
Here are the relevant notes for the deployment that I tried:
The first part of this tutorial that you set up your resource group and configure Kubernetes cluster works fine:
az login
az account set --subscription id_of_my_subscription
az group create -n kubeflowdep -l westeurope
az aks create -g kubeflowdep -n KubeCluster -s Standard_D4s_v3 -c 2 -l westeurope --generate-ssh-keys
az aks get-credentials -n KubeCluster -g kubeflowdep
After this step, it fails for my case, I am getting this annoying error:
WARN[0013] Encountered error applying application application: (kubeflow.error): Code 500 with message: Apply.Run : [unable to recognize "/tmp/kout852172664": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1", unable to recognize "/tmp/kout852172664": no matches for kind "Application" in version "app.k8s.io/v1beta1"] filename="kustomize/kustomize.go:284"
WARN[0013] Will retry in 4 seconds.
There is an open issue on Kubeflow github but it seems like nobody cared to provide a proper explanation about its possible roots, let-alone a fix. In the end I found out that it is a version issue, where Kubeflow 1.5.0 is only supported by Kubernetes 1.21.0 or below. Thus I decided to install another cubectl
version; but since I didnot want to remove the latest cubectl
I found a workaround using an app asdf
which can handle different version of apps by providing a kind of “app-enviroment” (the idea came from: here). As in their documentation what I did:
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.10.2
and add . $HOME/.asdf/asdf.sh
and . $HOME/.asdf/completions/asdf.bash
to ~/.bashrc
; then finally install the plugin:
asdf plugin add cubectl
asdf install cubectl 1.21.0
You can also install plugins with their url - like descriptions given in here.
Then I make sure that my local version running in my current working directory is 1.21.0:
asdf local kubectl 1.21.0
After this step, I followed the instructions in the Kubeflow page This requires:
- Kubernetes (up to 1.21.0) - check
- kustomize (v 3.2.) - downloaded and made it executable - check
- cubuctl - check
Downloaded the manifests in v1.3.0, extracted them and finally I applied all the official components given in the examples
directory by:
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
It creates all services and pods - it takes time to spun up all pods working so be patient. Then you are supposed to be done; I dont know the reason but some of the pods have never gone up, it will be investigated further! That’s it for an “un-fruitful” tutorial! :D