DevOps & Observability
for ML

Helix brings it all together

Helix takes the best open source MLOps components and integrates them in an enterprise-ready, supported and secure experience so that your data science and ML teams can be productive immediately within a robust governance framework.

 

Because it stands on the shoulders of giants, Helix itself is small, lightweight yet surprisingly powerful. It is a hosted control plane that and can be used to easily to integrate existing systems and/or create an MLOps platform in your own cloud account from scratch.

An End-to-End Machine Learning & Data Science Platform through Helix

Teams, Projects & Collaboration

Helix supports Single Sign-On with OIDC, OAuth 2.0 and SAML 2.0, user federation with LDAP and Active Directory, and advanced fine-grained authorization policies.

Users can create secure and isolated projects, add collaborators to share their notebook servers, pipelines, models & dashboards for maximum productivity, all while ensuring critical business data and models are secure and all events in the system can be fully logged and audited.

Notebooks

Helix deploys secured JupyterLab notebook servers for projects. We support the latest collaborative editing features in JupyterLab, integrated with the Single Sign-On system, for amazing and seamless collaboration between users.

These notebook servers are automatically configured to log experiments and models to MLflow for experiment and model management. The notebook servers support GPUs, auto-scaled from the underlying infrastructure where available.

Experiments & Model Management

Helix deploys a secured MLflow instance per project; read and write access can be managed between users and teams. It also mediates access to an object storage bucket per MLflow instance so that the S3-like credentials that are handed out to clients (notebooks, pipelines etc) can be controlled securely.

MLflow provides detailed experiment tracking so data scientists can see and compare model metrics and metadata between experiment runs, promote runs to named models, and push those models into the build & deploy system so they can be run in development and production.

Data & Pipelines (coming soon)

Helix can integrate with Pachyderm, a data versioning & pipeline system that can be used for data engineering and automated model training after a model has been prototyped and developed in a notebook environment.

Pachyderm supports highly efficient, distributed, incremental data processing, as well as full provenance so that for example you can trace back from a model misbehaving in production back to the specific version of the dataset that it was trained on (and how that dataset was processed/created). NB: Pachyderm Auth is an enterprise feature which is available under a separate license agreement to Helix.

Build, Deploy, A/B

Helix supports building models from MLflow into KServe deployments that reference models living in object storage. These models can then be securely deployed via the Helix UI into development and production use-cases in KServe on Kubernetes.

Models can be easily integrated into applications using the SimpleML project (coming soon). Once models are deployed, live experiments can be run either with shadow mode (all traffic sent to both, shadow responses ignored but logged) or A/B tests (traffic split e.g. 90%/10% for evaluation). Deployed models can then be monitored for data and model drift.

Deployments are fully GitOps driven for DevOps best practice.

Statistical Monitoring (coming soon)

Helix supports Prometheus and Grafana based monitoring of models for error rates and latencies, but that's just the start for monitoring ML models. In order to understand the behavior of models in production, you also need to understand how the data being passed into models varies compares to the data the models were trained on, and also how the predictions or classifications the models are making varies between training time and production.

Helix integrates with the Boxkite project which supports recording a histogram of data inputs and model predictions at training time, and comparing both at runtime using secure Grafana dashboards created per model, using KL Divergence and K-S Test metrics.

Cloud/On-prem, Kubernetes & GPUs

Helix can be deployed in any environment: in the cloud or on-prem. It interfaces with the Kubernetes APIs, as well as MLflow, Jupyter, KServe etc. On AWS, Azure and GCP, it supports deploying a Kubernetes cluster with auto-scaling GPUs configured.

This means you can give your ML team self-serve access to the resources they need for efficient model training, as well as keeping costs under control by automatically shutting down notebooks when idle.

Deployment & upgrades are all handled using best practice GitOps.