14 Principles To Secure Your Data Pipelines

Part 3 of the 7 Layers of MLOps Security Guide

A simple ETL flow using airflow in AWS using S3, redshift and quicksights
A simple ETL flow using airflow in AWS

Series Links

Part 1. Intro and Data Security
Part 2. Protecting Your Data Storage
Part 3. Securing Your Orchestrator
Part 4. ML Model Security
Part 5. ML Model Hosting
Part 6. Securely Exposing ML Models to Users
Part 7. Logging and Monitoring MLOps Infra

What are data pipelines?

Data pipelines are at the heart of building a robust data practice. These pipelines can help clean your data, trigger ML model retraining, or notify an analyst when a certain metric is hit. Since pipelines are so ubiquitous, you need a way to manage them, which we’ll cover next.

What is an orchestrator?

Once we build out the infrastructure and our pipelines, how do we actually make the data flow?

Choir conductor
A choir conductor helps designate the flow of the songs
  • Airflow
  • Airbyte
  • Apache Camel
  • Azure Data Factory
  • AWS Glue
  • GCP Dataflow
  • Kubeflow

An example data pipeline

Before going further, let’s present an example of a sample data pipeline we would use, and later secure.

Simple Apache Airflow Architecture
A simple ETL flow using airflow in AWS

Protecting a data pipeline — 7 steps and 14 principles

Now with some background on data pipelines we’ll go through different steps to secure them. For each of the steps below, I’ve included principles to keep in mind and apply to your own work.

  1. Defining user personas
  2. Defining their actions
  3. Understanding the platform
  4. Securing the platform
  5. Writing secure pipelines/jobs
  6. Granting Access
  7. Keeping the platform up to date

Who would be involved in this flow?

Principle 1: Understand who will use your platform

  • Platform Engineers
  • Data Engineers
  • Data Analysts
  • Operations Engineers
  • Security Engineers

What actions does each user take?

Principle 3: Establish a baseline of actions to set boundaries for each user

Our user actions
Sneaky Cat
Don’t encourage your employees to sneak! | Looney Toons

Understanding the platform— Airflow

Principle 5: Understand the tools your are working with.

Airflow Gui
Airflow GUI with different jobs (DAGs) | Apache
Airflow Architecture
Airflow Architecture | Apache
  • Airflow is a platform that lets you build and run workflows. A workflow is represented as a DAG (a Directed Acyclic Graph), and contains individual pieces of work called Tasks, arranged with dependencies and data flows taken into account.
  • A webserver, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.
  • A scheduler, which handles both triggering scheduled workflows, and submitting Tasks to the executor to run.
  • An executor, which handles running tasks. In the default Airflow installation, this runs everything inside the scheduler, but most production-suitable executors actually push task execution out to workers.
  • A folder of DAG files, read by the scheduler and executor (and any workers the executor has)
  • A metadata database, used by the scheduler, executor and webserver to store state.
Airflow component access

Protecting a managed version of Airflow

Principle 6: Use trusted managed services when possible, embrace the shared responsibility model.

AWS MWAA components
AWS Managed Airflow (MWAA) | AWS

IAM and RBAC

Principle 7: Set up effective use permissions and remember least privilege!

MWAA IAM policy | AWS
Role Assignments Airflow
MWAAFullConsoleAccess Snippet | AWS
Amazon MWAA Architecture
MWAA Architecture | AWS

Creating Jobs

Once you have access to the platform it’s time to create your jobs. Every job you create will be slightly different, and each team will have different use cases, but that’s the power of Airflow, there’s plenty of flexibility! Below we’ll cover some important principles to keep in mind when building your DAGs.

  • Avoid deploying premature code
  • Automate testing
  • Deploy to prod after tests pass
  • Roll forward on bugs
  • Deploy code only from a trusted location
  • Have good visibility on what we have running
  • Avoiding mega complicated jobs that touch everything when possible
An airflow ui screen with many operators
Monitoring a complex job in airflow | Airflow
  • Setup a standard logging framework that will be used by many teams. It should be someone’s responsibility to maintain it, test it and improve it.
  • Use types that censor sensitive data, like SecretStr in python
  • Strike a balance between throwing descriptive errors and revealing key pieces information about your app. The name of the S3 might be appropriate to log, but the ARN contains more compromising info
  • Treat log files as potentially containing sensitive information, scan and sanitize them regularly
class UserAccount {
id: string
username: string
passwordHash: string
firstName: string
lastName: string

...

public toString() {
return "UserAccount(${this.id})";
}

Network Access

Principle 12: Think about your network topology, but never rely on it

  • Have good IAM policies and roles setup
  • Configure MFA
  • Enable privileged role management

Segmenting Environments

Principle 13: Create development environments for trying new things, and a staging (non prod) environment that mirrors prod

Baby yoda touching a lever
Baby Yoda Touching Everything | Disney

Protecting a deployment — Airflow

Now if you don’t want to use a managed version of Airflow, the security gets more comprehensive. We’ve already seen an architecture of AWS MWAA, below is an architecture of GCP Composer, GCP’s managed Airflow.

GCP Composer Public Facing Architecture
GCP Composer Public Architecture | GCP
  1. Externalized Storage — Postgresql and Redis
  2. Kubernetes Network Policy
  3. Kubernetes RBAC

Externalized Storage

Often teams utilize a managed service for their Postgresql and Redis instances to improve reliability and observability. These managed services may also improve your security with automatic key rotations, log correlations and backups. If you are deploying a significant cluster it is often worth making the change.

postgresql:
enabled: false
externalDatabase:
type: postgres
host: postgres.example.org
port: 5432
database: airflow_cluster1
user: airflow_cluster1
passwordSecret: "airflow-cluster1-postgres-password"
passwordSecretKey: "postgresql-password"
# use this for any extra connection-string settings, e.g. ?sslmode=disable
properties: ""

Kubernetes Network Policy

After deploying deploying airflow, we may want to limit who can access what inside our cluster. By default we select a new namespace to deploy airflow into, and perhaps we have other apps living in adjacent namespaces. We can restrict the network connectivity between the name spaces to only allow pods with the airflow tag to be able to communicate with one another.

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: intra-namespace
namespace: airflow
spec:
podSelector:
ingress:
- from:
- namespaceSelector:
matchLabels:
name: airflow

Kubernetes RBAC + Pod Identity

We may also want to change the underlying implementation of airflow to not rely on a username and password for connecting to Redis and Postgresql. In that case we can use something like pod identity on AKS, or workload identity on GCP to connect directly. This would require you to rebuild the images to leverage this environment based connection, and update the helm charts with any additional specs, like listed below for GCP.

spec:
serviceAccountName: Kubernetes_service-account
  1. Enable OAuth
  2. Enable Logging
  3. Many more…

An end to end example

Principle 14: Understand what you are building towards and secure the whole system, not just the individual parts.

DataOps Flow orchestrated by airflow
DataOps flow | MadeWithML
MLModel Training and Deployment Flow | MadeWithML
ML Model Update Flow
ML Model Update Flow | MadeWithML

Using other orchestrators

Today we focused mainly on Airflow, but all the principles we covered still apply to other orchestrators, such as Cloud native tools and or ML focused tools. Each technology will vary slightly, but the concepts are similiar. Leverage them and your knowledge of the tools to refine and improve what you’ve built.

Conclusion

You made it to the end, congrats! Hopefully you take the 14 principles you learned and apply them to your work, they’re neatly compiled here for easy access!

Next Time — Part 4: ML Model Security

We’ve secured our data, transformed it, and are ready to train our models, now what? That’s what Part 4 is for!

--

--

ML Architect @ Voiceflow

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store