14 Principles To Secure Your Data Pipelines
Part 3 of the 7 Layers of MLOps Security Guide
Welcome to Part 3 of our MLOps security guide! To recap, last time we discussed how to protect your data storage and went through a comprehensive list of storage solutions. Today we’ll dive deeper into understanding and protecting our data pipelines.
Series Links
Part 1. Intro and Data Security
Part 2. Protecting Your Data Storage
Part 3. Securing Your Orchestrator
Part 4. ML Model Security
Part 5. ML Model Hosting
Part 6. Securely Exposing ML Models to Users
Part 7. Logging and Monitoring MLOps Infra
What are data pipelines?
Data pipelines are at the heart of building a robust data practice. These pipelines can help clean your data, trigger ML model retraining, or notify an analyst when a certain metric is hit. Since pipelines are so ubiquitous, you need a way to manage them, which we’ll cover next.
What is an orchestrator?
Once we build out the infrastructure and our pipelines, how do we actually make the data flow?
For this we need an orchestrator, something that executes each of our tasks and returns their status. Orchestrators come in many forms, and just like in the real world, some are loud and in your face, others prefer to operate from the shadows like a puppeteer.
Depending on the use case and your infrastructure, you will have several options for orchestrators, with some listed below:
- Airflow
- Airbyte
- Apache Camel
- Azure Data Factory
- AWS Glue
- GCP Dataflow
- Kubeflow
The list is not exhaustive, but each orchestrator has a different use case ranging from Airflow, which is very general, to Kubeflow which specializes in ML flows.
An example data pipeline
Before going further, let’s present an example of a sample data pipeline we would use, and later secure.
In this flow we’re taking raw data from an S3 bucket, loading it into Redshift, creating a few aggregations and then emailing a business analyst when it’s ready.
Protecting a data pipeline — 7 steps and 14 principles
Now with some background on data pipelines we’ll go through different steps to secure them. For each of the steps below, I’ve included principles to keep in mind and apply to your own work.
- Defining user personas
- Defining their actions
- Understanding the platform
- Securing the platform
- Writing secure pipelines/jobs
- Granting Access
- Keeping the platform up to date
Who would be involved in this flow?
Principle 1: Understand who will use your platform
For each data pipeline and process, there will be different users who will be interested in the systems. Before building your platform, it is important to understand who will use it and how they will use it. In a large company these could include:
- Platform Engineers
- Data Engineers
- Data Analysts
- Operations Engineers
- Security Engineers
Each persona will have a different reason to check and work on an orchestrator. So let’s go through them briefly.
Platform Engineers: responsible for building out the cloud infrastructure and deploying the three services listed in our example above.
Data Engineers: write the Airflow jobs and queries to prepare the data.
Data Analysts: find insights in the data leveraging our platform.
Operations Engineers: respond to any outages and work with platform and data engineers to make sure the platform is performant and resilient.
Security Engineers: ensure the platform is well configured and audit any possible issues with the system.
In a startup you may only have two types of users, those creating the platform and those consuming it. In that case, it’s still important to have a set of use cases for each persona.
Principle 2: Everyone has a responsibility to keep the platform secure
It’s also important to remember that security is everyones responsibility. Each team member should be vigilant and also offer their domain knowledge to improving security. A data analyst may know when large queries are run, so they can inform the operations team if things are unusually slow when nothing should be running. Likewise, a data engineer should understand how all the data flows and what might cause a leak or unusual usage.
We covered the who of our orchestration platform, now let’s cover the what and how.
What actions does each user take?
Principle 3: Establish a baseline of actions to set boundaries for each user
As we saw, each user in our example cares about their domain and will take certain actions to accomplish their goals. A data analyst cares that the BI tool works and they can have access to the data they need.
Sometimes there are conflicting goals; a data analyst may want access to all possible data, while a security engineer wants to reduce the data exposure to follow least privilege.
Below is a sample table of actions each user would like to take.
These actions may vary company to company, but it’s important you gather these requirements in a structured way.
Principle 4: Provide an easy, but secure pathway for doing unusual actions
For every 100 normal operations, there will be exceptions that need to be made. It might be restarting a job, viewing a different data schema or deploying a fix. Even though these actions might go outside the purview of an everyday activity, we need to make it easy and simple for a user to do them.
Why? Because the alternative is much worse.
If it is challenging to complete a rare, but necessary task, each user will leverage an exploit to get their jobs done and will cover their tracks. This alternative world now limits our platform visibility and encourages users to perform potentially insecure actions.
As an example, let’s say a data engineer needs to do a quick deployment to prod to fix a bug they introduced. If there is an easy way to make this change, get it approved, and not be criticized for it, the data engineer will make the change with full visibility.
On the other hand, if the fix will take a week and will cause the data engineer to be shamed by management, they may instead message their operations engineer friend, have them disable prod logging, open up a screen share, and deploy with a manual code update. Now you have developers sneaking around, disabling systems and losing key logging and lineage to prod changes!
I went into developing a healthy culture in more detail in the Part 1. Intro and Data Security article, if this is a topic of interest, give it a read!
Understanding the platform— Airflow
Principle 5: Understand the tools your are working with.
With our user goals and actions defined, let’s get into protecting the data platform. In previous articles we already discussed protecting the data and the data storage, so we can focus on our example orchestrator, Airflow.
Airflow is a fairly comprehensive platform, with a collection of components.
- Airflow is a platform that lets you build and run workflows. A workflow is represented as a DAG (a Directed Acyclic Graph), and contains individual pieces of work called Tasks, arranged with dependencies and data flows taken into account.
- A webserver, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.
- A scheduler, which handles both triggering scheduled workflows, and submitting Tasks to the executor to run.
- An executor, which handles running tasks. In the default Airflow installation, this runs everything inside the scheduler, but most production-suitable executors actually push task execution out to workers.
- A folder of DAG files, read by the scheduler and executor (and any workers the executor has)
- A metadata database, used by the scheduler, executor and webserver to store state.
While it may be easy to gloss over all these components, debugging them and securing them is very difficult without knowing what they do, and what they can do if they are compromised.
After diving deeper into each part of Airflow, let’s map resource access to our user personas.
This might have seemed like a silly exercise with all the debugs and reads, but it’s important to understand why someone needs access to each component. You may also have different philosophies on platform ownership. I wrote that all technical users needs access to the platform for debugging purpose, but your organization may have a different delineation of responsibilities. There are many ways to define a platform, you just need to be able to justify it.
Now that we’ve defined which personas need access to which component, let’s make their access secure, first with a managed version of Airflow and then a deploy-your-own version.
Protecting a managed version of Airflow
Principle 6: Use trusted managed services when possible, embrace the shared responsibility model.
As you’ve seen in the previous section, Airflow is a multipart deployment that requires you to understand several moving parts. Many teams want to focus on creating better Airflow jobs rather than deploying and managing Airflow, so they choose a managed version!
We use managed resources all the time in cloud, so using manage Airflow is no different. We deploy a service, it scales elastically and integrates with other cloud resources. AWS and GCP currently offer well managed versions of Airflow, and for today’s example, we’ll be using the AWS version (MWAA).
In the next few sections we will cover how to secure our MWAA instance. Many of these concepts have been covered in previous articles, so I’ll focus on the how it relates to Airflow.
IAM and RBAC
Principle 7: Set up effective use permissions and remember least privilege!
When using a managed version of Airflow, you get the benefit of integrating with the cloud native IAM system. This allows you to assign roles to your personas in the AWS console and access the resources with those roles. To get us started, let’s give our users access to log into Airflow and view the UI.
For this, we will need to grant users the permission AmazonMWAAWebServerAccess as described with the following JSON.
Under the {airflow-role} you would specify the Airflow roles such as Admin
, Op
, User
or Viewer
. For someone like a data engineer, we could assign an op role in non prod and a viewer role in prod, since we want only automation accounts to have write access in prod. Below is a more detailed table of possible role combinations.
Apart from Airflow access, the platform engineering and ops teams will need access to the underlying cloud environment to make sure it is stable, and well integrated with other systems. Two such roles are AmazonMWAAReadOnlyAccess and AmazonMWAAFullConsoleAccess. These provide access to underlying cloud resources that Airflow MWAA uses, with some highlighted in the JSON below.
These will allow the team to debug any issues that occur by understanding the underlying architecture.
Principle 8: Don’t manage credentials when you don’t have to
We also want to configure SSO when possible so that users don’t need to remember extra passwords. They may insecurely store them, or reuse an existing password, neither are good options!
Creating Jobs
Once you have access to the platform it’s time to create your jobs. Every job you create will be slightly different, and each team will have different use cases, but that’s the power of Airflow, there’s plenty of flexibility! Below we’ll cover some important principles to keep in mind when building your DAGs.
Principle 9: Version and trace code before deploying it in prod
As we discussed in a previous anecdote, you want everything deployed to production to be transparent, tested and simple. This means we:
- Avoid deploying premature code
- Automate testing
- Deploy to prod after tests pass
- Roll forward on bugs
- Deploy code only from a trusted location
- Have good visibility on what we have running
- Avoiding mega complicated jobs that touch everything when possible
All these things help with security since many exploits are based on small issues that go unnoticed. We also want to avoid toiling in complexity as it will make it harder to debug issues.
Principle 10: Keep track of your logs, and trigger automatic events based on them
Mature data pipelines and orchestrators will have good logging integrations built in, allowing you to automatically write to storage or send further downstream. As we’ve seen in our managed solutions, they integrate nicely with Cloud Watch and Composer Logs.
These logs can be fed into applications to detect potential security events and notify teams if there are issues!
Principle 11: Avoid logging things that are sensitive, be careful with error handling.
This may seem obvious but when things go wrong, you can get some pretty sensitive information being leaked.
Some ways to minimize the risks:
- Setup a standard logging framework that will be used by many teams. It should be someone’s responsibility to maintain it, test it and improve it.
- Use types that censor sensitive data, like SecretStr in python
- Strike a balance between throwing descriptive errors and revealing key pieces information about your app. The name of the S3 might be appropriate to log, but the ARN contains more compromising info
- Treat log files as potentially containing sensitive information, scan and sanitize them regularly
Below is a code snippet that avoids logging sensitive info for a user by only mapping its id to the string definition.
class UserAccount {
id: string
username: string
passwordHash: string
firstName: string
lastName: string
...
public toString() {
return "UserAccount(${this.id})";
}
This snippet is from the article linked below which has more details and code examples.
Network Access
Principle 12: Think about your network topology, but never rely on it
If you’re working in a large enterprise you may have your own VPN and/or Datacenter set up. As such, your VPC may only be accessible through your corporate network. In this case it makes sense to block external access to your data pipelines. However, don’t rely on network access as your perimeter, remember other good identity practices:
- Have good IAM policies and roles setup
- Configure MFA
- Enable privileged role management
Segmenting Environments
Principle 13: Create development environments for trying new things, and a staging (non prod) environment that mirrors prod
Similar to any good software development practice, you want to give developers freedom in a test environment, have a stable mirror environment and an end user prod environment. It also makes you double check your IAM permissions to make sure they line up with your environment expectations.
It’s especially important since orchestrators may touch everything in your environment, because that’s their job! Make sure you have good checks and balances or you might accidentally start an intergalactic war.
Protecting a deployment — Airflow
Now if you don’t want to use a managed version of Airflow, the security gets more comprehensive. We’ve already seen an architecture of AWS MWAA, below is an architecture of GCP Composer, GCP’s managed Airflow.
There are many moving parts involved, with potential security hazards along the way. If we want to deploy our own version we can use the Airflow Helm Chart. The Helm chart has many nice configurations build in, but if you are building a comprehensive Airflow offering, you may want to augment some parts to improve the security.
- Externalized Storage — Postgresql and Redis
- Kubernetes Network Policy
- Kubernetes RBAC
Externalized Storage
Often teams utilize a managed service for their Postgresql and Redis instances to improve reliability and observability. These managed services may also improve your security with automatic key rotations, log correlations and backups. If you are deploying a significant cluster it is often worth making the change.
Below is the configuration change within the helm chart that you would need to undertake.
postgresql:
enabled: falseexternalDatabase:
type: postgres
host: postgres.example.org
port: 5432
database: airflow_cluster1
user: airflow_cluster1
passwordSecret: "airflow-cluster1-postgres-password"
passwordSecretKey: "postgresql-password"# use this for any extra connection-string settings, e.g. ?sslmode=disable
properties: ""
Kubernetes Network Policy
After deploying deploying airflow, we may want to limit who can access what inside our cluster. By default we select a new namespace to deploy airflow into, and perhaps we have other apps living in adjacent namespaces. We can restrict the network connectivity between the name spaces to only allow pods with the airflow
tag to be able to communicate with one another.
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: intra-namespace
namespace: airflow
spec:
podSelector:
ingress:
- from:
- namespaceSelector:
matchLabels:
name: airflow
We can then add additional constraints based on who wants to access the UI, webserver, ect.
Kubernetes RBAC + Pod Identity
We may also want to change the underlying implementation of airflow to not rely on a username and password for connecting to Redis and Postgresql. In that case we can use something like pod identity on AKS, or workload identity on GCP to connect directly. This would require you to rebuild the images to leverage this environment based connection, and update the helm charts with any additional specs, like listed below for GCP.
spec:
serviceAccountName: Kubernetes_service-account
There are also many additional steps you can take such as:
- Enable OAuth
- Enable Logging
- Many more…
Now you know why many people opt for managed solutions :)
An end to end example
Principle 14: Understand what you are building towards and secure the whole system, not just the individual parts.
Now that we’ve discussed the components of Airflow, let’s examine some end to end ML workflows, leveraging those designed by Made With ML.
We have three flows defined, a data ops flow, a model training flow, and a model update flow. In the data ops flow, we are accessing a data warehouse, creating features in spark/pandas and writing to a feature store.
In this case, we need to store credentials for each of those services and make sure that they are properly configured.
In model training flow, we train and optimize the model, saving and storing it if meets our performance requirements. Similar to the previous example we have at least three services to read/write from, but also the complexity of logging failure into a notification service.
Lastly, we may want to monitor and improve our model so we set up a retraining process and update our models in storage. This requires more data integrations and is a more automated process, so it needs to be well observed otherwise things can go poorly and no one will notice until it’s really bad.
Each of these examples should illustrate the bigger picture view of what you are building, and hopefully allow you to apply what you’ve learned so far to each component!
Using other orchestrators
Today we focused mainly on Airflow, but all the principles we covered still apply to other orchestrators, such as Cloud native tools and or ML focused tools. Each technology will vary slightly, but the concepts are similiar. Leverage them and your knowledge of the tools to refine and improve what you’ve built.
Conclusion
You made it to the end, congrats! Hopefully you take the 14 principles you learned and apply them to your work, they’re neatly compiled here for easy access!
Principle 1: Understand who will use your platform
Principle 2: Everyone has a responsibility to keep the platform secure
Principle 3: Establish a baseline of actions to set boundaries for each user
Principle 4: Provide an easy, but secure pathway for doing unusual actions
Principle 5: Understand the tools your are working with.
Principle 6: Use trusted managed services when possible, embrace the shared responsibility model
Principle 7: Set up effective use permissions and remember least privilege!
Principle 8: Don’t manage credentials when you don’t have to
Principle 9: Version and trace code before deploying it in prod
Principle 10: Keep track of your logs, and trigger automatic events based on them
Principle 11: Avoid logging things that are sensitive, be careful with error handling.
Principle 12: Think about your network topology, but never rely on it
Principle 13: Create development environments for trying new things, and a staging (non prod) environment that mirrors prod
Principle 14: Understand what you are building towards and secure the whole system, not just the individual parts.
Next Time — Part 4: ML Model Security
We’ve secured our data, transformed it, and are ready to train our models, now what? That’s what Part 4 is for!
In Part 4 we will discuss what do now that you are ready to train a model, and also how to deploy it. We will focus on model security during training and inference time, followed by model hosting in Part 5.