7 Layers of MLOps Security
Part 1, Securing your data
So you’ve heard of MLOps and you’ve deployed a model to production! Awesome, but now that you’ve hit your milestone, you should probably double check to make sure that everything is nice and secure.
Security is a never ending topic but implementing the foundational aspects will help you sleep better at night. This article focuses on giving you a small tasting platter of the security world and some tips + frameworks to use to assess potential solutions. It should also help raise discussions in your organization about topics like data access, privacy, human intervention and many more.
Today’s article is part one of seven, where we’ll discuss seven layers of securing your ML pipelines. Let’s get started!
Part 1. Intro and Data Security
Part 2. Protecting Your Data Storage
Part 3. Securing Your Orchestrator
Part 4. ML Model Security
Part 5. ML Model Hosting
Part 6. Securely Exposing ML Models to Users
Part 7. Logging and Monitoring MLOps Infra
We start with the reason we need an ML Platform — to analyze the data! The data itself can be protected and organized in ways to limit security risks, a couple of which we’ll go through.
1. Data Access
First and foremost, you should ask yourself: “who needs access to this data?” This follows the security principle of least privilege and helps us plan who will need access to the data. Most data stores have forms of direct access control (DAC — users assigned permissions) or role-based access control (RBAC — which roles can access data).
For example, does everyone need to be able to read the data in the first place? Probably not.
There’s different ways of configuring access management patterns, but functionally different people or roles should be able to do different things. A customer service rep has a different use case than a data scientist, marketing, or product development person.
TIP 1: Creating Roles. For each employee type, create short access descriptions and use this to construct your access model. For new use cases, review this model to see if it still makes sense.
You should also limit who has write access to avoid unintended data changes. After you define your data access model, you’ll need to segment your data.
Privileged Identity/Access Management
Often you will need to verify manually that certain processes are working on live data or debug issues. As such, having a way to assume a more powerful role helps keep up a concept of least privilege while also allowing people to do their jobs. Below is an example with Azure PIM.
In the example above, we want to make sure that anyone accessing prod data needs to have admin approval and have limited time to do their action. It should also be logged and monitored for any suspicious activity as prod data is quite sensitive. Also note that we are not giving full admin permissions for this role, we are scoping it for a specific scenario where a data quality check may fail.
Mapping Company Roles to Functional Roles
Now that we know who needs access to data in our company, we can create roles in our database.
Ideally these roles map 1 to 1 to our titles: Analysts have Analyst roles, Developers have Developer roles. With these roles we can grant specific access to code, tables, infrastructure and applications. These will also vary across environments, as demonstrated by the table below.
2. Data Segmentation
When we store data it’s important to think about how it’s grouped together. Typically we think about this from a normalized vs denormalized function of how we want to optimize reading, writing and data storage overall.
We should also think about organizing our data so it’s clean and secure. As an analogy, your house would be very confusing if you found socks in your safe, and financial statements in your sock drawer. It may seem fine when you are living alone, but once you bring more people to your household, it can become very confusing and chaotic.
Companies can become even more chaotic and eventually their data could turn from a data lake into a data swamp, with data poorly labeled, organized and from a security perspective, a nightmare. So how can we categorize our data?
- Sensitivity: Different data stores for public/internal/sensitive data
- Customer: Many companies have many customers, so segmenting data stores for each customer keeps things organized
- Analytical/Transactional/Object Store: Data has different storage requirements and purposes. Keep them separate and optimize for the best practices
- Line of Business/Use case: Different teams will work with data and depending on your client agreements there may need to be isolation
- Test/Prod: This one is fairly obvious, but testing systems with mock data when possible limits mistakes and leaks, especially when you are developing new systems. Having good mock data or anonymized data makes it much easier.
- Data Residency: Certain jurisdictions have requirements for data to be stored within the said region for compliance reasons (ex: Canada, Germany, Russia). These may also have additional processes and security requirements attached to them like demonstrating that data has been deleted (ex: GDPR), or how compromised personal information holders should act. These depend region to region, so security, compliance and legal teams need to be aware. As such you probably want to separate you data by jurisdiction.
Once you map out these division, you can think about which data systems should contain which data.
- Buckets (Object storage for S3 and other formats)
- Databases (SQL, NoSQL, Graph)
- Tables (Within databases)
- Topics (For streaming applications)
Each use case will benefit from different data set up and segmentation, but keeping some principles will help you get set up.
- Least privilege: limit access to who needs access to different segments
- Data classification: grouping sensitive and non sensitive data together
- Avoiding sensitive values for primary keys and join values (like SSN)
- Have strong naming conventions that are consistent
- Keep configuration as code and automate the creation of your segments
TIP 2: Dealing with bad data. When building out your ML Platform, you will run into bad data. Budget time and money for it.
Metadata is extremely important for a well-governed data platform. In the context of security and MLOps, knowing what data lives where, how it’s accessed and how it’s protected will help your platform run seamlessly and help you sleep better at night.
Data schemas are important to validate the quality of your data and ensure it won’t cause issues upstream. Schemas can be documented in a SQL, Programming Language Model, or JSON-like way. Even “schema-less” storage follows a schema empirically, so define what you expect as a recursive structure.
Example — JSON From BigQuery
"description": "sales representative",
"description": "total sales",
Example — Python — With Marshmallow
from datetime import date
from pprint import pprint
from marshmallow import Schema, fields
name = fields.Str()
title = fields.Str()
release_date = fields.Date()
artist = fields.Nested(ArtistSchema())
TIP 3: Zero Trust. Every time data moves between environments, sample/validate data to make sure that it’s what you are expecting. Your services should also be authenticated/authorized against each other.
Every time data is aggregated, it should be treated as some kind of new item. This can help you keep track of where it came from and how it is was acquired. This can help you audit who has accessed the data, where it came from, and how changing the data may affect downstream processes.
4. Data Protection (masking, tokenizing, encrypting, hashing)
In general we should try to have as much data security as possible, but we often create overhead when consuming the data. Since most data needs to be analyzed we need to think about trade-offs on how to protect it, yet keep it usable.
As such, there are many ways to protect the data, with three common techniques depicted below.
Encryption: Encrypt each data point to a different value with no guarantee that the data with the same value has the same encrypted form. Example, AES-256 encrypting a file.
Hashing/Tokenizing: Mapping the data to some another value that’s unique but obfuscated.
Masking: Changing the data on read such that it preserves some characteristics, but does not give the full value. Example, dynamic masking in Snowflake.
How do we choose which type of protection to use?
For this article I’ve listed out 6 parameters that help make decisions on what kind of protection to use. While there is no easy rule for the decisions, these factors can make decisions more systematic.
Note: Boundary ratings give you some extremes on how to score/assess each category.
- Data Sensitivity
- Data Usage
- Data Size
- Data Access Model
- Downstream processes
- Joining requirements
Each use case you have can then be used to determine which protection mechanism to use, with three examples given below.
TIP 4: Business Sensitivity. Note that internal company data may be very sensitive. A set of images of cars may not be valuable, but a images of cars that are labelled and are core to your revenue generation could mean life and death for your company.
TIP 5: Avoid concentration risk. Similar to least privilege we want to avoid putting data that can be combined together to form more sensitive datasets together and making them accessible. This can turn moderately sensitive data (first name, last name, email), into a data set ripe for identity theft (first name, last name, email, date of birth, address, car model).
TIP 6: Don’t join-on sensitive data. The requirement to join or correlate data on a sensitive field means you limit your protection options. You also increase the number of accounts that need to access the data and the number of reads, all leading to higher exposure.
5. Data loss prevention (DLP)
DLP is a more general term for preventing data from leaving your companies environment. Many of the techniques we have mentioned would help limit data leaving your environment but may stop other forms of data leakage (email, usb transfer, etc.)
Overall your data access and exchange points need to have their logs and data aggregated and ready for consumption. Using these logs people or models can decide whether or not security actions that are executed are security risks.
Attack Vector — Data Poisoning
If someone manages to inject malicious data into your dataset, it can muddy your downstream analytics workloads. But in a ML context, the data might be used further down the line to allow an action or result that otherwise should have been rejected.
Non-ML Example: Allowlist of licence plates for garage entry — add your licence plate to it
ML Example: Car Model allowlist — add some pictures of your car to the training set
Attack Vector — Infrastructure Compromise
If your application gets compromised at the infrastructure level, your exposure depends on which data and corresponding keys it has access to.
If your app has access to all the data and keys to decrypt the data, any protection you have matters much less if the attacker can get access to a scripting env (python, sh), they can use fetch these secrets and connect to your data and key stores. This will allow them to pipe out your data unless you have strong monitoring and network restrictions.
Even if your data is encrypted with the best standards, but your services have the keys to encrypt and decrypt it, you’re not completely safe. Depending on the attack, will make it only marginally harder to compromise. With this in mind, you shouldn’t forget about the other forms of protection, the principle defence in depth.
Attack Vector — Employee Data Leak
When you have people/employees looking/accessing your data, there can be leaks. Whether it’s a picture taken, email forwarded or infrastructure access policy changed, malicious employees can leak data for every point of access they have.
Similar to the previous attack vector, all your protections may be moot if the employee has unmonitored and unfettered access to data and keys, and can slowly copy it out or replace it.
Attack Vector — SQL Injection
An attack vector we should all be familiar with is the mighty SQL injection. SQL injections can be generalized as scenarios where users have access to raw data stores by exploiting poorly sanitized or secured data sources. This can include a SQL injection, security token hijacking or poorly written CLI/API commands.
Whenever we have applications that touch data, we need to make sure that our input is validated, sanitized and monitored so that we can prevent and track unusual behaviour.
End to End Example — Verified Reviews
Now we’ve gone through 5 different data protection considerations, let’s go through a real life example.
A Restaurant Review Application!
You’ve recently joined a company that is building a verified restaurant review platform. The premise of the app is that each user needs to confirm their identity before being allowed to post a review, so reviews can be more honest and transparent, while allowing restaurants to use this info to predict future customers.
How could we design it? Let’s take a look in the next section.
The Context — Applications Services
Below is a summary of services for the reviewers and the business users that we will assume we have developed. For each service we’ll map out the data model and how we’ll protect the data.
- Profile — Demographics, location, etc.
- Browse Restaurants
- Leave a Review
- Upload a Photo
- Popular Restaurants
- Profile — Location, description, photo
- Predicted Analytics
Review NLP — Sentiment, loyalty, ect
Image Analysis — Food pictures, restaurant profile
Visitors Model — Forecasting traffic and purchases, ect
Applying the theory
So now that we’re familiar with goal and some of the services, let’s figure out how to protect our data! We’ll be going through each of the principles we outlined earlier.
- Data Access
- Data Protection
Data Access — Roles at the Company
- Data Scientist/ML Engineer — Access to live data, other than sensitive data which isn’t used for analytics. Ability to create new aggregated data sets/views.
- Security Engineer — Access to infra and data protection services, including potential identity verification.
- Software Engineer — Access to infrastructure, mock data and prod data when supporting on call.
- Marketing/Sales Data Analysts — Access to dashboards and aggregated data, ability to create new views.
- Product Managers — Access to dashboards and aggregated data, ability to create new views.
- Project Managers/Executives — Access to dashboards.
Here we didn’t define specific read/writes for the data, but generally we will limit data access in prod to people on call or data scientists who are looking at specific problems, trying to avoid sensitive data when possible.
Data Access Roles based on Company Roles
Mapping our permissions for each role granularly would be quite comprehensive, so we’ve mapped it out for developer in details and added some notes for the roles in general.
Note 1: Live transactional data should be restricted except in troubleshooting scenario. Access to sensitive data (passwords, document hashes) should be even further restricted and audited.
Note 2: Most roles will only need access to the analytical store for their operations. Exceptions may include, customer support (for real time systems troubleshooting), software engineers (bug fixes) and security engineers (for system assessments)
Note 3: You should also avoid reading from the transactional store for performance reasons, This taxes your prod db and can lead to a slower user experience with due the intensity of data workloads, and can lead to some very unhappy customers.
Note: For this example, we’ll assume you are using cloud best practices for data protection, such as customer managed keys, disk level encryption, etc., which we will cover in the next part of the series
Note: We assume analytical use cases use the same table schemas.
Note: We’re actually using three data stores, SQL database, Object store and our analytical store.
Data Roles Per Table
In this case, we’ll only have additional access restrictions access to the sensitive data.
Protecting Each Data Type
- User Passwords: These are hashed in memory and then stored with best practices (salt, pepper, etc.)
- Verification Document, verified with third party API, unique identifier hashed and stored like password
- Names, emails, masked for any analytics use case (BI, Model, etc.)
- Reviews: Protected at the infra with storage level keys
- ML Models: Protected at the infra level, behind an API Gateway
To manage our Metadata we can assume we have a service that does this for us such as Glue, Purview, Collibra, Custom written, etc. We just need to register our datasets with it.
For our practical use case, let’s go through an example of creating a new dataset and what metadata we would store for it.
Let’s say we are launching a campaign to help restaurants attract young professionals (aged 22–30) to their stores. The marketing analyst would like to understand what the current sentiment is, so they ask the data engineering team to make sure this data is available for the marketing analyst to do some queries on.
In this case we need data from 3 of our tables: 2 base tables (User Demographics, Reviews) and 1 derived table (Sentiment analysis on Reviews). So what does our schema look like?
Note: In this schema we keep userID and reviewID for future joins, but prevent the marketing analyst from accessing the user information data otherwise as indicated in the data access model.
Now the schema above doesn’t have any datatypes or other metadata, so lets add it in and assume we’ll store it in JSON.
"description": "For the marketing team to use for review analytics",
"description": "Which Review this was based off",
"description": "Which User the review and demographics info came from",
"description": "How positive or negative this review was",
"description": "How positive or negative this review was",
"description": "How old the reviewer is",
"description": "Gender of the reviewer",
"description": "Location of the reviewer",
We have information about lineage, data sensitivity, and registration time. We should also store when this schema was accessed, but we’ll leave that for a further part of the series.
This one we’ll assume we are aggregating data access logs across environments (databases, object stores, email, usb drives, ect) and correlating them with a SAAS or custom model.
Usually DLP is the hardest to enforce and could be its own series, but we’re assuming you’re either trying to do something or are a small startup who’s not too concerned yet :)
We just walked through some key data protection principles for MLOps, learning the foundations to ask some good questions along the way. Hopefully you can take these ideas and apply them to your own use cases; everyone’s data platform will be a little different because everyone’s data is different!
You may feel that we didn’t go in depth on some aspects, like infrastructure, ML models, logging, applications security, etc. And you’re right, but you’ll be able to read more about those further in my series!
Next Time — Part 2: Data Storage Security
If you were rich and had bars of gold you’d want to protect them. You might put them in a safe. But if that safe says “gold here, pls don’t steal” on it, while in Times Square, it will be hard to keep it in your possession. Similarly for data protection, we need to secure its storage and infrastructure to keep it safe, which is what Part 2 of this series is all about.