7 Layers of MLOps Security

Part 1, Securing your data

Series Links

Part 1. Intro and Data Security
Part 2. Protecting Your Data Storage
Part 3. Securing Your Orchestrator
Part 4. ML Model Security
Part 5. ML Model Hosting
Part 6. Securely Exposing ML Models to Users
Part 7. Logging and Monitoring MLOps Infra

Protecting Data

We start with the reason we need an ML Platform — to analyze the data! The data itself can be protected and organized in ways to limit security risks, a couple of which we’ll go through.

1. Data Access

First and foremost, you should ask yourself: “who needs access to this data?” This follows the security principle of least privilege and helps us plan who will need access to the data. Most data stores have forms of direct access control (DAC — users assigned permissions) or role-based access control (RBAC — which roles can access data).

Three Azure Pim Roles, VM in eng, Non Prod Data reader, Prod Database reader
Azure PIM example across different roles

Mapping Company Roles to Functional Roles

Now that we know who needs access to data in our company, we can create roles in our database.

Tavle of different permissions for developers for dev. staging and prod
Sample Developer Permissions

2. Data Segmentation

When we store data it’s important to think about how it’s grouped together. Typically we think about this from a normalized vs denormalized function of how we want to optimize reading, writing and data storage overall.

  1. Sensitivity: Different data stores for public/internal/sensitive data
  2. Customer: Many companies have many customers, so segmenting data stores for each customer keeps things organized
  3. Analytical/Transactional/Object Store: Data has different storage requirements and purposes. Keep them separate and optimize for the best practices
  4. Line of Business/Use case: Different teams will work with data and depending on your client agreements there may need to be isolation
  5. Test/Prod: This one is fairly obvious, but testing systems with mock data when possible limits mistakes and leaks, especially when you are developing new systems. Having good mock data or anonymized data makes it much easier.
  6. Data Residency: Certain jurisdictions have requirements for data to be stored within the said region for compliance reasons (ex: Canada, Germany, Russia). These may also have additional processes and security requirements attached to them like demonstrating that data has been deleted (ex: GDPR), or how compromised personal information holders should act. These depend region to region, so security, compliance and legal teams need to be aware. As such you probably want to separate you data by jurisdiction.
  • Buckets (Object storage for S3 and other formats)
  • Databases (SQL, NoSQL, Graph)
  • Tables (Within databases)
  • Topics (For streaming applications)
Kafka data being shown for three topics
Kafka Topic Example From CloudKarafka
  1. Least privilege: limit access to who needs access to different segments
  2. Data classification: grouping sensitive and non sensitive data together
  3. Avoiding sensitive values for primary keys and join values (like SSN)
  4. Have strong naming conventions that are consistent
  5. Keep configuration as code and automate the creation of your segments

3. Metadata

Metadata is extremely important for a well-governed data platform. In the context of security and MLOps, knowing what data lives where, how it’s accessed and how it’s protected will help your platform run seamlessly and help you sleep better at night.

Data Schemas

Data schemas are important to validate the quality of your data and ensure it won’t cause issues upstream. Schemas can be documented in a SQL, Programming Language Model, or JSON-like way. Even “schema-less” storage follows a schema empirically, so define what you expect as a recursive structure.

[
{
"description": "quarter",
"mode": "REQUIRED",
"name": "qtr",
"type": "STRING"
},
{
"description": "sales representative",
"mode": "NULLABLE",
"name": "rep",
"type": "STRING"
},
{
"description": "total sales",
"mode": "NULLABLE",
"name": "sales",
"type": "FLOAT"
}
]
from datetime import date
from pprint import pprint

from marshmallow import Schema, fields


class ArtistSchema(Schema):
name = fields.Str()


class AlbumSchema(Schema):
title = fields.Str()
release_date = fields.Date()
artist = fields.Nested(ArtistSchema())

Data Lineage

Every time data is aggregated, it should be treated as some kind of new item. This can help you keep track of where it came from and how it is was acquired. This can help you audit who has accessed the data, where it came from, and how changing the data may affect downstream processes.

Different database schemas connected together to track lineage
Keeping track of data lineage- Image Credit — Octapai

4. Data Protection (masking, tokenizing, encrypting, hashing)

In general we should try to have as much data security as possible, but we often create overhead when consuming the data. Since most data needs to be analyzed we need to think about trade-offs on how to protect it, yet keep it usable.

A lock representing encryption, a token for tokenization and stars for masking
Different forms of data protection | Icons from theNounProject
An unauthorized user seeing masked data vs an authorized user seeing raw data
Masking policies in Snowflake | From Snowflake Website
  1. Data Sensitivity
  2. Data Usage
  3. Data Size
  4. Data Access Model
  5. Downstream processes
  6. Joining requirements
3 Radar Charts on a Trasactional, ML and BI use case
Comparison Of Data Protection on Different Use Cases

5. Data loss prevention (DLP)

DLP is a more general term for preventing data from leaving your companies environment. Many of the techniques we have mentioned would help limit data leaving your environment but may stop other forms of data leakage (email, usb transfer, etc.)

Many forms of media including email, server, cloud, mobile focused on DLP
What is Data Loss Prevention (DLP) | Data Leakage Mitigation | Imperva

Attack Vector — Data Poisoning

If someone manages to inject malicious data into your dataset, it can muddy your downstream analytics workloads. But in a ML context, the data might be used further down the line to allow an action or result that otherwise should have been rejected.

Attack Vector — Infrastructure Compromise

If your application gets compromised at the infrastructure level, your exposure depends on which data and corresponding keys it has access to.

Attack Vector — Employee Data Leak

When you have people/employees looking/accessing your data, there can be leaks. Whether it’s a picture taken, email forwarded or infrastructure access policy changed, malicious employees can leak data for every point of access they have.

Attack Vector — SQL Injection

An attack vector we should all be familiar with is the mighty SQL injection. SQL injections can be generalized as scenarios where users have access to raw data stores by exploiting poorly sanitized or secured data sources. This can include a SQL injection, security token hijacking or poorly written CLI/API commands.

End to End Example — Verified Reviews

Now we’ve gone through 5 different data protection considerations, let’s go through a real life example.

A Restaurant Review Application!

You’ve recently joined a company that is building a verified restaurant review platform. The premise of the app is that each user needs to confirm their identity before being allowed to post a review, so reviews can be more honest and transparent, while allowing restaurants to use this info to predict future customers.

Application Architecture

The Context — Applications Services

Below is a summary of services for the reviewers and the business users that we will assume we have developed. For each service we’ll map out the data model and how we’ll protect the data.

  1. Signup
  2. Verification
  3. Authentication

User

  1. Profile — Demographics, location, etc.
  2. Browse Restaurants
  3. Leave a Review
  4. Upload a Photo
  5. Popular Restaurants
  1. Profile — Location, description, photo
  2. Menu
  3. Predicted Analytics

ML Models

Review NLP — Sentiment, loyalty, ect

Applying the theory

So now that we’re familiar with goal and some of the services, let’s figure out how to protect our data! We’ll be going through each of the principles we outlined earlier.

  1. Data Access
  2. Data
  3. Metadata
  4. Data Protection
  5. DLP

Data Access — Roles at the Company

  1. Data Scientist/ML Engineer — Access to live data, other than sensitive data which isn’t used for analytics. Ability to create new aggregated data sets/views.
  2. Security Engineer — Access to infra and data protection services, including potential identity verification.
  3. Software Engineer — Access to infrastructure, mock data and prod data when supporting on call.
  4. Marketing/Sales Data Analysts — Access to dashboards and aggregated data, ability to create new views.
  5. Product Managers — Access to dashboards and aggregated data, ability to create new views.
  6. Project Managers/Executives — Access to dashboards.

Data Access Roles based on Company Roles

Mapping our permissions for each role granularly would be quite comprehensive, so we’ve mapped it out for developer in details and added some notes for the roles in general.

Tavle of different permissions for developers for dev. staging and prod
Sample Developer Permissions

Data Schema

Note: For this example, we’ll assume you are using cloud best practices for data protection, such as customer managed keys, disk level encryption, etc., which we will cover in the next part of the series

Tables for our data store

Data Roles Per Table

In this case, we’ll only have additional access restrictions access to the sensitive data.

Protecting Each Data Type

  1. User Passwords: These are hashed in memory and then stored with best practices (salt, pepper, etc.)
  2. Verification Document, verified with third party API, unique identifier hashed and stored like password
  3. Names, emails, masked for any analytics use case (BI, Model, etc.)
  4. Reviews: Protected at the infra with storage level keys
  5. ML Models: Protected at the infra level, behind an API Gateway

MetaData

To manage our Metadata we can assume we have a service that does this for us such as Glue, Purview, Collibra, Custom written, etc. We just need to register our datasets with it.

[
{
"registrationTime": "1623167711",
"registeredBy": "ServiceAccount1",
"description": "For the marketing team to use for review analytics",
"version": "1",
"columns": [
{
"name": "ReviewID",
"description": "Which Review this was based off",
"mode": "REQUIRED",
"type": "STRING",
"originatingSource": "transactional:Reviews",
"sensitivity": "Regular",
"protection": "None"
},
{
"name": "UserID",
"description": "Which User the review and demographics info came from",
"mode": "REQUIRED",
"type": "STRING",
"originatingSource": "transactional:UserDemographics",
"sensitivity": "Regular",
"protection": "None"
},
{
"name": "ReviewSentiment",
"description": "How positive or negative this review was",
"mode": "REQUIRED",
"type": "Decimal",
"originatingSource": "mlmodel:Sentiment",
"sensitivity": "Regular",
"protection": "None"
},
{
"name": "ReviewSentiment",
"description": "How positive or negative this review was",
"mode": "REQUIRED",
"type": "Decimal",
"originatingSource": "mlmodel:Sentiment",
"sensitivity": "Regular",
"protection": "None"
},
{
"name": "BirthYear",
"description": "How old the reviewer is",
"mode": "REQUIRED",
"type": "Integer",
"originatingSource": "transactional:UserDemographics",
"sensitivity": "Regular",
"protection": "None"
},
{
"name": "Gender",
"description": "Gender of the reviewer",
"mode": "REQUIRED",
"type": "Integer",
"originatingSource": "transactional:UserDemographics",
"sensitivity": "Regular",
"protection": "None"
},
{
"name": "Location",
"description": "Location of the reviewer",
"mode": "REQUIRED",
"type": "String",
"originatingSource": "transactional:UserDemographics",
"sensitivity": "Regular",
"protection": "None"
}
]
}
]

DLP

This one we’ll assume we are aggregating data access logs across environments (databases, object stores, email, usb drives, ect) and correlating them with a SAAS or custom model.

Conclusion

We just walked through some key data protection principles for MLOps, learning the foundations to ask some good questions along the way. Hopefully you can take these ideas and apply them to your own use cases; everyone’s data platform will be a little different because everyone’s data is different!

Next Time — Part 2: Data Storage Security

If you were rich and had bars of gold you’d want to protect them. You might put them in a safe. But if that safe says “gold here, pls don’t steal” on it, while in Times Square, it will be hard to keep it in your possession. Similarly for data protection, we need to secure its storage and infrastructure to keep it safe, which is what Part 2 of this series is all about.

--

--

ML Architect @ Voiceflow

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store