Protecting Your Data Engineering and MLOps Storage

Part 2 of the 7 Layers of MLOps Security Guide

An ML Storage Architecture

Series Links

Part 1. Intro and Data Security
Part 2. Protecting Your Data Storage
Part 3. Securing Your Orchestrator
Part 4. ML Model Security
Part 5. ML Model Hosting
Part 6. Securely Exposing ML Models to Users
Part 7. Logging and Monitoring MLOps Infra

How to read this article

After seeing the reading time and word count you may be thinking, what kind of person would read, let alone write this kind of article. Unfortunately, you’re interested and already here so I’ll provide you some tips.

  1. There are many sections — you probably aren’t working on a problem that covers all of them so feel free to jump to the section.
  2. This article assumes intermediary programming knowledge. Some concepts I define, others I assume you know.
  3. After reading (and understanding) this article you should be able to have a good conversation with your peers about data storage protection. Your basic questions should be answered and you’ll have the vocabulary and concepts to google or discuss more complicated ones.

Table of Contents

  1. MLOps Data Architecture
  2. Transactional Datastores
  3. Handling Mistakes
  4. Network Security
  5. Encryption at Rest
  6. Column Level Encryption
  7. Encryption In Transit
  8. Key Management
  9. Data Warehouses
  10. Scaling Warehouses for Multi Tenancy
  11. Datalakes
  12. Streaming and Messaging
  13. Federated Access
  14. Zero Trust and Access
  15. Conclusion

An architecture for the bits and pieces

Data storage infrastructure is an important component for any MLOps solution since it houses and protects the data. Without the storage, your data can’t really exist anywhere. MLOps requires a significant portion of data engineering and movement, so today’s article and Part 3 on orchestrators, will help us position the data correctly for model training and publishing. Each of the cloud providers has their own interpretation of a data strategy (like the AWS graphic below), so we’ll also call out nuances to each approach.

AWS Lakehouse with many data services
AWS Lakehouse Architecture with many data components| AWS
The Data Storage Layers of a MLOps Architecture
Simple MLOps architecture in Azure for the Data Storage Layers

Transactional Data Stores

Starting off with our bread and butter: the transactional database! Transactional, ACID compliant databases have been around for decades, but their security continues to evolve.

Mistakes and Outages

First days at a new job can be stressful. We try to make a good impression, but sometimes a nightmare scenario unfolds.

  1. Harms other investigations. If we lose data for an ongoing investigation whether direct data, or data that can be correlated, we lose the ability to find and patch issues.
  2. Limits ability to do anomaly detection. Many actions don’t exist in a vacuum, so correlating past actions to current actions is important to determine if an event is malicious. The credential someone just used to read prod data might be ok if this is a support person fixing a P1 bug, but if it’s a random login from a random IP, you might be in trouble.
  3. Limits cross org coordination. In large companies and development environments, attackers may want to compromise a simple system and then move around to a more valuable system. With dataloss on previous cross team incidents, a centralized or decentralized team cannot be as effective.

How Should We Handle Mistakes?

Blameless Postmortems and Root Cause Analysis

Disaster Recovery and High Availability

Another important aspect to discuss is how to minimize the risk of downtime and dataloss if your systems go down. Most cloud databases/data stores make this easy to configure, so we’ll go over the foundational concepts first.

SLA by number of nines
Achieving an SLA over 99.99% for applications is challenging | Wikipedia
Regions and Availability Zones in Azure | Microsoft Docs
Regions and Availability Zones in Azure | Microsoft Docs
How Azure Replicates Data for Blob Storage | Azure
Azure SQL with no region in the URL | Microsoft
Load balancing is becoming easier with new tech | GCP
Recovery Time (RTO) and Recovery Point (RPO) in DR Strategy
Recovery Time (RTO) and Recovery Point (RPO) in DR Strategy | MSP360
DR strategies related to RPO/RTO
DR strategies related to RPO/RTO | AWS
AWS Aurora Global Clusters, AWS
Disaster recovery in Snowflake across regions/accounts | Snowflake

Network Protections

We’ll cover more detailed network protection in the dedicated section, but for transactional data stores you don’t want to expose them to end users directly.

User Connecting to API Gateway that maps to an Azure function then to a private Azure SQL
Users can only connect via the API gateway
Azure VNET connection between a Azure VM and Azure SQL with a vnet peering
Azure VNET connection between a Azure VM and Azure SQL | Azure Docs
AWS VPC with a public application subnet and private db subnet
AWS VPC with a public application subnet and private db subnet | AWS Website
Cloud SQL connection from VM using private IP
GCP VPC with a private Cloud SQL instance | GCP Website

Attack Vector —DB DDOS Attacks

One of the reasons to isolate your database from the end user to prevent it from going down due to excessive traffic. Malicious actors could try connecting many times and downing the database directly with connection timeouts and requests.

Attack Vector — Dictionary Attack

A dictionary attack focuses on using a list of known passwords to try to log into a system. If we make it more difficult for users to access production systems, they won’t have an opportunity to attack. Or rather they’d have to compromise another part of the system to gain access, hopefully triggering alarms while this is happening and slowing them down enough to have a defensive team fix things.

Database Encryption

An important aspect when you’re protecting your database is to think about how you encrypt the data. You can encrypt in a variety of ways which we will cover below.

  1. Oracle and SQL Server TDE. This technique focuses on encrypting the database at the page level and storing the encryption key separate from the data store. When in a cloud environment you can either provide a Customer Managed Key (CMK) or rely on a random one from the cloud provider. TDE is usually set up at creation time.
  2. Aurora Cluster Encryption. Similar to TDE, Amazon’s managed database encrypts data, snapshots and logs. Likewise you can set up a CMK for the cluster.
  3. Postresql Partition Encryption. This is the technology that’s used for encryption at rest for managed Postgresql for services like RDS, Cloud SQL or Azure managed Postgresql. Each cloud provider uses a slightly different technique for encryption management so it’s worthwhile to read up: GCP, Azure, AWS

Column Level Encryption

Encryption at rest is great since it protects the hardware, but when it reaches the client side, it’s decrypted. Column level encryption focuses on protecting data when it’s within the client, and limiting a user’ or applications’ ability to see it unless they have the right permissions.

Example of encrypting and decrypting a string in MySQL
Example of encrypting and decrypting a string in MySQL | MySQL docs 8.0
AWS Aurora column level encryption with KMS
A sample app with client side encryption | AWS
Immuta SQL Workflow Architecture | Immuta Docs
Immuta SQL Workflow Architecture | Immuta Docs
  1. You don’t want to change your app architecture.
  2. You have multiple databases and want central access controls and key management.
  3. You want centralized data visibility and classification.
A list of Privecera Partners, which means fewer connectors to build
A list of Privacera Partners, which means fewer connectors to build | Privecera
  1. Performance
  2. Vendor lockin
  3. Fragmented IAM model

Encryption in Transit

HTTPS Communication and TLS

We’ve discussed data encryption and encryption at rest, the third layer is encryption in our networking channel. Encrypting in transit is important to prevent people with access to your traffic from snooping on your data and potentially intercepting it. This might be someone on the internet staging a man-in-the-middle attack, or a disgruntled employee in your datacentre connected to the network.

A fun comic sample from How HTTPS works
A fun comic sample from How HTTPS works

Connecting to A Database

After our primer on TLS/HTTPS, how do we use it with our database? Usually most providers enable (and enforce) encryption in transit and offer it out of the box, so that you dont have to worry about it. If you are running your own database server, you will have to deal with certificate setup which each database has it’s own steps to follow.

Beyond channel encryption

In addition to relying on TLS to encrypt your data, you can also use the same principles with different keys to do so at the data level. One example is configuring a banking pin. This is a critical piece of information which you may want to wrap with your own key rather than just relying on a certificate authority.

Different Types of Data Keys

We started talking about keys, let’s continue! When encrypting data, there are usually many keys involved, and they may vary depending on the service. Let’s go over some common terms.

Snowflake’s Tri Security Key Hierarchy, root key encrypts account master, encryts table master, which encrypts file keys
Snowflake’s Tri Security Key Hierarchy | Snowflake

Master Key

This may be held at the account or database level, and may include multiple masters as seen above in Snowflake.

KEK — Key Encryption Key

The key encryption key is used to encrypt/decrypt another key or collection of keys.

DEK — Data Encryption Key

A data encryption key is used to encrypt the data file, table or partition. Sometimes there may be more the one key in the data key hierarchy as seen in the Snowflake example above. A good read on the topic can be found below

CMK — Customer Managed Key

Often for large enterprises, there is a desire to use your own keys for encrypting data within cloud infrastructure or SAAS platforms. This has different ways of manifesting whether it’s customer managed keys, bring your own key or hold your own key. Depending on the platform, your data may be encrypted with just your key or with a composed key, with an example of how Snowflake does it below.

Snowflake Tri Security with Bring Your Own Key | Snowflake

BYOK — Bring your own key

To extend on CMKs, some platforms differentiate between BYOK and HYOK. The main distinction is that for BYOK, the customer key is used for decrypting data at the infrastructure level and is stored at in integrated keystore. For HYOK usually data infra and keystores are more decoupled and may or may not decrypt data before returning it to users, depending on the integration points.

Comparison of bring your own key and hold your own key
BYOK vs HYOK | Secupi

Re-Keying

It’s good practice to rotate your keys on a schedule to make sure they aren’t compromised. When you do so, data that is encrypted (and any downstream keys) need to be decrypted with the old key, and encrypted with the new key. Ideally this happens in the background across different nodes so it doesn’t disrupt users, but depending on the service, it may cause outages.

Snowflake Trisecurity Re-keying after rotation
Snowflake Trisecurity Re-keying| Snowflake

Data Warehouses

So we’ve talked about dealing with transactional data, but how do we analyze that data? Typically we don’t want to impact our production datastore for analytics, so we either stream those events somewhere else or use batch jobs to extract them. We can either move the data into a datalake with some compute to execute the queries or into a data warehouse.

According to every major data warehouse, it’s cheaper than the competition :)

Multi Tenancy

When setting up your data warehousing solution you should be thinking about how to scale it as more applications onboard. Whether you are an enterprise doing a POC or a startup needing to analyze your data, you’ll usually start with moving one app’s data into your warehouse. At this point you’re probably write your setup scripts by hand and try to get things working. Hopefully before the next datasets onboards, you solidify your design and automation, otherwise it will be harder to change and manage.

Snowflake object hierarchy, account at the top follow by db, schema and table
Snowflake object hierarchy | Snowflake
Cross AWS Redshift architecture with EC2 access from another account
Cross AWS Redshift architecture | AWS
  1. Store all the data
  2. All data for a line of business
  3. All data for an app/small collection of apps
  • Use one database to run your entire analytics workload.
  • Consolidate your existing analytics environment to use one database.
  • Leverage user-defined schemas to provide the boundary previously implemented using databases.

Multi Tenancy Decision Making

Now that we’ve covered the main building blocks for how a data warehouse is configured, let’s look at what factors influence our design.

  1. Development environments
  2. Dollar Cost
  3. Time Cost
  4. Data Sharing
  5. Compliance
  6. Contractual Obligations
  7. Disaster recovery/ High availability
What factors can influence multi tenancy design
What factors can influence multi tenancy design

Multi Tenancy Case Study— Snowflake

Since we don’t have time in this article to do a deep dive on multi tenancy for each data warehouse, we’ll do it for one, Snowflake.

One Account, Database Per Environment

Depicting our multi tenancy use case
  1. You can’t experiment on roles and hierarchies since the user admin is a privileged role. You can very easily wipe out users/roles in prod.
  2. If you are connecting over a private link, you will have to choose which network you are connecting to. This would not mirror your infrastructure deployment and may cause confusion and backdoor channels on getting prod data into dev, if you aren’t careful.
  3. If your business continues to grow and you add many more product lines, you will be limited in how you segment them.
  4. Billing by default is at the database level: if you need to implement chargeback you need to do some extra work to parse out data storage costs at the schema level.
  5. Your roles will need to include which environment you’re working with, which may create challenges in doing upgrades across environments as the naming convention will not be the same. It may also cause issues with your SSO/SCIM solution.

Account Per Environment, Database Per App

Multiple Accounts With Database per App
Multiple Accounts With Database per App
  1. Creating databases is a fairly privileged action, so your automation will need to be granted this permission to make the process seamless.
  2. Schemas will need to be a well defined construct otherwise chaos will ensue. There are many options here, but having experimentation, project and stable zones is usually beneficial. You also need to make sure proper schema level permissions are granted since they govern the data at the end of the day.

Account Per Environment, Database Per LOB, Schema Per App

Multiple Accounts With Database per LOB
Multiple Accounts With Database per LOB
  1. Billing by default is at the database level, so if you need to implement chargeback you need to do some extra work to parse out data storage costs at the schema level if each app code needs chargeback.
  2. You have to be more careful with schema permissions since there is more data per database.

Additional Considerations

  1. Account setup. When considering a multi-tenant environment, onboarding a new team and doing the initial setup has its own steps.
  2. Creation of roles. This is its own topic, but you’ll need to define new roles at onboarding time for new apps. They should follow a well defined hierarchy and be automated.
  3. Read/Write/Delete permissions. In snowflake you must be able to read data if you would like to write data . However, masks can be applied to turn the read data into meaningless strings. You may also need to create intermediary tables and views since your user may be assigned many roles, but you can only use one role at a time.
  4. Virtual Warehouses per app/role. We haven’t talked about the compute portion of Snowflake, but to execute queries you’ll need a Virtual Warehouse. Similar to storage you need to configure this aswell.
  5. Automation. You will probably needs a control plane application that runs the automation on top of SCIM/SSO. This will help assign roles dynamically for new roles or workflows.
  6. Python sdk or SQL connector. Snowflake recently announced a Rest API for executing most commands, but you can also use SQL drivers and execute it within an application.

Data Warehouse Details

For each of the data warehouses, I’ve included a couple important points below in each category for further reading. This list is by no means exhaustive and only a prompt.

Snowflake

One of the items I didn’t cover was private link connectivity to Snowflake. This is a cool feature that many SAAS providers don’t have, and allows you to extend your virtual network to Snowflake. In combination with security groups, private networking and firewalls, you can make sure all your data movement happens on within your governed network. This feature is available on AWS and Azure, with an AWS diagram below.

Snowflake Private Link With AWS
Snowflake Private Link With AWS | Snowflake

Redshift

For Redshift, there are many useful integrations with the AWS logging, monitoring, identity and other data systems. One of the aspects I wanted to highlight is the multi tenancy model described below. Depending on your tenancy model, other aspects will follow, so it’s important to understand what the data sharing capabilities are.

https://www.snowflake.com/blog/bringing-the-worlds-data-together-announcements-from-snowflake-summit/
Redshfit Multi-Tenancy Models | AWS

Big Query

As with the AWS ecosystem, Big Query has nice integrations with the GCP ecosystem. To highlight a particular integration is the DLP feature with column level protection.

Column Level Protection and tagging BigQuery
Column Level Protection and tagging BigQuery | GCP

Synapse Analytics

One of the differences between the other three warehouses and Synapse Analytics is that it runs on an Azure SQL Server. This give you many of the integrations and features that are found in Azure SQL and makes it easier to up-skill SQL server developers to become Synapse Analytics users. It is also part of the larger Synapse ecosystem, which aims to provide a single pane of glass for data analytics. More info can be found in the link below.

Synapse Studio Connects Many Data Compoennts Together in Azure
Synapse Studio Connects Many Data Components Together in Azure | Azure

Datalakes

We’ve talked about mainly structured/tabular data so far, but many ML models need to use other data, which is where a datalake comes in handy. There have been a few iterations of datalakes, but one of the main challenges was storing the data in an organized fashion and being able to run useful queries on them. With advances in distributed storage and computing, there are now many systems that can do this such as Spark or Amazon Athena.

ACL (Access Control List)

We’ve already briefly spoken about identity and RBAC, but more broadly, many tech platforms use ACLs as a way to limit who can access what. In our datalake case, we define ACLs to control operations such as reads, writes, deletes and a couple others. We can make ACLs longed lived or provision them on the fly to allow just in time access.

S3 datalake account with multiple S3 buckets and objects within
S3 datalake account with multiple S3 buckets and objects within | AWS

Network Access

Similar to databases, we typically restrict network access to our datalake. It’s usually a good idea as a company is scaling to put it into a different subscription/account to separate business apps from the analytics layer. You can then peer VPCs/VNets and use storage private end points to connect.

Datalake Specific Nuances

Now that we have covered some general protection tips for datalakes, we’ll cover three technologies with some nuances.

Azure Datalake Gen 2

Azure Datalake Gen 2 (ADLS) is a hierarchical storage that combines the capabilities of the original Azure Datalake with the benefits of Azure Blob Storage. When deploying you may use ARM templates and use the SDK to manage the hierarchy which consists of containers and files.

Azure datalake gen2 supports many types of users
Azure datalake gen2 supports many types of users | Azure

S3

S3 is a core AWS service and has recently been used more often as a datalake. With fine grain permissions, policies and performance, it’s a great candidate for a datalake. AWS has doubled down on the potential by releasing LakeFormation a couple of years ago, a framework for managing data and metadata for datalakes.

AWS Lake formation helps configure key parts of a datalake
AWS Lake formation helps configure key parts of a datalake | AWS
How AWS Services integrate with LakeFormation
How AWS Services integrate with LakeFormation | AWS

Deltalake

Deltalake is an open source project that builds on top of datalakes. It offers a unique parquet based format that offers additional benefits out of the box that can be found below.

Deltalake overview
Deltalake overview | Deltalake
  • ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
  • Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
  • Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
  • Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
  • Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  • Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net","<password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

Streaming and Messaging

We’ve mainly covered standard storage, but as streaming technologies evolve and mature to accommodate real time needs, they can also be included as a data storage environment, even if temporary. With many streaming services having replay capabilities to avoid dataloss, it’s important to make sure that your stream is protected, as data will be persisted. I’ve also included a cool article on whether Kafka is a database by DZone.

Network, Identity, Encryption, oh my!

We’ve covered these topics numerous times already, so please make sure to read about how to secure your streaming service. In this section we’ll focus on some of the interesting security features for common streaming solutions.

Kafka and Managed Kafka

Apache Kafka is a leading opensource streaming platform originally developer by Linkedin. It has become a standard for processing realtime data and for event driven architectures, including handling driver matching at Uber. Uber has also written about how Kafka can be configured for Multi Region DR.

Confluent builds on Kafka to offer an ecosystem
Confluent builds on Kafka to offer an ecosystem | Confluent

Azure Eventhubs

Azure Eventhubs is Microsoft’s realtime streaming solution. It has a number of nice features in terms of integrations, security features and Kafka inter-operability. You can integrate it with Azure data suite and connect your applications with the SDK. As per usual it also integrates with your VNET, AAD and Replication services.

Eventhub integration with Databricks
Eventhub integration with Databricks | Azure

Kinesis and Kinesis Firehose

Kinesis and Kinesis Firehose are AWS’s data streaming technologies. They focus on streaming data from applications to data sinks, and integrate natively with all other AWS services. You’ll notice that best practices doc covers the topics we’ve been discussing, so hopefully that gives you some confidence :).

AWS Kinesis For Datastreaming
AWS Kinesis For Data streaming | AWS

Integrating With A Data warehouse or Datalake

Two of the potential destinations for a data stream are a data warehouse or datalake. In the major providers this is usually a build in capability, and allows a secure and scalable integration so you don’t have to build custom apps to do so. For AWS and Azure, Kinesis Firehouse and Stream Analytics are the products for the task respectively. You can find more details in the articles below.

Azure Stream Analytics Flow Pattern
Azure Stream Analytics Flow Pattern | Azure
AWS Firehose potential desitnations
AWS Kinesis Data Firehouse | AWS

Federated Access Models

One of the most compelling reasons for sticking to one cloud provider is having a consistent set of identities that can be provisioned and used for services to authenticate against each other. When using multiple clouds or SAAS products, you’ll usually need a mechanism to bridge the gap. Some SAAS products are quite mature allowing identities such as Service Principles and IAM roles to be granted between your cloud environment and theirs. Others require you to use SSO + SCIM with local ACLs to do security. The least mature ones require just local accounts. Below is an example on how Snowflake can create an Azure SP in your account to do data ingestion, and how you can manage its access in AAD easily.

Connecting to Powerbi with one identity

Azure Active Directory With Azure SQL/Synapse Analytics

With Azure Synapse Analytics, you can allow users to login with their Azure credentials. These users can be added directly to the database, or groups can be added that will then be synched with AAD. Each user and group however will need to be added via a T-SQL command, but this can be done using a configured automation identity (Managed identity) that can run the scripts.

Setting up a SQL Server AAD Admin | Azure
Setting up a SQL Server AAD Admin | Azure

AWS IAM with Aurora

AWS Aurora also allows IAM roles to be used to login via granting temporary access tokens to avoid using local credentials. An excerpt from the documentation provides a great description with a full link below.

Connecting to Postgresql using IAM policy | AWS
  • Network traffic to and from the database is encrypted using Secure Socket Layer (SSL) or Transport Layer Security (TLS). For more information about using SSL/TLS with Amazon Aurora, see Using SSL/TLS to encrypt a connection to a DB cluster.
  • You can use IAM to centrally manage access to your database resources, instead of managing access individually on each DB cluster.
  • For applications running on Amazon EC2, you can use profile credentials specific to your EC2 instance to access your database instead of a password, for greater security.

Using Vendor Products

Sometimes the cloud native services don’t fit your needs, so what do you do? When choosing products from a security perspective, you should still try to integrate your identities with their permissions framework. We’ll cover three important terms on what to look for.

Shopify app SSO options, there are many
Shopify app SSO options | Shopify
AD synch for groups and users to snowflake
Snowflake SSO with AD mapping roles and users | by Mika Heino

Zero Trust and MFA

One of the hot recent terms in security is Zero Trust and the methods to achieve it. The premise is that when two entities make a connection, we should put both of them through a validation process to verify who they are, rather than assuming they safe due to pre-requisite conditions.

Infographic illustrating the Zero Trust reference architecture
Zero Trust Diagram | Microsoft

Conclusion: Bringing it back to MLOps

So we went down many rabbit holes to discuss data storage security, but hopefully it has provided you a foundation on how to secure infrastructure. From each section you’ll have a set of tool to make architectural recommendations on how to secure your storage and how to encourage engineers to ask good questions about the security of their solutions.

An ML Storage Architecture
Our Initial MLOps Architecture

Next Time — Part 3: Moving and Transforming Data

After securing all your different data storage components you may ask, now what? We typically need to move, process and transform our data before it ready to be used by an ML model. That’s what Part 3 is for!

--

--

ML Architect @ Voiceflow

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store