Protecting Your Data Engineering and MLOps Storage
Part 2 of the 7 Layers of MLOps Security Guide
Welcome to Part 2 of our MLOps security guide! To recap, last time we discussed how to protect data and went through an example for a restaurant review app. We’ll continue to reference concepts from that article and build on them as seen in the Table of Contents. Today we’ll dive deeper into protecting the storage that houses this data, focusing on different data storage technologies, and enabling these settings for our previous use case.
Series Links
Part 1. Intro and Data Security
Part 2. Protecting Your Data Storage
Part 3. Securing Your Orchestrator
Part 4. ML Model Security
Part 5. ML Model Hosting
Part 6. Securely Exposing ML Models to Users
Part 7. Logging and Monitoring MLOps Infra
How to read this article
After seeing the reading time and word count you may be thinking, what kind of person would read, let alone write this kind of article. Unfortunately, you’re interested and already here so I’ll provide you some tips.
- There are many sections — you probably aren’t working on a problem that covers all of them so feel free to jump to the section.
- This article assumes intermediary programming knowledge. Some concepts I define, others I assume you know.
- After reading (and understanding) this article you should be able to have a good conversation with your peers about data storage protection. Your basic questions should be answered and you’ll have the vocabulary and concepts to google or discuss more complicated ones.
Table of Contents
- MLOps Data Architecture
- Transactional Datastores
- Handling Mistakes
- Network Security
- Encryption at Rest
- Column Level Encryption
- Encryption In Transit
- Key Management
- Data Warehouses
- Scaling Warehouses for Multi Tenancy
- Datalakes
- Streaming and Messaging
- Federated Access
- Zero Trust and Access
- Conclusion
So we have lots of material to cover, so let’s get started!
An architecture for the bits and pieces
Data storage infrastructure is an important component for any MLOps solution since it houses and protects the data. Without the storage, your data can’t really exist anywhere. MLOps requires a significant portion of data engineering and movement, so today’s article and Part 3 on orchestrators, will help us position the data correctly for model training and publishing. Each of the cloud providers has their own interpretation of a data strategy (like the AWS graphic below), so we’ll also call out nuances to each approach.
Finally, we’ve included a general architecture for a MLOps flow below. We can see a couple components: our transactional data store, object storage, data warehousing and streaming. We also connect these with virtual networks and identity providers (and logging, monitoring, but in another article).
We’ll cover each of the 6 main components, enjoy!
Transactional Data Stores
Starting off with our bread and butter: the transactional database! Transactional, ACID compliant databases have been around for decades, but their security continues to evolve.
These databases usually contain our mission critical information: transactions, customer info, SKUs and more. For this data the priority is redundancy, uptime, and throughput. Security has also been important, but before the rise of cloud, companies put these important databases within their datacentres and assumed with some network isolation they’d be safe.
With VPN, cloud, and distributed working models becoming critical to how many companies run, protecting key infrastructure has changed. In this section we’ll discuss different forms of protection: at the network, data, and human level.
Mistakes and Outages
First days at a new job can be stressful. We try to make a good impression, but sometimes a nightmare scenario unfolds.
Today was my first day on the job as a Junior Software Developer and was my first non-internship position after university. Unfortunately i screwed up badly.
I was basically given a document detailing how to setup my local development environment. Which involves run a small script to create my own personal DB instance from some test data. After running the command i was supposed to copy the database url/password/username outputted by the command and configure my dev environment to point to that database. Unfortunately instead of copying the values outputted by the tool, i instead for whatever reason used the values the document had.
Unfortunately apparently those values were actually for the production database (why they are documented in the dev setup guide i have no idea). Then from my understanding that the tests add fake data, and clear existing data between test runs which basically cleared all the data from the production database. Honestly i had no idea what i did and it wasn’t about 30 or so minutes after did someone actually figure out/realize what i did.
While what i had done was sinking in. The CTO told me to leave and never come back. He also informed me that apparently legal would need to get involved due to severity of the data loss.
(Story from Reddit)
We’ve probably all heard a story similar to this, a new employee accidentally breaks something in production leading to full on panic mode. We may think of this is as a process or software development issue,but it’s actually a security problem as well. Poorly documented processes can severely hurt a company, whether through data loss, leaked credentials or opportunity cost of recovery.
Leaked credentials are fairly obvious in how they can harm a company, but what about data loss? Let’s go through a couple of scenarios.
- Harms other investigations. If we lose data for an ongoing investigation whether direct data, or data that can be correlated, we lose the ability to find and patch issues.
- Limits ability to do anomaly detection. Many actions don’t exist in a vacuum, so correlating past actions to current actions is important to determine if an event is malicious. The credential someone just used to read prod data might be ok if this is a support person fixing a P1 bug, but if it’s a random login from a random IP, you might be in trouble.
- Limits cross org coordination. In large companies and development environments, attackers may want to compromise a simple system and then move around to a more valuable system. With dataloss on previous cross team incidents, a centralized or decentralized team cannot be as effective.
How Should We Handle Mistakes?
Blameless Postmortems and Root Cause Analysis
One of the biggest dangers to security and system stability within a company is a culture of blame. When teams finger point and try to gain political points for the success of their systems a collaborative and problem solving culture begins to break down. Incidents, risks and responses become harder to find, identify and voice due to a fear of repercussions. One of Google’s SRE (Site Reliability Engineer) recommendations is to have blameless postmortems, i.e figuring out why something happened and instituting fixes (tech and social) so it doesn’t happen again.
Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization [Boy13].
With a culture and leadership practice that focuses on speedy remediation and prevention, teams can be transparent about their past issues and assist each other in the future. It will also make it easier for individuals to admit fault and learn from their mistakes, leading to good learning opportunities and rapport for the individual and team.
Bug Bounties
Even with the best processes and people, internal teams still make mistakes. Whether it’s familiarity blindness or pressure to ship features, having an extra set of eyes is always useful to remediate security issues. Many companies bring 3rd party firms or consultants to do so, but those are usually expensive and occur on an infrequent basis.
Another option for continuous monitoring and engagement is to have a policy of bug bounties: rewards to incentivize security professionals to report security vulnerabilities through a secure channel for a financial reward. The idea is to promote a financial reward and recognition for finding issues rather than keeping them quiet and risking the exploit being sold to criminals.
What if I say your Email ID is all I need to takeover your account on your favorite website or an app. Sounds scary right? This is what a bug in
Sign in with Apple
allowed me to do.
The excerpt above is from a developer who found an exploit in Apple’s sign in process, and was paid $100k for securely reporting it. Big tech companies have been embracing these programs for a number of years and governments have also joined in. The US Department of Defence also has a program called “Hack the Pentagon” that was launched in 2016, and continues to help identify thousands of bugs.
Bug bounties are definitely an interesting topic and you can read more about them here.
Disaster Recovery and High Availability
Another important aspect to discuss is how to minimize the risk of downtime and dataloss if your systems go down. Most cloud databases/data stores make this easy to configure, so we’ll go over the foundational concepts first.
Service Level Agreement (SLA)
When using designing an app or using a cloud service, you need to decide how often you’re willing for it be unavailable. While we do want our apps and data to be running 100% of the time, this is often not feasible with technology. As such, we use SLAs to define our requirements. SLAs are given by “number of nines” such as a 99.9% uptime.
As the image above shows, getting beyond “four nines” is increasingly challenging, and most companies max out for transactional stores and applications. Some data-stores have higher availability but make trade offs in other dimensions.
Regions and Availability Zones
In the world of cloud, your applications and data are replicated across many datacenters, which are called availability zones. These then form a cluster called a region, as demonstrated in the diagram below.
DNS Load Balancing
Now that we’ve talked about regions and availability zones, how do we get users to access data in them? Typically different datacenters were connected by DNS to be able to load balance and resolve applications.
Today some datastores, like Azure SQL, do DNS resolution automatically, so that the end user does not see the region. In the background they are doing something called DNS load balancing. A global load balancer does health checks to see which regions are available to run the app on.
Let’s say we have an app and it’s running on myapp.com. If we deployed the app on two regions, the load balancer will check each one to see if it’s still up. If we are running Active-Passive, traffic is only being redirected to one region/CName, which in the example below is Canada Central.
Active — myapp.canada-central.com
Passive — myapp.canada-east.com
The Snowflake data warehouse still uses regions in its url explicitly, so applications need to know which datastore to point to.
Since we talked about regions and load balancing, let’s talk about the time it takes to recover data.
RPO and RTO
So we are ready to back up our data, but what are our metrics? RPO and RTO are two important metrics. RPO stands for recovery point objective, which is the time that data is lost during downtime, i.e when data cannot be ingested. RTO is the recovery time objective: how long it takes to bring up the system. The diagram below provides a great reference.
Many data stores have trade offs between RTO, RPO and cost, which the AWS article below does a great job explaining.
Examples of Cross Regional Replication
Below are two diagrams, one for AWS Aurora and one for Snowflake. The Aurora diagram describes how clusters replicate across regions and their behaviour by using read and write replicas in the cluster. The Snowflake diagram demonstrates the leader/follower flip for region syncs when the main region goes down.
Network Protections
We’ll cover more detailed network protection in the dedicated section, but for transactional data stores you don’t want to expose them to end users directly.
The architecture is typically a front end/mobile device, through a middleware/api gateway that connects to a backend service and then to your db. As such your database can sit in an isolated virtual network that can only be accessed by the backend services.
In AWS and GCP these network isolations are known as Virtual Private Clouds (VPCs) with private subnets and in Azure as VNETs. You also want to enable private endpoints so that your data flows between different connection points through the cloud provider’s backbone rather than the public internet. Below we have three diagrams that reference private network access in Azure, AWS and GCP.
Attack Vector —DB DDOS Attacks
One of the reasons to isolate your database from the end user to prevent it from going down due to excessive traffic. Malicious actors could try connecting many times and downing the database directly with connection timeouts and requests.
At provision, Databases for PostgreSQL sets the maximum number of connections to your PostgreSQL database to 115. 15 connections are reserved for the superuser to maintain the state and integrity of your database, and 100 connections are available for you and your applications.
Postgresql, typically can handle 100 connections so you can easily down the database and even a cluster with a small DDOS attack. So instead of running the risk, we can isolate the network policies and connections such that only our apps can connect.
Attack Vector — Dictionary Attack
A dictionary attack focuses on using a list of known passwords to try to log into a system. If we make it more difficult for users to access production systems, they won’t have an opportunity to attack. Or rather they’d have to compromise another part of the system to gain access, hopefully triggering alarms while this is happening and slowing them down enough to have a defensive team fix things.
Database Encryption
An important aspect when you’re protecting your database is to think about how you encrypt the data. You can encrypt in a variety of ways which we will cover below.
Encryption At Rest
In general encryption at rest focuses on protecting your hardware if it is stolen. If someone runs into your server space or datacenter and takes a drive, they won’t be able to decrypt it. Each company does it in a slightly different way, but it’s a well supported best practice and available for almost every cloud service. Your applications will not notice that the data is encrypted because it is decrypted at runtime in the db, the encryption exists only before the data is read.
- Oracle and SQL Server TDE. This technique focuses on encrypting the database at the page level and storing the encryption key separate from the data store. When in a cloud environment you can either provide a Customer Managed Key (CMK) or rely on a random one from the cloud provider. TDE is usually set up at creation time.
- Aurora Cluster Encryption. Similar to TDE, Amazon’s managed database encrypts data, snapshots and logs. Likewise you can set up a CMK for the cluster.
- Postresql Partition Encryption. This is the technology that’s used for encryption at rest for managed Postgresql for services like RDS, Cloud SQL or Azure managed Postgresql. Each cloud provider uses a slightly different technique for encryption management so it’s worthwhile to read up: GCP, Azure, AWS
Column Level Encryption
Encryption at rest is great since it protects the hardware, but when it reaches the client side, it’s decrypted. Column level encryption focuses on protecting data when it’s within the client, and limiting a user’ or applications’ ability to see it unless they have the right permissions.
Many databases in the past haven’t focused on this feature, so it has given rise to external services that decrypt the data with an intermediary application.
Database native services
Postgresql
The pgcrypto module allows certain fields to be stored encrypted. This is useful when only some of the data is sensitive. The client supplies the decryption key and the data is decrypted on the server and then sent to the client. This process is described by pgcrypto module docs below.
“The decrypted data and the decryption key are present on the server for a brief time while it is being decrypted and communicated between the client and server. This presents a brief moment where the data and keys can be intercepted by someone with complete access to the database server, such as the system administrator.”
MySQL
MySQL has the capability to encrypt data using a variety of cryptographic functions including encryption and hashing. These are usually done at run time by using the functions found within the SQL Statement below.
It is also possible to decouple encrypting data in the database, by inserting encrypted values and then using the decrypting functions with cryptographic parameters (initialization vector, password, ect). This would require more setup per column and require the storage of the keys somewhere (or restricting access within the db), it is possible to set up column level encryption in MySQL.
Client-side encryption
SQL Always Encrypted
Another interesting technique for encrypting transactional databases is Microsoft’s Always Encrypted technology. The premise is to send encrypted data to the database, and store the keys in a supported keystone (like an Azure Keyvault) and then at run time retrieve the keys and decrypt within the database driver.
This approach works if the client wants to keep the data encrypted until it reaches the client and segregate data management and access patterns, as described in the Microsoft docs.
Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases. Always Encrypted allows clients to encrypt sensitive data inside client applications and never reveal the encryption keys to the Database Engine (SQL Database or SQL Server). As a result, Always Encrypted provides a separation between those who own the data and can view it, and those who manage the data but should have no access.
Two key types are used for encrypting, the column encryption keys for the data and the column master keys to wrap the column encryption keys. These keys can be assigned with role separation as part of the encryption process.
Setting up Always Encrypted is beyond the scope of this tutorial but Microsoft lists the steps for its different databases and those can be found below:
AWS Aurora
A similar approach is supported by AWS Aurora, with integrations with AWS KMS for the key store. The architecture involves using a Customer Managed Key that wraps a Data Encryption Key and then stores the data in the database.
More details can be found within the blog post below.
External Services
We’ve now reviewed a couple of column level encryption features within popular databases, but there are a decent number of cross db encryption services out there. These usually involve setting up a proxy to intercept your traffic to/from the database, checking your role and then decrypting the data if the role matches. The data is usually stored in virtual tables with metadata to allow the service to know which steps have been taken for encryption and decryption.
Why would you want to use a cross db/proxy service?
- You don’t want to change your app architecture.
- You have multiple databases and want central access controls and key management.
- You want centralized data visibility and classification.
The first reason is based on modifying how apps interact with databases. If you have a legacy or tightly coupled monolith, changing the database layer to accept encrypted values and decrypting them within the app requires refactoring and careful redeployments. If you’re running a micro-service oriented application, or have the ability to run a sidecar application within your infrastructure, using a third-party service to handle data protection becomes less compelling.
The second reason focuses on leveraging connectors that vendors have built that connect to different flavours of SQL and other data stores. These abstract away some of the challenges of writing your own custom code to manage the different database connections. Likewise, each database (or cloud) may use different key stores and integrating them to your database takes time. Finally, because many databases have different access patterns, setting up single sign on (SSO) for a proxy is easier than trying to do it across many tools, again delegating the access responsibility to the vendor service.
The third reason is related to data governace and data discovery. If you use a common service for access patterns, you have one place to look for your logs.
There are also a couple downsides for using third-party encryption services.
- Performance
- Vendor lockin
- Fragmented IAM model
Performance as a concern is fairly obvious as now you have both an encryption overhead and a proxy application overhead added to the equation.
Vendor lockin refers to how your data is usually encrypted and structured within the database. Even if you have your own keys, data is generally stored in a type of virtual table that requires the proxy application to decode and use the metadata to transform the encrypted data to usable data. Reverse engineering this may take substantial time, likewise to doing bulk copies from the proxy.
Fragmented IAM may seem a little counter intuitive given one of the advantages was central management. The challenge is most of your IAM policies won’t run purely through your vendor encryption proxy. You’ll have other services that will use policies within your cloud environment. Since security incidents often involve privilege and role escalations, correlation between logs and permissions from your IDP and infra vs data proxy will be more difficult. Now you have to monitor another system and deploy IAM policies across multiple infrastructure environments.
Finally, encryption vendors don’t magically solve encryption, they just hide the problem from you. They still have to deal with implementing all the security controls like network protections, least privilege, key rotations etc, it just becomes part of their mandate. This might be a good thing to lower your overhead, but it isn’t a magic bullet.
One more aspect to bring up is the last key problem. No matter how many times you encrypt your data, you have to store the key somewhere. And to make your data useful, you need the key to be accessible. This is important when designing your application: you’ll add processes that add complexity and delay, but then have to still give all your applications and users access to the decryption service. You can read more about data level protection in the previous blog post, 7 Layers of MLOPS Security, securing your data.
Encryption in Transit
HTTPS Communication and TLS
We’ve discussed data encryption and encryption at rest, the third layer is encryption in our networking channel. Encrypting in transit is important to prevent people with access to your traffic from snooping on your data and potentially intercepting it. This might be someone on the internet staging a man-in-the-middle attack, or a disgruntled employee in your datacentre connected to the network.
For a more detailed and funner intro to the HTTPs, see set of comics below.
The basic premise is to use asymmetric encryption via certificates to establish a way to communicate between each party. After exchanging public keys, you send encrypted data with symmetric keys that will be used later to encrypt your core message.
If you’re interested in learning more about the TLS and HTTPs details, this blog offers a good reference and so does the wiki page with links to the RFC standards. While this section doesn't go into depth about the differences and algorithms for symmetric vs asymmetric encryption, it should provide you with a review or launching point for further research.
Connecting to A Database
After our primer on TLS/HTTPS, how do we use it with our database? Usually most providers enable (and enforce) encryption in transit and offer it out of the box, so that you dont have to worry about it. If you are running your own database server, you will have to deal with certificate setup which each database has it’s own steps to follow.
Beyond channel encryption
In addition to relying on TLS to encrypt your data, you can also use the same principles with different keys to do so at the data level. One example is configuring a banking pin. This is a critical piece of information which you may want to wrap with your own key rather than just relying on a certificate authority.
While there are now more modern ways to deal with authenticating, some data may need additional keys in case the internet’s (or certificates’) security gets compromised. If you’re sending a password or a pin, you may want to encrypt it with a different private/public key pair so that the attacker would need both to compromise your application.
Different Types of Data Keys
We started talking about keys, let’s continue! When encrypting data, there are usually many keys involved, and they may vary depending on the service. Let’s go over some common terms.
Master Key
This may be held at the account or database level, and may include multiple masters as seen above in Snowflake.
KEK — Key Encryption Key
The key encryption key is used to encrypt/decrypt another key or collection of keys.
DEK — Data Encryption Key
A data encryption key is used to encrypt the data file, table or partition. Sometimes there may be more the one key in the data key hierarchy as seen in the Snowflake example above. A good read on the topic can be found below
CMK — Customer Managed Key
Often for large enterprises, there is a desire to use your own keys for encrypting data within cloud infrastructure or SAAS platforms. This has different ways of manifesting whether it’s customer managed keys, bring your own key or hold your own key. Depending on the platform, your data may be encrypted with just your key or with a composed key, with an example of how Snowflake does it below.
BYOK — Bring your own key
To extend on CMKs, some platforms differentiate between BYOK and HYOK. The main distinction is that for BYOK, the customer key is used for decrypting data at the infrastructure level and is stored at in integrated keystore. For HYOK usually data infra and keystores are more decoupled and may or may not decrypt data before returning it to users, depending on the integration points.
Each data platforms’ implementation is different so make sure to read more into it if you have requirements.
Re-Keying
It’s good practice to rotate your keys on a schedule to make sure they aren’t compromised. When you do so, data that is encrypted (and any downstream keys) need to be decrypted with the old key, and encrypted with the new key. Ideally this happens in the background across different nodes so it doesn’t disrupt users, but depending on the service, it may cause outages.
Impact On Service
Depending on the service and integration, you may have some downtime when rotating keys. When using a customer managed key with Snowflake in Azure, there is downtime while rekeying is done based on how Azure Key Vault secret versions are used.
Other times, when customer keys, or platform keys are not accessible, different services have different approaches.
Please consult your storage provider’s docs and customer service if you have questions, because this may affect your disaster recovery, uptime (SLA/SLOs) and deployment/maintenance processes. One of the worst things a company can do is to crypto-lock their data due to poor key management.
Below are two examples on how GCP handles keys for BigQuery and Spanner, including key availability.
Data Warehouses
So we’ve talked about dealing with transactional data, but how do we analyze that data? Typically we don’t want to impact our production datastore for analytics, so we either stream those events somewhere else or use batch jobs to extract them. We can either move the data into a datalake with some compute to execute the queries or into a data warehouse.
Data warehouses typically are SQL based and are optimized for executing large queries. TPC is a common collection of datasets that many organizations measure their performance against, trying to mimic common query types.
Many data warehouses now support semi structured data, and even unstructured data, so they are blending in with traditional datalakes. Databricks and AWS even released the concept of a data lakehouse (datalake warehouse), making it even more ambiguous.
At the end of the day, if you need to query big data, data warehouses are great, but also need to be secured since they may be housing the majority of your data.
Tip : Proper RBAC (Role Based Access) is even more important for data warehouses since there will be many application’s data within
Tip: When configuring any data warehouse remember the key principles: network, identity, least privilege, encryption, logging/alerting.
One topic we haven’t touched on is multi tenancy, how we can set up our data warehouse to accommodate all our datasets and users. In the following section we’ll deep dive into the topic.
Multi Tenancy
When setting up your data warehousing solution you should be thinking about how to scale it as more applications onboard. Whether you are an enterprise doing a POC or a startup needing to analyze your data, you’ll usually start with moving one app’s data into your warehouse. At this point you’re probably write your setup scripts by hand and try to get things working. Hopefully before the next datasets onboards, you solidify your design and automation, otherwise it will be harder to change and manage.
So how can we scale our warehouse to accommodate more teams? Let’s discuss some terminology first.
Account/Subscription
At the cloud infrastructure level, we deploy our resources into Accounts or Subscriptions in the case of AWS/Snowflake and Azure respectively. Typically we isolate these for LOBs, environments (dev vs prod) and workload types. Depending on your architecture, this may vary, but it’s a good logical separation to be aware of. Usually at this logical level you’ll deploy your data warehouse. Below is an example of deploying your Redshift cluster to another account compared to your main applications.
Cluster/Server
Once you’ve created your account/subscription, you can deploy your cluster/server within it. Clusters/Servers can contain many databases, and sometimes in the case of Azure, different types of SQL databases (SQL, Synapase, Hyperscale, Serverless). Many of your security setting will be configured at the server level and you may be able to do cross database queries depending on the platform.
Database
A database in a traditional sense focused on storing data for one application or collection of applications. Since we are aggregating data across applications in a data warehouse, we can use a database to:
- Store all the data
- All data for a line of business
- All data for an app/small collection of apps
Depending on your platform, a database can be quite flexible and allow 100s or even 1000s to be created (Redshift, Snowflake) or have a pattern of just one database (Synapse)
Schema
A schema is a collection of datasets/tables with similar permissions or business use cases. We would typically group application data under one schema, where our schema could be our “app” or line of business, depending how big your company is.
In the case of Snowflake, you’d have an account which you’d connect to. as a user. Then you’d have a database, schema and table you’d specify at query time. So your select statement would be something like select * from [database].[schema].[table].
Redshift would be very similar to Snowflake.
SELECT * from db1.public.table1 ORDER BY c1;
When setting up Azure Synapse Analytics (prev Sql Data Warehouse) Microsoft recommends using schemas and placing all your data into one database as outlined below:
Included are recommendations for consolidating workloads, security, domain, and functional boundaries by using user-defined schemas:
- Use one database to run your entire analytics workload.
- Consolidate your existing analytics environment to use one database.
- Leverage user-defined schemas to provide the boundary previously implemented using databases.
In this case your query would be select * from [schema].table.
Table/Dataset
Tables are where your data is stored usually with a predefined schema.
Multi Tenancy Decision Making
Now that we’ve covered the main building blocks for how a data warehouse is configured, let’s look at what factors influence our design.
- Development environments
- Dollar Cost
- Time Cost
- Data Sharing
- Compliance
- Contractual Obligations
- Disaster recovery/ High availability
Factors 1,4,7 are probably the most important in my view when designing your solutions. We’ll see in a case study below how factor #1 is important with determining your multi tenancy structure in Snowflake. Factor #4 becomes a challenge when your data warehousing strategy involves more teams, but is rapidly being addressed by cloud vendors as they try to make it easier to share data. Factor #7 is important as your solution gains maturity and disaster recovery becomes important. You can checkout some links below on disaster recovery below.
Multi Tenancy Case Study— Snowflake
Since we don’t have time in this article to do a deep dive on multi tenancy for each data warehouse, we’ll do it for one, Snowflake.
Snowflake accounts are free and compute is serverless, so you don’t have to worry about cost, you can focus on effective segmentation. There are a couple of approaches with different pros and cons. To illustrate our use case let’s assume we are store that has two main product lines, food and outdoor equipment. We have dev, non-prod and prod environments and are looking to build out our analytics capabilities.
One Account, Database Per Environment
We’ve create three databases within our Snowflake account. Each database has the same schemas and tables, but only prod has the real data. There are two schemas, one for each LOB, food and outdoors. Finally, the foods schema has customers and products tables.
This approach is pretty simple and will accommodate many organizations’ needs. Data ingested can be loaded via standard staging tables and most permissions are granted at the schema level for data management.
However there are some limitations:
- You can’t experiment on roles and hierarchies since the user admin is a privileged role. You can very easily wipe out users/roles in prod.
- If you are connecting over a private link, you will have to choose which network you are connecting to. This would not mirror your infrastructure deployment and may cause confusion and backdoor channels on getting prod data into dev, if you aren’t careful.
- If your business continues to grow and you add many more product lines, you will be limited in how you segment them.
- Billing by default is at the database level: if you need to implement chargeback you need to do some extra work to parse out data storage costs at the schema level.
- Your roles will need to include which environment you’re working with, which may create challenges in doing upgrades across environments as the naming convention will not be the same. It may also cause issues with your SSO/SCIM solution.
Account Per Environment, Database Per App
The second approach is to create three Snowflake accounts with non prod and prod mirroring each others’ configuration. The dev account is more flexible for experimentation and allows infrastructure, security and data engineers to be granted privileged accounts for faster iteration with minimum risk.
There are fewer limitations, but some to take note of:
- Creating databases is a fairly privileged action, so your automation will need to be granted this permission to make the process seamless.
- Schemas will need to be a well defined construct otherwise chaos will ensue. There are many options here, but having experimentation, project and stable zones is usually beneficial. You also need to make sure proper schema level permissions are granted since they govern the data at the end of the day.
Account Per Environment, Database Per LOB, Schema Per App
The third approach is to create three Snowflake accounts with non prod and prod mirroring each others configuration as before, but have fewer databases and with more schemas in each. This limits the number of database onboarding steps, but also has limitations:
- Billing by default is at the database level, so if you need to implement chargeback you need to do some extra work to parse out data storage costs at the schema level if each app code needs chargeback.
- You have to be more careful with schema permissions since there is more data per database.
Additional Considerations
- Account setup. When considering a multi-tenant environment, onboarding a new team and doing the initial setup has its own steps.
- Creation of roles. This is its own topic, but you’ll need to define new roles at onboarding time for new apps. They should follow a well defined hierarchy and be automated.
- Read/Write/Delete permissions. In snowflake you must be able to read data if you would like to write data . However, masks can be applied to turn the read data into meaningless strings. You may also need to create intermediary tables and views since your user may be assigned many roles, but you can only use one role at a time.
- Virtual Warehouses per app/role. We haven’t talked about the compute portion of Snowflake, but to execute queries you’ll need a Virtual Warehouse. Similar to storage you need to configure this aswell.
- Automation. You will probably needs a control plane application that runs the automation on top of SCIM/SSO. This will help assign roles dynamically for new roles or workflows.
- Python sdk or SQL connector. Snowflake recently announced a Rest API for executing most commands, but you can also use SQL drivers and execute it within an application.
I’ll be writing a more detailed article on Snowflake best practices in the future for more details.
Data Warehouse Details
For each of the data warehouses, I’ve included a couple important points below in each category for further reading. This list is by no means exhaustive and only a prompt.
Snowflake
One of the items I didn’t cover was private link connectivity to Snowflake. This is a cool feature that many SAAS providers don’t have, and allows you to extend your virtual network to Snowflake. In combination with security groups, private networking and firewalls, you can make sure all your data movement happens on within your governed network. This feature is available on AWS and Azure, with an AWS diagram below.
Snowflake also has some interesting security features, with data classification, masking and encryption options at the available at the column and row level.
https://docs.snowflake.com/en/user-guide/security-column-ddm-use.html
https://www.snowflake.com/blog/bringing-the-worlds-data-together-announcements-from-snowflake-summit
Redshift
For Redshift, there are many useful integrations with the AWS logging, monitoring, identity and other data systems. One of the aspects I wanted to highlight is the multi tenancy model described below. Depending on your tenancy model, other aspects will follow, so it’s important to understand what the data sharing capabilities are.
Big Query
As with the AWS ecosystem, Big Query has nice integrations with the GCP ecosystem. To highlight a particular integration is the DLP feature with column level protection.
You can scan a BigQuery table and see a decryption on what the DLP engine thinks the metadata is. Afterwords you can use BigQueries column level protection policies to restrict user access at a granular level, especially if you found some sensitive fields, like SSN.
Synapse Analytics
One of the differences between the other three warehouses and Synapse Analytics is that it runs on an Azure SQL Server. This give you many of the integrations and features that are found in Azure SQL and makes it easier to up-skill SQL server developers to become Synapse Analytics users. It is also part of the larger Synapse ecosystem, which aims to provide a single pane of glass for data analytics. More info can be found in the link below.
The AAD native federation is also nice especially for enterprises with hybrid architectures, so you man use on prem ids to access your cloud workloads. We’ll cover this more in the identity federation section.
Datalakes
We’ve talked about mainly structured/tabular data so far, but many ML models need to use other data, which is where a datalake comes in handy. There have been a few iterations of datalakes, but one of the main challenges was storing the data in an organized fashion and being able to run useful queries on them. With advances in distributed storage and computing, there are now many systems that can do this such as Spark or Amazon Athena.
With compute and storage now largely separate for big data technology, data can be stored in object storages such Azure Blob, or AWS S3, and then be processed by analytics or ML systems. We also want to also avoid data silos, where different apps or businesses keep their own datasets and standards, so creating a unified standard with strong data segregation and security is important. We’ll talk about two important concepts to achieve this, ACLs and network restrictions.
ACL (Access Control List)
We’ve already briefly spoken about identity and RBAC, but more broadly, many tech platforms use ACLs as a way to limit who can access what. In our datalake case, we define ACLs to control operations such as reads, writes, deletes and a couple others. We can make ACLs longed lived or provision them on the fly to allow just in time access.
These ACLs can be very specific (user level) or broad (everyone on the internet), but configuring them allows you to keep data in one place and delineate it. In the example below you could delineate multiple S3 buckets with specific ACLs at the Bucket and Object level.
Network Access
Similar to databases, we typically restrict network access to our datalake. It’s usually a good idea as a company is scaling to put it into a different subscription/account to separate business apps from the analytics layer. You can then peer VPCs/VNets and use storage private end points to connect.
Datalake Specific Nuances
Now that we have covered some general protection tips for datalakes, we’ll cover three technologies with some nuances.
Azure Datalake Gen 2
Azure Datalake Gen 2 (ADLS) is a hierarchical storage that combines the capabilities of the original Azure Datalake with the benefits of Azure Blob Storage. When deploying you may use ARM templates and use the SDK to manage the hierarchy which consists of containers and files.
One of the benefits of Azure Datalake is you can use AAD to create roles and permissions for your different containers and files. This means when you’re ingesting or analyzing data you’ll have a common set of identities to use. You can also configure Azure Security Centre to ingest your emitted logs and events from ADLS to see who is accessing your data or changing the configurations of the datalake. You may also use credential passthrough to check user permissions when accessing Azure Datalake with Azure Databricks.
More information on Azure Datalake security can be found below with a set of Microsoft provided recommendations.
S3
S3 is a core AWS service and has recently been used more often as a datalake. With fine grain permissions, policies and performance, it’s a great candidate for a datalake. AWS has doubled down on the potential by releasing LakeFormation a couple of years ago, a framework for managing data and metadata for datalakes.
With additional controls and easy integrations with AWS’s data ecosystem, Lakeformation reduces operational overhead, security risk and complexity.
Deltalake
Deltalake is an open source project that builds on top of datalakes. It offers a unique parquet based format that offers additional benefits out of the box that can be found below.
- ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
- Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
- Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
- Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
- Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
- Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
Note that for ADLS gen 2, you need to use a service principal to access data, while in S3 you can use IAM roles.
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net","<password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
You can find more setup info at https://docs.delta.io/latest/delta-storage.html
Streaming and Messaging
We’ve mainly covered standard storage, but as streaming technologies evolve and mature to accommodate real time needs, they can also be included as a data storage environment, even if temporary. With many streaming services having replay capabilities to avoid dataloss, it’s important to make sure that your stream is protected, as data will be persisted. I’ve also included a cool article on whether Kafka is a database by DZone.
https://dzone.com/articles/is-apache-kafka-a-database-the-2020-update
Network, Identity, Encryption, oh my!
We’ve covered these topics numerous times already, so please make sure to read about how to secure your streaming service. In this section we’ll focus on some of the interesting security features for common streaming solutions.
One thing to remember is to handle multi tenancy well for streams, otherwise you could get missing data, duplicated data, and general leakage. With a good IAM model and infrastructure automation, each app can get it’s own topic or deployment, depending on your model. The article below illustrates some of these challenges, covering the integration points between GCP Pub/Sub and Dataflow.
Kafka and Managed Kafka
Apache Kafka is a leading opensource streaming platform originally developer by Linkedin. It has become a standard for processing realtime data and for event driven architectures, including handling driver matching at Uber. Uber has also written about how Kafka can be configured for Multi Region DR.
Kafka can be deployed in its open source version or through a managed provider either in the cloud or on prem. AWS, GCP and Azure all offer a managed version of Kafka, with their own flavours to it. Confluent also offers a managed version of Kafka with many plugins.
You can also checkout this article from DZone about some Kafka security details.
Azure Eventhubs
Azure Eventhubs is Microsoft’s realtime streaming solution. It has a number of nice features in terms of integrations, security features and Kafka inter-operability. You can integrate it with Azure data suite and connect your applications with the SDK. As per usual it also integrates with your VNET, AAD and Replication services.
https://docs.microsoft.com/en-us/azure/databricks/scenarios/databricks-stream-from-eventhubs
Kinesis and Kinesis Firehose
Kinesis and Kinesis Firehose are AWS’s data streaming technologies. They focus on streaming data from applications to data sinks, and integrate natively with all other AWS services. You’ll notice that best practices doc covers the topics we’ve been discussing, so hopefully that gives you some confidence :).
Integrating With A Data warehouse or Datalake
Two of the potential destinations for a data stream are a data warehouse or datalake. In the major providers this is usually a build in capability, and allows a secure and scalable integration so you don’t have to build custom apps to do so. For AWS and Azure, Kinesis Firehouse and Stream Analytics are the products for the task respectively. You can find more details in the articles below.
Federated Access Models
One of the most compelling reasons for sticking to one cloud provider is having a consistent set of identities that can be provisioned and used for services to authenticate against each other. When using multiple clouds or SAAS products, you’ll usually need a mechanism to bridge the gap. Some SAAS products are quite mature allowing identities such as Service Principles and IAM roles to be granted between your cloud environment and theirs. Others require you to use SSO + SCIM with local ACLs to do security. The least mature ones require just local accounts. Below is an example on how Snowflake can create an Azure SP in your account to do data ingestion, and how you can manage its access in AAD easily.
Another important part of data security is being able correlate access and keep it consistent across layers. Many applications use different permissions and permission frameworks between the UI, Backend and Datalayer, which can make it difficult to manage and audit. Ideally you’re using passthrough authentication in at least two layers, so that users have roles that work across the stack, and those roles are a source of truth.
An example is creating an Azure Active Directory group for business analyst that need to access all line of business datasets. Let’s say we call it BA_RETAIL_PROD, and add all BAs into it. Then we grant this group read access on our Synapse table. When the user wants to use Databricks or Powerbi, these credentials are used for access at all levels of the tool: data, data compute service, visualization.
This example is using the Azure stack, but is applicable to other cloud and customer solutions.
Azure Active Directory With Azure SQL/Synapse Analytics
With Azure Synapse Analytics, you can allow users to login with their Azure credentials. These users can be added directly to the database, or groups can be added that will then be synched with AAD. Each user and group however will need to be added via a T-SQL command, but this can be done using a configured automation identity (Managed identity) that can run the scripts.
The downside is that your permissions will only exist in the SQL database/warehouse and not be auditable in Azure AD.
AWS IAM with Aurora
AWS Aurora also allows IAM roles to be used to login via granting temporary access tokens to avoid using local credentials. An excerpt from the documentation provides a great description with a full link below.
You can authenticate to your DB cluster using AWS Identity and Access Management (IAM) database authentication. IAM database authentication works with Aurora MySQL and Aurora PostgreSQL. With this authentication method, you don’t need to use a password when you connect to a DB cluster. Instead, you use an authentication token.
An authentication token is a unique string of characters that Amazon Aurora generates on request. Authentication tokens are generated using AWS Signature Version 4. Each token has a lifetime of 15 minutes. You don’t need to store user credentials in the database, because authentication is managed externally using IAM. You can also still use standard database authentication.
IAM database authentication provides the following benefits:
- Network traffic to and from the database is encrypted using Secure Socket Layer (SSL) or Transport Layer Security (TLS). For more information about using SSL/TLS with Amazon Aurora, see Using SSL/TLS to encrypt a connection to a DB cluster.
- You can use IAM to centrally manage access to your database resources, instead of managing access individually on each DB cluster.
- For applications running on Amazon EC2, you can use profile credentials specific to your EC2 instance to access your database instead of a password, for greater security.
Using Vendor Products
Sometimes the cloud native services don’t fit your needs, so what do you do? When choosing products from a security perspective, you should still try to integrate your identities with their permissions framework. We’ll cover three important terms on what to look for.
Single Sign on (SSO)
When using a vendor product for data related capabilities, one of the technical features you will see advertised is the use of Single Sign On. SSO allows users to only remember their main credential and be able to use it across multiple applications. Each application uses an Identity Provider (IDP) to verify if the user credentials are correct, and based on an IDP token/signature, accept the users login. The protocol that’s used varies, with some popular ones including SAML, Kerberos and OAuth2. You can read more in the articles below.
On the consumer side, FB, Google, Microsoft and Apple are commonly seen for SSO options. On the enterprise side it’s mainly Microsoft, Okta and Google.
To set up SSO, you need to configure a trust between your application and identity provider which is fairly straight forward to do. What you’ll also usually end up doing is creating accounts for the users within the application. You’ll need to do this because the app must use the SSO claim that is created and be able to correlate something in the system, otherwise it will reject the claim. It’s like showing up to a new doctor or dentist with a health card, but them not having your basic health records, you’ll need to fill out some more forms.
In many apps, it’s possible to create user profiles by hand, but if you’re part of a large organization, the initial creation is a pain, and keeping track of who joins and leaves the organization is even worse. Thankfully there is protocol in place to make this easier, SCIM.
System for Cross-domain Identity Management (SCIM)
SCIM allows systems that are trusted to sync identities between each other, usually with one system acting as the source of truth. This allows companies to onboard new applications quickly and have an active sync going to keep things up to date. Most IDPs allow you to use SCIM automatically, so you don’t need to write your own jobs to manage the sync.
Below is a good article to learn more about the SCIM protocol, as RFC 7644.
What’s also handy is the ability to map groups and roles using SCIM. With Azure AD, you can typically sync users and their groups, which map to users and roles in many SAAS applications. This means that you can assign permissions to the SAAS roles, and as you make updates within Azure Active Directory, they will automatically be updated in your application. Pretty neat!
Tip: Check to see if Service Accounts can be synched as well, Azure Databricks enables this feature which makes running automation jobs much easier!
OAuth + SCIM for Automation
We mentioned OAuth briefly earlier, but one of the benefits we didn’t discuss was using it for Automation accounts. We can use the protocol to grab an access token for an application, and use it to login to our data app! One example is using external (AAD) OAuth with Snowflake and configuring Snowflake to accept tokens with Service Principal identities as seen in the article below.
Another good example is getting an Azure AD token for a Service Principal in Azure and then using that to log into Azure Databricks, no intermediary steps required.
With these configurations, you can now automate many of your jobs while using only one identity!
Bonus Tip: Beyond Identity integration, you should also look for the ability to set up a private links between your network and the SAAS vendor provider.
Zero Trust and MFA
One of the hot recent terms in security is Zero Trust and the methods to achieve it. The premise is that when two entities make a connection, we should put both of them through a validation process to verify who they are, rather than assuming they safe due to pre-requisite conditions.
One example is if a company is running a datacentre, they may assume all computers within it are safe, and thus not have strong identity practices. Now that public cloud has risen to prominence, this is becoming trickier because fewer companies manager their infrastructure directly, and most new companies just run on cloud.
So when logging in as a human user, or service account, verifying identity becomes crucial because your employees and services can be connecting from anywhere (almost)! The ability to create credentials on the fly, quickly disable permissions, use behavioral analytics, and log & audit become key parts to protecting resources and the identities used to access them. This is also why having multi factor authentication becomes even more important, as Identity is your true perimeter.
We mentioned the use of Azure AAD/PIM in the last article, but there are multiple identity solutions out there for SSO like Okta or AWS Cognito. These are useful for user logins especially for enabling SSO across many services.
For production workloads, you may want to use a proxy server/jump box to login for an extra level of logging and authentication. This can make it more difficult to login as an adversary by requiring multiple credentials and having more fine tuned logging. This also encourages using service ids for production login or separate credentials for dev environments, making it easier to limit potentially toxic access.
While logging into your storage infrastructure, it’s important to think about how your identities are managed, and use some of the recommendations from the previous section to keep your data safe!
Conclusion: Bringing it back to MLOps
So we went down many rabbit holes to discuss data storage security, but hopefully it has provided you a foundation on how to secure infrastructure. From each section you’ll have a set of tool to make architectural recommendations on how to secure your storage and how to encourage engineers to ask good questions about the security of their solutions.
Below is how we could secure our application from Part 1, if we were deploying it Azure. We have our private networks configure with private endpoints, encryption in the database using keys from Azure key vault, identity configurations with Azure AD and events/logs streamed to security centre and Azure storage!
Next Time — Part 3: Moving and Transforming Data
After securing all your different data storage components you may ask, now what? We typically need to move, process and transform our data before it ready to be used by an ML model. That’s what Part 3 is for!
In Part 3 we will discuss how to secure your orchestration layer, covering open source (Airflow) and cloud native technologies (ADF, Azure Functions, and AWS Lambdas).
Part 3. Securing Your Orchestrator