Protecting Your Data Engineering and MLOps Storage

Part 2 of the 7 Layers of MLOps Security Guide

45 min readJul 13, 2021

Welcome to Part 2 of our MLOps security guide! To recap, last time we discussed how to protect data and went through an example for a restaurant review app. We’ll continue to reference concepts from that article and build on them as seen in the Table of Contents. Today we’ll dive deeper into protecting the storage that houses this data, focusing on different data storage technologies, and enabling these settings for our previous use case.

Series Links

Part 1. Intro and Data Security
Part 2. Protecting Your Data Storage
Part 3. Securing Your Orchestrator
Part 4. ML Model Security
Part 5. ML Model Hosting
Part 6. Securely Exposing ML Models to Users
Part 7. Logging and Monitoring MLOps Infra

How to read this article

After seeing the reading time and word count you may be thinking, what kind of person would read, let alone write this kind of article. Unfortunately, you’re interested and already here so I’ll provide you some tips.

There are many sections — you probably aren’t working on a problem that covers all of them so feel free to jump to the section.
This article assumes intermediary programming knowledge. Some concepts I define, others I assume you know.
After reading (and understanding) this article you should be able to have a good conversation with your peers about data storage protection. Your basic questions should be answered and you’ll have the vocabulary and concepts to google or discuss more complicated ones.

MLOps Data Architecture
Transactional Datastores
Handling Mistakes
Network Security
Encryption at Rest
Column Level Encryption
Encryption In Transit
Key Management
Data Warehouses
Scaling Warehouses for Multi Tenancy
Datalakes
Streaming and Messaging
Federated Access
Zero Trust and Access
Conclusion

So we have lots of material to cover, so let’s get started!

An architecture for the bits and pieces

Data storage infrastructure is an important component for any MLOps solution since it houses and protects the data. Without the storage, your data can’t really exist anywhere. MLOps requires a significant portion of data engineering and movement, so today’s article and Part 3 on orchestrators, will help us position the data correctly for model training and publishing. Each of the cloud providers has their own interpretation of a data strategy (like the AWS graphic below), so we’ll also call out nuances to each approach.

AWS Lakehouse with many data services — AWS Lakehouse Architecture with many data components| AWS

Build a Lake House Architecture on AWS | Amazon Web Services

Organizations can gain deeper and richer insights when they bring together all their relevant data of all structures…

aws.amazon.com

Finally, we’ve included a general architecture for a MLOps flow below. We can see a couple components: our transactional data store, object storage, data warehousing and streaming. We also connect these with virtual networks and identity providers (and logging, monitoring, but in another article).

The Data Storage Layers of a MLOps Architecture — Simple MLOps architecture in Azure for the Data Storage Layers

We’ll cover each of the 6 main components, enjoy!

Transactional Data Stores

Starting off with our bread and butter: the transactional database! Transactional, ACID compliant databases have been around for decades, but their security continues to evolve.

These databases usually contain our mission critical information: transactions, customer info, SKUs and more. For this data the priority is redundancy, uptime, and throughput. Security has also been important, but before the rise of cloud, companies put these important databases within their datacentres and assumed with some network isolation they’d be safe.

With VPN, cloud, and distributed working models becoming critical to how many companies run, protecting key infrastructure has changed. In this section we’ll discuss different forms of protection: at the network, data, and human level.

Mistakes and Outages

First days at a new job can be stressful. We try to make a good impression, but sometimes a nightmare scenario unfolds.

Today was my first day on the job as a Junior Software Developer and was my first non-internship position after university. Unfortunately i screwed up badly.
I was basically given a document detailing how to setup my local development environment. Which involves run a small script to create my own personal DB instance from some test data. After running the command i was supposed to copy the database url/password/username outputted by the command and configure my dev environment to point to that database. Unfortunately instead of copying the values outputted by the tool, i instead for whatever reason used the values the document had.
Unfortunately apparently those values were actually for the production database (why they are documented in the dev setup guide i have no idea). Then from my understanding that the tests add fake data, and clear existing data between test runs which basically cleared all the data from the production database. Honestly i had no idea what i did and it wasn’t about 30 or so minutes after did someone actually figure out/realize what i did.
While what i had done was sinking in. The CTO told me to leave and never come back. He also informed me that apparently legal would need to get involved due to severity of the data loss.
(Story from Reddit)

We’ve probably all heard a story similar to this, a new employee accidentally breaks something in production leading to full on panic mode. We may think of this is as a process or software development issue,but it’s actually a security problem as well. Poorly documented processes can severely hurt a company, whether through data loss, leaked credentials or opportunity cost of recovery.

Leaked credentials are fairly obvious in how they can harm a company, but what about data loss? Let’s go through a couple of scenarios.

Harms other investigations. If we lose data for an ongoing investigation whether direct data, or data that can be correlated, we lose the ability to find and patch issues.
Limits ability to do anomaly detection. Many actions don’t exist in a vacuum, so correlating past actions to current actions is important to determine if an event is malicious. The credential someone just used to read prod data might be ok if this is a support person fixing a P1 bug, but if it’s a random login from a random IP, you might be in trouble.
Limits cross org coordination. In large companies and development environments, attackers may want to compromise a simple system and then move around to a more valuable system. With dataloss on previous cross team incidents, a centralized or decentralized team cannot be as effective.

How Should We Handle Mistakes?

Blameless Postmortems and Root Cause Analysis

One of the biggest dangers to security and system stability within a company is a culture of blame. When teams finger point and try to gain political points for the success of their systems a collaborative and problem solving culture begins to break down. Incidents, risks and responses become harder to find, identify and voice due to a fear of repercussions. One of Google’s SRE (Site Reliability Engineer) recommendations is to have blameless postmortems, i.e figuring out why something happened and instituting fixes (tech and social) so it doesn’t happen again.

Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization [Boy13].

With a culture and leadership practice that focuses on speedy remediation and prevention, teams can be transparent about their past issues and assist each other in the future. It will also make it easier for individuals to admit fault and learn from their mistakes, leading to good learning opportunities and rapport for the individual and team.

Bug Bounties

Even with the best processes and people, internal teams still make mistakes. Whether it’s familiarity blindness or pressure to ship features, having an extra set of eyes is always useful to remediate security issues. Many companies bring 3rd party firms or consultants to do so, but those are usually expensive and occur on an infrequent basis.

Another option for continuous monitoring and engagement is to have a policy of bug bounties: rewards to incentivize security professionals to report security vulnerabilities through a secure channel for a financial reward. The idea is to promote a financial reward and recognition for finding issues rather than keeping them quiet and risking the exploit being sold to criminals.

What if I say your Email ID is all I need to takeover your account on your favorite website or an app. Sounds scary right? This is what a bug in Sign in with Apple allowed me to do.

The excerpt above is from a developer who found an exploit in Apple’s sign in process, and was paid $100k for securely reporting it. Big tech companies have been embracing these programs for a number of years and governments have also joined in. The US Department of Defence also has a program called “Hack the Pentagon” that was launched in 2016, and continues to help identify thousands of bugs.

Bug bounties are definitely an interesting topic and you can read more about them here.

Disaster Recovery and High Availability

Another important aspect to discuss is how to minimize the risk of downtime and dataloss if your systems go down. Most cloud databases/data stores make this easy to configure, so we’ll go over the foundational concepts first.

Service Level Agreement (SLA)

When using designing an app or using a cloud service, you need to decide how often you’re willing for it be unavailable. While we do want our apps and data to be running 100% of the time, this is often not feasible with technology. As such, we use SLAs to define our requirements. SLAs are given by “number of nines” such as a 99.9% uptime.

SLA by number of nines — Achieving an SLA over 99.99% for applications is challenging | Wikipedia

As the image above shows, getting beyond “four nines” is increasingly challenging, and most companies max out for transactional stores and applications. Some data-stores have higher availability but make trade offs in other dimensions.

Service Level Agreements in the Cloud: Who cares?

Service Level Agreements (SLAs) in the Cloud. There have been many articles written on the topic, but still there is…

www.wired.com

Regions and Availability Zones

In the world of cloud, your applications and data are replicated across many datacenters, which are called availability zones. These then form a cluster called a region, as demonstrated in the diagram below.

How Azure Replicates Data for Blob Storage | Azure

Data redundancy - Azure Storage

Azure Storage always stores multiple copies of your data so that it is protected from planned and unplanned events…

docs.microsoft.com

DNS Load Balancing

Now that we’ve talked about regions and availability zones, how do we get users to access data in them? Typically different datacenters were connected by DNS to be able to load balance and resolve applications.

Today some datastores, like Azure SQL, do DNS resolution automatically, so that the end user does not see the region. In the background they are doing something called DNS load balancing. A global load balancer does health checks to see which regions are available to run the app on.

Azure SQL with no region in the URL | Microsoft

Let’s say we have an app and it’s running on myapp.com. If we deployed the app on two regions, the load balancer will check each one to see if it’s still up. If we are running Active-Passive, traffic is only being redirected to one region/CName, which in the example below is Canada Central.

Active — myapp.canada-central.com

Passive — myapp.canada-east.com

The Snowflake data warehouse still uses regions in its url explicitly, so applications need to know which datastore to point to.

Load balancing is becoming easier with new tech | GCP

Since we talked about regions and load balancing, let’s talk about the time it takes to recover data.

RPO and RTO

So we are ready to back up our data, but what are our metrics? RPO and RTO are two important metrics. RPO stands for recovery point objective, which is the time that data is lost during downtime, i.e when data cannot be ingested. RTO is the recovery time objective: how long it takes to bring up the system. The diagram below provides a great reference.

Recovery Time (RTO) and Recovery Point (RPO) in DR Strategy | MSP360

RPO and RTO: Understanding the Differences | Enterprise Storage Forum

Disasters come in many forms. Corruption, theft, loss, or natural disaster can all take down your applications and…

www.enterprisestorageforum.com

Many data stores have trade offs between RTO, RPO and cost, which the AWS article below does a great job explaining.

Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud | Amazon…

As lead solutions architect for the AWS Well-Architected Reliability pillar, I help customers build resilient workloads…

aws.amazon.com

Examples of Cross Regional Replication

Below are two diagrams, one for AWS Aurora and one for Snowflake. The Aurora diagram describes how clusters replicate across regions and their behaviour by using read and write replicas in the cluster. The Snowflake diagram demonstrates the leader/follower flip for region syncs when the main region goes down.

Disaster recovery in Snowflake across regions/accounts | Snowflake

Network Protections

We’ll cover more detailed network protection in the dedicated section, but for transactional data stores you don’t want to expose them to end users directly.

The architecture is typically a front end/mobile device, through a middleware/api gateway that connects to a backend service and then to your db. As such your database can sit in an isolated virtual network that can only be accessed by the backend services.

User Connecting to API Gateway that maps to an Azure function then to a private Azure SQL — Users can only connect via the API gateway

In AWS and GCP these network isolations are known as Virtual Private Clouds (VPCs) with private subnets and in Azure as VNETs. You also want to enable private endpoints so that your data flows between different connection points through the cloud provider’s backbone rather than the public internet. Below we have three diagrams that reference private network access in Azure, AWS and GCP.

Azure VNET connection between a Azure VM and Azure SQL with a vnet peering — Azure VNET connection between a Azure VM and Azure SQL | Azure Docs

AWS VPC with a public application subnet and private db subnet | AWS Website

Cloud SQL connection from VM using private IP — GCP VPC with a private Cloud SQL instance | GCP Website

Attack Vector —DB DDOS Attacks

One of the reasons to isolate your database from the end user to prevent it from going down due to excessive traffic. Malicious actors could try connecting many times and downing the database directly with connection timeouts and requests.

At provision, Databases for PostgreSQL sets the maximum number of connections to your PostgreSQL database to 115. 15 connections are reserved for the superuser to maintain the state and integrity of your database, and 100 connections are available for you and your applications.

Postgresql, typically can handle 100 connections so you can easily down the database and even a cluster with a small DDOS attack. So instead of running the risk, we can isolate the network policies and connections such that only our apps can connect.

Attack Vector — Dictionary Attack

A dictionary attack focuses on using a list of known passwords to try to log into a system. If we make it more difficult for users to access production systems, they won’t have an opportunity to attack. Or rather they’d have to compromise another part of the system to gain access, hopefully triggering alarms while this is happening and slowing them down enough to have a defensive team fix things.

Database Encryption

An important aspect when you’re protecting your database is to think about how you encrypt the data. You can encrypt in a variety of ways which we will cover below.

Encryption At Rest

In general encryption at rest focuses on protecting your hardware if it is stolen. If someone runs into your server space or datacenter and takes a drive, they won’t be able to decrypt it. Each company does it in a slightly different way, but it’s a well supported best practice and available for almost every cloud service. Your applications will not notice that the data is encrypted because it is decrypted at runtime in the db, the encryption exists only before the data is read.

Oracle and SQL Server TDE. This technique focuses on encrypting the database at the page level and storing the encryption key separate from the data store. When in a cloud environment you can either provide a Customer Managed Key (CMK) or rely on a random one from the cloud provider. TDE is usually set up at creation time.
Aurora Cluster Encryption. Similar to TDE, Amazon’s managed database encrypts data, snapshots and logs. Likewise you can set up a CMK for the cluster.
Postresql Partition Encryption. This is the technology that’s used for encryption at rest for managed Postgresql for services like RDS, Cloud SQL or Azure managed Postgresql. Each cloud provider uses a slightly different technique for encryption management so it’s worthwhile to read up: GCP, Azure, AWS

Column Level Encryption

Encryption at rest is great since it protects the hardware, but when it reaches the client side, it’s decrypted. Column level encryption focuses on protecting data when it’s within the client, and limiting a user’ or applications’ ability to see it unless they have the right permissions.

Many databases in the past haven’t focused on this feature, so it has given rise to external services that decrypt the data with an intermediary application.

Database native services

Postgresql

The pgcrypto module allows certain fields to be stored encrypted. This is useful when only some of the data is sensitive. The client supplies the decryption key and the data is decrypted on the server and then sent to the client. This process is described by pgcrypto module docs below.

“The decrypted data and the decryption key are present on the server for a brief time while it is being decrypted and communicated between the client and server. This presents a brief moment where the data and keys can be intercepted by someone with complete access to the database server, such as the system administrator.”

MySQL

MySQL has the capability to encrypt data using a variety of cryptographic functions including encryption and hashing. These are usually done at run time by using the functions found within the SQL Statement below.

It is also possible to decouple encrypting data in the database, by inserting encrypted values and then using the decrypting functions with cryptographic parameters (initialization vector, password, ect). This would require more setup per column and require the storage of the keys somewhere (or restricting access within the db), it is possible to set up column level encryption in MySQL.

MySQL :: MySQL 8.0 Reference Manual :: 12.14 Encryption and Compression Functions

Passwords or other sensitive values supplied as arguments to encryption functions are sent as cleartext to the MySQL…

dev.mysql.com

Client-side encryption

SQL Always Encrypted

Another interesting technique for encrypting transactional databases is Microsoft’s Always Encrypted technology. The premise is to send encrypted data to the database, and store the keys in a supported keystone (like an Azure Keyvault) and then at run time retrieve the keys and decrypt within the database driver.

This approach works if the client wants to keep the data encrypted until it reaches the client and segregate data management and access patterns, as described in the Microsoft docs.

Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases. Always Encrypted allows clients to encrypt sensitive data inside client applications and never reveal the encryption keys to the Database Engine (SQL Database or SQL Server). As a result, Always Encrypted provides a separation between those who own the data and can view it, and those who manage the data but should have no access.

Two key types are used for encrypting, the column encryption keys for the data and the column master keys to wrap the column encryption keys. These keys can be assigned with role separation as part of the encryption process.

Setting up Always Encrypted is beyond the scope of this tutorial but Microsoft lists the steps for its different databases and those can be found below:

Configure Always Encrypted using PowerShell - SQL Server

Applies to: SQL Server (all supported versions) Azure SQL Database Azure SQL Managed Instance The SqlServer PowerShell…

docs.microsoft.com

Configure Always Encrypted by using Azure Key Vault - Azure SQL Database

APPLIES TO: Azure SQL Database Azure SQL Managed Instance This article shows you how to secure sensitive data in a…

docs.microsoft.com

Always Encrypted - SQL Server

Applies to: SQL Server (all supported versions) Azure SQL Database Azure SQL Managed Instance Always Encrypted is a…

docs.microsoft.com

AWS Aurora

A similar approach is supported by AWS Aurora, with integrations with AWS KMS for the key store. The architecture involves using a Customer Managed Key that wraps a Data Encryption Key and then stores the data in the database.

More details can be found within the blog post below.

Performing SQL database client-side encryption for multi-Region high availability | Amazon Web…

Important Update: On 06/16/2021 AWS Key Management Service (AWS KMS) introduced multi-Region keys, a new capability…

aws.amazon.com

External Services

We’ve now reviewed a couple of column level encryption features within popular databases, but there are a decent number of cross db encryption services out there. These usually involve setting up a proxy to intercept your traffic to/from the database, checking your role and then decrypting the data if the role matches. The data is usually stored in virtual tables with metadata to allow the service to know which steps have been taken for encryption and decryption.

Immuta SQL Workflow Architecture | Immuta Docs

Why would you want to use a cross db/proxy service?

You don’t want to change your app architecture.
You have multiple databases and want central access controls and key management.
You want centralized data visibility and classification.

The first reason is based on modifying how apps interact with databases. If you have a legacy or tightly coupled monolith, changing the database layer to accept encrypted values and decrypting them within the app requires refactoring and careful redeployments. If you’re running a micro-service oriented application, or have the ability to run a sidecar application within your infrastructure, using a third-party service to handle data protection becomes less compelling.

The second reason focuses on leveraging connectors that vendors have built that connect to different flavours of SQL and other data stores. These abstract away some of the challenges of writing your own custom code to manage the different database connections. Likewise, each database (or cloud) may use different key stores and integrating them to your database takes time. Finally, because many databases have different access patterns, setting up single sign on (SSO) for a proxy is easier than trying to do it across many tools, again delegating the access responsibility to the vendor service.

A list of Privecera Partners, which means fewer connectors to build — A list of Privacera Partners, which means fewer connectors to build | Privecera

The third reason is related to data governace and data discovery. If you use a common service for access patterns, you have one place to look for your logs.

There are also a couple downsides for using third-party encryption services.

Performance
Vendor lockin
Fragmented IAM model

Performance as a concern is fairly obvious as now you have both an encryption overhead and a proxy application overhead added to the equation.

Vendor lockin refers to how your data is usually encrypted and structured within the database. Even if you have your own keys, data is generally stored in a type of virtual table that requires the proxy application to decode and use the metadata to transform the encrypted data to usable data. Reverse engineering this may take substantial time, likewise to doing bulk copies from the proxy.

Fragmented IAM may seem a little counter intuitive given one of the advantages was central management. The challenge is most of your IAM policies won’t run purely through your vendor encryption proxy. You’ll have other services that will use policies within your cloud environment. Since security incidents often involve privilege and role escalations, correlation between logs and permissions from your IDP and infra vs data proxy will be more difficult. Now you have to monitor another system and deploy IAM policies across multiple infrastructure environments.

Finally, encryption vendors don’t magically solve encryption, they just hide the problem from you. They still have to deal with implementing all the security controls like network protections, least privilege, key rotations etc, it just becomes part of their mandate. This might be a good thing to lower your overhead, but it isn’t a magic bullet.

One more aspect to bring up is the last key problem. No matter how many times you encrypt your data, you have to store the key somewhere. And to make your data useful, you need the key to be accessible. This is important when designing your application: you’ll add processes that add complexity and delay, but then have to still give all your applications and users access to the decryption service. You can read more about data level protection in the previous blog post, 7 Layers of MLOPS Security, securing your data.

Encryption in Transit

HTTPS Communication and TLS

We’ve discussed data encryption and encryption at rest, the third layer is encryption in our networking channel. Encrypting in transit is important to prevent people with access to your traffic from snooping on your data and potentially intercepting it. This might be someone on the internet staging a man-in-the-middle attack, or a disgruntled employee in your datacentre connected to the network.

For a more detailed and funner intro to the HTTPs, see set of comics below.

How HTTPS works

🙀 A cat explains how HTTPS works...in a comic! 😻

howhttps.works

The basic premise is to use asymmetric encryption via certificates to establish a way to communicate between each party. After exchanging public keys, you send encrypted data with symmetric keys that will be used later to encrypt your core message.

If you’re interested in learning more about the TLS and HTTPs details, this blog offers a good reference and so does the wiki page with links to the RFC standards. While this section doesn't go into depth about the differences and algorithms for symmetric vs asymmetric encryption, it should provide you with a review or launching point for further research.

Connecting to A Database

After our primer on TLS/HTTPS, how do we use it with our database? Usually most providers enable (and enforce) encryption in transit and offer it out of the box, so that you dont have to worry about it. If you are running your own database server, you will have to deal with certificate setup which each database has it’s own steps to follow.

Configuring SSL/TLS certificates | Cloud SQL for MySQL | Google Cloud

"type": "thumb-down", "id": "hardToUnderstand", "label":"Hard to understand" },{ "type": "thumb-down", "id"…

cloud.google.com

Using SSL/TLS to encrypt a connection to a DB cluster

You can use Secure Socket Layer (SSL) or Transport Layer Security (TLS) from your application to encrypt a connection…

docs.aws.amazon.com

Connectivity settings for Azure SQL Database and Azure Synapse Analytics - Azure SQL Database and…

APPLIES TO: Azure SQL Database Azure Synapse Analytics This article introduces settings that control connectivity to…

docs.microsoft.com

Beyond channel encryption

In addition to relying on TLS to encrypt your data, you can also use the same principles with different keys to do so at the data level. One example is configuring a banking pin. This is a critical piece of information which you may want to wrap with your own key rather than just relying on a certificate authority.

While there are now more modern ways to deal with authenticating, some data may need additional keys in case the internet’s (or certificates’) security gets compromised. If you’re sending a password or a pin, you may want to encrypt it with a different private/public key pair so that the attacker would need both to compromise your application.

Different Types of Data Keys

We started talking about keys, let’s continue! When encrypting data, there are usually many keys involved, and they may vary depending on the service. Let’s go over some common terms.

Snowflake’s Tri Security Key Hierarchy, root key encrypts account master, encryts table master, which encrypts file keys — Snowflake’s Tri Security Key Hierarchy | Snowflake

Master Key

This may be held at the account or database level, and may include multiple masters as seen above in Snowflake.

KEK — Key Encryption Key

The key encryption key is used to encrypt/decrypt another key or collection of keys.

DEK — Data Encryption Key

A data encryption key is used to encrypt the data file, table or partition. Sometimes there may be more the one key in the data key hierarchy as seen in the Snowflake example above. A good read on the topic can be found below

The Definitive Guide to Encryption Key Management Fundamentals

Back to Top The encryption key life-cycle, defined by NIST as having a pre-operational, operational, post-operational…

info.townsendsecurity.com

CMK — Customer Managed Key

Often for large enterprises, there is a desire to use your own keys for encrypting data within cloud infrastructure or SAAS platforms. This has different ways of manifesting whether it’s customer managed keys, bring your own key or hold your own key. Depending on the platform, your data may be encrypted with just your key or with a composed key, with an example of how Snowflake does it below.

Snowflake Tri Security with Bring Your Own Key | Snowflake

BYOK — Bring your own key

To extend on CMKs, some platforms differentiate between BYOK and HYOK. The main distinction is that for BYOK, the customer key is used for decrypting data at the infrastructure level and is stored at in integrated keystore. For HYOK usually data infra and keystores are more decoupled and may or may not decrypt data before returning it to users, depending on the integration points.

Each data platforms’ implementation is different so make sure to read more into it if you have requirements.

Comparison of bring your own key and hold your own key — BYOK vs HYOK | Secupi

Re-Keying

It’s good practice to rotate your keys on a schedule to make sure they aren’t compromised. When you do so, data that is encrypted (and any downstream keys) need to be decrypted with the old key, and encrypted with the new key. Ideally this happens in the background across different nodes so it doesn’t disrupt users, but depending on the service, it may cause outages.

Impact On Service

Depending on the service and integration, you may have some downtime when rotating keys. When using a customer managed key with Snowflake in Azure, there is downtime while rekeying is done based on how Azure Key Vault secret versions are used.

Other times, when customer keys, or platform keys are not accessible, different services have different approaches.

Please consult your storage provider’s docs and customer service if you have questions, because this may affect your disaster recovery, uptime (SLA/SLOs) and deployment/maintenance processes. One of the worst things a company can do is to crypto-lock their data due to poor key management.

Below are two examples on how GCP handles keys for BigQuery and Spanner, including key availability.

Customer-managed encryption keys (CMEK) | Cloud Spanner | Google Cloud

By default, all data at rest in Google Cloud, including the data in Cloud Spanner, is protected using Google-managed…

cloud.google.com

Protecting data with Cloud KMS keys | BigQuery | Google Cloud

By default, BigQuery encrypts customer content stored at rest. BigQuery handles and manages this default encryption for…

cloud.google.com

Data Warehouses

So we’ve talked about dealing with transactional data, but how do we analyze that data? Typically we don’t want to impact our production datastore for analytics, so we either stream those events somewhere else or use batch jobs to extract them. We can either move the data into a datalake with some compute to execute the queries or into a data warehouse.

Data warehouses typically are SQL based and are optimized for executing large queries. TPC is a common collection of datasets that many organizations measure their performance against, trying to mimic common query types.

Many data warehouses now support semi structured data, and even unstructured data, so they are blending in with traditional datalakes. Databricks and AWS even released the concept of a data lakehouse (datalake warehouse), making it even more ambiguous.

At the end of the day, if you need to query big data, data warehouses are great, but also need to be secured since they may be housing the majority of your data.

According to every major data warehouse, it’s cheaper than the competition :)

Tip : Proper RBAC (Role Based Access) is even more important for data warehouses since there will be many application’s data within

Tip: When configuring any data warehouse remember the key principles: network, identity, least privilege, encryption, logging/alerting.

One topic we haven’t touched on is multi tenancy, how we can set up our data warehouse to accommodate all our datasets and users. In the following section we’ll deep dive into the topic.

Multi Tenancy

When setting up your data warehousing solution you should be thinking about how to scale it as more applications onboard. Whether you are an enterprise doing a POC or a startup needing to analyze your data, you’ll usually start with moving one app’s data into your warehouse. At this point you’re probably write your setup scripts by hand and try to get things working. Hopefully before the next datasets onboards, you solidify your design and automation, otherwise it will be harder to change and manage.

Snowflake object hierarchy, account at the top follow by db, schema and table — Snowflake object hierarchy | Snowflake

So how can we scale our warehouse to accommodate more teams? Let’s discuss some terminology first.

Account/Subscription

At the cloud infrastructure level, we deploy our resources into Accounts or Subscriptions in the case of AWS/Snowflake and Azure respectively. Typically we isolate these for LOBs, environments (dev vs prod) and workload types. Depending on your architecture, this may vary, but it’s a good logical separation to be aware of. Usually at this logical level you’ll deploy your data warehouse. Below is an example of deploying your Redshift cluster to another account compared to your main applications.

Cross AWS Redshift architecture with EC2 access from another account — Cross AWS Redshift architecture | AWS

Cluster/Server

Once you’ve created your account/subscription, you can deploy your cluster/server within it. Clusters/Servers can contain many databases, and sometimes in the case of Azure, different types of SQL databases (SQL, Synapase, Hyperscale, Serverless). Many of your security setting will be configured at the server level and you may be able to do cross database queries depending on the platform.

Database

A database in a traditional sense focused on storing data for one application or collection of applications. Since we are aggregating data across applications in a data warehouse, we can use a database to:

Store all the data
All data for a line of business
All data for an app/small collection of apps

Depending on your platform, a database can be quite flexible and allow 100s or even 1000s to be created (Redshift, Snowflake) or have a pattern of just one database (Synapse)

Schema

A schema is a collection of datasets/tables with similar permissions or business use cases. We would typically group application data under one schema, where our schema could be our “app” or line of business, depending how big your company is.

In the case of Snowflake, you’d have an account which you’d connect to. as a user. Then you’d have a database, schema and table you’d specify at query time. So your select statement would be something like select * from [database].[schema].[table].

Redshift would be very similar to Snowflake.

SELECT * from db1.public.table1 ORDER BY c1;

When setting up Azure Synapse Analytics (prev Sql Data Warehouse) Microsoft recommends using schemas and placing all your data into one database as outlined below:

Included are recommendations for consolidating workloads, security, domain, and functional boundaries by using user-defined schemas:

Use one database to run your entire analytics workload.
Consolidate your existing analytics environment to use one database.
Leverage user-defined schemas to provide the boundary previously implemented using databases.

User-defined schemas within Synapse SQL - Azure Synapse Analytics

In the sections below, you'll find various tips for using T-SQL user-defined schemas to develop solutions within…

docs.microsoft.com

In this case your query would be select * from [schema].table.

Table/Dataset

Tables are where your data is stored usually with a predefined schema.

Multi Tenancy Decision Making

Now that we’ve covered the main building blocks for how a data warehouse is configured, let’s look at what factors influence our design.

Development environments
Dollar Cost
Time Cost
Data Sharing
Compliance
Contractual Obligations
Disaster recovery/ High availability

What factors can influence multi tenancy design

Factors 1,4,7 are probably the most important in my view when designing your solutions. We’ll see in a case study below how factor #1 is important with determining your multi tenancy structure in Snowflake. Factor #4 becomes a challenge when your data warehousing strategy involves more teams, but is rapidly being addressed by cloud vendors as they try to make it easier to share data. Factor #7 is important as your solution gains maturity and disaster recovery becomes important. You can checkout some links below on disaster recovery below.

Building Multi-AZ or Multi-Region Amazon Redshift Clusters | Amazon Web Services

Erik Swensson is an AWS Solutions Architect. AWS Solutions Architect Patrick Shumate also contributed to this post…

aws.amazon.com

Database restore points for Azure Synapse Analytics - Cloud Adoption Framework

One of the key benefits of a modern cloud-based infrastructure such as Microsoft Azure is that features for database…

docs.microsoft.com

Dataset locations | BigQuery | Google Cloud

This page explains the concept of data location and the different locations where you can create datasets. To learn how…

cloud.google.com

Sharing Data Securely Across Regions and Cloud Platforms - Snowflake Documentation

Global Snowflake utilizes database replication to allow data providers to securely share data with data consumers…

docs.snowflake.com

Multi Tenancy Case Study— Snowflake

Since we don’t have time in this article to do a deep dive on multi tenancy for each data warehouse, we’ll do it for one, Snowflake.

Snowflake accounts are free and compute is serverless, so you don’t have to worry about cost, you can focus on effective segmentation. There are a couple of approaches with different pros and cons. To illustrate our use case let’s assume we are store that has two main product lines, food and outdoor equipment. We have dev, non-prod and prod environments and are looking to build out our analytics capabilities.

One Account, Database Per Environment

We’ve create three databases within our Snowflake account. Each database has the same schemas and tables, but only prod has the real data. There are two schemas, one for each LOB, food and outdoors. Finally, the foods schema has customers and products tables.

This approach is pretty simple and will accommodate many organizations’ needs. Data ingested can be loaded via standard staging tables and most permissions are granted at the schema level for data management.

However there are some limitations:

You can’t experiment on roles and hierarchies since the user admin is a privileged role. You can very easily wipe out users/roles in prod.
If you are connecting over a private link, you will have to choose which network you are connecting to. This would not mirror your infrastructure deployment and may cause confusion and backdoor channels on getting prod data into dev, if you aren’t careful.
If your business continues to grow and you add many more product lines, you will be limited in how you segment them.
Billing by default is at the database level: if you need to implement chargeback you need to do some extra work to parse out data storage costs at the schema level.
Your roles will need to include which environment you’re working with, which may create challenges in doing upgrades across environments as the naming convention will not be the same. It may also cause issues with your SSO/SCIM solution.

Account Per Environment, Database Per App

The second approach is to create three Snowflake accounts with non prod and prod mirroring each others’ configuration. The dev account is more flexible for experimentation and allows infrastructure, security and data engineers to be granted privileged accounts for faster iteration with minimum risk.

There are fewer limitations, but some to take note of:

Creating databases is a fairly privileged action, so your automation will need to be granted this permission to make the process seamless.
Schemas will need to be a well defined construct otherwise chaos will ensue. There are many options here, but having experimentation, project and stable zones is usually beneficial. You also need to make sure proper schema level permissions are granted since they govern the data at the end of the day.

Account Per Environment, Database Per LOB, Schema Per App

The third approach is to create three Snowflake accounts with non prod and prod mirroring each others configuration as before, but have fewer databases and with more schemas in each. This limits the number of database onboarding steps, but also has limitations:

Billing by default is at the database level, so if you need to implement chargeback you need to do some extra work to parse out data storage costs at the schema level if each app code needs chargeback.
You have to be more careful with schema permissions since there is more data per database.

Additional Considerations

Account setup. When considering a multi-tenant environment, onboarding a new team and doing the initial setup has its own steps.
Creation of roles. This is its own topic, but you’ll need to define new roles at onboarding time for new apps. They should follow a well defined hierarchy and be automated.
Read/Write/Delete permissions. In snowflake you must be able to read data if you would like to write data . However, masks can be applied to turn the read data into meaningless strings. You may also need to create intermediary tables and views since your user may be assigned many roles, but you can only use one role at a time.
Virtual Warehouses per app/role. We haven’t talked about the compute portion of Snowflake, but to execute queries you’ll need a Virtual Warehouse. Similar to storage you need to configure this aswell.
Automation. You will probably needs a control plane application that runs the automation on top of SCIM/SSO. This will help assign roles dynamically for new roles or workflows.
Python sdk or SQL connector. Snowflake recently announced a Rest API for executing most commands, but you can also use SQL drivers and execute it within an application.

I’ll be writing a more detailed article on Snowflake best practices in the future for more details.

Data Warehouse Details

For each of the data warehouses, I’ve included a couple important points below in each category for further reading. This list is by no means exhaustive and only a prompt.

Snowflake

One of the items I didn’t cover was private link connectivity to Snowflake. This is a cool feature that many SAAS providers don’t have, and allows you to extend your virtual network to Snowflake. In combination with security groups, private networking and firewalls, you can make sure all your data movement happens on within your governed network. This feature is available on AWS and Azure, with an AWS diagram below.

Snowflake also has some interesting security features, with data classification, masking and encryption options at the available at the column and row level.

https://docs.snowflake.com/en/user-guide/security-column-ddm-use.html

https://www.snowflake.com/blog/bringing-the-worlds-data-together-announcements-from-snowflake-summit

Redshift

For Redshift, there are many useful integrations with the AWS logging, monitoring, identity and other data systems. One of the aspects I wanted to highlight is the multi tenancy model described below. Depending on your tenancy model, other aspects will follow, so it’s important to understand what the data sharing capabilities are.

https://www.snowflake.com/blog/bringing-the-worlds-data-together-announcements-from-snowflake-summit/ — Redshfit Multi-Tenancy Models | AWS

Implementing multi-tenant patterns in Amazon Redshift using data sharing | Amazon Web Services

Software service providers offer subscription-based analytics capabilities in the cloud with Analytics as a Service…

aws.amazon.com

Big Query

As with the AWS ecosystem, Big Query has nice integrations with the GCP ecosystem. To highlight a particular integration is the DLP feature with column level protection.

Using Cloud DLP to scan BigQuery data | Google Cloud

"type": "thumb-down", "id": "hardToUnderstand", "label":"Hard to understand" },{ "type": "thumb-down", "id"…

cloud.google.com

You can scan a BigQuery table and see a decryption on what the DLP engine thinks the metadata is. Afterwords you can use BigQueries column level protection policies to restrict user access at a granular level, especially if you found some sensitive fields, like SSN.

Introduction to BigQuery column-level security | Google Cloud

BigQuery provides fine-grained access to sensitive columns using policy tags, or type-based classification, of data…

cloud.google.com

Synapse Analytics

One of the differences between the other three warehouses and Synapse Analytics is that it runs on an Azure SQL Server. This give you many of the integrations and features that are found in Azure SQL and makes it easier to up-skill SQL server developers to become Synapse Analytics users. It is also part of the larger Synapse ecosystem, which aims to provide a single pane of glass for data analytics. More info can be found in the link below.

Azure Synapse Analytics | Microsoft Azure

Experience a new class of analytics Azure Synapse Analytics is a limitless analytics service that brings together data…

azure.microsoft.com

Synapse Studio Connects Many Data Compoennts Together in Azure — Synapse Studio Connects Many Data Components Together in Azure | Azure

The AAD native federation is also nice especially for enterprises with hybrid architectures, so you man use on prem ids to access your cloud workloads. We’ll cover this more in the identity federation section.

Datalakes

We’ve talked about mainly structured/tabular data so far, but many ML models need to use other data, which is where a datalake comes in handy. There have been a few iterations of datalakes, but one of the main challenges was storing the data in an organized fashion and being able to run useful queries on them. With advances in distributed storage and computing, there are now many systems that can do this such as Spark or Amazon Athena.

History and evolution of data lakes - Databricks

In the early days of data management, the relational database was the primary method that companies used to collect…

databricks.com

A Brief History of Data Lakes - DATAVERSITY

Data Lakes are consolidated, centralized storage areas for raw, unstructured, semi-structured, and structured data…

www.dataversity.net

With compute and storage now largely separate for big data technology, data can be stored in object storages such Azure Blob, or AWS S3, and then be processed by analytics or ML systems. We also want to also avoid data silos, where different apps or businesses keep their own datasets and standards, so creating a unified standard with strong data segregation and security is important. We’ll talk about two important concepts to achieve this, ACLs and network restrictions.

ACL (Access Control List)

We’ve already briefly spoken about identity and RBAC, but more broadly, many tech platforms use ACLs as a way to limit who can access what. In our datalake case, we define ACLs to control operations such as reads, writes, deletes and a couple others. We can make ACLs longed lived or provision them on the fly to allow just in time access.

These ACLs can be very specific (user level) or broad (everyone on the internet), but configuring them allows you to keep data in one place and delineate it. In the example below you could delineate multiple S3 buckets with specific ACLs at the Bucket and Object level.

S3 datalake account with multiple S3 buckets and objects within | AWS

Network Access

Similar to databases, we typically restrict network access to our datalake. It’s usually a good idea as a company is scaling to put it into a different subscription/account to separate business apps from the analytics layer. You can then peer VPCs/VNets and use storage private end points to connect.

Datalake Specific Nuances

Now that we have covered some general protection tips for datalakes, we’ll cover three technologies with some nuances.

Azure Datalake Gen 2

Azure Datalake Gen 2 (ADLS) is a hierarchical storage that combines the capabilities of the original Azure Datalake with the benefits of Azure Blob Storage. When deploying you may use ARM templates and use the SDK to manage the hierarchy which consists of containers and files.

One of the benefits of Azure Datalake is you can use AAD to create roles and permissions for your different containers and files. This means when you’re ingesting or analyzing data you’ll have a common set of identities to use. You can also configure Azure Security Centre to ingest your emitted logs and events from ADLS to see who is accessing your data or changing the configurations of the datalake. You may also use credential passthrough to check user permissions when accessing Azure Datalake with Azure Databricks.

More information on Azure Datalake security can be found below with a set of Microsoft provided recommendations.

Security recommendations for Blob storage - Azure Storage

This article contains security recommendations for Blob storage. Implementing these recommendations will help you…

docs.microsoft.com

S3

S3 is a core AWS service and has recently been used more often as a datalake. With fine grain permissions, policies and performance, it’s a great candidate for a datalake. AWS has doubled down on the potential by releasing LakeFormation a couple of years ago, a framework for managing data and metadata for datalakes.

AWS Lake formation helps configure key parts of a datalake | AWS

With additional controls and easy integrations with AWS’s data ecosystem, Lakeformation reduces operational overhead, security risk and complexity.

How AWS Services integrate with LakeFormation | AWS

Deltalake

Deltalake is an open source project that builds on top of datalakes. It offers a unique parquet based format that offers additional benefits out of the box that can be found below.

ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.

Delta Lake - Reliable Data Lakes at Scale

Scalable Metadata Handling In big data, even the metadata itself can be "big data." Delta Lake treats metadata just…

delta.io

Note that for ADLS gen 2, you need to use a service principal to access data, while in S3 you can use IAM roles.

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
 spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
 spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
 spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net","<password>")
 spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

You can find more setup info at https://docs.delta.io/latest/delta-storage.html

Streaming and Messaging

We’ve mainly covered standard storage, but as streaming technologies evolve and mature to accommodate real time needs, they can also be included as a data storage environment, even if temporary. With many streaming services having replay capabilities to avoid dataloss, it’s important to make sure that your stream is protected, as data will be persisted. I’ve also included a cool article on whether Kafka is a database by DZone.

https://dzone.com/articles/is-apache-kafka-a-database-the-2020-update

Network, Identity, Encryption, oh my!

We’ve covered these topics numerous times already, so please make sure to read about how to secure your streaming service. In this section we’ll focus on some of the interesting security features for common streaming solutions.

One thing to remember is to handle multi tenancy well for streams, otherwise you could get missing data, duplicated data, and general leakage. With a good IAM model and infrastructure automation, each app can get it’s own topic or deployment, depending on your model. The article below illustrates some of these challenges, covering the integration points between GCP Pub/Sub and Dataflow.

Streaming with Pub/Sub | Cloud Dataflow | Google Cloud

"type": "thumb-down", "id": "hardToUnderstand", "label":"Hard to understand" },{ "type": "thumb-down", "id"…

cloud.google.com

Kafka and Managed Kafka

Apache Kafka is a leading opensource streaming platform originally developer by Linkedin. It has become a standard for processing realtime data and for event driven architectures, including handling driver matching at Uber. Uber has also written about how Kafka can be configured for Multi Region DR.

Disaster Recovery for Multi-Region Kafka at Uber

Uber has one of the largest deployments of Apache Kafka in the world, processing trillions of messages and multiple…

eng.uber.com

Kafka can be deployed in its open source version or through a managed provider either in the cloud or on prem. AWS, GCP and Azure all offer a managed version of Kafka, with their own flavours to it. Confluent also offers a managed version of Kafka with many plugins.

Confluent builds on Kafka to offer an ecosystem | Confluent

You can also checkout this article from DZone about some Kafka security details.

Apache Kafka Security: Features and Uses of Kafka - DZone Security

There are a number of features added in Kafka community in release 0.9.0.0. There is a flexibility for their usage…

dzone.com

Azure Eventhubs

Azure Eventhubs is Microsoft’s realtime streaming solution. It has a number of nice features in terms of integrations, security features and Kafka inter-operability. You can integrate it with Azure data suite and connect your applications with the SDK. As per usual it also integrates with your VNET, AAD and Replication services.

Eventhub integration with Databricks | Azure

https://docs.microsoft.com/en-us/azure/databricks/scenarios/databricks-stream-from-eventhubs

Kinesis and Kinesis Firehose

Kinesis and Kinesis Firehose are AWS’s data streaming technologies. They focus on streaming data from applications to data sinks, and integrate natively with all other AWS services. You’ll notice that best practices doc covers the topics we’ve been discussing, so hopefully that gives you some confidence :).

AWS Kinesis For Datastreaming — AWS Kinesis For Data streaming | AWS

Security Best Practices for Kinesis Data Streams

Amazon Kinesis Data Streams provides a number of security features to consider as you develop and implement your own…

docs.aws.amazon.com

Integrating With A Data warehouse or Datalake

Two of the potential destinations for a data stream are a data warehouse or datalake. In the major providers this is usually a build in capability, and allows a secure and scalable integration so you don’t have to build custom apps to do so. For AWS and Azure, Kinesis Firehouse and Stream Analytics are the products for the task respectively. You can find more details in the articles below.

Amazon Kinesis Data Firehose - Streaming Data Pipeline - Amazon Web Services

Prepare and load real-time data streams into data stores and analytics services Amazon Kinesis Data Firehose is the…

aws.amazon.com

Azure Stream Analytics Flow Pattern | Azure

AWS Firehose potential desitnations — AWS Kinesis Data Firehouse | AWS

Federated Access Models

One of the most compelling reasons for sticking to one cloud provider is having a consistent set of identities that can be provisioned and used for services to authenticate against each other. When using multiple clouds or SAAS products, you’ll usually need a mechanism to bridge the gap. Some SAAS products are quite mature allowing identities such as Service Principles and IAM roles to be granted between your cloud environment and theirs. Others require you to use SSO + SCIM with local ACLs to do security. The least mature ones require just local accounts. Below is an example on how Snowflake can create an Azure SP in your account to do data ingestion, and how you can manage its access in AAD easily.

Configuring an Azure Container for Loading Data - Snowflake Documentation

This section describes how to use storage integrations to allow Snowflake to read data from and write data to an Azure…

docs.snowflake.com

Another important part of data security is being able correlate access and keep it consistent across layers. Many applications use different permissions and permission frameworks between the UI, Backend and Datalayer, which can make it difficult to manage and audit. Ideally you’re using passthrough authentication in at least two layers, so that users have roles that work across the stack, and those roles are a source of truth.

An example is creating an Azure Active Directory group for business analyst that need to access all line of business datasets. Let’s say we call it BA_RETAIL_PROD, and add all BAs into it. Then we grant this group read access on our Synapse table. When the user wants to use Databricks or Powerbi, these credentials are used for access at all levels of the tool: data, data compute service, visualization.

This example is using the Azure stack, but is applicable to other cloud and customer solutions.

Azure Active Directory With Azure SQL/Synapse Analytics

With Azure Synapse Analytics, you can allow users to login with their Azure credentials. These users can be added directly to the database, or groups can be added that will then be synched with AAD. Each user and group however will need to be added via a T-SQL command, but this can be done using a configured automation identity (Managed identity) that can run the scripts.

Authorize server and database access using logins and user accounts - Azure SQL Database & SQL…

Learn about how Azure SQL Database, SQL Managed Instance, and Azure Synapse authenticate users for access using logins…

docs.microsoft.com

The downside is that your permissions will only exist in the SQL database/warehouse and not be auditable in Azure AD.

Setting up a SQL Server AAD Admin | Azure

AWS IAM with Aurora

AWS Aurora also allows IAM roles to be used to login via granting temporary access tokens to avoid using local credentials. An excerpt from the documentation provides a great description with a full link below.

Connecting to Postgresql using IAM policy | AWS

You can authenticate to your DB cluster using AWS Identity and Access Management (IAM) database authentication. IAM database authentication works with Aurora MySQL and Aurora PostgreSQL. With this authentication method, you don’t need to use a password when you connect to a DB cluster. Instead, you use an authentication token.

An authentication token is a unique string of characters that Amazon Aurora generates on request. Authentication tokens are generated using AWS Signature Version 4. Each token has a lifetime of 15 minutes. You don’t need to store user credentials in the database, because authentication is managed externally using IAM. You can also still use standard database authentication.

IAM database authentication provides the following benefits:

Network traffic to and from the database is encrypted using Secure Socket Layer (SSL) or Transport Layer Security (TLS). For more information about using SSL/TLS with Amazon Aurora, see Using SSL/TLS to encrypt a connection to a DB cluster.
You can use IAM to centrally manage access to your database resources, instead of managing access individually on each DB cluster.
For applications running on Amazon EC2, you can use profile credentials specific to your EC2 instance to access your database instead of a password, for greater security.

IAM database authentication

Authenticate to your DB instance or cluster using IAM database authentication.

docs.aws.amazon.com

Using Vendor Products

Sometimes the cloud native services don’t fit your needs, so what do you do? When choosing products from a security perspective, you should still try to integrate your identities with their permissions framework. We’ll cover three important terms on what to look for.

Single Sign on (SSO)

When using a vendor product for data related capabilities, one of the technical features you will see advertised is the use of Single Sign On. SSO allows users to only remember their main credential and be able to use it across multiple applications. Each application uses an Identity Provider (IDP) to verify if the user credentials are correct, and based on an IDP token/signature, accept the users login. The protocol that’s used varies, with some popular ones including SAML, Kerberos and OAuth2. You can read more in the articles below.

Single sign-on: What is it & how does it work?

Single sign-on (SSO) is an authentication method that enables users to securely authenticate with multiple applications…

www.onelogin.com

What Is and How Does Single Sign-On Authentication Work?

Forrester Consulting analysis determines that using Auth0 can yield a 548% ROI and $3.7M in identity-related savings…

auth0.com

Choosing an SSO Strategy: SAML vs OAuth2 | Mutually Human

Chances are you've logged into an application (mobile app or web app) by clicking on a 'Log in with Facebook' button…

www.mutuallyhuman.com

On the consumer side, FB, Google, Microsoft and Apple are commonly seen for SSO options. On the enterprise side it’s mainly Microsoft, Okta and Google.

Shopify app SSO options, there are many — Shopify app SSO options | Shopify

To set up SSO, you need to configure a trust between your application and identity provider which is fairly straight forward to do. What you’ll also usually end up doing is creating accounts for the users within the application. You’ll need to do this because the app must use the SSO claim that is created and be able to correlate something in the system, otherwise it will reject the claim. It’s like showing up to a new doctor or dentist with a health card, but them not having your basic health records, you’ll need to fill out some more forms.

In many apps, it’s possible to create user profiles by hand, but if you’re part of a large organization, the initial creation is a pain, and keeping track of who joins and leaves the organization is even worse. Thankfully there is protocol in place to make this easier, SCIM.

System for Cross-domain Identity Management (SCIM)

SCIM allows systems that are trusted to sync identities between each other, usually with one system acting as the source of truth. This allows companies to onboard new applications quickly and have an active sync going to keep things up to date. Most IDPs allow you to use SCIM automatically, so you don’t need to write your own jobs to manage the sync.

Below is a good article to learn more about the SCIM protocol, as RFC 7644.

SCIM: System for Cross-domain Identity Management

The System for Cross-domain Identity Management (SCIM) specification is designed to make managing user identities in…

www.simplecloud.info

What’s also handy is the ability to map groups and roles using SCIM. With Azure AD, you can typically sync users and their groups, which map to users and roles in many SAAS applications. This means that you can assign permissions to the SAAS roles, and as you make updates within Azure Active Directory, they will automatically be updated in your application. Pretty neat!

AD synch for groups and users to snowflake — Snowflake SSO with AD mapping roles and users | by Mika Heino

Tip: Check to see if Service Accounts can be synched as well, Azure Databricks enables this feature which makes running automation jobs much easier!

OAuth + SCIM for Automation

We mentioned OAuth briefly earlier, but one of the benefits we didn’t discuss was using it for Automation accounts. We can use the protocol to grab an access token for an application, and use it to login to our data app! One example is using external (AAD) OAuth with Snowflake and configuring Snowflake to accept tokens with Service Principal identities as seen in the article below.

Configure Microsoft Azure AD for External OAuth - Snowflake Documentation

Steps 1-3 are derived from the Azure AD documentation on OAuth 2.0 and Authentication. For more information on how…

docs.snowflake.com

Another good example is getting an Azure AD token for a Service Principal in Azure and then using that to log into Azure Databricks, no intermediary steps required.

Get an Azure Active Directory token using a service principal - Azure Databricks - Workspace

This article describes how a service principal defined in Azure Active Directory (Azure AD) can also act as a principal…

docs.microsoft.com

With these configurations, you can now automate many of your jobs while using only one identity!

Bonus Tip: Beyond Identity integration, you should also look for the ability to set up a private links between your network and the SAAS vendor provider.

Zero Trust and MFA

One of the hot recent terms in security is Zero Trust and the methods to achieve it. The premise is that when two entities make a connection, we should put both of them through a validation process to verify who they are, rather than assuming they safe due to pre-requisite conditions.

One example is if a company is running a datacentre, they may assume all computers within it are safe, and thus not have strong identity practices. Now that public cloud has risen to prominence, this is becoming trickier because fewer companies manager their infrastructure directly, and most new companies just run on cloud.

So when logging in as a human user, or service account, verifying identity becomes crucial because your employees and services can be connecting from anywhere (almost)! The ability to create credentials on the fly, quickly disable permissions, use behavioral analytics, and log & audit become key parts to protecting resources and the identities used to access them. This is also why having multi factor authentication becomes even more important, as Identity is your true perimeter.

We mentioned the use of Azure AAD/PIM in the last article, but there are multiple identity solutions out there for SSO like Okta or AWS Cognito. These are useful for user logins especially for enabling SSO across many services.

Infographic illustrating the Zero Trust reference architecture — Zero Trust Diagram | Microsoft

For production workloads, you may want to use a proxy server/jump box to login for an extra level of logging and authentication. This can make it more difficult to login as an adversary by requiring multiple credentials and having more fine tuned logging. This also encourages using service ids for production login or separate credentials for dev environments, making it easier to limit potentially toxic access.

While logging into your storage infrastructure, it’s important to think about how your identities are managed, and use some of the recommendations from the previous section to keep your data safe!

Conclusion: Bringing it back to MLOps

So we went down many rabbit holes to discuss data storage security, but hopefully it has provided you a foundation on how to secure infrastructure. From each section you’ll have a set of tool to make architectural recommendations on how to secure your storage and how to encourage engineers to ask good questions about the security of their solutions.

Below is how we could secure our application from Part 1, if we were deploying it Azure. We have our private networks configure with private endpoints, encryption in the database using keys from Azure key vault, identity configurations with Azure AD and events/logs streamed to security centre and Azure storage!

Next Time — Part 3: Moving and Transforming Data

After securing all your different data storage components you may ask, now what? We typically need to move, process and transform our data before it ready to be used by an ML model. That’s what Part 3 is for!

In Part 3 we will discuss how to secure your orchestration layer, covering open source (Airflow) and cloud native technologies (ADF, Azure Functions, and AWS Lambdas).

Part 3. Securing Your Orchestrator