Protecting Your Data Engineering and MLOps Storage

Part 2 of the 7 Layers of MLOps Security Guide

An ML Storage Architecture

Series Links

How to read this article

Table of Contents

An architecture for the bits and pieces

AWS Lakehouse with many data services
AWS Lakehouse Architecture with many data components| AWS
The Data Storage Layers of a MLOps Architecture
Simple MLOps architecture in Azure for the Data Storage Layers

Transactional Data Stores

Today was my first day on the job as a Junior Software Developer and was my first non-internship position after university. Unfortunately i screwed up badly.

I was basically given a document detailing how to setup my local development environment. Which involves run a small script to create my own personal DB instance from some test data. After running the command i was supposed to copy the database url/password/username outputted by the command and configure my dev environment to point to that database. Unfortunately instead of copying the values outputted by the tool, i instead for whatever reason used the values the document had.

Unfortunately apparently those values were actually for the production database (why they are documented in the dev setup guide i have no idea). Then from my understanding that the tests add fake data, and clear existing data between test runs which basically cleared all the data from the production database. Honestly i had no idea what i did and it wasn’t about 30 or so minutes after did someone actually figure out/realize what i did.

While what i had done was sinking in. The CTO told me to leave and never come back. He also informed me that apparently legal would need to get involved due to severity of the data loss.
(Story from Reddit)

Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization [Boy13].

What if I say your Email ID is all I need to takeover your account on your favorite website or an app. Sounds scary right? This is what a bug in Sign in with Apple allowed me to do.

SLA by number of nines
Achieving an SLA over 99.99% for applications is challenging | Wikipedia
Regions and Availability Zones in Azure | Microsoft Docs
Regions and Availability Zones in Azure | Microsoft Docs
How Azure Replicates Data for Blob Storage | Azure
Azure SQL with no region in the URL | Microsoft
Load balancing is becoming easier with new tech | GCP
Recovery Time (RTO) and Recovery Point (RPO) in DR Strategy
Recovery Time (RTO) and Recovery Point (RPO) in DR Strategy | MSP360
DR strategies related to RPO/RTO
DR strategies related to RPO/RTO | AWS
AWS Aurora Global Clusters, AWS
Disaster recovery in Snowflake across regions/accounts | Snowflake
User Connecting to API Gateway that maps to an Azure function then to a private Azure SQL
Users can only connect via the API gateway
Azure VNET connection between a Azure VM and Azure SQL with a vnet peering
Azure VNET connection between a Azure VM and Azure SQL | Azure Docs
AWS VPC with a public application subnet and private db subnet
AWS VPC with a public application subnet and private db subnet | AWS Website
Cloud SQL connection from VM using private IP
GCP VPC with a private Cloud SQL instance | GCP Website

At provision, Databases for PostgreSQL sets the maximum number of connections to your PostgreSQL database to 115. 15 connections are reserved for the superuser to maintain the state and integrity of your database, and 100 connections are available for you and your applications.

Example of encrypting and decrypting a string in MySQL
Example of encrypting and decrypting a string in MySQL | MySQL docs 8.0

Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases. Always Encrypted allows clients to encrypt sensitive data inside client applications and never reveal the encryption keys to the Database Engine (SQL Database or SQL Server). As a result, Always Encrypted provides a separation between those who own the data and can view it, and those who manage the data but should have no access.

AWS Aurora column level encryption with KMS
A sample app with client side encryption | AWS
Immuta SQL Workflow Architecture | Immuta Docs
Immuta SQL Workflow Architecture | Immuta Docs
A list of Privecera Partners, which means fewer connectors to build
A list of Privacera Partners, which means fewer connectors to build | Privecera

Encryption in Transit

A fun comic sample from How HTTPS works
A fun comic sample from How HTTPS works

Different Types of Data Keys

Snowflake’s Tri Security Key Hierarchy, root key encrypts account master, encryts table master, which encrypts file keys
Snowflake’s Tri Security Key Hierarchy | Snowflake
Snowflake Tri Security with Bring Your Own Key | Snowflake
Comparison of bring your own key and hold your own key
BYOK vs HYOK | Secupi
Snowflake Trisecurity Re-keying after rotation
Snowflake Trisecurity Re-keying| Snowflake

Data Warehouses

According to every major data warehouse, it’s cheaper than the competition :)
Snowflake object hierarchy, account at the top follow by db, schema and table
Snowflake object hierarchy | Snowflake
Cross AWS Redshift architecture with EC2 access from another account
Cross AWS Redshift architecture | AWS
What factors can influence multi tenancy design
What factors can influence multi tenancy design
Depicting our multi tenancy use case
Multiple Accounts With Database per App
Multiple Accounts With Database per App
Multiple Accounts With Database per LOB
Multiple Accounts With Database per LOB
Snowflake Private Link With AWS
Snowflake Private Link With AWS | Snowflake
https://www.snowflake.com/blog/bringing-the-worlds-data-together-announcements-from-snowflake-summit/
Redshfit Multi-Tenancy Models | AWS
Column Level Protection and tagging BigQuery
Column Level Protection and tagging BigQuery | GCP
Synapse Studio Connects Many Data Compoennts Together in Azure
Synapse Studio Connects Many Data Components Together in Azure | Azure

Datalakes

S3 datalake account with multiple S3 buckets and objects within
S3 datalake account with multiple S3 buckets and objects within | AWS
Azure datalake gen2 supports many types of users
Azure datalake gen2 supports many types of users | Azure
AWS Lake formation helps configure key parts of a datalake
AWS Lake formation helps configure key parts of a datalake | AWS
How AWS Services integrate with LakeFormation
How AWS Services integrate with LakeFormation | AWS
Deltalake overview
Deltalake overview | Deltalake
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net","<password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

Streaming and Messaging

Confluent builds on Kafka to offer an ecosystem
Confluent builds on Kafka to offer an ecosystem | Confluent
Eventhub integration with Databricks
Eventhub integration with Databricks | Azure
AWS Kinesis For Datastreaming
AWS Kinesis For Data streaming | AWS
Azure Stream Analytics Flow Pattern
Azure Stream Analytics Flow Pattern | Azure
AWS Firehose potential desitnations
AWS Kinesis Data Firehouse | AWS

Federated Access Models

Connecting to Powerbi with one identity
Setting up a SQL Server AAD Admin | Azure
Setting up a SQL Server AAD Admin | Azure
Connecting to Postgresql using IAM policy | AWS
Shopify app SSO options, there are many
Shopify app SSO options | Shopify
AD synch for groups and users to snowflake
Snowflake SSO with AD mapping roles and users | by Mika Heino

Zero Trust and MFA

Infographic illustrating the Zero Trust reference architecture
Zero Trust Diagram | Microsoft

Conclusion: Bringing it back to MLOps

An ML Storage Architecture
Our Initial MLOps Architecture

Next Time — Part 3: Moving and Transforming Data

--

--

ML Lead @ Voiceflow

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store