ELTIMS — The New Data Acronym
ETL doesn’t cover our modern data needs
Every morning, most of us start our day by checking our phones. We catch up on what’s happening the world, chat with friends and family and watch a couple cat videos. All these activities require dozens of things to happen in the background, including curating and loading the data that powers what we want to see. Collectively we generate an astronomical amount of data, close to 2.5 quintillion bytes a day (Forbes, 2018).
Someone has to manage all this data, but most of the vocabulary we use to describe these processes is outdated. That’s why I wanted to introduce the idea of ELTIMS, a new acronym that better captures what companies need to build in order to meet modern consumers needs.
Apart from going through the acronym and its use cases, the article also provide insights into the direction of many data focused companies based on modern data needs. Let’s first start with a history of data transformations, a commentary on some of the largest players and why ELTIMS is a natural way forward.
The original acronym - ETL
In the 1970’s computers were coming online and starting to generate digital data. Apart from storing data, we needed a way to prepare it for downstream applications.
As data warehouses emerged in the early 90’s we began to see a centralized location for data and it’s transformations. These data warehouses were still limited in the formats they used and the operations you could run within them, so intermediary steps (transformations) were needed before loading the data in.
Since then, data warehouses and data stores in general, have added transformational abilities with SQL support, views, jobs and stored procedures. Combined with largely scalable cloud offerings, we could try out a new paradigm.
Get the data in, deal with it later! — ELT
The concept of ELT is quite compelling, rather than having an intermediary system that bottlenecks your data transfer, you move it downstream and let that system deal with the data! While this sounds like a shift of responsibility, you get a couple of benefits:
- The compute for conducting transformation is closer to the storage which usually leads to faster performance.
- The data storage team now owns this process, so you’re likely to have faster iteration and better collaboration
- You have more flexibility on what new views you create, since the data is centralized.
There are downsides, especially if your downstream system is inflexible or expensive. To mitigate this, solution providers have decoupled storage and compute costs and introduced increasing format flexibility. Some of these features improvements include:
- Snowflake and Databricks providing storage at cloud provider cost (S3, Blob, etc) while enabling full analytics capabilities
- Snowflake and Databricks decoupling compute from storage, with several solution providers getting closer to enabling true server less capabilities
- Snowflake, BigQuery, Redshift, Azure Synapse, Databricks supporting JSON and semi structured data
- Major players (listed above) allowing native transformation tasks and jobs to be run within their data storage procedures.
For the past couple of years these advancements have been focused on winning over consumers compute and storage needs for their ELT processes. By controlling these two aspects, vendors are more likely to keep customers happy and committed to their platforms.
Competitive insight: controlling compute and storage
As we’ve mentioned earlier, controlling both compute and storage was the name of the game. Previous leaders like Terradata and IBM have fallen behind in the space due to their slow start in the cloud space, and large set of on-prem legacy customers which has limited their ability to offer competitive elastic solutions.
One of the approaches for controlling both storage and compute is decoupling storage and compute, meaning customers the receive the flexibility benefits of serverless compute and low costs of object storage. This decoupling was one the biggest selling points of Snowflake and later Databricks. Other vendors are still catching up on the serverless space, but the gap is quickly closing with products like Redshift Spectrum, Azure Synapse and BigQuery making gains with enormous engineering teams. This leads to a need to build a moat around your offering with new formats and custom workloads being tools of choice.
Building moats — custom workflows and new formats
As the ELT space matures, major players will need to defend their existing accounts. The best way to do this is to offer customers convenient features that make their lives easier but also increase the switching costs.
The first, and most related to ELT is custom workflows. A key part of ELT is the customizability and cost of running all the transformations. This has led to features being built into existing platforms or deeper integrations with adjacent products or features. Some examples include:
- Serverless Azure DataFactory with dozens of integrations
- Serverless Glue jobs
- Snowpipe and custom SQL/UDF tasks in Snowflake
- Jobs Api and Cheaper clusters introduced into Databricks
For a consumer, a platform based workflow allows you to leverage native capabilities of the platform and reduces the number of tools you use.
If you’re an enterprise with potentially hundreds of workflows, migrating these workflows becomes a project on its own, increasing the likelyhood you stick with the existing solution. To combat this inertia, many vendors develop migration tools, such as, Babelfish, the SQL server to postgresql library, that was unveiled at AWS Reinvent 2020. Babelfish was also promised to be open sourced in 2021 to provide broader community benefits.
The interesting part about open sourcing tools is that they provide a huge opportunities to build successful businesses and carve out market niches. Companies like Redhat, Hashicorp and Databricks are some of the largest success stories, with hundreds of other successful companies mixed in.
So where does the moat come in? By creating a new open source standard and getting wide spread adoption, companies can create strong lock in to “their eco systems”. In the case of Databricks, MLFlow and Delta lake are great open source tools that also give it an immense competitive advantage, customers adopting these tools and their integrations with Databricks are less likely to move. Delta lake is used by Databricks as a custom storage format to support to support their lakehouse architecture which allows rapid and versioned data queries.
Customers get a better data experience and also will have a harder time migrating, so what’s not to like?
Limited operations, rising expenses
While the past couple of paragraphs have commented positively on Databricks and Snowflake as case studies, they still have some key limitations that have prevented them from winning the space.
Databricks is still quite expensive since it was designed as a spark big data processing tool. As such, running non spark jobs is expensive for things like everyday python analysis and visualization. Before lakehouse they also required data to be loaded into memory to run sql jobs rather than persisting views like a traditional warehouse.
Snowflake, on the other hand has most of the traditional ELT stack figured out, but is only starting to get into the scripting and visualization spaces. They recently added python support and partnered with Anaconda and introduced a lightweight BI tool called Snowsight.
Get some insights, do some modeling — ELTIM
At the end of the day, companies invest in data infrastructure to get insights that can advance their business. Customer, industry, and product trends all need to be understood and made actionable, with data being the key ingredients.
These insights are typically gathered through data aggregation, statistical tests and visualization, all of which typically need their own technology stacks. Aggregation is easily done in a traditional data warehouse, but other approaches have previously required moving your data into a separate tool or ecosystem.
This creates challenges with additional tooling setup and integrations, increased operational costs from data egress and latency from moving data between sources. Even with these downsides, tools such as Tableau, Data robot and Dataiku have been able to build out or maintain significant followings. Often the stickiness of the existing tool and feature set will overcome an integrated experience that’s incomplete.
Worldwide spending on big data and business analytics (BDA) solutions is forecast to reach $215.7 billion this year, an increase of 10.1% over 2020, according to a new update to the Worldwide Big Data and Analytics Spending Guide from International Data Corporation (IDC)
With the data market growing quickly and new features being released every day, there will significant churn in the market as companies continue to invest in their data practices. A fully integrated data platform is especially compelling as it helps users get their products to market faster, reduces engineering cost and provides better security.
At the end of the day, this platform must integrate with other technology solutions and be usable by the end customer, which leads us to the core part of the article — ELTIMS.
Serving what you made — ELTIMS
The core of our article just needs one more step, the “serve” step. One of the biggest challenges that companies have is getting their models into production in a customer facing environment. Often this step takes several weeks, with the chart below based on a 700 company study that Algorithmia did.
To capture this part of the market, we’ve seen existing players introduce model serving features, including scalable endpoints, model versioning, gpu support and many more. Adding this capability to existing platforms the offering is quite compelling, being able to take a model you trained and pushing to production without leaving the environment.
The big cloud players (AWS, Azure, GCP) have been integrating their tooling to allow this to happen. AWS Sagemaker, Azure Synapse and Vertex AI all promise to provide this seamless ELTIMS experience, while also integrating nicely with your application stack. Databricks has also recently added model serving and VC funded unicorns like DataRobot and Weights&Biases continue to work towards this area.
The serve step is important but it can’t exist in isolation either. Recently we’ve been using MLOps as a catch all term for getting models into production, but having a good data practice goes beyond just building models, it’s foundational to any business expansion.
ELTIMS is exciting because it captures the requirements of the new generation of data platforms, one that focuses on a seamless and integrated user experience. It provides with an acronym that describes what we can a platform to do, and builds off of a rich history of data engineering.
ELTIMS Is The Future, For Now
We’ve gone through the evolution of data products over the past 40 years, and many things have certainly changed. With machine learning currently the leading frontier for company digital transformation and evolution, achieving fully functional ELTIMS is target state for many companies.
The largest companies in the world have largely solved these problems internally, but apart from the FANMG++ very few companies are mature in their ELTIMS processes. With such a large market and stickiness in the data space, we can expect billions of dollars to be invested in ELTIMS progress for years to come.