Data lake + data warehouse = lakehouse: Modernize your data warehouse to a lakehouse

The View from the Lakehouse Blog Series PART 2

Over the past few years, the term “data lakehouse” has grown in popularity. Why? 

Fundamentally, a data lakehouse combines the best of data warehouses and data lakes, supporting business intelligence and advanced analytics workloads in a single platform. In other words, a data lakehouse provides a means to support all data (even voice and video) as well as all analytics and AI use cases. With a data lakehouse, data engineers don’t need to spend precious time copying subsets of data from the lake to the warehouse. Nor do they need to design and oversee complex processes related to schema management and record tracking. That’s a win for efficiency, security, and governance. 

The lakehouse paradigm is resonating with leaders and practitioners alike. Technology providers are taking notice, even latching on to the term “data lakehouse” to characterize their solutions. At times, the characterization is accurate. But often, it isn’t.

At Lovelytics, we help customers create a strategy for the use of data—one that ultimately drives outcomes for the business. We also help customers cut through noise by answering two all-important questions: 

  1. Should I invest in this technology? 
  2. Does this technology actually do what it claims to do?

Time and money aren’t limitless. That’s why you need to feel confident that your investment choices are in alignment with your strategy. We’re your partner in this journey, and we take our advisory responsibility seriously.

In this blog, we’ll define the critical capabilities of a data lakehouse, so you can easily separate the true lakehouses from those that are riding the coattails of the term’s success but don’t support the paradigm in principle. 

Ultimately, there’s only one data lakehouse that checks all the boxes: Databricks. That’s why we bet our business on Databricks, and why you should, too.

A lakehouse … eliminates data silos

A data lakehouse must allow users to perform data transformations in the lake. Without this capability, data engineers have to copy subsets of data from the lake to the warehouse to support business intelligence use cases. It’s as simple as that.

The Databricks Lakehouse Platform enables users to do this with unmanaged tables. An unmanaged table is an object for which only metadata (not the data itself) is managed.

Data engineers prefer unmanaged tables because transformations (like aggregating or pivoting data for reporting purposes) don’t require data extraction—everything happens (and is accessible) within a single platform. Clearly, there’s also an organizational benefit to this. Because all data (and transformations performed against that data) lives within a single platform, all teams can leverage this data for a wide range of use cases.

Some solutions that have adopted the data lakehouse name do not offer this functionality. For example, one popular solution only allows users to interact with data lakes in a read-only fashion. In other words, it’s not possible to write transformations back to the lake, presenting the transformations that have been completed in the warehouse. As a result, users have no choice but to store data in two places: the lake and the warehouse.

A lakehouse … is the extensible brain that powers your business

A data lakehouse must support data of all shapes and sizes: structured (relational data), semi-structured (XML data), and unstructured (documents, videos). A lakehouse must also support use cases at any stage in the analytics maturity curve. And, it needs the flexibility to produce outputs to support integrations with other technologies in the data, analytics, and AI ecosystems. That’s a lot, no doubt. But it is the lakehouse paradigm. Anything less is counter to the values we laid out previously—efficiency, security, and governance.

One platform delivers against this transformative vision: Databricks. At the core, Databricks has a few components to its architecture: a landing area in the data lake; metadata handling separate from compute; and the ability to select the best architectural components, as if they were building blocks. The logical breakout of architectural components is what enables Databricks to work its magic. Let’s explore each area.

LANDING DATA IN DATABRICKS

While Databricks can acquire data from transactional systems directly with code, it also enables data acquisition tools to become a piece of the lakehouse architecture. Options include Fivetran, Apache Airflow, Matillion, and other tools that are native to your company’s cloud(s). These technologies integrate seamlessly with the Databricks Lakehouse Platform, reducing the burdens of API management and Change Data Capture.

METADATA CAPABILITIES IN THE LAKEHOUSE

In Databricks, you can easily convert your data into Delta format with a single line of code. Additionally, Delta is entirely open source, storing data using Parquet files in snappy compression, creating a balance of portability, cost and performance by default. Metadata can be taken to a much higher scope, incorporating lineage and governance into the lake with Unity Catalog and your metastore with your files, tables, dashboards, and models.

CHOOSE THE BEST BUILDING BLOCKS

Transformations and workloads (such as batch and streaming) support multiple languages. Furthermore, Databricks supports modern transformation technologies such as dbt. All of this gives data engineers a range of options for performing tasks, allowing them to feel at home quickly and onboard faster.

Databricks also offers a wide selection of compute types and sizes for transformation and workflows, all of which are very transparent about utilization. Moreover, capabilities such as autoscaling are available to ensure your organization uses the minimum number of resources necessary to complete a particular effort.

Finally, Databricks can train and prep machine learning models through multiple methods. You can use AutoML to quickly generate baseline models, or bring your own models that are built on packages such as scikit-learn, MLlib, or TensorFlow. With just a few clicks, you can take a model from development to product, unlocking predictive and prescriptive use cases.

A lakehouse … accelerates your innovation

A lakehouse is, by definition, innovative and evolutionary. Let’s go back to the top of this blog. Organizations need a data warehouse and a data lake to support the breadth of use cases that are demanded of today’s hyper-competitive and constantly changing business environment. Data warehouses are great for business intelligence, explaining what’s happening at any given moment in your business. Data lakes are needed for advanced analytics, predicting what will happen next, and prescribing an action to take. Bringing the data warehouse and the data lake together into a single platform is innovative, and many Databricks Lakehouse Platform adopters are already seeing measurable results.

Equally innovative is the notion of driving collaboration across data, analytics, and business teams within a single platform. Transformations in the lake can be picked up by any third-party tool. The data itself can be read by third-party tools, as long as they support Delta format. Even better, Databricks Unity Catalog offers enhanced governance and data assets management, and Delta Sharing lets you share live feeds of your data with other platforms.

To speed up file uptake and schema management, Databricks can natively inherit schema from Parquet in one-line of code. As a result, users can quickly add new data sources and automate the process of incorporating more.

With Databricks, a novice can see its simplicity when looking from the outside in. But an expert can dive into details and make finely tuned adjustments to performance for their workloads. This is due to many out-of-the-box optimizations, which are enabled by default in newer runtimes such as Spark 3.0. So, whether or not someone digs into execution plans, there are efforts taking place with Adaptive Query Execution to handle unpredictable data and situations, such as complex join performance.

Don’t give the “data lakehouse” the Kleenex treatment

In the early 1920s, a consumer product hit the streets: the Kleenex. Over the years, the Kleenex name would become genericized, used to describe any facial tissue regardless of brand.

Is the data lakehouse being given the Kleenex treatment? We certainly hope not. A data lakehouse has specific capabilities that enable organizations to bring together the best of the data warehouse and the data lake. The Databricks Lakehouse Platform is the only solution that checks all of these boxes, which is why we’re bullish, and  building a long-term strategy—for our business and your own—with Databricks at the core.


About Lovelytics

Lovelytics is a data and analytics consultancy. We partner with people, teams, and organizations to deliver services and build solutions that drive outcomes while promoting self-sufficiency through hands-on enablement. At Lovelytics, we work across the entire data lifecycle, including data and analytics strategy, full-scale implementations, data engineering, data visualization, advanced analytics (AI/ML), managed products and services, training, and custom development.

Author