Data Lakehouses Enable Data as a Product

Written by Matt Aslett | Jun 17, 2025 10:00:00 AM

I have previously described how data as a product was initially closely aligned with data mesh, a cultural and organizational approach to distributed data processing. As a result of data mesh’s association with distributed data, many assumed that the concept was diametrically opposed to the data lake, which offered a platform for combining large volumes of data from multiple data sources. That assumption was always misguided: There was never any reason why data lakes could not be used as a data persistence and processing platform within a data mesh environment. In recent years, data as a product has gained momentum outside the context of data mesh, while data lakes have evolved into data lakehouses. It has become increasingly clear that data lakehouses and data as a product are well matched, as the data intelligence cataloging capabilities of a lakehouse environment can serve as the foundation to enable the development, sharing and management of data as a product.

The concept of the data lakehouse has become so ubiquitous that it is easy to forget that just a few years ago, it was closely associated with only a few providers. It was derided by many others, not least established purveyors of data warehouses. Today, many of those early naysayers have not only dropped their objections to the data lakehouse concept but have actively adapted to align with data lakehouse architecture to enable the unification of data from multiple sources to support various workloads, including analytics and artificial intelligence (AI). Potential data lakehouse adopters today would be hard-pressed to find an analytic data platform provider not claiming at least coexistence.

Several factors influenced this change of perspective. The first is widespread adoption of cloud object storage and open file formats, which fundamentally altered the economics of storing and processing large volumes of data. Based on cloud object storage and open file formats such as Apache Parquet, Apache Avro and Apache Orc, data lakes provide a relatively inexpensive environment in which to combine data from multiple sources, especially semi- and unstructured data that is not suitable for storing and processing in a traditional data warehouse. More than one-half (53%) of participants in our Analytics and Data Benchmark Research are in production with the use of object stores for analytics. Data has gravity. Rather than trying to persuade enterprises to extract data from cloud object storage and open file formats and move it to products, data warehousing providers instead opted to bring the data warehouse processing engines to the data.

The second factor that altered the perception of the data lakehouse was the popularization of open table formats, which provide the consistency and reliability guarantees required for business-critical data as well as interoperability with multiple data processing engines. As I previously explained, open table formats—Apache Hudi, Apache Iceberg and Delta Lake—are fundamental enablers of a data lakehouse, providing support for atomic, consistent, isolated and durable (or ACID) transactions and create, read, update and delete (or CRUD) operations. This provides the guaranteed consistency and reliability required for processing business-critical data. I assert that by 2027, more than 8 in 10 enterprises using data lakehouses will adopt open table formats to deliver support for ACID transactions and CRUD operations on data stored in object storage.

Of the three primary open table formats—Apache Hudi, Apache Iceberg and Delta Lake—there has been a significant uptick in community and provider support for Apache Iceberg in recent years. One of the key features that encouraged data platform providers to coalesce around Apache Iceberg is the Iceberg REST Catalog, which provides a common API for interacting with any compatible Iceberg implementation. This facilitates governance and access controls for diverse processing platforms and query engines that implement Iceberg. Interoperability via standard catalog APIs across the growing ecosystem of tools supporting Iceberg enables enterprises to use multiple data engines to access and query data in a data lakehouse. This includes query engines such as Apache Spark, Trino and Presto but also analytic databases offered by data warehousing providers.

The third factor that changed the perception of the data lakehouse was the widespread adoption of the medallion architecture design pattern. This involves using data operations pipelines to transform and refine data through three stages: bronze tables for raw ingested data; silver tables for cleansed, enriched and normalized data; and gold curated tables suitable for serving domain-oriented business requirements. When combined with the application of product thinking to data initiatives, the medallion pattern culminates in the delivery of data products suitable for sharing and consumption by business users, data scientists and agentic AI. The delivery of trusted data products has been facilitated by the incorporation into the data lakehouse of a metadata catalog layer providing data intelligence capabilities that support a unified view of data in the lakehouse environment, as well as capabilities for the identification and management of sensitive data, data lineage, auditing and access controls. The ability to support the delivery of data products is likely to become an increasingly important consideration.

By nature, a data lakehouse is a complex environment. The breadth of capabilities spans multiple ISG Buyers Guides, including Analytic Data Platforms, Data Intelligence, Data Governance, Data Operations and Data Products. As such, considerable skill and resources are required to configure and maintain a data lakehouse, addressing a combination of ingestion, pipeline management, table optimization, data governance and lineage, catalog integrations and query tuning across a variety of query engines. I recommend that enterprises investigating the data lakehouse approach evaluate potential providers based on not only core data processing functionality but also the availability of additional tooling that facilitates data operations pipelines and the delivery of data as a product.

Regards,

Matt Aslett

View full post