I have previously described how data as a product was initially closely aligned with data mesh, a cultural and organizational approach to distributed data processing. As a result of data mesh’s association with distributed data, many assumed that the concept was diametrically opposed to the data lake, which offered a platform for combining large volumes of data from multiple data sources. That assumption was always misguided: There was never any reason why data lakes could not be used as a data persistence and processing platform within a data mesh environment. In recent years, data as a product has gained momentum outside the context of data mesh, while data lakes have evolved into data lakehouses. It has become increasingly clear that data lakehouses and data as a product are well matched, as the data intelligence cataloging capabilities of a lakehouse environment can serve as the foundation to enable the development, sharing and management of data as a product.
The concept of the data lakehouse has become so ubiquitous that it is easy to forget that just a few years ago, it was closely associated with only a few providers. It was derided by many others, not least established purveyors of data warehouses. Today, many of those early naysayers have not only dropped their objections to the data lakehouse concept but have actively adapted to align with data lakehouse architecture to enable the unification of data from multiple sources to support various workloads, including analytics and artificial intelligence (AI). Potential data lakehouse adopters today would be hard-pressed to find an analytic data platform provider not claiming at least coexistence.
Several factors influenced this change of perspective. The first is widespread adoption of cloud object storage and open file formats, which fundamentally altered the economics of storing and processing large volumes of data.
The second factor that altered the perception of the data lakehouse was the popularization of open table formats, which provide the consistency and reliability guarantees required for business-critical data as well as interoperability with multiple data processing engines.
Of the three primary open table formats—Apache Hudi, Apache Iceberg and Delta Lake—there has been a significant uptick in community and provider support for Apache Iceberg in recent years. One of the key features that encouraged data platform providers to coalesce around Apache Iceberg is the Iceberg REST Catalog, which provides a common API for interacting with any compatible Iceberg implementation. This facilitates governance and access controls for diverse processing platforms and query engines that implement Iceberg. Interoperability via standard catalog APIs across the growing ecosystem of tools supporting Iceberg enables enterprises to use multiple data engines to access and query data in a data lakehouse. This includes query engines such as Apache Spark, Trino and Presto but also analytic databases offered by data warehousing providers.
The third factor that changed the perception of the data lakehouse was the widespread adoption of the medallion architecture design pattern. This involves using data operations pipelines to transform and refine data through three stages: bronze tables for raw ingested data; silver tables for cleansed, enriched and normalized data; and gold curated tables suitable for serving domain-oriented business requirements. When combined with the application of product thinking to data initiatives, the medallion pattern culminates in the delivery of data products suitable for sharing and consumption by business users, data scientists and agentic AI. The delivery of trusted data products has been facilitated by the incorporation into the data lakehouse of a metadata catalog layer providing data intelligence capabilities that support a unified view of data in the lakehouse environment, as well as capabilities for the identification and management of sensitive data, data lineage, auditing and access controls. The ability to support the delivery of data products is likely to become an increasingly important consideration.
By nature, a data lakehouse is a complex environment. The breadth of capabilities spans multiple ISG Buyers Guides, including Analytic Data Platforms, Data Intelligence, Data Governance, Data Operations and Data Products. As such, considerable skill and resources are required to configure and maintain a data lakehouse, addressing a combination of ingestion, pipeline management, table optimization, data governance and lineage, catalog integrations and query tuning across a variety of query engines. I recommend that enterprises investigating the data lakehouse approach evaluate potential providers based on not only core data processing functionality but also the availability of additional tooling that facilitates data operations pipelines and the delivery of data as a product.
Regards,
Matt Aslett