Data Lakehouse

The Data Lakehouse is a data management architecture that combines the advantages of a data warehouse with those of a data lake. areto offers Data Lakehouse reference architectures based on Microsoft Azure, AWS, Databricks.

What is a Data Lakehouse?

A data lakehouse is an open data management architecture that combines the advantages of a data lake with those of a data warehouse. It combines the high flexibility, scalability, and cost efficiency of a data lake with the data management and ACID (Atomicity, Consistency, Isolation, and Durability) transactions of a data warehouse. This enables faster access for data teams, as well as simplifying the integration of business intelligence tools and machine learning on a platform’s existing data.

The first use of the term data lakehouse is attributed to the company Jellyvision in 2017. AWS then used the name to describe its data and analytics services in the Lakehouse architecture. Since 2017, the architecture has been used because data warehouses often reach their limitations due to the growing volumes of data and cannot be scaled flexibly. Since then, data lakehouses have revolutionized the industry with high demand for flexible infrastructures combined with speed and operational efficiency.

Advantages of a Data Warehouse

A data warehouse collects structured data, usually in table form. The data model must be defined in advance so that the architecture can be adapted to the specific requirements of the company.

Data warehouses offer high robustness and easy data maintenance. They make use of the ETL model and are particularly suitable for business analysts and KPI reporting or for analytical reuse. They consist of the different layers: staging, storage, data marts and serve. Data warehouses are available on-premise, hybrid and in the cloud and offer an all-encompassing solution for Big Data.

Advantages of a Data Lake

A data lake takes data from different sources and different processing stages and stores this data in raw unstructured format. This ensures a flat hierarchy of data. The raw data has a high information value, since no important values are lost due to the lack of processing.

A storage of the data can take place just like the storage location independently of how and whether the data is needed at a later time. In this respect, the Data Lake follows the ELT model. Through storage, a simple democratization of data takes place, so that data silos cannot arise. They also offer high scalability. Data Lakes can be used on-premise, hybrid or in the cloud.

Data Warehouse vs. Data Lake vs. Data Lakehouse

Why a Data Lakehouse?

Companies in today’s competitive environment cannot afford to wait long for data or analysis results, to work with missing data. Nor can they regularly adapt the data architecture to current requirements. Data management of various tools demands more and more resources and the amount of data is constantly increasing.

A data lakehouse combines the advantages of a data lake with those of a data warehouse. It thus offers data-driven companies a great advantage and security for growing data volumes and analysis tools of the future. The data infrastructure in companies is simplified and innovations are promoted, especially with regard to the necessities of machine learning and artificial intelligence. Reliability, performance and quality of the data do not have to be sacrificed. All data is unified, cleansed as needed, processed and available at a single point of truth for all data teams as well as management.

Data Lakehouse Features

Scalability

The model uses cloud elasticity. It is therefore cost-efficiently scalable without abandoning existing architecture.

No vendor lock-in

Open standards allow data to be stored and processed independently of tools.

Cross-team platform

Different data teams can access and use data. Data Scientists, Data Analysts, Data Engineers work on one platform.

Integration into your own environment

The model can be integrated into your own environment through the possibility of various interfaces.

Security

Fine-grained access controls ensure compliance requirements are met.

Reliability

ACID transactions log all operations transactionally, making it easy to detect errors. Historical data is automatically stored.

Performance

No speed is lost due to (Apache Spark's) distributed processing power.

Flexible storage

The ability to separate compute and memory resources makes it easy to scale storage as needed.

Data Lakehouse Architecture

Data lakehouses build on existing data lakes, which often contain more than 90% of the data in the enterprise, and extend them with traditional data warehouse functions.

The basic layer of a data lakehouse is based on a data lake. This requires the use of a low-cost multicloud object store using the standard file format (e.g. Apache Paquet) such as Microsoft Azure or AWS S3. Structured, semi-structured and unstructured data can be stored there.

This is followed by the transactional metadata and governance layer, which enables ACID transactions already in the data layer. Data pipelines such as Apache Spark, Kafka and Azure prepare the raw data in the layer. They provide efficient and secure data transfer. Powerful software in the serve layer, such as Snowflake, prepares data and makes it available.

The data can be used for analytics, business intelligence and machine learning via many interfaces. This is done via solutions such as Power BI, Tableau, R and Python. These present the data graphically and thus provide added value for your company to use your data even more efficiently.

Advantages of a Data Lakehouse

Cost-efficient storage

Data lake storage with cost-effective storage such as Amazon S3 or Microsoft Azure Blob Storage.

Performance optimization

Optimization techniques such as caching, multi-dimensional clustering, data skipping through data statistics and summarization can be used to minimize data sizes.

Open standard formats

An open API for direct data access without vendor lock-in with language support for Python and R libraries.

Reliability and Quality

Reliability and quality in the datalake through transaction support of ACID transactions with SQL for concurrent data reads or writes for multiple parties. Schema support for data warehouse architectures such as Snowflake for robust governance and auditing mechanisms.

Addition of governance and security controls

DML support through Java, Python, and SQL for updating and deleting records. Data history captures all changes and provides a complete audit trail. In addition, data snapshots allow developers to access previous versions of data for audits or reproducing experiments. Role-based access control allows fine-grained gradations for row and column levels.

Support for Machine Learning

Support for a wide variety of data types for storage, analysis, and access including images, video, audio, and semi-structured files. Reading large amounts of data with e.g. Python or R libraries is also efficiently possible. In addition, data accesses in ML workloads are supported by an integrated DataFrame API (Application Programming Interface).

Delta Lake: Data Lakehouse from Databricks

Databricks’ Delta Lake is an open source storage layer that enables the creation of a lakehouse architecture based on a data lake. This provides reliability, security and performance on the data lake.

The Delta Lake layer provides ACID transactions, schema forcing and scalable metadata processing. This enables streaming and batch operations. The stored data is saved in Apache Parquet format so that it can be read and processed by any compatible program.

The Modern Cloud Data Platform

Combine data warehouse performance with datalake flexibility. The Databricks platform, built on the Lakehouse architecture, brings data warehouse quality and reliability to open, flexible data lakes. This simplified architecture provides a single environment for analytics, streaming data, data science and machine learning.

areto's Azure Lakehouse Reference Architecture

The Azure Data Lakehouse reference architecture developed by areto offers many benefits.

Using areto’s Data Lakehouse Reference Architecture provides customers with best practices for developing and operating reliable, secure, efficient, and cost-effective systems. areto’s Azure Data Lakehouse architecture solutions are consistently measured against Microsoft best practices to deliver the highest value to customers.

The areto Azure reference architecture is based on five pillars: operational excellence, security, reliability, performance efficiency, cost optimization.

Operational Excellence
Optimal design of system operation and monitoring as well as continuous improvement of supporting processes and procedures.

Security
Protection of information, systems, assets, risk assessments and mitigation strategies

Cost optimization
Maximizing ROI through the continuous process of improving the system throughout its lifecycle.

Reliability
Ensuring security, disaster recovery, for business continuity as data is mirrored across multiple redundant sites.

Performance efficiency
Efficient use of computing resources, scalability to meet short-term peaks in demand, future-proofing.

areto Data Lakehouses - BI & ML

Unifying data and bundling it at a point-of-truth makes it possible to run analytics with machine learning in a single place, without additional architecture. By storing raw data and transforming it when needed, no important information is lost. Thus, the necessary requirements for business intelligence are provided.

Machine learning also requires large amounts of raw data that can be manipulated using open source tooling. Therefore, the unstructured part of the Data Lakehouse supports direct access to the raw data in different formats and supports the ETL as well as the ELT model.

Become a data-driven company with areto Data Lakehouse experts!

Overtake the Competition by Making Faster and Better Decisions!

Find out where your company currently stands on the way to becoming a data-driven company.
We analyze the status quo and show you which potentials are available.
How do you want to start?

Free consulting & demo appointments

Do you already have a strategy for your future data lakehouse solution? Are you already taking advantage of modern cloud platforms and automation? We would be happy to show you examples of how our customers are already using areto’s agile and scalable architecture solutions.

Workshops / Coachings

Our workshops and coaching sessions provide you with the necessary know-how to build a modern data lakehouse architecture. The areto Data Lakehouse TrainingCenter offers a wide range of learning content.

Proof of Concepts

Which architecture is right for us? Are the framework conditions suitable? Which requirements have to be met? Proof of Concepts (POCs) answer these and other questions so that you can then make the right investment decisions. This way, you start your project optimally prepared.

Data Lakehouse Know-How Video Library

Data Lakehouse Explained in 5 Minutes

Why Lakehouse Architecture Now?

SQL Analytics and the Lakehouse Architecture | Ali Ghodsi

Creating a Lakehouse on AWS

What is Lakehouse and why it matters | AWS Events

AWS Summit ANZ 2021-Lakehouse architecture:

Use your data. Discover opportunities. Gain new insights.

We look forward to hearing from you

Till Sander
CTO
Phone: +49 221 66 95 75-0
E-mail: till.sander@areto.de