The Data Lakehouse is a data management architecture that combines the advantages of a data warehouse with those of a data lake. areto offers Data Lakehouse reference architectures based on Microsoft Azure, AWS, Databricks.
A data lakehouse is an open data management architecture that combines the advantages of a data lake with those of a data warehouse. It combines the high flexibility, scalability, and cost efficiency of a data lake with the data management and ACID (Atomicity, Consistency, Isolation, and Durability) transactions of a data warehouse. This enables faster access for data teams, as well as simplifying the integration of business intelligence tools and machine learning on a platform’s existing data.
The first use of the term data lakehouse is attributed to the company Jellyvision in 2017. AWS then used the name to describe its data and analytics services in the Lakehouse architecture. Since 2017, the architecture has been used because data warehouses often reach their limitations due to the growing volumes of data and cannot be scaled flexibly. Since then, data lakehouses have revolutionized the industry with high demand for flexible infrastructures combined with speed and operational efficiency.
A data warehouse collects structured data, usually in table form. The data model must be defined in advance so that the architecture can be adapted to the specific requirements of the company.
Data warehouses offer high robustness and easy data maintenance. They make use of the ETL model and are particularly suitable for business analysts and KPI reporting or for analytical reuse. They consist of the different layers: staging, storage, data marts and serve. Data warehouses are available on-premise, hybrid and in the cloud and offer an all-encompassing solution for Big Data.
A data lake takes data from different sources and different processing stages and stores this data in raw unstructured format. This ensures a flat hierarchy of data. The raw data has a high information value, since no important values are lost due to the lack of processing.
A storage of the data can take place just like the storage location independently of how and whether the data is needed at a later time. In this respect, the Data Lake follows the ELT model. Through storage, a simple democratization of data takes place, so that data silos cannot arise. They also offer high scalability. Data Lakes can be used on-premise, hybrid or in the cloud.
Companies in today’s competitive environment cannot afford to wait long for data or analysis results, to work with missing data. Nor can they regularly adapt the data architecture to current requirements. Data management of various tools demands more and more resources and the amount of data is constantly increasing.
A data lakehouse combines the advantages of a data lake with those of a data warehouse. It thus offers data-driven companies a great advantage and security for growing data volumes and analysis tools of the future. The data infrastructure in companies is simplified and innovations are promoted, especially with regard to the necessities of machine learning and artificial intelligence. Reliability, performance and quality of the data do not have to be sacrificed. All data is unified, cleansed as needed, processed and available at a single point of truth for all data teams as well as management.
The model uses cloud elasticity. It is therefore cost-efficiently scalable without abandoning existing architecture.
Open standards allow data to be stored and processed independently of tools.
Different data teams can access and use data. Data Scientists, Data Analysts, Data Engineers work on one platform.
The model can be integrated into your own environment through the possibility of various interfaces.
Fine-grained access controls ensure compliance requirements are met.
ACID transactions log all operations transactionally, making it easy to detect errors. Historical data is automatically stored.
No speed is lost due to (Apache Spark's) distributed processing power.
The ability to separate compute and memory resources makes it easy to scale storage as needed.
Data lakehouses build on existing data lakes, which often contain more than 90% of the data in the enterprise, and extend them with traditional data warehouse functions.
The basic layer of a data lakehouse is based on a data lake. This requires the use of a low-cost multicloud object store using the standard file format (e.g. Apache Paquet) such as Microsoft Azure or AWS S3. Structured, semi-structured and unstructured data can be stored there.
This is followed by the transactional metadata and governance layer, which enables ACID transactions already in the data layer. Data pipelines such as Apache Spark, Kafka and Azure prepare the raw data in the layer. They provide efficient and secure data transfer. Powerful software in the serve layer, such as Snowflake, prepares data and makes it available.
The data can be used for analytics, business intelligence and machine learning via many interfaces. This is done via solutions such as Power BI, Tableau, R and Python. These present the data graphically and thus provide added value for your company to use your data even more efficiently.
Cost-efficient storage
Data lake storage with cost-effective storage such as Amazon S3 or Microsoft Azure Blob Storage.
Performance optimization
Optimization techniques such as caching, multi-dimensional clustering, data skipping through data statistics and summarization can be used to minimize data sizes.
Open standard formats
An open API for direct data access without vendor lock-in with language support for Python and R libraries.
Reliability and Quality
Reliability and quality in the datalake through transaction support of ACID transactions with SQL for concurrent data reads or writes for multiple parties. Schema support for data warehouse architectures such as Snowflake for robust governance and auditing mechanisms.
Addition of governance and security controls
DML support through Java, Python, and SQL for updating and deleting records. Data history captures all changes and provides a complete audit trail. In addition, data snapshots allow developers to access previous versions of data for audits or reproducing experiments. Role-based access control allows fine-grained gradations for row and column levels.
Support for Machine Learning
Support for a wide variety of data types for storage, analysis, and access including images, video, audio, and semi-structured files. Reading large amounts of data with e.g. Python or R libraries is also efficiently possible. In addition, data accesses in ML workloads are supported by an integrated DataFrame API (Application Programming Interface).
Databricks’ Delta Lake is an open source storage layer that enables the creation of a lakehouse architecture based on a data lake. This provides reliability, security and performance on the data lake.
The Delta Lake layer provides ACID transactions, schema forcing and scalable metadata processing. This enables streaming and batch operations. The stored data is saved in Apache Parquet format so that it can be read and processed by any compatible program.
Combine data warehouse performance with datalake flexibility. The Databricks platform, built on the Lakehouse architecture, brings data warehouse quality and reliability to open, flexible data lakes. This simplified architecture provides a single environment for analytics, streaming data, data science and machine learning.
The Azure Data Lakehouse reference architecture developed by areto offers many benefits.
Using areto’s Data Lakehouse Reference Architecture provides customers with best practices for developing and operating reliable, secure, efficient, and cost-effective systems. areto’s Azure Data Lakehouse architecture solutions are consistently measured against Microsoft best practices to deliver the highest value to customers.
The areto Azure reference architecture is based on five pillars: operational excellence, security, reliability, performance efficiency, cost optimization.
Operational Excellence
Optimal design of system operation and monitoring as well as continuous improvement of supporting processes and procedures.
Security
Protection of information, systems, assets, risk assessments and mitigation strategies
Cost optimization
Maximizing ROI through the continuous process of improving the system throughout its lifecycle.
Reliability
Ensuring security, disaster recovery, for business continuity as data is mirrored across multiple redundant sites.
Performance efficiency
Efficient use of computing resources, scalability to meet short-term peaks in demand, future-proofing.
Unifying data and bundling it at a point-of-truth makes it possible to run analytics with machine learning in a single place, without additional architecture. By storing raw data and transforming it when needed, no important information is lost. Thus, the necessary requirements for business intelligence are provided.
Machine learning also requires large amounts of raw data that can be manipulated using open source tooling. Therefore, the unstructured part of the Data Lakehouse supports direct access to the raw data in different formats and supports the ETL as well as the ELT model.
Become a data-driven company with areto Data Lakehouse experts!
Find out where your company currently stands on the way to becoming a data-driven company.
We analyze the status quo and show you which potentials are available.
How do you want to start?
Free consulting & demo appointments
Do you already have a strategy for your future data lakehouse solution? Are you already taking advantage of modern cloud platforms and automation? We would be happy to show you examples of how our customers are already using areto’s agile and scalable architecture solutions.
Workshops / Coachings
Our workshops and coaching sessions provide you with the necessary know-how to build a modern data lakehouse architecture. The areto Data Lakehouse TrainingCenter offers a wide range of learning content.
Proof of Concepts
Till Sander
CTO
Phone: +49 221 66 95 75-0
E-mail: till.sander@areto.de