Setting Up Cloud Native
Lakehouse

A "warehouse" bucket was created for the Lakehouse

Spark was used to consume data from Redpanda and write it as an Iceberg table

A sample dataset was produced for Redpanda

PostgreSQL was also queried using Trino

Sample dashboards were created on Superset using Trino queries

Trino was used to query the Iceberg table and Redpanda topic

Prometheus and Grafana were configured to monitor the Kubernetes cluster

Master the Power of Containers: Kubernetes Cluster Setup and Administration

Challenge

High costs associated with traditional data lake solutions
Vendor lock-in and limited flexibility
Migrating existing big data workloads to a new platform

Solution

We successfully built a Lakehouse using 100% open-source tools, achieving the following:

Kubernetes and Ceph Storage Cluster: Deployed and validated a Kubernetes and Ceph Storage Cluster using open-source tools.
Hadoop Cluster Replacement: Verified the functionality of the Lakehouse by running existing big data workloads on the Hadoop cluster.
Open-Source Lakehouse: Demonstrated the feasibility of creating a Lakehouse with 100% open-source software, eliminating license costs.

Cost-effectiveness:

No upfront license fees or ongoing subscription costs

Vendor neutrality:

Avoid vendor lock-in and maintain control over your data

Open-source flexibility:

Freedom to choose the best tools for your specific needs

Future-proof:

Easily adapt to new technologies and changing requirements

CONCLUSION

This project demonstrates the viability of building a powerful and cost-effective Lakehouse using 100% open-source tools.

By leveraging open-source technologies, organizations can achieve greater flexibility, control, and cost savings while future-proofing their data infrastructure.

KEY COMPONENTS

Rancher Kubernetes Engine-2: Manages and simplifies Kubernetes cluster operations
MetalLB: Provides load balancing for Kubernetes services
Rancher: Provides a centralized platform for Kubernetes cluster management
Ceph Storage (Rook): Offers object and block storage for the Lakehouse
Redpanda: Provides a real-time streaming platform
Spark (Operator): Facilitates batch and stream processing
Airflow: Orchestrates workflows and data pipelines

Argo CD: Supports continuous integration and continuous delivery (CI/CD) and application deployment
Trino: Enables distributed data querying
Apache Superset: Visualizes data through interactive dashboards
Prometheus and Grafana: Collect and monitor metrics for the Lakehouse
Harbor: Serves as a private image repository.
Project Nessie: Manages catalog and metadata for the Lakehouse

Setting Up Cloud Native
Lakehouse

A "warehouse" bucket was created for the Lakehouse

Spark was used to consume data from Redpanda and write it as an Iceberg table

A sample dataset was produced for Redpanda

PostgreSQL was also queried using Trino

Sample dashboards were created on Superset using Trino queries

Trino was used to query the Iceberg table and Redpanda topic

Prometheus and Grafana were configured to monitor the Kubernetes cluster

Master the Power of Containers: Kubernetes Cluster Setup and Administration

Challenge

Solution

Cost-effectiveness:

No upfront license fees or ongoing subscription costs

Vendor neutrality:

Avoid vendor lock-in and maintain control over your data

Open-source flexibility:

Freedom to choose the best tools for your specific needs

Future-proof:

Easily adapt to new technologies and changing requirements

CONCLUSION

This project demonstrates the viability of building a powerful and cost-effective Lakehouse using 100% open-source tools.

By leveraging open-source technologies, organizations can achieve greater flexibility, control, and cost savings while future-proofing their data infrastructure.

KEY COMPONENTS

Partner with Us

Contact us today to discuss your specific needs and explore how we can empower your business with insightful, cost-effective data solutions.

Unleash the power of your data with a custom-built, open-source big data platform.

Setting Up Cloud Native Lakehouse

A "warehouse" bucket was created for the Lakehouse

Spark was used to consume data from Redpanda and write it as an Iceberg table

A sample dataset was produced for Redpanda

PostgreSQL was also queried using Trino

Sample dashboards were created on Superset using Trino queries

Trino was used to query the Iceberg table and Redpanda topic

Prometheus and Grafana were configured to monitor the Kubernetes cluster

Master the Power of Containers: Kubernetes Cluster Setup and Administration

Challenge

Solution

Cost-effectiveness:

No upfront license fees or ongoing subscription costs

Vendor neutrality:

Avoid vendor lock-in and maintain control over your data

Open-source flexibility:

Freedom to choose the best tools for your specific needs

Future-proof:

Easily adapt to new technologies and changing requirements

CONCLUSION

This project demonstrates the viability of building a powerful and cost-effective Lakehouse using 100% open-source tools. By leveraging open-source technologies, organizations can achieve greater flexibility, control, and cost savings while future-proofing their data infrastructure.

KEY COMPONENTS

Partner with Us

Contact us today to discuss your specific needs and explore how we can empower your business with insightful, cost-effective data solutions.

Unleash the power of your data with a custom-built, open-source big data platform.

Setting Up Cloud Native
Lakehouse

This project demonstrates the viability of building a powerful and cost-effective Lakehouse using 100% open-source tools.

By leveraging open-source technologies, organizations can achieve greater flexibility, control, and cost savings while future-proofing their data infrastructure.