Setting Up Cloud Native
Lakehouse

A "warehouse" bucket was created for the Lakehouse

Spark was used to consume data from Redpanda and write it as an Iceberg table

A sample dataset was produced for Redpanda

PostgreSQL was also queried using Trino

Sample dashboards were created on Superset using Trino queries

Trino was used to query the Iceberg table and Redpanda topic

Prometheus and Grafana were configured to monitor the Kubernetes cluster

Master the Power of Containers: Kubernetes Cluster Setup and Administration

Challenge

  • High costs associated with traditional data lake solutions
  • Vendor lock-in and limited flexibility
  • Migrating existing big data workloads to a new platform

Solution

We successfully built a Lakehouse using 100% open-source tools, achieving the following:

  • Kubernetes and Ceph Storage Cluster: Deployed and validated a Kubernetes and Ceph Storage Cluster using open-source tools.
  • Hadoop Cluster Replacement: Verified the functionality of the Lakehouse by running existing big data workloads on the Hadoop cluster.
  • Open-Source Lakehouse: Demonstrated the feasibility of creating a Lakehouse with 100% open-source software, eliminating license costs.

Cost-effectiveness:

No upfront license fees or ongoing subscription costs

Vendor neutrality:

Avoid vendor lock-in and maintain control over your data

Open-source flexibility:

Freedom to choose the best tools for your specific needs

Future-proof:

Easily adapt to new technologies and changing requirements

CONCLUSION

This project demonstrates the viability of building a powerful and cost-effective Lakehouse using 100% open-source tools.

By leveraging open-source technologies, organizations can achieve greater flexibility, control, and cost savings while future-proofing their data infrastructure.

KEY COMPONENTS

Partner with Us

Contact us today to discuss your specific needs and explore how we can empower your business with insightful, cost-effective data solutions.

Unleash the power of your data with a custom-built, open-source big data platform.