Datamatics Blog on technologies and innovative solutions

9 best practices for building data lakes with Apache Hadoop

Written by Gaurav Gandhi | May 22, 2019 3:11:31 PM

Estimated reading time: 4 mins

Over the last few years, the data type and data quality have varied dynamically. Besides, the data volumes have increased exponentially. As a result, the traditional data warehouses, which are accustomed to process only structured data, are finding it increasingly difficult to store, process, and analyze at scale leading to bottle necks and task failures.  A data lake, built using Apache Hadoop, on-premise or on-cloud is fast becoming the newage solution for storing, processing, and analyzing multi-variate and high volume data in real-time.

Benefits of data lakes over data warehouses

Data lakes integrate seamlessly with traditional database systems, analytics tools, and query engines used for business reporting. They can systematically extend a traditional data warehouse. At a broad level, data lakes provide the following benefits:

  • Distributed file system: It offers a highly scalable, fault-tolerant, distributed storage. It works concurrently with a number of data access applications through a YARN service.

  • YARN: It simultaneously allows multiple data processing engines provide analytics at scale; for example - interactive SQL, real-time streaming, data science, analytics workbench, and batch processing.

  • Support for Apache Spark: The augmented technology offers lightning fast unified analytics for large scale data processing. It allows writing parallel applications in Java, Scala, Python, R, and SQL.

  • Reduced TCO of data analysis: Data lakes reduce the cost of data management and analytics. Organizations experience reduced time-to-insights as well as deployment from days to minutes.

  • Unified storage platform: It reduces the number of data marts, employs business centric rules, and helps access policies within the same storage platform.

  • New insight creation: With a wide range of analytics tools, the platform helps quickly answer queries with reduction in time for insights and complex analytics.

Sachin Rane, Executive Vice President & Head - Software Solutions, Datamatics Global Services Ltd. explains why data lakes are important for your organization.
Watch now >>

Business impact of using data lakes

  • Data lakes offer a centralized repository, which is scalable across innumerable machines.

  • They allow ingesting, storing, processing, and transforming both structured and unstructured data, unlike data warehouses.

  • They empower a business with multi-functional tools such as data discovery, reporting, data science & analytics, and visual reporting on stored data irrespective of the native format.

  • They offer to draw insights in real-time and access it on-demand.

Best practices for data lakes

Here are some best practices for building a data lake solution as a new initiative or as a re-architecture of a data warehouse:

9 best practices for building data lakes with Apache Hadoop -

  1. Configure data lakes to be flexible and scalable
  2. Include Big Data Analytics components
  3. Implement access control policies
  4. Provide data search mechanisms
  5. Ensure data movement for any amount of data
  6. Securely store, index, and catalog data
  7. Perform comprehensive analytics
  8. Use Machine Learning
  9. Use BI tools
  1. Configure data lakes to be flexible and scalable for aggregating and storing all types of data.

  2. Include Big Data Analytics components, which support data encryption, search, complex analysis, interactive analytics, and querying.

  3. Implement access control policies and data security mechanisms to protect the stored data.

  4. Provide data search mechanisms for quick and easy search and retrieval towards performing 360 degree analysis.

  5. Ensure data movement for allowing import of any amount of data in its native format from disparate sources into a unified storage.

  6. Securely store, index, and catalog data for allowing easy understanding and search of data streaming from mobile apps, IoT devices, and social media.

  7. Perform comprehensive analytics using popular Big Data frameworks, such as Apache Hadoop and Spark without moving the data to a separate analytics system.

  8. Use Machine Learning to derive valuable insights and perform self-learning using models, predict outcomes, and suggest actions for achieving optimal results.

  9. Use BI tools, which seamlessly integrate with the data lake platform, to provide faster business analytics, dashboards and visualizations that are accessible from any browser and mobile device.

Strategies to extend a traditional data warehouse using data lake

  • Retain the frequently used data in the warehouse and offload the unused data and ETL workload to data lake repository. Use a Big Data Analytics framework such as Apache Spark to perform fast in-memory analytics and maintain business continuity.

  • Migrate data in batches by using Network File System (NFS) or Apache Sqoop or real-time methods such as Kafka Connect. Subsequently, store the data in Hive tables or Parquet or Avro files.

  • Use unified SQL engines to deliver data to Business Intelligence teams. Leverage stored data in tables using BI tools. BI teams can query the offloaded data using SQL whereas the Data Science teams can analyze the newly sourced data using Analytics workbench.

Data lakes on cloud

Data lakes augment data storage, intelligent processing, and complex analytics, especially on cloud platforms such as Infrastructure as a Service (IaaS). Data lakes usage can also be extended through Data Analysis (DAaaS) platform:

  • Data lake analytics: The goal is achieved by scaling the data storage and processing over the Infrastructure as a Service (IaaS) platform, which is provided by the likes of AWS and Microsoft Azure. With IaaS, organizations can develop and run massive parallel data transformations and processing programs without business overheads.

  • On-premise or on-cloud Apache Spark and Hadoop Services: AWS EMR and Azure HDInsight's provide a fully managed cloud-based Hadoop cluster with Analytics capabilities and extended support for Machine Learning libraries. It helps to quickly spin-up the cluster on-demand and scale-up and down based on organization requirements and needs.

  • Data lake storage to power Big Data Analytics: AWS s3 powered data lake solutions provide massively scalable and secure storage. They are highly available, designed to deliver 99.99% durability, and store data for millions of applications. They also provide ‘query in place’ functionality, which allows running a query on the data set at rest. AWS s3 is supported by the large community of third-party applications and AWS services.

Download whitepaper on "Building Modern Data Platform with Apache Hadoop (Data Lakes)"

In summary

Data lakes solve challenges related to business intelligence and analytics. However, business needs are constantly evolving. Future-proofing data lake implementations, which evolve with the organization’s business needs, are the way ahead. Data lakes built on Hadoop platform empower the businesses to grow around existing and new data assets and easily derive business insights without limitations.