Apache Hadoop on Amazon EMR

Posted on

Apache Hadoop is a free and open-source software framework designed to quickly and easily handle massive data sets. Hadoop allows for the clustering of commodity hardware together to analyze massive data sets in parallel, negating the need for a single large computer to process and store the data.

The Hadoop ecosystem includes many different applications and execution engines, giving you a wide range of options to tailor your tools to your specific analytics workloads. With Amazon Elastic MapReduce (EMR), Hadoop and other Hadoop ecosystem applications can be deployed and managed in fully configured, elastic clusters of Amazon EC2 instances.

Software and infrastructure in the Hadoop ecosystem

Hadoop is shorthand for the Apache Hadoop project, which consists of the MapReduce (computing framework), YARN (resource manager), and HDFS (data storage) technologies (distributed storage). Apache Tez is a cutting-edge framework that can be installed and used as an alternate execution engine to Hadoop MapReduce. EMRFS, a connector built into Amazon EMR that enables Hadoop to employ Amazon S3 as a storage layer, is also available.

Low-latency querying tools, graphical user interfaces for interactive querying, multiple querying interfaces (including SQL), and distributed NoSQL databases are just some of the other applications and frameworks that make up the Hadoop ecosystem. Hive, Pig, Hue, Ganglia, Oozie, and HBase are just some examples of the many open source tools in the Hadoop ecosystem that can be quickly and easily installed and configured on your cluster using Amazon Elastic MapReduce. In addition to Hadoop, you can use Amazon Elastic MapReduce to run other frameworks, such as Apache Spark for in-memory processing or Presto for interactive SQL.

Hadoop: the basic components

Apache Hadoop on Amazon EMR

Hadoop applications like Hadoop MapReduce, YARN, HDFS, and Apache Tez are automatically installed and configured on all of your cluster nodes by Amazon Elastic MapReduce using a programmable template.

Methods based on Hadoop MapReduce, Tez, and YARN for Processing

Execution engines within the Hadoop ecosystem, such as MapReduce and Tez, process workloads by means of frameworks that divide up jobs into smaller, more manageable chunks of work that can then be distributed across the nodes of your Amazon EMR cluster. Designed for redundancy and the possibility of hardware failure at any time, these systems are built to withstand the loss of any single node in your cluster. Hadoop will continue running a task on another machine in the event of a failed server.

To interact with apache Hadoop, you can write Java code for MapReduce and Tez, use Hadoop Streaming to run custom scripts in parallel, or take advantage of Hive and Pig for abstractions over MapReduce and Tez at a higher level.

Hadoop 2 introduces YARN, yet another resource negotiator (YARN). Your entire cluster’s resources are tracked by YARN, and they are dynamically allocated to carry out the processing job’s tasks. For example, YARN can handle distributed frameworks like Apache Spark in addition to Hadoop’s MapReduce and Tez workloads.

Using Amazon S3 and EMRFS as a Data Store

The EMR File System (EMRFS) allows you to use Amazon S3 as a data layer for Hadoop on your Amazon EMR cluster. Amazon Simple Storage Service (S3) is a great data store for big data processing because it is scalable, inexpensive, and built for durability. You can avoid needing more nodes in your Amazon EMR cluster than are strictly necessary in order to maximize on-cluster storage by storing your data in Amazon S3, effectively unbundling your compute layer from your storage layer. Further, your data will still be accessible in Amazon S3 even if your Amazon EMR cluster is shut down during periods of inactivity.

When it comes to processing objects with Amazon S3 server-side and client-side encryption, EMRFS has you covered. It is optimized for Hadoop to read and write in parallel to Amazon S3, and it does so quickly and efficiently. Hadoop in Amazon Elastic MapReduce (EMR) can serve as an elastic query layer, and EMRFS enables you to use Amazon Simple Storage Service (S3) as your data lake.

On-cluster storage with HDFS

The Hadoop Distributed File System (HDFS) is Hadoop’s bundled distributed file system, which uses your cluster’s local disks to store data in large blocks. HDFS’s replication factor is adjustable (it’s set to 3x by default), increasing the system’s reliability and uptime. HDFS keeps an eye on replication and distributes your data uniformly across your nodes, even as nodes are added or removed.

You can use HDFS and Amazon S3 to store your input and output data, and HDFS is included in the Hadoop installation on your Amazon EMR cluster. Use of an Amazon EMR security setup makes HDFS encryption simple. Even if your input data is stored in Amazon S3, Amazon EMR will still configure Hadoop so that it uses HDFS and local disk for intermediate data created during your Hadoop MapReduce jobs.

Hadoop’s benefits when used with Amazon Elastic MapReduce

Enhanced kinetic agility

The time it takes to make resources available to your users and data scientists can be drastically reduced by using Amazon Elastic MapReduce (EMR) to quickly and dynamically initialize a new Hadoop cluster or to add servers to an existing cluster. By reducing the time and money needed to allocate resources for experimentation and development, using Hadoop on the AWS platform can greatly increase your organization’s agility.

simplification of government procedures

It can be difficult and time-consuming to set up Hadoop and all of its associated components, such as the network, servers, security, and administration. Amazon Elastic MapReduce (EMR) is a managed service that takes care of your Hadoop infrastructure needs so you can concentrate on running your business.

Connectivity to other Amazon Web Services

To facilitate data movement, workflows, and analytics across the many diverse services on the AWS platform, Hadoop can be easily integrated with other services like Amazon S3, Amazon Kinesis, Amazon Redshift, and Amazon DynamoDB. When it comes to Apache Hive and Apache Spark, the AWS Glue Data Catalog can be used as a managed metadata repository.

You only pay for clusters when you actually use them.

The nature of many apache Hadoop jobs is spiky. A data ETL job, for instance, might run once a day, once a week, or once a month, while financial firm modeling or genetic sequencing might only happen once or twice a year. Hadoop on Amazon Elastic MapReduce (EMR) enables you to quickly deploy clusters to handle these workloads, store the results, and release idle Hadoop resources to save money on infrastructure. Hadoop 3 is supported on EMR 6.x, allowing the YARN NodeManager to launch containers on the EMR cluster host or inside a Docker container. For more information, please peruse our docs.

A more reliable and resilient availability and disaster recovery system

You can launch your Hadoop clusters in any number of Availability Zones across any AWS region when using Amazon Elastic MapReduce. Launching a cluster in a different zone can quickly and easily sidestep a potential threat in one region or zone.

Flexible capacity

Without adequate capacity planning prior to deploying a Hadoop environment, you may end up wasting money on unused or underutilized resources. Amazon Elastic MapReduce allows you to rapidly deploy clusters with the necessary throughput by leveraging EMR Managed Scaling’s ability to add and remove nodes on the fly.

What is the relationship between Hadoop and large amounts of data?

Apache Hadoop on Amazon EMR

Hadoop’s high scalability makes it a popular choice for handling large data processing jobs. Your Hadoop cluster’s processing power can be increased by adding more servers with sufficient CPU and memory to handle your workload.

Hadoop’s ability to handle computational analytical workloads in parallel is matched by its high reliability and availability. Hadoop is well-suited to big data workloads due to its availability, durability, and scalability of processing. Within minutes, you can have an Amazon Elastic Compute Cloud (EC2) Hadoop cluster up and running with the help of Amazon Elastic MapReduce (EMR).

Running Apache Hadoop on AWS

By using the latest versions of popular big data processing frameworks like Apache Hadoop, Spark, HBase, and Presto on fully configurable clusters, you can process and analyze large datasets with the help of Amazon Elastic MapReduce (EMR).

  • You can get started with an Amazon Elastic MapReduce cluster in a matter of minutes, and it’s incredibly simple to use. Node provisioning, cluster setup, Hadoop configuration, or cluster tuning are not your responsibility.
  • Amazon Elastic Compute Cloud pricing is straightforward and predictable, making it an affordable option. Spot Instances allow you to save even more money by using instances only when they are needed, and you pay by the hour for each instance hour used.
  • Amazon Elastic MapReduce allows for scalable data processing by allowing the provisioning of a single compute instance up to thousands of instances.
  • Temporary: HDFS data stored persistently in Amazon S3 can be used with EMRFS to run clusters on demand. Data can be automatically saved to Amazon S3 as jobs complete on a cluster before it is shut down. Only the time that the cluster is actually being used by your computers will be billed to you.
  • All standard AWS security features, including:
    • Roles and policies for Identity and Access Management (IAM) to control access.
    • Data protection and compliance with regulations like HIPAA can be accomplished with end-to-end encryption while in motion and while stored.
    • Inbound and outbound network traffic to your cluster nodes can be managed with security groups.
    • For security analysis, resource change tracking, and compliance auditing, use AWS CloudTrail to record all Amazon Elastic MapReduce Processor Interface (PI) calls made in your account.