How Hadoop Works


    How Hadoop Works? (2018)

    In this video you would learn, How Hadoop Works, the Architecture of Hadoop, Core Components of Hadoop, What is NameNode, DataNode, Secondary NameNode, JobTracker and TaskTracker.

    In this session let us try to understand, how Hadoop works ?

    The Hadoop framework comprises of the Hadoop Distributed File System and the MapReduce framework.

    Let us try to understand, how the data is managed and processed by the Hadoop framework?

    The Hadoop framework, divides the data into smaller chunks and stores each part of the data on a separate node within the cluster.

    Let us say we have around 4 terabytes of data and a 4 node Hadoop cluster.

    The HDFS would divide this data into 4 parts of 1 terabyte each.
    By doing this, the time taken to store this data onto the disk is significantly reduced.

    The total time taken to store this entire data onto the disk is equal to storing 1 part of the data, as it will store all the parts of the data simultaneously on different machines.

    In order to provide high availability what Hadoop does is, it would replicate each part of the data onto other machines that are present within the cluster.

    The number of copies it will replicate depends on the “Replication Factor”.

    By default the replication factor is set to 3.

    If we consider, the default replication factor is set, then there will be 3 copies for each part of the data on 3 different machines.
    In order to reduce the bandwidth and latency time, it would store 2 copies of the same part of the data, on the nodes that are present within the same rack, and the last copy would be stored on a node, that is present on a different rack.

    Let’s say Node 1 and Node 2 are on Rack 1 and Node 3 & Node 4 are on Rack 2.

    Then the 1st 2 copies of part 1 will be stored, on Node 1 and Node 2, and the 3rd copy of part 1, will be stored, either on Node 3 or Node 4.

    The similar process is followed, for storing remaining parts of the data.

    Since this data is distributed across the cluster, the HDFS takes care of networking required by these nodes to communicate.
    Another advantage of distributing this data across the cluster is that, while processing this data, it reduces lot of time, as this data can be processed simultaneously.

    This was an overview of, how Hadoop works, we would learning, how data is written, or read from the Hadoop cluster, in the later sessions.

    Enroll into this course at a deep discounted price:

    Please don’t forget to subscribe to our channel.

    If like this video, please like and share it.

    Visit to access the complete course.

    Follow Us On



    Previous articleWhat is Kafka?