The Big Data problem & HDFS.

Devanshu Singh
6 min readSep 17, 2020

--

We can no longer ignore data. Now that we have begun to define it and find new ways of collecting it, we see it everywhere and in everything humans do. Our current output of data is roughly 2.5 quintillion bytes a day and as the world becomes ever more connected with an ever-increasing number of electronic devices, it will grow to numbers we haven’t even conceived of yet.

Just like that, before talking about big data, each of you has to know about what we are dealing with. Here, I cover some of the challenges we face in Big Data with 8 V’s.

1. Volume:

When we talk about Big data, probably volume is the very first criteria for consideration. The range of volume justifies whether it should be considered as ‘big’ or not. Usually, if the volume of data is above gigabytes then only it is considered as big data from a volume perspective. What does measurement signify here? It could be petabytes, terabytes, exabytes. This volume amount is considered based on data surveys of different organizations and here are some of the examples:

2. Velocity:

Stream analytics is a popular term today where high-speed data is processed using tools. But do you know stream analytics associated with which characteristics of big data? No doubt, it is the velocity of data. Here velocity means data generation speed, how frequently it is delivered, and analyzed.

Now, the amount of data generated in today’s scenario is massive. Most importantly it needs real-time processing for analysis purposes. For example, Google alone generates more than 40k search queries per second. Hence, we can imagine how fast processing is required to get insights from data.

3. Variety:

Big data deals with any data format structured, unstructured, semi-structured, or even very complex structured. So, storing and processing unformatted data through RDBMS is not easy. However, such unstructured data provides more valuable insights on the information which we rarely get from structured data. Besides, a variety of data means different data sources. So, this characteristic of big data also provides information on the data sources.

4. Veracity:

Not that all data that come for processing are valuable. So, unless the data is cleansed correctly, it is not wise to store or process complete data. Especially when the volume is such massive. There comes this dimension of big data — veracity. These particular characteristics also help to know whether the data is coming from a reliable source or it is the right fit for the analytic model.

5. Variability:

In Big data analysis data inconsistency is a common scenario that arises as the data is sourced from different sources. Besides, they contain different data types. Hence, to get meaningful data out of that enormous amount of data anomaly and outlier detection are essential. So, variability is considered as one of the characteristics of big data.

6. Value:

The primary interest for big data is probably for its business value. Perhaps this is the most crucial characteristic of big data. Because unless you get any business insights out of it, there is no meaning of other characteristics of big data.

7. Visualization:

Big data processing is not the only means of getting a meaningful result out of it. Unless it is represented or visualizes in a meaningful way, there is no point in analyzing it. Hence, big data must be visualized with appropriate tools which serve different parameters to help data scientists or analysts to understand it in a better way.

However, plotting billions of data points is not an easy task. Furthermore, it associates different techniques like using treemaps, network diagrams, cone trees, etc.

8. Validity:

Validity has some similarities with veracity. As the meaning of the word suggests, the validity of big data means how correct is the data for the purpose it is used for. Interestingly a considerable portion of big data remains un-useful which is considered as ‘dark data.‘ The remaining part of the collected unstructured data is cleansed first for analysis.

To conclude, the above mentioned 8 characteristics of big data indicate that each of the characteristics is associated with some advantages. However, they are not beyond challenges. Besides these are the characteristics that determine the root of failures or defects in data on a real-time basis. Also, analysis based on these characteristics feed the risk portfolio of a company and helps to prevent fraudulent activities.

Now the question arises on how to overcome these problems?

In today’s world, we have become very advanced technologically. So, there must be a way to deal with Big Data. So, let’s see how we can conquer the Big Data problems.

So we can use the Distributed Storage Cluster.

HDFS:

The Hadoop Distributed File System (HDFS) is a Java-based distributed file system that is fault-tolerant, scalable, and extremely easy to expand. It is designed to run on commodity hardware and can be deployed on low-cost hardware. HDFS is the primary distributed storage for Hadoop applications. It provides interfaces for applications to move closer to data.

Architecture

HDFS architecture contains a NameNode, DataNode, and Secondary NameNode.

HDFS has a master/slave architecture.

NameNode — An HDFS cluster consists of a single NameNode (Master Server), which manages the file system namespace and regulates access to files by clients. It maintains and manages the file system metadata; e.g. what blocks make up a file, and on which data nodes those blocks are stored.

DataNode — There are several DataNodes, usually one per node in the cluster, which manages storage attached to the nodes that they run on. DataNode in HDFS stores the actual data. We can add more data nodes to increase the space available.

Secondary NameNode — The secondary NameNode service is not a standby secondary NameNode, despite its name. Specifically, it does not offer High Availability (HA) for the NameNode.

Why Secondary NameNode?

  • The NameNode stores modifications to the file system as a log appended to a native file system file.
  • When a NameNode starts up, it reads the HDFS state from an image file, fsimage, and then applies edits from the edits log file.
  • It then writes a new HDFS state to the fsimage and starts normal operation with an empty edits file.
  • Since NameNode merges fsimage and edits files only during startup, the edits log file could get very large overtime on a busy cluster.
  • Another side effect of a larger edits file is that the next restart of NameNode takes longer.
  • The Secondary NameNode merges the fsimage and the edits log files periodically and keep edits log size within a limit.
  • It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode.

Key Features

Failure tolerant — data is duplicated across multiple DataNodes to protect against machine failures. The default is a replication factor of 3 (every block is stored on three machines if you have 3 data nodes available).

Scalability — data transfers happen directly with the DataNodes so your read/write capacity scales fairly well with the number of DataNodes.

Space — need more disk space? Just add more DataNodes and re-balance.

Industry-standard — Other distributed applications are built on top of HDFS (HBase, Map-Reduce).

HDFS is designed to process large data sets with write-once-read-many semantics, it is not for low latency access.

Data Organization

  • Each file written into HDFS is split into 64 MB or 128 MB data blocks.
  • Each block is stored on one or more nodes.
  • Each copy of the block is called a replica.

Block Placement Policy

  • The first replica is placed on the local node.
  • The second replica is placed on a different rack.
  • The third replica is placed in the same rack as the second replica.

My final thoughts

When it comes to Big Data, it’s easy to get overwhelmed with its endless exciting possibilities. Nevertheless, critical assessment, the understanding of shortcomings and vulnerabilities (technological, ethical, and legal), as well as strategies to address them should be at the core of any Big Data implementation.

THANK YOU FOR READING. GIVE IT A LIKE IF YOU LOVED MY ARTICLE IT WILL ENCOURAGE ME TO WRITE MORE ARTICLES.

--

--

No responses yet