Saturday, February 24, 2018

Big Data 101 - What Is Big Data And Why Hadoop?

Like I mentioned in one of my previous posts, I'm exploring the big data ecosystem. In this post, I will briefly talk about big data and Hadoop, and why they are needed. I'm envisioning this as a series of 4 blog posts, and here goes the first one.


What is big data?
What is huge amount of data today need not be considered big a few years from now. So, there are 3 important vectors to check upon if the problem needs a big data solution or not:
  • Volume of data
  • Velocity at which data is being generated: depends on the growth rate
  • Variety of data: structured vs unstructured vs multi factored vs linked vs dynamic
Example: site analyics, clickstream data etc can all be considered big data problems.

Big data comes with big problems
  • We need efficiency in storage since data volume/velocity/variety is high
  • Data losses due to corruption and hard disk failures get magnified when working with big data, and accordingly, the recovery strategies needs to be adapted.
  • The time it takes to analyze data also goes up significantly, thus requiring better techniques for analysis
  • Finally, the monetary cost of analysis also shoots up due to huge storage & computation needs
As such, traditional RDBMS databases don’t help
Grid computing approaches don’t help either
When the amount of data is huge, nodes may end up spending too much time in data transfer. While Grid Computing works well for high analysis with lower amount of data, it requires low level programming and thus may not prove as efficient.

Hadoop was built to overcome above shortcomings
Its key features are:
  • Is cost effective
  • Can handle huge volume of data
  • Efficient in storage
  • Has good recovery solutions
  • Is horizontally scale
  • Minimizes learning curve

So is Hadoop better than other databases?
Well, it depends on the use case. There are some use cases where RDBMS solutions like MySQL, PostgreSQL, MSSQL etc shine, and then others where Hadoop is the better alternate. In general,
  • RDBMS work exceptionally well with low volume data, while Hadoop with larger datasets
  • RDBMS models are static schema while Hadoop allows dynamic schemas
  • RDBMS can scale vertically (you can improve the process itself) but won’t scale horizontally (can’t improve performance of query by adding more nodes)
  • Database solutions require dedicated server requirements which can get more expensive quickly, Hadoop is made of commodity computers
  • Hadoop is a batch interactive system, and so can’t expect millisecond latencies. Thus for most practical purposes where you need to return a response quickly, Hadoop won’t be the ideal choice.
  • Hadoop encourages you to write data once into the storage and analyze it multiple times, while databases support both read and write multiple times.

It is important to note here that newer databases like Cassandra and DynamoDB allow huge volume of data to be processed - millions of columns and billions of columns and give RDBMS competition. They still have limitations on querying on fields other than primary and secondary index, but for many practical purposes, can replace the RDBMS variant.

So what is Hadoop?
Hadoop is a framework for distributed processing of large data sets, across clusters of commodity computers (nodes). All the nodes that we need are commodity hardware - it is enterprise grade servers with no customisation needed, and thus can be bought off the shelf as is. In the world of cloud computing, these nodes can sit inside a VPC as well.  

Hadoop has two core components:
  • HDFS (Hadoop Distributed File System): Takes care of all storage related complexities, which data goes where, replicating data. HDFS is virtual, so the local file system and HDFS co-exist
  • Mapreduce: Takes care of all computation related complexities

In the next posts, we will explore HDFS, Mapreduce and Hadoop ecosystem in detail.

14 comments:

  1. The advertisement crusades are advanced with the view to spike change rate. It very well may be any computerized crusade by means of Adwords or TV plugs. Indeed, A/B testing empowers investigating the pace of traffic-pulling and its transformation proportion. data science course in pune

    ReplyDelete
  2. I was blown out after viewing the article which you have shared over here. So I just wanted to express my opinion on Data Science, as this is best trending medium to promote or to circulate the updates, happenings, knowledge sharing.. Aspirants & professionals are keeping a close eye on Data science course in Mumbai to equip it as their primary skill.

    ReplyDelete
  3. Just saying thanks will not just be sufficient, for the fantastic lucidity in your writing. I will instantly grab your articles to get deeper into the topic. And as the same way ExcelR also helps organisations by providing data science courses based on practical knowledge and theoretical concepts. It offers the best value in training services combined with the support of our creative staff to provide meaningful solution that suits your learning needs

    ReplyDelete
  4. Thanks for sharing your valuable information to us, it is very useful.
    digital marketing course

    ReplyDelete
  5. Such a very useful article. I have learn some new information.thanks for sharing.
    data scientist course in mumbai

    ReplyDelete
  6. Really impressed! Everything is very open and very clear clarification of issues. It contains truly facts. Your website is very valuable. Thanks for sharing.

    Data science course in mumbai

    ReplyDelete
  7. Such a very useful Blog. Very interesting to read this article. I have learn some new information.thanks for sharing. know more about

    ReplyDelete
  8. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
    ExcelR data analytics

    ReplyDelete
  9. I am impressed by the information that you have on this blog. It shows how well you understand this subject.
    ExcelR Business Analytics Course

    ReplyDelete
  10. Very nice blog here and thanks for post it.. Keep blogging...
    ExcelR data science training

    ReplyDelete
  11. Attend The PMP Certification in Abu Dhabi From ExcelR. Practical PMP Certification in Abu Dhabi Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The PMP Certification in Abu Dhabi.
    ExcelR PMP Certification in Abu Dhabi

    ReplyDelete
  12. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    ExcelR data analytics courses

    ReplyDelete