Saturday, February 24, 2018

Big Data 101 - What Is Big Data And Why Hadoop?

Like I mentioned in one of my previous posts, I'm exploring the big data ecosystem. In this post, I will briefly talk about big data and Hadoop, and why they are needed. I'm envisioning this as a series of 4 blog posts, and here goes the first one.

What is big data?
What is huge amount of data today need not be considered big a few years from now. So, there are 3 important vectors to check upon if the problem needs a big data solution or not:
  • Volume of data
  • Velocity at which data is being generated: depends on the growth rate
  • Variety of data: structured vs unstructured vs multi factored vs linked vs dynamic
Example: site analyics, clickstream data etc can all be considered big data problems.

Big data comes with big problems
  • We need efficiency in storage since data volume/velocity/variety is high
  • Data losses due to corruption and hard disk failures get magnified when working with big data, and accordingly, the recovery strategies needs to be adapted.
  • The time it takes to analyze data also goes up significantly, thus requiring better techniques for analysis
  • Finally, the monetary cost of analysis also shoots up due to huge storage & computation needs
As such, traditional RDBMS databases don’t help
Grid computing approaches don’t help either
When the amount of data is huge, nodes may end up spending too much time in data transfer. While Grid Computing works well for high analysis with lower amount of data, it requires low level programming and thus may not prove as efficient.

Hadoop was built to overcome above shortcomings
Its key features are:
  • Is cost effective
  • Can handle huge volume of data
  • Efficient in storage
  • Has good recovery solutions
  • Is horizontally scale
  • Minimizes learning curve

So is Hadoop better than other databases?
Well, it depends on the use case. There are some use cases where RDBMS solutions like MySQL, PostgreSQL, MSSQL etc shine, and then others where Hadoop is the better alternate. In general,
  • RDBMS work exceptionally well with low volume data, while Hadoop with larger datasets
  • RDBMS models are static schema while Hadoop allows dynamic schemas
  • RDBMS can scale vertically (you can improve the process itself) but won’t scale horizontally (can’t improve performance of query by adding more nodes)
  • Database solutions require dedicated server requirements which can get more expensive quickly, Hadoop is made of commodity computers
  • Hadoop is a batch interactive system, and so can’t expect millisecond latencies. Thus for most practical purposes where you need to return a response quickly, Hadoop won’t be the ideal choice.
  • Hadoop encourages you to write data once into the storage and analyze it multiple times, while databases support both read and write multiple times.

It is important to note here that newer databases like Cassandra and DynamoDB allow huge volume of data to be processed - millions of columns and billions of columns and give RDBMS competition. They still have limitations on querying on fields other than primary and secondary index, but for many practical purposes, can replace the RDBMS variant.

So what is Hadoop?
Hadoop is a framework for distributed processing of large data sets, across clusters of commodity computers (nodes). All the nodes that we need are commodity hardware - it is enterprise grade servers with no customisation needed, and thus can be bought off the shelf as is. In the world of cloud computing, these nodes can sit inside a VPC as well.  

Hadoop has two core components:
  • HDFS (Hadoop Distributed File System): Takes care of all storage related complexities, which data goes where, replicating data. HDFS is virtual, so the local file system and HDFS co-exist
  • Mapreduce: Takes care of all computation related complexities

In the next posts, we will explore HDFS, Mapreduce and Hadoop ecosystem in detail.


  1. The advertisement crusades are advanced with the view to spike change rate. It very well may be any computerized crusade by means of Adwords or TV plugs. Indeed, A/B testing empowers investigating the pace of traffic-pulling and its transformation proportion. data science course in pune

  2. I was blown out after viewing the article which you have shared over here. So I just wanted to express my opinion on Data Science, as this is best trending medium to promote or to circulate the updates, happenings, knowledge sharing.. Aspirants & professionals are keeping a close eye on Data science course in Mumbai to equip it as their primary skill.

  3. Just saying thanks will not just be sufficient, for the fantastic lucidity in your writing. I will instantly grab your articles to get deeper into the topic. And as the same way ExcelR also helps organisations by providing data science courses based on practical knowledge and theoretical concepts. It offers the best value in training services combined with the support of our creative staff to provide meaningful solution that suits your learning needs

  4. Thanks for sharing your valuable information to us, it is very useful.
    digital marketing course

  5. Such a very useful article. I have learn some new information.thanks for sharing.
    data scientist course in mumbai

  6. Really impressed! Everything is very open and very clear clarification of issues. It contains truly facts. Your website is very valuable. Thanks for sharing.

    Data science course in mumbai

  7. Such a very useful Blog. Very interesting to read this article. I have learn some new information.thanks for sharing. know more about

  8. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
    ExcelR data analytics

  9. I am impressed by the information that you have on this blog. It shows how well you understand this subject.
    ExcelR Business Analytics Course

  10. Very nice blog here and thanks for post it.. Keep blogging...
    ExcelR data science training

  11. Attend The PMP Certification in Abu Dhabi From ExcelR. Practical PMP Certification in Abu Dhabi Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The PMP Certification in Abu Dhabi.
    ExcelR PMP Certification in Abu Dhabi

  12. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    ExcelR data analytics courses

  13. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.
    ExcelR data science course in mumbai

  14. A good blog always comes-up with new and exciting information and while reading I have feel that this blog is really have all those quality that qualify a blog to be a one.
    data science course in india

  15. An information store contains a subset of corporate-wide information that is of incentive to a particular gathering of clients.Data Analytics Course in Bangalore

  16. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
    data analytics course mumbai

  17. I am a new user of this site so here i saw multiple articles and posts posted by this blog,I curious more interest in some of them hope you will give more information on this topics in your next articles.
    Data Science Courses
    Data Scientist

  18. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
    Please check this Data Scientist Course

  19. keep up the good work. this is an Ossam post. This is to helpful, i have read here all post. i am impressed. thank you. this is our Data Science course Mumbai
    data science course mumbai |

  20. This is a wonderful article, Given so much info in it, Thanks for sharing. CodeGnan offers courses in new technologies and makes sure students understand the flow of work from each and every perspective in a Real-Time environmen python training in vijayawada. , data scince training in vijayawada . , java training in vijayawada. ,

  21. I’m excited to uncover this page. I need to to thank you for ones time for this particularly fantastic read!! I definitely really liked every part of it and i also have you saved to fav to look at new information in your science course

  22. The information provided on the site is informative. Looking forward more such blogs. Thanks for sharing .
    Artificial Inteligence course in Aurangabad
    AI Course in Aurangabad

  23. I am genuinely thankful to the holder of this web page who has shared this wonderful paragraph at at this place
    data science course

  24. Your blog is splendid, I follow and read continuously the blogs that you share, they have some really important information. M glad to be in touch plz keep up the good work.
    Data Scientist Courses

  25. Hello Admin!

    Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuffs like this. If you are looking for the Advertising Agency in Chennai | Printing in Chennai , Visit Inoventic Creative Agency Today..

  26. wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
    Data science Interview Questions

  27. Very informative article, which you have shared here about the big data and hadoop. After reading your article I got very much information and it is very useful for us. I am thankful to you for sharing this article here. best big data trainer

  28. Attend The Data Science Courses From ExcelR. Practical Data Science Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science Courses.
    Data Science Courses
    Data Science Interview Questions

  29. Thanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. Data Blending in Tableau

  30. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance