Azure HDinsight
Azure HDInsight is a cloud distribution of the Hortonworks Data Platform’s (HDP) Hadoop components. Azure HDInsight essentially brings both Hadoop and Spark to the same table to help enterprises manage both easily using several tools. This platform also offers a standard notebook experience by providing support for Jupyter and Zepplin. Azure HDInsight is the perfect choice for those enterprises, who wish to manage both Hadoop, Spark and enjoy the ease of manageability across Big Data workloads.
Note that HDinsight is a Apache Hadoop running on Microsoft Azure. This means that we now have a cluster available in the cloud. Starting with some background on Hadoop. In other words, it is a cloud service that allows cost-effective data processing using open-source frameworks such as Hadoop, Spark, Hive, Storm, and Kafka, among others.
Using Apache Sqoop, we can import and export data to and from a multitude of sources, but the native file system that HDInsight uses is either Azure Data Lake Store or Azure Blob Storage.
HDInsight is the closest to an IaaS, since there is some amount of cluster management involved. Billing is on a per-minute basis, but activities can be scheduled on demand using Data Factory, even though this limits the use of storage to Blob Storage.
Let’s understand “Hadoop” in short…
It is an open-source framework for storing data and running apps on clusters. It offers massive storage for any data, lots of processing power. It can handle virtually “limitless” concurrent tasks. Hadoop has been declared open source and is now named Apache Hadoop.
In other words, Apache Hadoop is a framework or software library that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
In Azure we can pick the following clusters based on different scenarios of business need. But note that we can only select one type of cluster during the configuration of the HDInsight. The HDInsight cluster cannot be turned off, so this can result in high costs during low use situations.:
- Hadoop: Petabyte-scale processing.
- HBase: Fast and Scalable NoSQL database.
- Spark: Fast data analytics and cluster computing using in-memory processing.
- Kafka: High throughput, low-latency, a real-time streaming platform using a publish-subscribe messaging system.
- ML Services: A server for hosting and managing parallel distributed R processes.
- Interactive Query: Uses Hive (SQL on Hadoop) and LLAP (Low Latency Analytical Processing).
- Storm: Real-time streams of data through reliable processes.
Azure HDInsight is usable on the top of Azure Data Lake and gives us the benefit of analyzing large scale data workload in Hadoop. Usability and support from Microsoft are outstanding.
When compared to our classic on-premise Apache IaaS Hadoop maintenance cost, Azure HDInsight is very cost effective and provides lots of room to optimize our data.
For Active Directory integration with HDInsight, we need a few components to make it work. You will need the Enterprise security package (ESP). For this, you will also need to deploy Azure Active Directory Domain Services. There is a high availability guarantee from Microsoft.
In short, Azure HDInsight provides the most popular open-source frameworks that are easily accessible from the portal. If you need a combination of multiple clusters for example: HDInsight Kafka for your streaming with Interactive Query, this would be a great choice.
Azure Databricks
Databricks was founded by the creator of Spark. The team behind databricks keeps the Apache Spark engine optimized to run faster and faster. The databricks platform provides around five times more performance than an open-source Apache Spark. With Databricks, you have collaborative notebooks, integrated workflows, and enterprise security. This will be in a fully managed cloud platform.
Azure Databricks is often the best choice for an enterprise running Azure Cloud Services as this is a Spark-based analytics platform specially optimized for Microsoft Azure Cloud. It is ideal for enterprises who wish to increase the collaboration between their Data Scientists for running Spark-based workloads efficiently at a much higher performance.
Databricks Lakehouse Platform (Unified Analytics Platform) makes the power of Spark accessible. It is a highly adaptable solution for data engineering, data science, and AI. Load times are not consistent and no ability to restrict data access to specific users or groups.
Azure Databricks works on a premium Spark cluster. This one is faster than the open-source Spark. Azure Databricks is a PaaS solution. It doesn’t require a lot of admin work after the initial setup. It is providing security thanks to the Azure Active Directory integration without any need for custom configuration. It brings you all the pros that Databricks brings to you only then in Azure.
Let’s understand “Spark” in short…
It is a general purpose distributed data processing engine. It can be used for a wide range of circumstances. It uses a lot of libraries that can be used. For example: SQL, machine learning, graph computing, and streaming processing. Spark does not provide storage, but only a computation engine. Spark extends the Hadoop MapReduce framework to work in an optimized way.
Analytics Solution Architecture Considerations
Comparison - HDInsight and Databricks
Let’s look at a full comparison between Azure HDInsight and Azure Databricks services to see where each one excels:
HDInsight | Databricks | |
Pricing | Per Cluster Time | Per Cluster Time (VM cost + DBU processing time) |
Engine | Apache Hive or Apache Spark | Apache Spark, but optimized for Databricks since founders were creators of Spark |
Default Environment | Ambari (HortonWorks), Zeppelin if using Spark | Databricks Notebooks, R Studio for Databricks |
De Facto Language | HiveQL, open source | R, Python, Scala, Java, SQL, mostly open-source languages |
Integration with Data Factory | Yes, to run MapReduce jobs, Pig, and Spark scripts | Yes, to run notebooks, or Spark scripts (Scala, Python) |
Scalability | Not scalable, requires cluster shutdown to resize | Easy to change machines and allows auto-scaling |
Testing | Easy, Ambari allows interactive query execution (if Hive). If using Spark, Zeppelin | Very easy, notebook functionality is extremely flexible |
Setup and Administration | Complex, we must decide cluster types and sizes | Easy, Databricks offers two main types of services and clusters can be modified with ease |
Sources | Wide variety, ADLS, Blob and databases with Sqoop | Wide variety, ADLS, Blob, flat files in cluster and databases with Sqoop |
Migration Possibility | Easy as long as new platform supports MapReduce or Spark | Easy as long as new platform supports Spark |
Learning Curve | Flexible as long as developers know basic SQL | Very flexible as almost all analytic-based languages are supported |
Reporting Services | Tableau, Power BI (if using Spark), Qlik | Tableau, open-source packages such as gplot2, matplot-lib, bokeh, etc. |
Pros and Cons - HDInsight and Databricks
Pros of HDInsight
- Highly scalable and Highly available
- Great backup facility and disaster recovery
- Simplified cluster creation and deletion
- Easy data management and it is available for retrieval at any time
- It is low cost when compared to on-premise Hadoop. It is cost-effective to collect and store structured or unstructured data.
- The flat network storage system technology offers a high-speed connection between nodes and blob storage system.
- Almost 99% SLA at large scale. Very good support available from Microsoft.
- Transparent data encryption for end to end security
- More than 30 popular applications to choose from. These can be deployed to the cluster within minutes.
- Easy transformation of high volume data
- The Hadoop cluster was built within minutes
Cons of HDInsight
- It won’t support full length Hadoop features
- Lack of integration with other Azure platforms
- There is more room for improvement in workload based scaling
- Spark version is old and crappy
- Not easy to use – Log report hardly shows anything and the user interface is not user friendly
- Need Azure expertise to handle errors and adapt the application
- Performance issues are noticed by certain customers while dealing with large volume data
- Though cost is proven as low compared to on-premise Hadoop, still cost is bit on higher side which every customer cannot afford.
Pros of Databricks
- Databricks lakehouse platform in backed uses Apache Spark for all the computation to be faster and distributed. It helps to complete data pipelines to process huge amounts [of] big data in lesser time with low cost.
- Supports major data sources – Seamless integration with Azure cloud platform services like Azure Data Lake Storage, Blob storage , Azure Data Factory and Azure DevOps.
- SQL based & hence easy to adopt
- Interactive analysis with notebook-style coding
- User friendly – a new user can easily navigate through SQL/Python queries
- Great performance
- Ready-2-use Spark environment with zero configuration required
- It supports all data science programming languages like R, Scala, Python, SQL and Java
- There are many resourceful training elements that are available to developers, data scientists, data engineers and other IT professionals to learn Apache Spark.
- It takes very few minutes to deploy models into production
- It has tools that ensures collaborations between developers
- Supports complex transformations
- There is Databricks community, which is a free version. It is available for beginners to have an easy start with a big data platform. It does not have every feature of the full version but is still adequate for extremely new coders.
Cons of Databricks
- No data back up feature
- Hard to debug code
- Errors can be difficult to understand at times
- It is difficult to regenerate tables when connection is lost – Session resets automatically at times, which leads to the temporary tables being wiped out from memory
- Integration with Git has challenges
- Cost is said to be high by many customers
Conclusion
HDInsight has always been very reliable when we know the workloads and the cluster sizes we’ll need to run them. Scaling in this case is tedious, and machines must be deleted and activated iteratively until we find the right choice. Using Hive is a perk, as its being open source and very similar to SQL allows us to get straight down to developing without further training. By using Hive, we take full advantage of MapReduce power, which shines in situations where there are huge amounts of data. In general, remember that, if you have a lot of long running jobs that need high power then Azure HDInsight could be better then Azure Databricks.
Databricks seems an ideal choice when the notebook (like Jupiter) interactive experience is a must, when data engineers and data scientists must work together to get insights from data and adapt smoothly to different situations, as scalability is extremely easy. Another perk of using Databricks is its speed, big thanks to Spark creators. However remember that if you only need a spark cluster, then Azure Databricks will bring you that as it has better performance then an open-source Spark cluster.
In other words, the choice between Azure HDInsight and Azure Databricks depends on the use case that you want to solve. The most important one is how are the data scientists going to work? Are they going to work without collaborating, then it could be wiser to choose Azure HDInsight. Will there be a lot of collaborating, then Azure Databricks can bring you the extra mile due to the shared notebooks and readily available workflows.
Last but not least, if you would like a Kafka based streaming service that is connected to a transformation tool, then the combination of HDInsight Kafka and Azure Databricks is the right solution.
Hello very nice website!! Guy .. Excellent .. Superb .. I will bookmark your blog and take the feeds I’m glad to find numerous helpful information right here within the post, we want work out extra techniques on this regard, thanks for sharing. . . . . .
I’m very happy to read this. This is the kind of manual that needs to be given and not the random misinformation that’s at the other blogs. Appreciate your sharing this greatest doc.
This is very interesting, You’re a very skilled blogger. I have joined your rss feed and look forward to seeking more of your excellent post. Also, I’ve shared your web site in my social networks!
you are in reality a just right webmaster. The site loading speed is amazing. It seems that you’re doing any distinctive trick. Moreover, The contents are masterwork. you have performed a magnificent activity in this subject!
I must express my appreciation to you just for rescuing me from this particular matter. After surfing through the search engines and meeting techniques which were not productive, I believed my entire life was well over. Living without the strategies to the difficulties you have sorted out through this website is a critical case, as well as ones which could have adversely damaged my entire career if I had not discovered your blog post. Your know-how and kindness in handling all the details was tremendous. I don’t know what I would’ve done if I hadn’t discovered such a step like this. I can also now look forward to my future. Thanks for your time so much for this reliable and sensible guide. I won’t think twice to endorse the website to anybody who needs care about this subject matter.
I would like to thank you for the efforts you’ve put in writing this website. I am hoping the same high-grade web site post from you in the upcoming as well. In fact your creative writing abilities has inspired me to get my own blog now. Really the blogging is spreading its wings quickly. Your write up is a good example of it.
Great blog you have here.. It’s hard to find excellent writing like yours these days. I seriously appreciate people like you! Take care!!|
Every weekend i used to visit this website, for the reason that i wish for enjoyment, since this this website conations really nice technical stuff too.|
Thanks for the blog.Really looking forward to read more. Fantastic.
You have written a very good article, I collected a lot of information after reading it, which also benefited me in studies. Keep writing like this in future also and keep guiding everyone. Youtreex Foundation
Thank you for the sensible critique. Me & my neighbor were just preparing to do some research on this. We got a grab a book from our area library but I think I learned more clear from this post. I am very glad to see such great information being shared freely out there.
Heya i’m for the first time here. I found this board and I in finding It really helpful & it helped me out much. I hope to give something again and help others like you helped me.
Sure, thanks!
This actually answered my downside, thank you!
Thanks for sharing superb informations. Your web site is so cool. I’m impressed by the details that you¦ve on this blog. It reveals how nicely you perceive this subject. Bookmarked this web page, will come back for more articles. You, my friend, ROCK! I found simply the information I already searched all over the place and simply could not come across. What an ideal website.
Greetings I am so grateful I found your weblog, I really found you by mistake, while I was researching on Bing for something else, Anyways I am here now and would just like to say many thanks for a marvelous post and a all round entertaining blog (I also love the theme/design), I don’t have time to go through it all at the moment but I have bookmarked it and also added in your RSS feeds, so when I have time I will be back to read a lot more, Please do keep up the superb work.
you’re actually a excellent webmaster. The website loading speed is incredible. It seems that you’re doing any unique trick. In addition, The contents are masterpiece. you’ve performed a excellent task on this topic!
Thanks for wonderful feedback. I’m using WordPress platform through Hostinger. And I have not done anything special on top of whatever platform they have given me:-)
You made some clear points there. I looked on the internet for the issue and found most individuals will approve with your blog.
Thank you.
I am pleased that I discovered this website, exactly the right information that I was looking for! .
It is included in my habit that I often visit blogs in my free time, so after landing on your blog. I have thoroughly impressed with it and decided to take out some precious time to visit it again and again. Thanks. https://123moviesonline.monster
Virtually all of the things you mention is astonishingly legitimate and it makes me wonder the reason why I had not looked at this in this light before. This particular article truly did turn the light on for me as far as this subject matter goes. But at this time there is one position I am not really too comfy with and whilst I make an effort to reconcile that with the core theme of your point, allow me see exactly what all the rest of your readers have to point out.Nicely done.
My brother suggested I might like this web site. He was totally right. This post actually made my day. You can not imagine simply how much time I had spent for this info! Thanks!
It抯 really a cool and useful piece of information. I am glad that you shared this useful info with us. Please keep us up to date like this. Thanks for sharing.
I will right away grab your rss as I can not to find your email subscription hyperlink or newsletter service. Do you have any? Kindly permit me recognise so that I could subscribe. Thanks.
No currently I don’t have.
Its such as you learn my mind! You appear to know so much about this, like you wrote the book in it or something. I think that you just could do with a few to drive the message house a little bit, but other than that, that is fantastic blog. An excellent read. I will certainly be back.
very nice put up, i certainly love this web site, keep on it
Hi, i feel that i noticed you visited my weblog thus i return the want?I’m trying to in finding issues to improve my site!I suppose its ok to make use of some of your ideas!!
Hello.This article was extremely fascinating, especially because I was investigating for thoughts on this subject last Monday.
Youre so cool! I dont suppose Ive read something like this before. So good to search out somebody with some unique ideas on this subject. realy thank you for beginning this up. this website is something that’s needed on the internet, somebody with a bit originality. helpful job for bringing something new to the internet!
you could have an amazing blog here! would you wish to make some invite posts on my weblog?
Yes, of course!
Good site! I truly love how it is simple on my eyes and the data are well written. I’m wondering how I might be notified whenever a new post has been made. I’ve subscribed to your RSS which must do the trick! Have a nice day!
I like the valuable information you provide in your articles. I will bookmark your blog and check again here frequently. I’m quite sure I抣l learn many new stuff right here! Best of luck for the next!