Azure HDInsight is a cloud distribution of the Hortonworks Data Platform’s (HDP) Hadoop components. Azure HDInsight essentially brings both Hadoop and Spark to the same table to help enterprises manage both easily using several tools. This platform also offers a standard notebook experience by providing support for Jupyter and Zepplin. Azure HDInsight is the perfect choice for those enterprises, who wish to manage both Hadoop, Spark and enjoy the ease of manageability across Big Data workloads.
Note that HDinsight is a Apache Hadoop running on Microsoft Azure. This means that we now have a cluster available in the cloud. Starting with some background on Hadoop. In other words, it is a cloud service that allows cost-effective data processing using open-source frameworks such as Hadoop, Spark, Hive, Storm, and Kafka, among others.
Using Apache Sqoop, we can import and export data to and from a multitude of sources, but the native file system that HDInsight uses is either Azure Data Lake Store or Azure Blob Storage.
HDInsight is the closest to an IaaS, since there is some amount of cluster management involved. Billing is on a per-minute basis, but activities can be scheduled on demand using Data Factory, even though this limits the use of storage to Blob Storage.
Let’s understand “Hadoop” in short…
It is an open-source framework for storing data and running apps on clusters. It offers massive storage for any data, lots of processing power. It can handle virtually “limitless” concurrent tasks. Hadoop has been declared open source and is now named Apache Hadoop.
In other words, Apache Hadoop is a framework or software library that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
In Azure we can pick the following clusters based on different scenarios of business need. But note that we can only select one type of cluster during the configuration of the HDInsight. The HDInsight cluster cannot be turned off, so this can result in high costs during low use situations.:
- Hadoop: Petabyte-scale processing.
- HBase: Fast and Scalable NoSQL database.
- Spark: Fast data analytics and cluster computing using in-memory processing.
- Kafka: High throughput, low-latency, a real-time streaming platform using a publish-subscribe messaging system.
- ML Services: A server for hosting and managing parallel distributed R processes.
- Interactive Query: Uses Hive (SQL on Hadoop) and LLAP (Low Latency Analytical Processing).
- Storm: Real-time streams of data through reliable processes.
Azure HDInsight is usable on the top of Azure Data Lake and gives us the benefit of analyzing large scale data workload in Hadoop. Usability and support from Microsoft are outstanding.
When compared to our classic on-premise Apache IaaS Hadoop maintenance cost, Azure HDInsight is very cost effective and provides lots of room to optimize our data.
For Active Directory integration with HDInsight, we need a few components to make it work. You will need the Enterprise security package (ESP). For this, you will also need to deploy Azure Active Directory Domain Services. There is a high availability guarantee from Microsoft.
In short, Azure HDInsight provides the most popular open-source frameworks that are easily accessible from the portal. If you need a combination of multiple clusters for example: HDInsight Kafka for your streaming with Interactive Query, this would be a great choice.
Databricks was founded by the creator of Spark. The team behind databricks keeps the Apache Spark engine optimized to run faster and faster. The databricks platform provides around five times more performance than an open-source Apache Spark. With Databricks, you have collaborative notebooks, integrated workflows, and enterprise security. This will be in a fully managed cloud platform.
Azure Databricks is often the best choice for an enterprise running Azure Cloud Services as this is a Spark-based analytics platform specially optimized for Microsoft Azure Cloud. It is ideal for enterprises who wish to increase the collaboration between their Data Scientists for running Spark-based workloads efficiently at a much higher performance.
Databricks Lakehouse Platform (Unified Analytics Platform) makes the power of Spark accessible. It is a highly adaptable solution for data engineering, data science, and AI. Load times are not consistent and no ability to restrict data access to specific users or groups.
Azure Databricks works on a premium Spark cluster. This one is faster than the open-source Spark. Azure Databricks is a PaaS solution. It doesn’t require a lot of admin work after the initial setup. It is providing security thanks to the Azure Active Directory integration without any need for custom configuration. It brings you all the pros that Databricks brings to you only then in Azure.
Let’s understand “Spark” in short…
It is a general purpose distributed data processing engine. It can be used for a wide range of circumstances. It uses a lot of libraries that can be used. For example: SQL, machine learning, graph computing, and streaming processing. Spark does not provide storage, but only a computation engine. Spark extends the Hadoop MapReduce framework to work in an optimized way.
Analytics Solution Architecture Considerations
Comparison - HDInsight and Databricks
Let’s look at a full comparison between Azure HDInsight and Azure Databricks services to see where each one excels:
|Per Cluster Time
|Per Cluster Time (VM cost + DBU processing time)
|Apache Hive or Apache Spark
|Apache Spark, but optimized for Databricks since founders were creators of Spark
|Ambari (HortonWorks), Zeppelin if using Spark
|Databricks Notebooks, R Studio for Databricks
|De Facto Language
|HiveQL, open source
|R, Python, Scala, Java, SQL, mostly open-source languages
|Integration with Data Factory
|Yes, to run MapReduce jobs, Pig, and Spark scripts
|Yes, to run notebooks, or Spark scripts (Scala, Python)
|Not scalable, requires cluster shutdown to resize
|Easy to change machines and allows auto-scaling
|Easy, Ambari allows interactive query execution (if Hive). If using Spark, Zeppelin
|Very easy, notebook functionality is extremely flexible
|Setup and Administration
|Complex, we must decide cluster types and sizes
|Easy, Databricks offers two main types of services and clusters can be modified with ease
|Wide variety, ADLS, Blob and databases with Sqoop
|Wide variety, ADLS, Blob, flat files in cluster and databases with Sqoop
|Easy as long as new platform supports MapReduce or Spark
|Easy as long as new platform supports Spark
|Flexible as long as developers know basic SQL
|Very flexible as almost all analytic-based languages are supported
|Tableau, Power BI (if using Spark), Qlik
|Tableau, open-source packages such as gplot2, matplot-lib, bokeh, etc.
Pros and Cons - HDInsight and Databricks
Pros of HDInsight
- Highly scalable and Highly available
- Great backup facility and disaster recovery
- Simplified cluster creation and deletion
- Easy data management and it is available for retrieval at any time
- It is low cost when compared to on-premise Hadoop. It is cost-effective to collect and store structured or unstructured data.
- The flat network storage system technology offers a high-speed connection between nodes and blob storage system.
- Almost 99% SLA at large scale. Very good support available from Microsoft.
- Transparent data encryption for end to end security
- More than 30 popular applications to choose from. These can be deployed to the cluster within minutes.
- Easy transformation of high volume data
- The Hadoop cluster was built within minutes
Cons of HDInsight
- It won’t support full length Hadoop features
- Lack of integration with other Azure platforms
- There is more room for improvement in workload based scaling
- Spark version is old and crappy
- Not easy to use – Log report hardly shows anything and the user interface is not user friendly
- Need Azure expertise to handle errors and adapt the application
- Performance issues are noticed by certain customers while dealing with large volume data
- Though cost is proven as low compared to on-premise Hadoop, still cost is bit on higher side which every customer cannot afford.
Pros of Databricks
- Databricks lakehouse platform in backed uses Apache Spark for all the computation to be faster and distributed. It helps to complete data pipelines to process huge amounts [of] big data in lesser time with low cost.
- Supports major data sources – Seamless integration with Azure cloud platform services like Azure Data Lake Storage, Blob storage , Azure Data Factory and Azure DevOps.
- SQL based & hence easy to adopt
- Interactive analysis with notebook-style coding
- User friendly – a new user can easily navigate through SQL/Python queries
- Great performance
- Ready-2-use Spark environment with zero configuration required
- It supports all data science programming languages like R, Scala, Python, SQL and Java
- There are many resourceful training elements that are available to developers, data scientists, data engineers and other IT professionals to learn Apache Spark.
- It takes very few minutes to deploy models into production
- It has tools that ensures collaborations between developers
- Supports complex transformations
- There is Databricks community, which is a free version. It is available for beginners to have an easy start with a big data platform. It does not have every feature of the full version but is still adequate for extremely new coders.
Cons of Databricks
- No data back up feature
- Hard to debug code
- Errors can be difficult to understand at times
- It is difficult to regenerate tables when connection is lost – Session resets automatically at times, which leads to the temporary tables being wiped out from memory
- Integration with Git has challenges
- Cost is said to be high by many customers
HDInsight has always been very reliable when we know the workloads and the cluster sizes we’ll need to run them. Scaling in this case is tedious, and machines must be deleted and activated iteratively until we find the right choice. Using Hive is a perk, as its being open source and very similar to SQL allows us to get straight down to developing without further training. By using Hive, we take full advantage of MapReduce power, which shines in situations where there are huge amounts of data. In general, remember that, if you have a lot of long running jobs that need high power then Azure HDInsight could be better then Azure Databricks.
Databricks seems an ideal choice when the notebook (like Jupiter) interactive experience is a must, when data engineers and data scientists must work together to get insights from data and adapt smoothly to different situations, as scalability is extremely easy. Another perk of using Databricks is its speed, big thanks to Spark creators. However remember that if you only need a spark cluster, then Azure Databricks will bring you that as it has better performance then an open-source Spark cluster.
In other words, the choice between Azure HDInsight and Azure Databricks depends on the use case that you want to solve. The most important one is how are the data scientists going to work? Are they going to work without collaborating, then it could be wiser to choose Azure HDInsight. Will there be a lot of collaborating, then Azure Databricks can bring you the extra mile due to the shared notebooks and readily available workflows.
Last but not least, if you would like a Kafka based streaming service that is connected to a transformation tool, then the combination of HDInsight Kafka and Azure Databricks is the right solution.