This is part 1 in a serie of blog posts about web analytics using Hadoop in the cloud.
Why would you like to run your Hadoop cluster in the cloud? Well, there may be several reasons for that. Perhaps you don't have resources (time, money and skills) to operate a Hadoop cluster inhouse or your demand fluctuates frequently, the arguments are the same as for cloud services in general. The important part when applying this to Hadoop is how to keep your data and meta data when taking down your Hadoop cluster. Also, it is preferable that the service doesn't take to long to spin up your cluster.
I have tried out both Amazon Elastic MapReduce and Azure HDInsight, and I'm impressed by both offerings. However, being more comfortable and experienced with Windows than Linux and working for a company that run pretty much all systems on Microsoft technology (even BI), the choice was never hard when deciding on a service for my Proof of Concept. I look forward to a GA-release of Azure HDInsight to fully recommend it as ready for production use cases.
So, how do you get started? First, apply for the Azure HDInsight feature preview (you need to log in to Azure first), it may take a while to get granted access, so don't wait, just do it. Second, download the local HDInsight distribution to start develop on your local machine before throwing your code on a bigger dataset.
If you don't already have a good use case for your Big data/Hadoop pilot I suggest that you start analysing your web logs. Reason, even if you don't have tens of TB:s in web logs, the data is of such nature that it suits well for practicing development and analysis on Hadoop. The data doesn't change once it is written and it is pretty easy to transport and parse. Also, in most businesses nowadays, the web presence is a crucial part of business. That use case may even be your Trojan Horse to get a Hadoop implementation through the Finance department as some web analytics solutions are rediculously overpriced data collection tools. Cost savings usually serves as a stronger argument than strategic objectives, perhaps there is som truth in "it is easier to save a buck than earning one". Example, it is not unusual that a proprietary web analytics SAAS solution may cost 200 000 USD annually. That is approximate what you would pay for a 25-30 node cluster running 24/7, and that is a lot of computing power for a standard web analytics implementation at a medium/large company.
I strongly suggest you get your hands on "Programming Hive" and "Programming Pig" to accompany you on your big data journey. If you are interested in operating a Hadoop cluster inhouse, then "Hadoop - the definitive guide" is probably a good investment as well.
But this blog serie will focus on Hadoop in the cloud or more specifically:
- - Azure HDInsight as Hadoop compute cluster
- - Azure Blob storage for persistent data storage
- - Azure SQL Server as persistent metastore
- - Excel PowerPivot and PowerView as analytics frontend
- - Pig for data processing/enhancement
- - hCatalog for meta data management
- - Hive for analysis
Please, comment, connect or send me a message if you have feedback or questions. Let the big data journey begin.