This is an email I sent to my colleagues sharing my hadoop learning experience. I hope it can help a little bit to others in the Internet as well.
I thought it’s probably good to share some experience on my learning of hadoop, to help others who avoid mistakes I’ve had and share some useful links. I learned from Apache distribution, so this probably apply to just Apache (maybe partially other distribution). But anyway, I hope this will help you in somewhat way. So here’s my suggestion:
1. To start, you may wanna follow this tutorial to make a mapreduce job running in single mode.
This will definitely help you understand how hadoop works, and it provides a really good prototype for you to scale out into real cluster. Inside this tutorial you’ll learn how to set up a single mode cluster, how to write a mapreduce job. Here‘s another article on how to set up single mode cluster and it’s easier to read.
2. Set up a multi-node cluster
3. Run a bigger hadoop job
When your scale out starting process a much bigger data set, there will always be some issues appearing. So you may want to do it in real cluster to know what it feels like. This doesn’t have to be a complicated job, but the input dataset should be HUGE. So you can know how the cluster running, distribute data, control its node those stuff.
Other than above, hadoop documentation and stackoverflow are always my top choices for trouble shooting. Hadoop Wiki is also a very valuable resource to answer common questions you may have during learning hadoop.
Hope this helps. Cheers