About Bill Bejeck
In this post we continue with our series of implementing the algorithms found in the Data-Intensive Text Processing with MapReduce book, this time discussing data joins. While we are going to discuss the techniques for joining data in Hadoop and provide sample code, in most cases you probably won’t be writing code to perform joins yourself. Instead, joining data is better accomplished using tools that work at a higher level of abstraction such as Hive or Pig. Why take the time to learn how to join data if there are tools that can take care of it for you? Joining data is arguably one of the biggest uses of Hadoop. Gaining a full understanding of how Hadoop performs joins is critical for deciding which join to use and for debugging when trouble strikes. Also, once you fully understand how different joins are performed in Hadoop, you can better leverage tools like Hive and Pig. Finally, there might be the one off case where a tool just won’t get you what you need and you’ll have to roll up your sleeves and write the code yourself.
The Need for Joins
When processing large data sets the need for joining data by a common key can be very useful, if not essential. By joining data you can further gain insight such as joining with timestamps to correlate events with a time a day. The need for joining data are many and varied. We will be covering 3 types of joins, Reduce-Side joins, Map-Side joins and the Memory-Backed Join over 3 separate posts. This installment we will consider working with Reduce-Side joins.
Source : http://www.javacodegeeks.com/2013/07/mapreduce-algorithms-understanding-data-joins-part-1.html