MapReduce

“People You Might Know” social network friendship recommendation algorithm builds on the idea that if two people have a lot of common friends, then the system should recommend them to friend each other.

Let us use MapReduce to implement an algorithm that recommends N = 10 users with the largest number of friends in common with an ego user to this user.

Input:

The input data is made up of multiple lines, each representing a list of friends of a user and being of the following format:

:

Here is a unique integer ID corresponding to a unique user and is a comma-separated list of unique IDs corresponding to the friends of the user.

A tiny sample of the data (assume that the IDs in the friend list of each user are sorted in ascending order):

=================================
0: 1, 2, 3, 6, 8, 9
1: 0, 5, 9
2: 0, 9
3: 0, 4, 5, 7, 9
4: 3, 8
5: 1, 3, 6, 8
6: 0, 5, 9
7: 3, 8
8: 0, 4, 5, 7

9: 0, 1, 2, 3, 6

Each line, keyed by the identifier of a user, has a list of identifiers for the friends of the user. For instance, the first line implies that user 0 has friended user 1, user 2, user 3, user 6, user 8, and user 9. Besides, the relationship is mutual, which means that, if user 1 is present in the friend list of user 0, user 0 should also be present in the friend list of user 1.

We can observe that users 1, 3, 6, and 8 have friended with both user 0 and user 5, which means user 0 and user 5 share at least 4 common friends. If they have not friended each other, the recommendation algorithm should suggest them to do so.

Output:

The output should contain one line per user in the following format:

:

where is a unique ID representing a unique user and is a comma-separated list of at most 10 tuples each consisting of a unique ID representing a recommended friend that might know and the number of common friends s/he shares with (i.e., ). And this list of recommended friends should be sorted in descending order by the number of friends in common (the second field of the tuple).

For this naïve implementation, let's ignore the possible existence of already-friend pairs, meaning that you don't have to exclude a pair of users who have friended each other from recommendation.

In relation to the above example, it means that even user 1 and user 9 have friended each other, if they co-occur many times in other users' friend lists, user 1 can also show up in user 9's recommendation, and vice versa.

Assume that we use two consecutive MapReduce jobs to achieve this goal, with the output of the first MapReduce job used as the input to the second one.

The first MapReduce job treats each row in the input data as a record and counts the number of common friends each co-occurring pair of users have (or how many users have a certain pair of users appearing concurrently in their friend lists).

The second MapReduce job that consumes the first job’s output generates the sorted list of recommended friends for .

Questions:

Please provide a description of how you are going to solve this problem. Specifically, describe

1) the key-value pairs emitted from the map functions of both MapReduce jobs;

2) the main operations performed by the reduce functions of both jobs as well;

3) What algorithm design techniques can be used, and how to partition, sort, combine, and group data if necessary?