Reduce a key-value pair into a key-list pair with Apache Spark

Reduce a key-value pair into a key-list pair with Apache Spark

Asked on November 23, 2018 in Apache-spark.
Add Comment


  • 3 Answer(s)

    For Map and ReduceByKey

    Here the Input type and output type of reduce should be the same, therefore if there is need to aggregate a list, Then map the input to lists. After  combine the lists into one list.

    For Combining lists

    There is a method to combine lists into one list. Here Phyton provides some methods to combine lists.

    append modifies the first list and will always return None.

    x = [1, 2, 3]
    x.append([4, 5])
    # x is [1, 2, 3, [4, 5]]
    

    extend does the same, but unwraps lists:

    x = [1, 2, 3]
    x.extend([4, 5])
    # x is [1, 2, 3, 4, 5]
    

    Here the Both methods will return None, but you’ll need a method that returns the combined list,

    For that simply use the plus sign.

    x = [1, 2, 3] + [4, 5]
    # x is [1, 2, 3, 4, 5]
    

    For Spark

    file = spark.textFile("hdfs://...")
    counts = file.flatMap(lambda line: line.split(" ")) \
        .map(lambda actor: (actor.split(",")[0], actor)) \
     
        # transform each value into a list
        .map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \
     
        # combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
        .reduceByKey(lambda a, b: a + b)
    

    Here the problem can also be solved by using combineByKey and groupByKey

     

    Answered on November 23, 2018.
    Add Comment

    Here is the alternative solution for the problem:

    >>> foo = sc.parallelize([(1, ('a','b')), (2, ('c','d')), (1, ('x','y'))])
    >>> foo.map(lambda (x,y): (x, [y])).reduceByKey(lambda p,q: p+q).collect()
    [(1, [('a', 'b'), ('x', 'y')]), (2, [('c', 'd')])]
    

     

    Answered on November 23, 2018.
    Add Comment

    Here the RDD groupByKey method is used.

    Try this code:

    data = [(1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')]
    rdd = sc.parallelize(data)
    result = rdd.groupByKey().collect()
    

    The output will be:

    [(1, ['a', 'b']), (2, ['c', 'd', 'e']), (3, ['f'])]
    
    Answered on November 23, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.