What is the difference between map and flatMap and a good use case for each?

What is the difference between map and flatMap and a good use case for each?

Asked on November 14, 2018 in Apache-spark.
Add Comment

  • 3 Answer(s)

    This is an best example of the difference, as a spark-shell session:

    First, some data – two lines of text:

    val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue")) // lines
        res0: Array[String] = Array("Roses are red", "Violets are blue")

    In this, map transforms an RDD of length N into another RDD of length N.

    For instance, it maps from two lines into two line-lengths:

        res1: Array[Int] = Array(13, 16)

        Here flatMap (loosely speaking) converts an RDD of length N into a collection of N collections, then flattens these into a single RDD of results.

    rdd.flatMap(_.split(" ")).collect
        res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")

        In this we got multiple words per line, and multiple lines, but we end up with a single output array of words

    Just to clarify that, flatMapping from a collection of lines to a collection of words looks like:

    ["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"]

    The input and output RDDs will be in different sizes for flatMap.

         Here, We approved to use map with our split function, we’d have ended up with nested structures (an RDD of arrays of words, with type RDD[Array[String]]) because we have to have exactly one result per input:

    rdd.map(_.split(" ")).collect
        res3: Array[Array[String]] = Array(
                       Array(Roses, are, red),
                       Array(Violets, are, blue)

         At the end, one good special case is mapping with a function which might not return an answer, and so returns an Option. And the flatMap to filter is used to out the elements that return None and extract the values from those that return a Some:

    val rdd = sc.parallelize(Seq(1,2,3,4))
    def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None
        res3: Array[Int] = Array(10,20)
    Answered on November 14, 2018.
    Add Comment

    If there is a need of difference between RDD.map and RDD.flatMap in Spark, map transforms an RDD of size N to another one of size N .

    For example:

    myRDD.map(x => x*2)

    for instance, if the RDD is composed of Doubles .

    While flatMap can transform the RDD into anther one of a different size:

    For example:

    myRDD.flatMap(x =>new Seq(2*x,3*x))

    This will return an RDD of size 2*N or

    myRDD.flatMap(x =>if x<10 new Seq(2*x,3*x) else new Seq(x) )
    Answered on November 14, 2018.
    Add Comment

    Here the test.md is used as a example:

    ➜ spark-1.6.1 cat test.md
    This is the first line;
    This is the second line;
    This is the last line.
    scala> val textFile = sc.textFile("test.md")
    scala> textFile.map(line => line.split(" ")).count()
    res2: Long = 3
    scala> textFile.flatMap(line => line.split(" ")).count()
    res3: Long = 15
    scala> textFile.map(line => line.split(" ")).collect()
    res0: Array[Array[String]] = Array(Array(This, is, the, first, line;), Array(This, is, the, second, line;), Array(This, is, the, last, line.))
    scala> textFile.flatMap(line => line.split(" ")).collect()
    res1: Array[String] = Array(This, is, the, first, line;, This, is, the, second, line;, This, is, the, last, line.)

        Here map method is used to get the lines of test.md, Then for flatMap method, There will be the number of words.

         The map method is same as to flatMap, they are all return a new RDD. map method often to use return a new RDD, flatMap method often to use split words.

    Answered on November 14, 2018.
    Add Comment

  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.