Extract column values of Dataframe as List in Apache Spark

Extract column values of Dataframe as List in Apache Spark

Asked on November 19, 2018 in Apache-spark.
Add Comment


  • 3 Answer(s)

    Here the collection containing single list will return:

    dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
    
    

    In this without the mapping, we will just get a Row object, which has every column from the database.

    Make sure that this will probably get you a list of Any type. Ïf we need to specify the result type, Then use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping

    P.S. due to automatic conversion you can skip the .rdd part.

    Answered on November 19, 2018.
    Add Comment

    By Spark 2.x and Scala 2.11
    There are three possible ways to convert values of a specific column to List

    Similar code snippets for all the approaches

    import org.apache.spark.sql.SparkSession
     
    val spark = SparkSession.builder.getOrCreate
    import spark.implicits._ // for .toDf() method
     
    val df = Seq(
        ("first", 2.0),
        ("test", 1.5),
        ("choose", 8.0)
    ).toDF("id", "val")
    

    Method 1

    df.select("id").collect().map(_(0)).toList
    // res9: List[Any] = List(one, two, three)
    

    In this there are collection of data to Driver with collect() and picking element zero from each record.

    Method 2

    df.select("id").rdd.map(r => r(0)).collect.toList
    //res10: List[Any] = List(one, two, three)
    

    In this we got distributed map transformation load among the workers rather than single Driver.

    Here rdd.map(r => r(0)) does not seems elegant.

    Method 3

    df.select("id").map(r => r.getString(0)).collect.toList
    //res11: List[String] = List(one, two, three)
    

    In this DataFrame is not converted to RDD. Look at map it won’t accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in next versions of Spark.

    Final conclusion
    The above all method gives same output but 2 and 3 are effective, finally 3rd one is effective and elegant.

    Answered on November 19, 2018.
    Add Comment

         Here the answer given and asked for is assumed for Scala, so In this simply provide a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and here there is no need of the select statement.

    i.e. A DataFrame, containing a column named “Raw”

    For getting each row value in “Raw” combined as a list where each entry is a row value from “Raw” Just simply use:

    MyDataFrame.rdd.map(lambda x: x.Raw).collect()
    
    Answered on November 19, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.