Extract column values of Dataframe as List in Apache Spark
Here the collection containing single list will return:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
In this without the mapping, we will just get a Row object, which has every column from the database.
Make sure that this will probably get you a list of Any type. Ïf we need to specify the result type, Then use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
By Spark 2.x and Scala 2.11
There are three possible ways to convert values of a specific column to List
Similar code snippets for all the approaches
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.getOrCreate import spark.implicits._ // for .toDf() method val df = Seq( ("first", 2.0), ("test", 1.5), ("choose", 8.0) ).toDF("id", "val")
df.select("id").collect().map(_(0)).toList // res9: List[Any] = List(one, two, three)
In this there are collection of data to Driver with collect() and picking element zero from each record.
df.select("id").rdd.map(r => r(0)).collect.toList //res10: List[Any] = List(one, two, three)
In this we got distributed map transformation load among the workers rather than single Driver.
Here rdd.map(r => r(0)) does not seems elegant.
df.select("id").map(r => r.getString(0)).collect.toList //res11: List[String] = List(one, two, three)
In this DataFrame is not converted to RDD. Look at map it won’t accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in next versions of Spark.
The above all method gives same output but 2 and 3 are effective, finally 3rd one is effective and elegant.
Here the answer given and asked for is assumed for Scala, so In this simply provide a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and here there is no need of the select statement.
i.e. A DataFrame, containing a column named “Raw”
For getting each row value in “Raw” combined as a list where each entry is a row value from “Raw” Just simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()