Difference between DataFrame (in Spark 2.0 i.e DataSet[Row] ) and RDD in Spark
DataFrame is well defined for searching “DataFrame definition” in google
A data frame is nothing but a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
Here, a DataFrame has extra metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
However, you can go from a DataFrame to an RDD through its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) through the toDF method
The DataFrame recommended to use where possible due to the built in query optimization.
Here RDD is core component, but DataFrame is an API introduced in spark 1.30.
Group of data partitions called RDD. These RDD have to follow some properties suchais:
- Fault Tolerant,
Here RDD is either structured or unstructured.
DataFrame is an API possible in Scala, Java, Python and R. It access to process any type of Structured and semi structured data. To define DataFrame, a collection of distributed data organized into named columns called DataFrame. This can be quickly optimize the RDDs in the DataFrame. And then process JSON data, parquet data, HiveQL data at a time by using DataFrame.
val sampleRDD = sqlContext.jsonFile("hdfs://localhost:9000/jsondata.json") val sample_DF = sampleRDD.toDF()
This is a Sample_DF consider as DataFrame. sampleRDD is (raw data) called RDD.
DataFrame is identical to a table in RDBMS and can also be manipulated in similar ways to the “native” distributed collections in RDDs. Unlike RDDs, Dataframes keep track of the schema and support various relational operations that lead to more optimized execution.
Each and every DataFrame object produce a logical plan but because of their “lazy” nature no execution occurs until the user calls a specific “output operation”.