‘PipelinedRDD’ object has no attribute ‘toDF’ in PySpark

‘PipelinedRDD’ object has no attribute ‘toDF’ in PySpark

Asked on November 24, 2018 in Apache-spark.
Add Comment


  • 1 Answer(s)

    The toDF method is a monkey patch executed inside SparkSession (SQLContext constructor in 1.x) constructor so for using it we need to create a SQLContext (or SparkSession) first:

    # SQLContext or HiveContext in Spark 1.x
    from pyspark.sql import SparkSession
    from pyspark import SparkContext
     
    sc = SparkContext()
     
    rdd = sc.parallelize([("a", 1)])
    hasattr(rdd, "toDF")
    ## False
     
    spark = SparkSession(sc)
    hasattr(rdd, "toDF")
    ## True
     
    rdd.toDF().show()
    ## +---+---+
    ## | _1| _2|
    ## +---+---+
    ## | a| 1  |
    ## +---+---+
    

    Here the SQLContext is required to work with DataFrames.

    Answered on November 24, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.