Spark RDD to DataFrame python



  • 2 Answer(s)

    For converting Spark RDD to DataFrame python, there are two ways:

    • toDF()
    • createDataFrame(rdd, schema)

    By toDF() method

    In the toDF() method the object Row() can get a **kwargs argument. and this could be done by:

    from pyspark.sql.types import Row
     
    #here you are going to create a function
    def f(x):
        d = {}
        for i in range(len(x)):
            d[str(i)] = x[i]
        return d
     
    #Now populate that
    df = rdd.map(lambda x: Row(**f(x))).toDF()
    

    By createDataFrame(rdd, schema) method

    The coding can be done for this method:

    from pyspark.sql.types import StructType
    from pyspark.sql.types import StructField
    from pyspark.sql.types import StringType
     
    schema = StructType([StructField(str(i), StringType(), True) for i in range(32)])
     
    df = sqlContext.createDataFrame(rdd, schema)
    

    Best suggestion is by doing with the second method.

    Answered on January 11, 2019.
    Add Comment

    Alternatively try by using below code:

    sc = spark.sparkContext
     
    # Infer the schema, and register the DataFrame as a table.
    schemaPeople = spark.createDataFrame(RddName)
    schemaPeople.createOrReplaceTempView("RddName")
    
    Answered on January 11, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.