how to loop through each row of dataFrame in pyspark

how to loop through each row of dataFrame in pyspark

Asked on January 7, 2019 in Apache-spark.
Add Comment


  • 3 Answer(s)

    Here map can be used and custom function can be defined.

    def customFunction(row):
    return (row.name, row.age, row.city)
    sample2 = sample.rdd.map(customFunction)
    

    Or else

    sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
    

    For every row custom function is applied of the dataframe. Make sure that sample2 will be a RDD, not a dataframe.

    For doing more complex computations, map is needed. Here derived column need to be added, The withColumn is used, with returns a dataframe.

    sample3 = sample.withColumn('age2', sample.age + 2)
    
    Answered on January 7, 2019.
    Add Comment

    DataFrames, same as other distributed data structures, are not iterable  and by only using dedicated higher order function and / or SQL methods can be accessed.

    In this collect method is used.

    for row in df.rdd.collect():
        do_something(row)
    

    or convert toLocalIterator

    for row in df.rdd.toLocalIterator():
        do_something(row)
    

    After that iterate as shown above.

    Answered on January 7, 2019.
    Add Comment

    In python, by using list comprehensions , Here entire column of values is collected into a list using just two lines:

    df = sqlContext.sql("show tables in default")
    tableList = [x["tableName"] for x in df.rdd.collect()]
    

    For the above instance, A list of tables is returned in database ‘default’, but the same can be adapted by replacing the query used in sql().

    More shortened;

    tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
    

    For the three columns instance, Here list of dictionaries is created, and then iterate through them in a for loop.

    sql_text = "select name, age, city from user"
    tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
              for x in sqlContext.sql(sql_text).rdd.collect()]
    for row in tupleList:
        print("{} is a {} year old from {}".format(
            row["name"],
            row["age"],
            row["city"]))
    
    Answered on January 7, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.