Filtering DataFrame using the length of a column

Filtering DataFrame using the length of a column

Add Comment


  • 2 Answer(s)

    Here  size function can be used for the version Spark >= 1.5:

    from pyspark.sql.functions import col, size
     
    df = sqlContext.createDataFrame([
        (["L", "S", "Y", "S"], ),
        (["L", "V", "I", "S"], ),
        (["I", "A", "N", "A"], ),
        (["I", "L", "S", "A"], ),
        (["E", "N", "N", "Y"], ),
        (["E", "I", "M", "A"], ),
        (["O", "A", "N", "A"], ),
        (["S", "U", "S"], )],
        ("tokens", ))
     
    df.where(size(col("tokens")) <= 3).show()
     
    ## +---------+
    ## | tokens  |
    ## +---------+
    ## |[S, U, S]|
    ## +---------+
    

    For the version Spark < 1.5 an UDF can be used:

    from pyspark.sql.types import IntegerType
    from pyspark.sql.functions import udf
     
    size_ = udf(lambda xs: len(xs), IntegerType())
     
    df.where(size_(col("tokens")) <= 3).show()
     
    ## +---------+
    ## | tokens  |
    ## +---------+
    ## |[S, U, S]|
    ## +---------+
    

    The size UDF with raw SQL will work with any version when HiveContext  is used:

    df.registerTempTable("df")
    sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()
     
    ## +--------------------+
    ## | tokens             |
    ## +--------------------+
    ## |ArrayBuffer(S, U, S)|
    ## +--------------------+
    

    Here use an udf defined above or length function for string columns:

    from pyspark.sql.functions import length
     
    df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
    df.where(length(col("k")) <= 3).show()
     
    ## +---+
    ## | k |
    ## +---+
    ## |bar|
    ## +---+
    
    Answered on January 3, 2019.
    Add Comment

    This is the alternative instance for String in scala:

    val stringData = Seq(("Maheswara"), ("Mokshith"))
    val df = sc.parallelize(stringData).toDF
    df.where((length($"value")) <= 8).show
    +--------+
    | value  |
    +--------+
    |Mokshith|
    +--------+
    df.withColumn("length", length($"value")).show
    +---------+------+
    | value   |length|
    +---------+------+
    |Maheswara| 9    |
    | Mokshith| 8    |
    +---------+------+
    
    Answered on January 3, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.