Cleanest, most efficient syntax to perform DataFrame self-join in Spark

Cleanest, most efficient syntax to perform DataFrame self-join in Spark

Asked on January 8, 2019 in Apache-spark.
Add Comment


  • 1 Answer(s)

    By aliasing, approaching can be done in two different ways:

    df.as("df1").join(df.as("df2"), $"df1.foo" === $"df2.foo")
    

    Or name-based equality joins can be used:

    // Note that it will result in ambiguous column names
    // so using aliases here could be a good idea as well.
    // df.as("df1").join(df.as("df2"), Seq("foo"))
     
    df.join(df, Seq("foo"))
    

    The safest practice across all the versions is by renaming column. To column resolution  few bugs are related and some details may differ between parsers (HiveContext / standard SQLContext) when raw expressions is used.

    The best suggestion is to use aliases because their resemblance to an idiomatic SQL and the scope of a specific DataFrame objects is used outside.

    Answered on January 8, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.