Cleanest, most efficient syntax to perform DataFrame self-join in Spark
By aliasing, approaching can be done in two different ways:
df.as("df1").join(df.as("df2"), $"df1.foo" === $"df2.foo")
Or name-based equality joins can be used:
// Note that it will result in ambiguous column names // so using aliases here could be a good idea as well. // df.as("df1").join(df.as("df2"), Seq("foo")) df.join(df, Seq("foo"))
The safest practice across all the versions is by renaming column. To column resolution few bugs are related and some details may differ between parsers (HiveContext / standard SQLContext) when raw expressions is used.
The best suggestion is to use aliases because their resemblance to an idiomatic SQL and the scope of a specific DataFrame objects is used outside.