How to join on multiple columns in Pyspark ?

How to join on multiple columns in Pyspark ?

Asked on January 11, 2019 in Apache-spark.
Add Comment


  • 2 Answer(s)

    The best solution is by using & / | operators and MAke sure that operators are precedence:

    df1 = sqlContext.createDataFrame(
        [(1, "a", 2.0), (2, "b", 3.0), (3, "c", 3.0)],
        ("x1", "x2", "x3"))
     
    df2 = sqlContext.createDataFrame(
        [(1, "f", -1.0), (2, "b", 0.0)], ("x1", "x2", "x3"))
     
    df = df1.join(df2, (df1.x1 == df2.x1) & (df1.x2 == df2.x2))
    df.show()
     
    ## +---+---+---+---+---+---+
    ## | x1| x2| x3| x1| x2| x3|
    ## +---+---+---+---+---+---+
    ## | 2 | b |3.0| 2 | b |0.0|
    ## +---+---+---+---+---+---+
    
    Answered on January 11, 2019.
    Add Comment

    This problem can be solved by the alternative method.

    df1 = sqlContext.createDataFrame(
        [(1, "a", 2.0), (2, "b", 3.0), (3, "c", 3.0)],
        ("x1", "x2", "x3"))
     
    df2 = sqlContext.createDataFrame(
        [(1, "f", -1.0), (2, "b", 0.0)], ("x1", "x2", "x4"))
     
    df = df1.join(df2, ['x1','x2'])
    df.show()
    

    The result will be:

    +---+---+---+---+
    | x1| x2| x3| x4|
    +---+---+---+---+
    | 2 | b |3.0|0.0|
    +---+---+---+---+
    

    In the output, the columns on which the tables are joined are not duplicated.

    The errors can be reduced by:

    org.apache.spark.sql.AnalysisException: Reference ‘x1’ is ambiguous, could be: x1#50L, x1#57L.

    When there is different names tables in the columns, The following syntax can be used:

    df = df1.join(df2.withColumnRenamed('y1','x1').withColumnRenamed('y2','x2'), ['x1','x2'])
    
    Answered on January 11, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.