Spark specify multiple column conditions for dataframe join

Spark specify multiple column conditions for dataframe join

Asked on November 23, 2018 in Apache-spark.
Add Comment


  • 3 Answer(s)

    For joining such case, Spark column/expression API is used:

    Leaddetails.join(
        Utm_Master,
        Leaddetails("LeadSource") <=> Utm_Master("LeadSource")
            && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")
            && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")
            && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),
        "left"
    )
    

    <=> this operator means “Equality test that is safe for null values”.

    The main difference with simple Equality test (===) is that the first one is safe to use in case one of the columns may have null values.

    Answered on November 23, 2018.
    Add Comment

    In the Spark version 1.5.0 (which is currently unreleased), Here we can join on multiple DataFrame columns. Refer to SPARK-7990: Add methods to facilitate equi-join on multiple join keys.

    For Python

    Leads.join(
        Utm_Master,
        ["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"],
        "left_outer"
    )
    

    For Scala

    Leads.join(
        Utm_Master,
        Seq("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
        "left_outer"
    )
    
    Answered on November 23, 2018.
    Add Comment

    The best way is to use raw SQL:

    case class Bar(x1: Int, y1: Int, z1: Int, v1: String)
    case class Foo(x2: Int, y2: Int, z2: Int, v2: String)
     
    val bar = sqlContext.createDataFrame(sc.parallelize(
    Bar(1, 1, 2, "bar") :: Bar(2, 3, 2, "bar") ::
    Bar(3, 1, 2, "bar") :: Nil))
     
    val foo = sqlContext.createDataFrame(sc.parallelize(
    Foo(1, 1, 2, "foo") :: Foo(2, 1, 2, "foo") ::
    Foo(3, 1, 2, "foo") :: Foo(4, 4, 4, "foo") :: Nil))
     
    foo.registerTempTable("foo")
    bar.registerTempTable("bar")
     
    sqlContext.sql(
    "SELECT * FROM foo LEFT JOIN bar ON x1 = x2 AND y1 = y2 AND z1 = z2")
    
    Answered on November 23, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.