How to avoid duplicate columns after join ?

How to avoid duplicate columns after join ?

Add Comment


  • 4 Answer(s)

    Here the issue can be solved by performing the join where the joined columns are expressed as an array of strings (or one string) rather than predicate.

    The below is the instance with  two join columns.

    The left dataframe

    val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
     
    val left = llist.toDF("firstname","lastname","date","duration")
     
    left.show()
     
    /*
    +---------+--------+----------+--------+
    |firstname|lastname| date     |duration|
    +---------+--------+----------+--------+
    | bob     | b      |2015-01-13| 4      |
    | alice   | a      |2015-04-23| 10     |
    +---------+--------+----------+--------+
    */
    

    The right dataframe:

    val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
     
    right.show()
     
    /*
    +---------+--------+------+
    |firstname|lastname|upload|
    +---------+--------+------+
    | alice   | a      | 100  |
    | bob     | b      | 23   |
    +---------+--------+------+
    */
    

    This solution is not correct, where the join columns are defined as the predicate left(“firstname”)===right(“firstname”) && left(“lastname”)===right(“lastname”).

    The firstname and lastname columns are duplicated in the joined data frame are the incorrect result:

    left.join(right, left("firstname")===right("firstname") &&
                 left("lastname")===right("lastname")).show
     
    /*
    +---------+--------+----------+--------+---------+--------+------+
    |firstname|lastname| date     |duration|firstname|lastname|upload|
    +---------+--------+----------+--------+---------+--------+------+
    | bob     | b      |2015-01-13| 4      | bob     | b      | 23   |
    | alice   | a      |2015-04-23| 10     | alice   | a      | 100  |
    +---------+--------+----------+--------+---------+--------+------+
    */
    

    The perfect answer is to define the join columns as an array of strings Seq(“firstname”, “lastname”). The output data frame does not have duplicated columns:

    left.join(right, Seq("firstname", "lastname")).show
     
    /*
    +---------+--------+----------+--------+------+
    |firstname|lastname| date     |duration|upload|
    +---------+--------+----------+--------+------+
    | bob     | b      |2015-01-13| 4      | 23   |
    | alice   | a      |2015-04-23| 10     | 100  |
    +---------+--------+----------+--------+------+
    */
    
    Answered on January 7, 2019.
    Add Comment

    Here DataFrame.join method is same as the SQL join, which is given below:

    SELECT * FROM a JOIN b ON joinExprs
    

    When there is need to ignore duplicate columns simply drop them or select columns of interest afterwards. For disambiguate Parent DataFrames are used for the access:

    val a: DataFrame = ???
    val b: DataFrame = ???
    val joinExprs: Column = ???
     
    a.join(b, joinExprs).select(a("id"), b("foo"))
    // drop equivalent
    a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
    

    Here aliases is used:

    // As for now aliases don't work with drop
    a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
    

    There is exist a special shortcut syntax which takes either  a sequence of strings for equi-joins:

    val usingColumns: Seq[String] = ???
     
    a.join(b, usingColumns)
    

    Foe the single string

    val usingColumn: String = ???
     
    a.join(b, usingColumn)
    

    In join condition  only one copy of columns are used

    Answered on January 7, 2019.
    Add Comment

    Here the given below solution which is quite

    Say a is

    scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
    a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
     
    scala> a.show
    +---+----+
    |key|vala|
    +---+----+
    | a| 1   |
    | b| 2   |
    +---+----+
    and
    scala> val b = Seq(("a", 1)).toDF("key", "valb")
    b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
     
    scala> b.show
    +---+----+
    |key|valb|
    +---+----+
    | a| 1   |
    +---+----+
    

    For selecting only the value in dataframe a and this can be done:

    scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
    +---+----+
    |key|vala|
    +---+----+
    | a | 1  |
    | b | 2  |
    +---+----+
    
    Answered on January 7, 2019.
    Add Comment

    SELECT
    dealing_record
    .Name
    ,shares.ID
    ,shares.Name
    ,transaction_type.Name
    ,transaction_type.ID
    FROM
    shares
    INNER JOIN shares ON shares.share_ID = dealing_record.share_id
    INNER JOIN transaction_type ON transaction_type.transaction_type_id = dealing_record.transaction_type_id;

    Answered on March 5, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.