How to split Vector into columns – using PySpark

How to split Vector into columns – using PySpark

Asked on January 3, 2019 in Apache-spark.
Add Comment


  • 1 Answer(s)

    Convert and from RDD by this approach:

    from pyspark.ml.linalg import Vectors
     
    df = sc.parallelize([
        ("assert", Vectors.dense([1, 2, 3])),
        ("require", Vectors.sparse(3, {1: 2}))
    ]).toDF(["word", "vector"])
     
    def extract(row):
        return (row.word, ) + tuple(row.vector.toArray().tolist())
     
    df.rdd.map(extract).toDF(["word"]) # Vector values will be named _2, _3, ...
     
    ## +-------+---+---+---+
    ## | word  | _2| _3| _4|
    ## +-------+---+---+---+
    ## | assert|1.0|2.0|3.0|
    ## |require|0.0|2.0|0.0|
    ## +-------+---+---+---+
    

    By creating an UDF, this is alternative solution:

    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import ArrayType, DoubleType
     
    def to_array(col):
        def to_array_(v):
            return v.toArray().tolist()
        return udf(to_array_, ArrayType(DoubleType()))(col)
     
    (df
        .withColumn("xs", to_array(col("vector")))
        .select(["word"] + [col("xs")[i] for i in range(3)]))
    ## +-------+-----+-----+-----+
    ## | word  |xs[0]|xs[1]|xs[2]|
    ## +-------+-----+-----+-----+
    ## | assert| 1.0 | 2.0 | 3.0 |
    ## |require| 0.0 | 2.0 | 0.0 |
    ## +-------+-----+-----+-----+
    
    Answered on January 3, 2019.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.