Getting Spark, Python, and MongoDB to work together

Getting Spark, Python, and MongoDB to work together

Asked on December 10, 2018 in Apache-spark.
Add Comment


  • 2 Answer(s)

    This can be done in two different ways to connect to MongoDB from Spark:

     

    While the former one seems to be relatively not fully developed the occuring one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.

    # Adjust Scala and package version according to your setup
    # although officially 0.11 supports only Spark 1.5
    # I haven't encountered any issues on 1.6.1
    bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
    
    df = (sqlContext.read
      .format("com.stratio.datasource.mongodb")
      .options(host="mongo:27017", database="foo", collection="bar")
      .load())
    df.show()
    ## +---+----+--------------------+
    ## | x|    y|     _id            |
    ## +---+----+--------------------+
    ## |1.0|-1.0|56fbe6f6e4120712c...|
    ## |0.0| 4.0|56fbe701e4120712c...|
    ## +---+----+--------------------+
    

    This seems to be much more fixed than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.

    Make sure that there are quite a few moving parts here. just try to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I’ve omitted Hadoop libraries for brevity though).

    Here this referencce for complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:

    git clone https://github.com/zero323/docker-mongo-spark.git
    cd docker-mongo-spark
    docker build -t zero323/mongo-spark .
    

    or just download an image Which is pushed to Docker Hub so we could simply  docker pull zero323/mongo-spark):

    Start images:

    docker run -d --name mongo mongo:2.6
    docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
    

    Start PySpark shell passing –jars and –driver-class-path:

    pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
    

    And finally this will work like:

    import pymongo
    import pymongo_spark
     
    mongo_url = 'mongodb://mongo:27017/'
     
    client = pymongo.MongoClient(mongo_url)
    client.foo.bar.insert_many([
        {"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
    client.close()
     
    pymongo_spark.activate()
    rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
        .map(lambda doc: (doc.get('x'), doc.get('y'))))
    rdd.collect()
     
    ## [(1.0, -1.0), (0.0, 4.0)]
    

    Make sure that mongo-hadoop seems to close the connection after the first action. So calling for instance rdd.count() after the collect will throw an exception.

     

    Answered on December 10, 2018.
    Add Comment

    Here –package option can be used instead of –jars … in  spark-submit command:

    spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
    

    In this the few of the jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.

    Answered on December 10, 2018.
    Add Comment


  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.