How to read from hbase using spark

How to read from hbase using spark

Asked on November 24, 2018 in Apache-spark.
Add Comment

  • 2 Answer(s)

    This code is written in Java for reading the HBase data using Spark (Scala) :

    import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
    import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
    import org.apache.hadoop.hbase.mapreduce.TableInputFormat
    import org.apache.spark._
    object HBaseRead {
      def main(args: Array[String]) {
        val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]")
        val sc = new SparkContext(sparkConf)
        val conf = HBaseConfiguration.create()
        val tableName = "table1"
        System.setProperty("", "hdfs")
        System.setProperty("HADOOP_USER_NAME", "hdfs")
        conf.set("hbase.master", "localhost:60000")
        conf.setInt("timeout", 120000)
        conf.set("hbase.zookeeper.quorum", "localhost")
        conf.set("zookeeper.znode.parent", "/hbase-unsecure")
        conf.set(TableInputFormat.INPUT_TABLE, tableName)
        val admin = new HBaseAdmin(conf)
        if (!admin.isTableAvailable(tableName)) {
           val tableDesc = new HTableDescriptor(tableName)
        val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
        println("Number of Records found : " + hBaseRDD.count())


    The Spark-HBase Connector is used  in Spark 1.0.x+

    The Maven Dependency is Included :

      <version>1.0.3</version> // Version can be changed as per your Spark version, I am using Spark 1.6.x

    This is sample code for the same:

    import org.apache.spark._
    import it.nerdammer.spark.hbase._
    object HBaseRead extends App {
        val sparkConf = new SparkConf().setAppName("Spark-HBase").setMaster("local[4]")
        sparkConf.set("", "<YourHostnameOnly>") //e.g. or localhost or your hostanme
        val sc = new SparkContext(sparkConf)
        // For Example If you have an HBase Table as 'Document' with ColumnFamily 'SMPL' and qualifier as 'DocID,     Title' then:
        val docRdd = sc.hbaseTable[(Option[String], Option[String])]("Document")
        .select("DocID", "Title").inColumnFamily("SMPL")
        println("Number of Records found : " + docRdd .count())


    The SHC Connector is used in  Spark 1.6.x+ :

      <version>1.0.0-2.0-s_2.11</version> // Version depends on the Spark version and is supported upto Spark 2.x


    Answered on November 24, 2018.
    Add Comment

    Here the suggestion is to read from hbase and do the json manipulation all in spark.
    Spark gives  JavaSparkContext.newAPIHadoopRDD function to read data from hadoop storage, including HBase. We need to provide the HBase configuration, table name, and scan in the configuration parameter and table input format and it’s key-value

    In this  table input format  class can be used and it’s job parameter to provide the table name and scan configuration

    For instance:

    conf.set(TableInputFormat.INPUT_TABLE, "tablename");
    JavaPairRDD<ImmutableBytesWritable, Result> data =
    jsc.newAPIHadoopRDD(conf, TableInputFormat.class,ImmutableBytesWritable.class, Result.class);

    Then json manipulation can be done in spark. Since spark can do recalculation when the memory is full, it will only load the data needed for the recalculation part (cmiiw) so there is no need to worry about the data size.

    Answered on November 24, 2018.
    Add Comment

  • Your Answer

    By posting your answer, you agree to the privacy policy and terms of service.