How using Spark read Hive DOUBLE value stored in Avro logical format

apache-spark binary hive double avro

652 просмотра

1 ответ

96 Репутация автора

I have existing Hive data stored in Avro format. For whatever reason reading these data by executing SELECT is very slow. I didn't figure out yet why. The data is partitioned and my WHERE clause always follows the partition columns. So I decided to read the data directly by navigating to the partition path and using Spark SQLContext. This works much faster. However, the problem I have is reading the DOUBLE values. Avro stores them in a binary format. When I execute the following query in Hive:

select myDoubleValue from myTable;

I'm getting the correct expected values

841.79
4435.13
.....

but the following Spark code:

    val path="PathToMyPartition"
    val sqlContext = new SQLContext(sc)
    val df = sqlContext.read.avro(path)
    df.select("myDoubleValue").rdd.map(x => x.getAs[Double](0))

gives me this exception

java.lang.ClassCastException : [B cannot be cast to java.lang.Double

What would be the right way either to provide a schema or convert the value that is stored in a binary format into a double format?

Автор: Michael D Источник Размещён: 18.07.2016 06:28

Ответы (1)


0 плюса

96 Репутация автора

I found a partial solution how to convert the Avro schema to a Spark SQL StructType. There is com.databricks.spark.avro.SchemaConverters developed by Databricks that has a bug in converting Avro logical data types in their toSqlType(avroSchema: Schema) method which was incorrectly converting the logicalType

{"name":"MyDecimalField","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":18}],"doc":"","default":null}

into

StructField("MyDecimalField",BinaryType,true)

I fixed this bug in my local version of the code and now it is converting into

StructField("MyDecimalField",DecimalType(38,18),true)

Now, the following code reads the Avro file and creates a Dataframe:

val avroSchema = new Schema.Parser().parse(QueryProvider.getQueryString(pathSchema))
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.schema(MyAvroSchemaConverter.toSqlType(avroSchema).dataType.asInstanceOf[StructType]).avro(path)

However, when I'm selecting the filed that I expect to be decimal by

df.select("MyDecimalField")

I'm getting the following exception:

scala.MatchError: [B@3e6e0d8f (of class [B)

This is where I stuck at this time and would appreciate if anyone can suggest what to do next or any other work around.

Автор: Michael D Размещён: 29.07.2016 03:05
Вопросы из категории :
32x32