Skip to content

Support for converting case class to Row #2

@jonas

Description

@jonas

Motivation

The library currently offer the possibility of reading a Spark Row and convert it to a case class (or custom structure - but optimized for case classes). What's missing is going the other way around, from the case class to the Row, to be able to serialize case classes and load them back. A collateral product of this is likely be to produce a Spark StructType for a case class.

When we have bidirectional conversion Row <==> case class, then we can use Parquet to take case of the Row <==> Array[Byte] part to get full serialization of case classes and get Spark's SQL support on that data.

Input

Output

  • an assessment of what the above failure examples mean.
    • is parquet serialization lossy? if so, can it be fixed? is it recoverable?
    • if parquet serialization to binary isn't lossy, then working code for converting case classes to Row with some of the following cases covered
      • custom field types with private/non-public constructor
      • nested field types
      • case classes with non-case class fields in them. a good example is joda DataTime
        • it's possible that the above may not be doable, or feasible, but in that case documentation regarding how to tackle such scenarios is to be added (eg. is the library user expected to define their own serializer/deserializers to the subset support by us, and if so what is the subset of cases our library would cover?)

Test

  • CI (with unit tests covering the above) passes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions