Motivation
The library currently offer the possibility of reading a Spark Row and convert it to a case class (or custom structure - but optimized for case classes). What's missing is going the other way around, from the case class to the Row, to be able to serialize case classes and load them back. A collateral product of this is likely be to produce a Spark StructType for a case class.
When we have bidirectional conversion Row <==> case class, then we can use Parquet to take case of the Row <==> Array[Byte] part to get full serialization of case classes and get Spark's SQL support on that data.
Input
Output
- an assessment of what the above failure examples mean.
- is parquet serialization lossy? if so, can it be fixed? is it recoverable?
- if parquet serialization to binary isn't lossy, then working code for converting case classes to Row with some of the following cases covered
- custom field types with private/non-public constructor
- nested field types
- case classes with non-case class fields in them. a good example is joda
DataTime
- it's possible that the above may not be doable, or feasible, but in that case documentation regarding how to tackle such scenarios is to be added (eg. is the library user expected to define their own serializer/deserializers to the subset support by us, and if so what is the subset of cases our library would cover?)
Test
- CI (with unit tests covering the above) passes
Motivation
The library currently offer the possibility of reading a Spark Row and convert it to a case class (or custom structure - but optimized for case classes). What's missing is going the other way around, from the case class to the Row, to be able to serialize case classes and load them back. A collateral product of this is likely be to produce a Spark
StructTypefor a case class.When we have bidirectional conversion
Row<==>case class, then we can use Parquet to take case of theRow<==>Array[Byte]part to get full serialization of case classes and get Spark's SQL support on that data.Input
parquet <-> binaryserialization layerwhich would have a direct impact on the motivation for this issue.
Output
DataTimeTest