Handle Schema Changes and Evolution in Hadoop
Approaches for handling schema evolution in Hadoop using Avro and ORC file formats, including a practical workflow for managing schema changes with Hive.
In Hadoop, if you use Hive and try to have different schemas for different partitions, you cannot have a field inserted in the middle.
If the fields are added at the end, you can use Hive natively.
However, things would break if a field is inserted in the middle.
There are a few ways to handle schema evolution and changes in Hadoop.
Use Avro
For flat schemas of database tables (or files), generate an Avro schema. This Avro schema can be used anywhere in programming or mapped with Hive using AvroSerde.
I am exploring various JSON APIs which can be used and also exploring various methods:
- Avro Schema from JAXB
- Nokia has released code to generate Avro schemas from XMLs: Avro-Schema-Generator
My problem statement and solution are simple. The ideas in my mind are:
- Store schema details of the table in some database
- Read the database field details and generate an Avro schema
- Store it to some location in Hadoop:
/schema/tableschema - Map Hive to use this Avro schema location in HDFS
- If some change comes in the schema, update the database and the system would again generate a new Avro schema
- Push the new schema to HDFS
- Hive would use the new schema without breaking — old data should be able to support schema changes and evolution for data in Hadoop
Most NoSQL databases have a similar approach. Check Oracle’s documentation:
- Avro Schemas in Oracle NoSQL — Oracle NoSQL manages the schema information and changes in KeyStore
- Providing Schema
Use ORC
The Hortonworks ORC project is a file format which has a similar feature of storing schema within data like Avro.
From the ORC documentation:
The “versioned metadata” means that the ORC file’s metadata is stored in ProtoBufs so that we can add (or remove) fields to the metadata. That means that for some changes to ORC file format we can provide both forward and backward compatibility.
ORC files like Avro files are self-describing. They include the type structure of the records in the metadata of the file. It will take more integration work with Hive to make the schemas very flexible with ORC.