Data EngineeringBig DataHiveHadoop

Handle Schema Changes and Evolution in Hadoop

Approaches for handling schema evolution in Hadoop using Avro and ORC file formats, including a practical workflow for managing schema changes with Hive.

30 March 2013 · 2 min read

In Hadoop, if you use Hive and try to have different schemas for different partitions, you cannot have a field inserted in the middle.

If the fields are added at the end, you can use Hive natively.

However, things would break if a field is inserted in the middle.

There are a few ways to handle schema evolution and changes in Hadoop.

Use Avro

For flat schemas of database tables (or files), generate an Avro schema. This Avro schema can be used anywhere in programming or mapped with Hive using AvroSerde.

I am exploring various JSON APIs which can be used and also exploring various methods:

Avro Schema from JAXB
Nokia has released code to generate Avro schemas from XMLs: Avro-Schema-Generator

My problem statement and solution are simple. The ideas in my mind are:

Store schema details of the table in some database
Read the database field details and generate an Avro schema
Store it to some location in Hadoop: /schema/tableschema
Map Hive to use this Avro schema location in HDFS
If some change comes in the schema, update the database and the system would again generate a new Avro schema
Push the new schema to HDFS
Hive would use the new schema without breaking — old data should be able to support schema changes and evolution for data in Hadoop

Most NoSQL databases have a similar approach. Check Oracle’s documentation:

Avro Schemas in Oracle NoSQL — Oracle NoSQL manages the schema information and changes in KeyStore
Providing Schema

Use ORC

The Hortonworks ORC project is a file format which has a similar feature of storing schema within data like Avro.

From the ORC documentation:

The “versioned metadata” means that the ORC file’s metadata is stored in ProtoBufs so that we can add (or remove) fields to the metadata. That means that for some changes to ORC file format we can provide both forward and backward compatibility.

ORC files like Avro files are self-describing. They include the type structure of the records in the metadata of the file. It will take more integration work with Hive to make the schemas very flexible with ORC.

Data Engineering Big Data Hive Hadoop

Disclosure: Ideas and analysis are my own. AI assisted with drafting and editing.