Data EngineeringBig DataHiveHadoop

Handle Schema Changes and Evolution in Hadoop

Approaches for handling schema evolution in Hadoop using Avro and ORC file formats, including a practical workflow for managing schema changes with Hive.

30 March 2013 · 2 min read

In Hadoop, if you use Hive and try to have different schemas for different partitions, you cannot have a field inserted in the middle.

If the fields are added at the end, you can use Hive natively.

However, things would break if a field is inserted in the middle.

There are a few ways to handle schema evolution and changes in Hadoop.

Use Avro

For flat schemas of database tables (or files), generate an Avro schema. This Avro schema can be used anywhere in programming or mapped with Hive using AvroSerde.

I am exploring various JSON APIs which can be used and also exploring various methods:

My problem statement and solution are simple. The ideas in my mind are:

  1. Store schema details of the table in some database
  2. Read the database field details and generate an Avro schema
  3. Store it to some location in Hadoop: /schema/tableschema
  4. Map Hive to use this Avro schema location in HDFS
  5. If some change comes in the schema, update the database and the system would again generate a new Avro schema
  6. Push the new schema to HDFS
  7. Hive would use the new schema without breaking — old data should be able to support schema changes and evolution for data in Hadoop

Most NoSQL databases have a similar approach. Check Oracle’s documentation:

Use ORC

The Hortonworks ORC project is a file format which has a similar feature of storing schema within data like Avro.

From the ORC documentation:

The “versioned metadata” means that the ORC file’s metadata is stored in ProtoBufs so that we can add (or remove) fields to the metadata. That means that for some changes to ORC file format we can provide both forward and backward compatibility.

ORC files like Avro files are self-describing. They include the type structure of the records in the metadata of the file. It will take more integration work with Hive to make the schemas very flexible with ORC.