back arrow

DDIA book: Encoding and decoding data files

12 - 07 - 2022

The use of textual formats such as XML, JSON or CSV which mostly depend on the consensus of the users than efficiency. These 3 formats have their weaknesses especially CSV since it doesn’t have schema but that’s not the case for XML and JSON which both have optional schema. What’s the point of having a schema anyway? Well it appears that a schema makes encoding and decoding more efficiently. That’s why binary format like Thrift, Protobuf and Avro have their own schemas.

In DDIA, Compact Thrift and Protobuf were 2 times smaller than JSON file and Avro only slightly better than those 2. But having a schema by itself means you’ll have to manage the schema say at a database, the author suggests:

A database of schema versions is a useful thing to have in any case, since it acts as documentation and gives you a chance to check schema compatibility [24]. As the version number, you could use a simple incrementing integer, or you could use a hash of the schema.

And then you can make use of the schema database this way:

A reader can fetch a record, extract the version number, and then fetch the writer’s schema for that version number from the database. Using that writer’s schema, it can decode the rest of the record. (Espresso [23] works this way, for example.)

What a hassle!