Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 4: Encoding and Evolution
Key Insight 4 from this chapter

Avro's Unique Approach to Schema Evolution

Key Insight

Apache Avro is a binary encoding format distinct from Protocol Buffers and Thrift, designed specifically for use cases like Hadoop, and uses a schema to define data structure. Avro schemas can be written in Avro IDL for human readability or a JSON-based format for machine readability. A key difference is the absence of field tag numbers in the schema, allowing for a highly compact binary encoding where values are simply concatenated, resulting in a minimal 32 bytes for the example record, making it the most compact among the formats discussed.

Avro's core innovation lies in its 'writer's schema' and 'reader's schema' concept. When data is encoded, it uses the writer's schema, which can be embedded directly in file formats or negotiated. When data is decoded, it uses the reader's schema. The Avro library resolves any differences between these two schemas dynamically at read time, matching fields by name. Fields present in the writer's but not the reader's schema are ignored. Fields expected by the reader but absent in the writer's schema are populated with default values specified in the reader's schema.

To maintain compatibility, Avro mandates that only fields with default values can be added or removed. For instance, a field like 'favoriteNumber' can be made `union { null, long }` with a 'null' default. Avro explicitly uses union types for nullability, unlike implicit optionality. Changing a field's name is backward compatible if the reader's schema includes aliases for the old name, but not forward compatible. This approach, along with not relying on code generation, makes Avro particularly suitable for dynamically generated schemas (e.g., from relational databases) and use with dynamically typed languages like Python or JavaScript, as its files are self-describing through embedded schemas.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.