Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 4: Encoding and Evolution
Key Insight 3 from this chapter

Schema-Driven Binary Encoding Formats: Thrift and Protocol Buffers

Key Insight

Apache Thrift and Protocol Buffers (protobuf) are binary encoding libraries developed at Facebook and Google, respectively, based on a shared principle: they require an explicit schema for all encoded data. Schemas are defined using specialized Interface Definition Languages (IDLs), such as Thrift IDL or Protocol Buffers' `message` syntax. Both frameworks provide code generation tools that translate these schemas into classes for various programming languages, enabling applications to efficiently encode and decode data.

These formats achieve compactness by omitting field names from the encoded data, instead using numeric field tags (e.g., 1, 2, 3) specified in the schema. The encoded data includes these tags, type annotations (e.g., string, integer, list), and length indicators. For example, Thrift's CompactProtocol packs field type and tag into a single byte and uses variable-length integers to represent numbers efficiently (e.g., 1337 encoded in two bytes), reducing the example record size to 34 bytes. Protocol Buffers employs a similar bit-packing strategy, encoding the same record into 33 bytes.

Schema evolution in these formats is managed through field tags: field names can change, but field tags are immutable and critical for data interpretation. Adding new fields with unique tag numbers maintains forward compatibility, as older code can simply ignore unrecognized tags based on datatype annotation. For backward compatibility, new fields must be `optional` or have default values, preventing old code from failing if the new field is absent. Removing fields requires they were `optional`, and their tags cannot be reused. Changing datatypes carries risks, such as precision loss (e.g., 64-bit to 32-bit integer conversion). Protocol Buffers’ `repeated` marker offers a flexible way to evolve single-valued fields into multi-valued lists.

πŸ“š Continue Your Learning Journey β€” No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.