Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 4: Encoding and Evolution
Key Insight 2 from this chapter

Challenges and Limitations of Standard Textual and Language-Specific Data Formats

Key Insight

Language-specific encoding formats, such as Java's 'java.io.Serializable' or Python's 'pickle', offer convenience for saving and restoring in-memory objects but pose significant problems. They are tightly coupled to a specific programming language, making cross-language interoperability difficult and locking systems into a particular technology stack. This hinders integration with systems from other organizations using different languages.

These formats often have severe security vulnerabilities, as the decoding process may need to instantiate arbitrary classes. An attacker exploiting this could achieve remote code execution by providing malicious byte sequences. Furthermore, versioning data is frequently an afterthought, neglecting crucial forward and backward compatibility concerns. Efficiency is also often overlooked; for instance, Java's built-in serialization is known for its poor performance and bloated encoding. Consequently, their use is generally discouraged for anything beyond highly transient data.

Common textual formats like JSON, XML, and CSV, despite their widespread use and human-readability, present subtle issues. Number encoding is ambiguous; XML and CSV cannot differentiate between a number and a digit-string without external schemas, and JSON lacks distinction between integers and floating-point numbers or precision specifications. This leads to problems with large numbers (e.g., integers greater than 2^53 become inaccurate in JavaScript, forcing Twitter to provide tweet IDs as both numbers and strings). They also lack native binary string support, requiring Base64 encoding which increases data size by 33%. While optional schema support exists for XML and JSON, it can be complex, and CSV has no schema, relying solely on application-level interpretation, making evolution and parsing difficult due to vague escaping rules.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.