Apache Avro
Apache Avro is a data serialization format. We can store data as .avro
files on disk. Avro files are typically used with Spark but Spark is completely independent of Avro. Avro is a row-based format that is suitable for evolving data schemas. One benefit of using Avro is that schema and metadata travels with the data. If you have an .avro
file, you have the schema of the data as well. The Apache Avro Specification provides easy-to-read yet detailed information.
Third-party Avro Packages
While avro-python3
is the official Avro package, it appears to be very slow. This is because it is written in pure python. In comparison, fastavro
uses C extensions (with regular CPython) making it much faster. Another benefit of using fastavro
is that you can install it the same way in both Python 2 and Python 3. fastavro
API is also the same2 for both Python 2 and 3.
Last updated