> For the complete documentation index, see [llms.txt](https://ai.nuhil.net/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ai.nuhil.net/data-engineering/data-format/apache-avro.md).

# Apache Avro

[Apache Avro](https://avro.apache.org/docs/current/) is a data serialization format. We can store data as `.avro` files on disk. Avro files are typically used with Spark but Spark is completely independent of Avro. Avro is a row-based format that is suitable for evolving data schemas. <mark style="background-color:green;">One benefit of using Avro is that schema and metadata travels with the data. If you have an</mark> <mark style="background-color:green;"></mark><mark style="background-color:green;">`.avro`</mark> <mark style="background-color:green;"></mark><mark style="background-color:green;">file, you have the schema of the data as well.</mark> The [Apache Avro Specification](https://avro.apache.org/docs/current/spec.html) provides easy-to-read yet detailed information.

```python
# Python 3 with `avro-python3` package available
import copy
import json
import avro
from avro.datafile import DataFileWriter, DataFileReader
from avro.io import DatumWriter, DatumReader

# Note that we combined namespace and name to get "full name"
schema = {
    'name': 'avro.example.User',
    'type': 'record',
    'fields': [
        {'name': 'name', 'type': 'string'},
        {'name': 'age', 'type': 'int'}
    ]
}

# Parse the schema so we can use it to write the data
schema_parsed = avro.schema.Parse(json.dumps(schema))

# Write data to an avro file
with open('users.avro', 'wb') as f:
    writer = DataFileWriter(f, DatumWriter(), schema_parsed)
    writer.append({'name': 'Pierre-Simon Laplace', 'age': 77})
    writer.append({'name': 'John von Neumann', 'age': 53})
    writer.close()

# Read data from an avro file
with open('users.avro', 'rb') as f:
    reader = DataFileReader(f, DatumReader())
    metadata = copy.deepcopy(reader.meta)
    schema_from_file = json.loads(metadata['avro.schema'])
    users = [user for user in reader]
    reader.close()

print(f'Schema that we specified:\n {schema}')
print(f'Schema that we parsed:\n {schema_parsed}')
print(f'Schema from users.avro file:\n {schema_from_file}')
print(f'Users:\n {users}')

# Schema that we specified:
#  {'name': 'avro.example.User', 'type': 'record',
#   'fields': [{'name': 'name', 'type': 'string'}, {'name': 'age', 'type': 'int'}]}
# Schema that we parsed:
#  {"type": "record", "name": "User", "namespace": "avro.example",
#   "fields": [{"type": "string", "name": "name"}, {"type": "int", "name": "age"}]}
# Schema from users.avro file:
#  {'type': 'record', 'name': 'User', 'namespace': 'avro.example',
#   'fields': [{'type': 'string', 'name': 'name'}, {'type': 'int', 'name': 'age'}]}
# Users:
#  [{'name': 'Pierre-Simon Laplace', 'age': 77}, {'name': 'John von Neumann', 'age': 53}]
```

### Third-party Avro Packages <a href="#third-party-avro-packages" id="third-party-avro-packages"></a>

While `avro-python3` is the official Avro package, it appears to be [very slow](https://medium.com/@abrarsheikh/benchmarking-avro-and-fastavro-using-pytest-benchmark-tox-and-matplotlib-bd7a83964453). This is because it is written in pure python. In comparison, [`fastavro`](https://github.com/fastavro/fastavro)uses C extensions (with regular CPython) making it much faster. Another benefit of using `fastavro` is that you can install it the same way in both Python 2 and Python 3. `fastavro` API is also the same[2](https://www.perfectlyrandom.org/2019/11/29/handling-avro-files-in-python/#fn:2) for both Python 2 and 3.

{% embed url="<https://www.perfectlyrandom.org/2019/11/29/handling-avro-files-in-python/>" %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ai.nuhil.net/data-engineering/data-format/apache-avro.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
