Announcing AvroEx v2.0

Published

What is Avro?

Avro is a data serialization system that is sponsored as an Apache project. It is typically used for serializing data in conjunction with publishing to Kafka. This setup is often used for real-time data streaming architectures, where you’re ingesting large quantities of data and processing them in realtime. If you’re working with Kafka and the Confluent ecosystem, you’re probably using Avro.

My Usage of Avro

I’ve had an on-and-off (and love/hate) relationship with Avro. An early version of the Simplebet product used Kafka and Avro for serving market data to our customers. Last summer, I spoke at RabbitMQ Summit on why we moved to RabbitMQ and AMQP for this customer interface.

More recently, we’ve been revisiting our use Kafka for ingesting real-time product and analytics data into our Databricks Delta Lake. To ease publishing, we put RabbitMQ as an intermediary, with a service in between to dispatch to Kafka. Services are still required to register Avro schemas in the schema registry and publish them with Avro, so we needed a good Avro solution in our Elixir applications.

Avro in the BEAM

Since working with the earlier customer-facing product, I’ve worked with several Avro libraries in the BEAM ecosystem.

All of these libraries are great, and I’ve written several PRs to improve these libraries, especially better interop with elixir terms. However, I still felt that when things went wrong, the quality of error messages produced by libraries built on top of this ecosystem left a lot to be desired. With that in mind, we chose AvroEx to serialize Avro data in our data streaming project, and I began making improvements over the last few weeks to AvroEx.

AvroEx v2.0.0

The culmination of this effort is the release of AvroEx v2.0.0 🥳! This release builds upon the great foundation that CJ Poll and contributors have started, and includes a number of improvements to AvroEx that I’m really excited to share.

Schema Decoding - Error Messages

The schema decoder previously leveraged Ecto.Changeset to cast and validate schema data. However, important validations were missing, such as reference validation, name and symbol duplication, union nesting, and field validation. Now if you write an invalid schema, the decoder will raise a helpful error message. This was made possible by writing a hand-rolled schema parser. The side effect is that we were also able to drop Ecto as a dependency. What follows is a few examples of how error messages have improved in Schema decoding, but note that it is not an exhaustive list!

Missing required fields

Schema’s that are missing required fields defined by Avro will raise AvroEx.Schema.DecodeError.

iex(1)> AvroEx.decode_schema!(%{"type" => "record", "fields" => []})
** (AvroEx.Schema.DecodeError) Schema missing required key `name` for AvroEx.Schema.Record in %{"fields" => [], "type" => "record"}
    (avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2

Duplicate Union symbols

Union’s cannot have duplicate types, or named types with duplicate names.

iex> AvroEx.decode_schema(["int", "int"])
** (AvroEx.Schema.DecodeError) Union contains duplicated int in ["int", "int"]
    (avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2

Nested Unions

Union’s cannot include Unions as immediate children.

iex> AvroEx.decode_schema!(["int", ["string", "int"]])
** (AvroEx.Schema.DecodeError) Union contains nested union Union<possibilities=string|int> as immediate child in ["int", ["string", "int"]]
    (avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2

Invalid Names

Names that don’t adhere to the full name Regex are invalid.

iex> AvroEx.decode_schema!(%{"type" => "record", "name" => "invalid name",  "fields" => []})
** (AvroEx.Schema.DecodeError) Invalid name `invalid name` for `name` in %{"fields" => [], "name" => "invalid name", "type" => "record"}
    (avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2

Strict Schema Validation

Avro is pretty loose in what you’re allowed to put into a schema. For example, arbitrary metadata is technically allowed on any field

Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.

However, I’ve found that this has bitten me when adding fields, such as logicalType, to a Record Field instead of a primitive. Now, AvroEx.decode_schema/2 accepts a :strict option that will raise an error if you have unrecognized fields

iex> AvroEx.decode_schema!(%{"type" => "map", "values" => "int","symbols" => ["a"]}, strict: true)
** (AvroEx.Schema.DecodeError) Unrecognized schema key `symbols` for AvroEx.Schema.Map in %{"symbols" => ["a"], "type" => "map", "values" => "int"}
    (avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2

Schema Encoding

When working with schemas, you may want to define them in code, and then serialize them to JSON for writing to disk or sending to a schema registry. This is now achieved with AvroEx.encode_schema/2

iex> schema = AvroEx.decode_schema!(["int", "string"])
iex> AvroEx.encode_schema(schema)
"[{\"type\":\"int\"},{\"type\":\"string\"}]"

Parsing Canonical Form

Equality of schema’s can be performed by converting a schema to Parsing Canonical Form and then comparing the output strings. AvroEx.encode_schema/2 supports this by passing the :canonical option.

iex> schema = AvroEx.decode_schema!(["int", "string"])
iex> AvroEx.encode_schema(schema, canonical: true)
"[\"int\",\"string\"]"

Encoding and Decoding Errors

In addition to schemas, encoding and decoding of data saw some improvements to error messages.

Schema Mismatch

iex> schema = AvroEx.decode_schema!("int")
iex> AvroEx.encode(schema, "not an int")
** (AvroEx.EncodeError) Schema Mismatch: Expected value of int, got "not an int"
    (avro_ex 2.0.0) lib/avro_ex.ex:146: AvroEx.encode!/2

Invalid Fixed

iex> schema = AvroEx.decode_schema!(~S({"type": "fixed", "name": "sha", "size": 40}))
iex> AvroEx.encode!(schema, "FFFF")
** (AvroEx.EncodeError) Invalid size for Fixed<name=sha, size=40>. Size of 4 for "FFFF"
    (avro_ex 2.0.0) lib/avro_ex.ex:146: AvroEx.encode!/2

Decoding invalid string

iex> schema = AvroEx.decode_schema!("string")
iex> AvroEx.decode!(schema, <<"\nhell", 0xFFFF::16>>)
** (AvroEx.DecodeError) Invalid UTF-8 string found <<104, 101, 108, 108, 255>>.
    (avro_ex 2.0.0) lib/avro_ex.ex:184: AvroEx.decode!/2

What’s next?

This release saw many improvements for developer experience through error messages and helpful feedback. In future releases, there’s a lot we can still improve for encoding and decoding error messages. Additionally, I would like to get better test-coverage to guarantee correctness through property testing. Other areas to support are multi-file schemas, Object Container Files, benchmarking, and the rest of the Avro 1.11.0 spec!

If you are interested, need a feature, or would like to help out, please reach out to me on the Elixir slack or Twitter. Thanks for reading!

See a mistake/bug? Edit this article on Github