Announcing AvroEx v2.0
What is Avro?
Avro is a data serialization system that is sponsored as an Apache project. It is typically used for serializing data in conjunction with publishing to Kafka. This setup is often used for real-time data streaming architectures, where you’re ingesting large quantities of data and processing them in realtime. If you’re working with Kafka and the Confluent ecosystem, you’re probably using Avro.
My Usage of Avro
I’ve had an on-and-off (and love/hate) relationship with Avro. An early version of the Simplebet product used Kafka and Avro for serving market data to our customers. Last summer, I spoke at RabbitMQ Summit on why we moved to RabbitMQ and AMQP for this customer interface.
More recently, we’ve been revisiting our use Kafka for ingesting real-time product and analytics data into our Databricks Delta Lake. To ease publishing, we put RabbitMQ as an intermediary, with a service in between to dispatch to Kafka. Services are still required to register Avro schemas in the schema registry and publish them with Avro, so we needed a good Avro solution in our Elixir applications.
Avro in the BEAM
Since working with the earlier customer-facing product, I’ve worked with several Avro libraries in the BEAM ecosystem.
- erlavro - Low-level Erlang Avro encoder/decoder
-
Avrora - High-level library that include schema registration, uses
erlavro
underneath -
AvroSchema - Another high-level library similar to Avrora, also uses
erlavro
- AvroEx - Low-level Avro encoder/decoder written in pure Elixir
- ConfluentSchemaRegistry - A client for publishing Avro schemas to the Confluent Schema Registry
All of these libraries are great, and I’ve written several PRs to improve these libraries, especially better interop with elixir terms. However, I still felt that when things went wrong, the quality of error messages produced by libraries built on top of this ecosystem left a lot to be desired. With that in mind, we chose AvroEx to serialize Avro data in our data streaming project, and I began making improvements over the last few weeks to AvroEx
.
AvroEx v2.0.0
The culmination of this effort is the release of AvroEx v2.0.0 🥳! This release builds upon the great foundation that CJ Poll and contributors have started, and includes a number of improvements to AvroEx
that I’m really excited to share.
Schema Decoding - Error Messages
The schema decoder previously leveraged Ecto.Changeset to cast and validate schema data. However, important validations were missing, such as reference validation, name and symbol duplication, union nesting, and field validation. Now if you write an invalid schema, the decoder will raise a helpful error message. This was made possible by writing a hand-rolled schema parser. The side effect is that we were also able to drop Ecto
as a dependency. What follows is a few examples of how error messages have improved in Schema decoding, but note that it is not an exhaustive list!
Missing required fields
Schema’s that are missing required fields defined by Avro will raise AvroEx.Schema.DecodeError
.
iex(1)> AvroEx.decode_schema!(%{"type" => "record", "fields" => []})
** (AvroEx.Schema.DecodeError) Schema missing required key `name` for AvroEx.Schema.Record in %{"fields" => [], "type" => "record"}
(avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2
Duplicate Union symbols
Union’s cannot have duplicate types, or named types with duplicate names.
iex> AvroEx.decode_schema(["int", "int"])
** (AvroEx.Schema.DecodeError) Union contains duplicated int in ["int", "int"]
(avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2
Nested Unions
Union’s cannot include Unions as immediate children.
iex> AvroEx.decode_schema!(["int", ["string", "int"]])
** (AvroEx.Schema.DecodeError) Union contains nested union Union<possibilities=string|int> as immediate child in ["int", ["string", "int"]]
(avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2
Invalid Names
Names that don’t adhere to the full name
Regex are invalid.
iex> AvroEx.decode_schema!(%{"type" => "record", "name" => "invalid name", "fields" => []})
** (AvroEx.Schema.DecodeError) Invalid name `invalid name` for `name` in %{"fields" => [], "name" => "invalid name", "type" => "record"}
(avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2
Strict Schema Validation
Avro is pretty loose in what you’re allowed to put into a schema. For example, arbitrary metadata is technically allowed on any field
Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
However, I’ve found that this has bitten me when adding fields, such as logicalType, to a Record Field instead of a primitive. Now, AvroEx.decode_schema/2
accepts a :strict
option that will raise an error if you have unrecognized fields
iex> AvroEx.decode_schema!(%{"type" => "map", "values" => "int","symbols" => ["a"]}, strict: true)
** (AvroEx.Schema.DecodeError) Unrecognized schema key `symbols` for AvroEx.Schema.Map in %{"symbols" => ["a"], "type" => "map", "values" => "int"}
(avro_ex 2.0.0) lib/avro_ex/schema/parser.ex:43: AvroEx.Schema.Parser.parse!/2
Schema Encoding
When working with schemas, you may want to define them in code, and then serialize them to JSON for writing to disk or sending to a schema registry. This is now achieved with AvroEx.encode_schema/2
iex> schema = AvroEx.decode_schema!(["int", "string"])
iex> AvroEx.encode_schema(schema)
"[{\"type\":\"int\"},{\"type\":\"string\"}]"
Parsing Canonical Form
Equality of schema’s can be performed by converting a schema to Parsing Canonical Form and then comparing the output strings. AvroEx.encode_schema/2
supports this by passing the :canonical
option.
iex> schema = AvroEx.decode_schema!(["int", "string"])
iex> AvroEx.encode_schema(schema, canonical: true)
"[\"int\",\"string\"]"
Encoding and Decoding Errors
In addition to schemas, encoding and decoding of data saw some improvements to error messages.
Schema Mismatch
iex> schema = AvroEx.decode_schema!("int")
iex> AvroEx.encode(schema, "not an int")
** (AvroEx.EncodeError) Schema Mismatch: Expected value of int, got "not an int"
(avro_ex 2.0.0) lib/avro_ex.ex:146: AvroEx.encode!/2
Invalid Fixed
iex> schema = AvroEx.decode_schema!(~S({"type": "fixed", "name": "sha", "size": 40}))
iex> AvroEx.encode!(schema, "FFFF")
** (AvroEx.EncodeError) Invalid size for Fixed<name=sha, size=40>. Size of 4 for "FFFF"
(avro_ex 2.0.0) lib/avro_ex.ex:146: AvroEx.encode!/2
Decoding invalid string
iex> schema = AvroEx.decode_schema!("string")
iex> AvroEx.decode!(schema, <<"\nhell", 0xFFFF::16>>)
** (AvroEx.DecodeError) Invalid UTF-8 string found <<104, 101, 108, 108, 255>>.
(avro_ex 2.0.0) lib/avro_ex.ex:184: AvroEx.decode!/2
What’s next?
This release saw many improvements for developer experience through error messages and helpful feedback. In future releases, there’s a lot we can still improve for encoding and decoding error messages. Additionally, I would like to get better test-coverage to guarantee correctness through property testing. Other areas to support are multi-file schemas, Object Container Files, benchmarking, and the rest of the Avro 1.11.0 spec!
If you are interested, need a feature, or would like to help out, please reach out to me on the Elixir slack or Twitter. Thanks for reading!