Skip to content

Haskell Avro Encoding and Decoding Native Support (no RPC)

License

Notifications You must be signed in to change notification settings

haskell-works/avro

Repository files navigation

Native Haskell implementation of Avro

Binaries Hackage

This is a Haskell Avro library useful for decoding and encoding Avro data structures. Avro can be thought of as a serialization format and RPC specification which induces three separable tasks:

  • Serialization/Deserialization - This library has been used "in anger" for:
    • Deserialization of avro container files
    • Serialization/deserialization Avro messages to/from Kafka topics
  • RPC - There is currently no support for Avro RPC in this library.

Generating code from Avro schema

The preferred method to use Avro is to be "schema first".
This library supports this idea by providing the ability to generate all the necessary entries (types, class instances, etc.) from Avro schemas.

import Data.Avro
import Data.Avro.Deriving (deriveAvroFromByteString, r)

deriveAvroFromByteString [r|
{
  "name": "Person",
  "type": "record",
  "fields": [
    { "name": "fullName", "type": "string" },
    { "name": "age", "type": "int" },
    { "name": "gender",
      "type": { "name": "Gender", "type": "enum", "symbols": ["Male", "Female"] }
    },
    { "name": "ssn", "type": ["null", "string"] }
  ]
}
|]

This code will generate the following entries:

data Gender = GenderMale | GenderFemale

schema'Gender :: Schema
schema'Gender = ...

data Person = Person
  { personFullName  :: Text
  , personAge       :: Int32
  , personGender    :: Gender,
  , personSsn       :: Maybe Text
  }

schema'Person :: Schema
schema'Person = ...

As well as all the useful instances for these types: Eq, Show, Generic, noticing HasAvroSchema, FromAvro and ToAvro.

See Data.Avro.Deriving module for more options like code generation from Avro schemas in files, specifying strictness and prefixes, etc.

Using Avro with existing Haskell types

Note: This is an advanced topic. Prefer generating from schemas unless it is required to make Avro work with manually defined Haskell types.

In this section we assume that the following Haskell type is manually defined:

data Person = Person
  { fullName  :: Text
  , age       :: Int32
  , ssn       :: Maybe Text
  } deriving (Eq, Show, Generic)

For a Haskell type to be encodable to Avro it should have ToAvro instance, and to be decodable from Avro it should have FromAvro instance.

There is also HasAvroSchema class that is useful to have an instance of (although it is not required strictly speaking).

Creating a schema

A schema can still be generated using TH:

schema'Person :: Schema
schema'Person = $(makeSchemaFromByteString [r|
{
  "name": "Person",
  "type": "record",
  "fields": [
    { "name": "fullName", "type": "string" },
    { "name": "age", "type": "int" },
    { "name": "ssn", "type": ["null", "string"] }
  ]
}
|])

Alternatively schema can be defined manually:

import Data.Avro
import Data.Avro.Schema.Schema (mkUnion)

schema'Person :: Schema
schema'Person =
  Record "Person" []  Nothing
    [ fld "fullName"  (String Nothing)                        Nothing
    , fld "age"       (Int Nothing)                           Nothing
    , fld "ssn"       (mkUnion $ Null :| [(String Nothing)])  Nothing
    ]
  where
     fld nm ty def = Field nm [] Nothing Nothing ty def

NOTE: When Schema is created separately to a data type there is no way to guarantee that the schema actually matches the type. It will be up to a developer to make sure of that.

Prefer generating data types with Data.Avro.Deriving when possible.


Instantiating FromAvro

When working with FromAvro directly it is important to understand the difference between Schema and ReadSchema.

Schema (as in the example above) is just a regular data schema for an Avro type.

ReadSchema is a similar type, but it is capable of captuting and resolving differences between "writer schema" and "reader schema". See Specification to learn more about schema resolution and de-conflicting.

FromAvro class requires ReaderSchema because with Avro it is possible to read data with a different schema compared to the schema that was used for writing this data.

ReadSchema can be obtained by converting an existing Schema with readSchemaFromSchema function, or by actually deconflicting two schemas using deconflict function.

Another important fact is that field's values in Avro payload are written and read in order with how these fields are defined in the schema.

This fact can be exploited in writing FromAvro instance for Person:

import           Data.Avro.Encoding.FromAvro (FromAvro (..))
import qualified Data.Avro.Encoding.FromAvro as FromAvro

instance FromAvro Person where
  fromAvro (FromAvro.Record _schema vs) = Person
    <$> fromAvro (vs Vector.! 0)
    <*> fromAvro (vs Vector.! 1)
    <*> fromAvro (vs Vector.! 2)

Fields resolution by name can be performed here (since we have reference to the schema). But in this case it is simpler (and faster) to exploit the fact that the order of values is known and to access required values by their positions.

Instantiating ToAvro

ToAvro class is defined as

class ToAvro a where
  toAvro :: Schema -> a -> Builder

A Schema is provided to help with disambiguating how exactly the specified value should be encoded.

For example, UTCTime can be encoded as milliseconds or as microseconds depending on schema's logical type accordig to Specification:

instance ToAvro UTCTime where
  toAvro s = case s of
    Long (Just TimestampMicros) ->
      toAvro @Int64 s . fromIntegral . utcTimeToMicros

    Long (Just TimestampMillis)) ->
      toAvro @Int64 s . fromIntegral . utcTimeToMillis

ToAvro instance for Person data type from the above could look like:

import Data.Avro.Encoding.ToAvro (ToAvro(..), record, ((.=)))

instance ToAvro Person where
  toAvro schema value =
    record schema
      [ "fullName"  .= fullName value
      , "age"       .= age value
      , "ssn"       .= ssn value
      ]

record helper function is responsible for propagaing individual fields' schemas (found in the provided schema) when toAvro'ing nested values.

Type mapping

Full list can be found in ToAvro and FromAvro modules.

This library provides the following conversions between Haskell types and Avro types:

Haskell type Avro type
() "null"
Bool "boolean"
Int, Int64 "long"
Int32 "int"
Double "double"
Text "string"
ByteString "bytes"
Maybe a ["null", "a"]
Either a b ["a", "b"]
Identity a ["a"]
Map Text a { "type": "map", "value": "a" }
Map String a { "type": "map", "value": "a" }
HashMap Text a { "type": "map", "value": "a" }
HashMap String a { "type": "map", "value": "a" }
[a] { "type": "array", "value": "a" }
UTCTime { "type": "long", "logicalType": "timestamp-millis" }
UTCTime { "type": "long", "logicalType": "timestamp-micros" }
LocalTime { "type": "long", "logicalType": "local-timestamp-millis" }
LocalTime { "type": "long", "logicalType": "local-timestamp-micros" }
DiffTime { "type": "int", "logicalType": "time-millis" }
DiffTime { "type": "long", "logicalType": "time-micros" }
Day { "type": "int", "logicalType": "date" }
UUID { "type": "string", "logicalType": "uuid" }

User defined data types should provide HasAvroSchema / ToAvro / FromAvro instances to be encoded/decoded to/from Avro.