Schema Registry in Kafka


Recently, I was inspired by an article of Amarpreet Singh. I decided to do a research.


Why Schema Registry?

Kafka, transfers data in byte format. There is no data verification that’s being done at the Kafka cluster level. In fact, Kafka doesn’t even know what kind of data it is sending or receiving. Whether it is a string or integer.

Due to the decoupled nature of Kafka, producers and consumers do not communicate with each other directly, but rather information transfer happens via Kafka topic. At the same time, the consumer still needs to know the type of data the producer is sending in order to deserialize it. What if the producer starts sending bad data to Kafka or if the data type of your data gets changed? Than your consumers will start breaking. We need a way to have a common data type that must be agreed upon.

That’s where Schema Registry comes into the picture. It is an application that resides outside of your Kafka cluster and handles the distribution of schemas to the producer and consumer by storing a copy of schema in its local cache.


Schema Registry Architecture

With the schema registry in place, the producer, before sending the data to Kafka, talks to the schema registry first and checks if the schema is available. If it doesn’t find the schema then it registers and caches it in the schema registry. Once the producer gets the schema, it will serialize the data with the schema and send it to Kafka in binary format prepended with a unique schema ID. When the consumer processes this message, it will communicate with the schema registry using the schema ID it got from the producer and deserialize it using the same schema. If there is a schema mismatch, the schema registry will throw an error letting the producer know that it’s breaking the schema agreement.


Conclusion

Schema Registry is a simple concept but it’s really powerful in enforcing data governance within your Kafka architecture. Schemas reside outside of your Kafka cluster, only the schema ID resides in your Kafka, hence making schema registry a critical component of your infrastructure. If the schema registry is not available, it will break producers and consumers. So it is always a best practice to ensure your schema registry is highly available.





Go to source code

Comments

Popular posts from this blog

Debug Java Stream in Intellij Idea

Creating Efficient Docker Images with Spring Boot 2.3

CRUD Goes Even Easier With JPABuddy