Categories: Kafka

by Julian Bergner

Share

by Julian Bergner

Topic Governance, Automation & Data Engineering Best Practices

Apache Kafka is often introduced with a simple demo: create a topic, produce a few messages, start a consumer, and watch the events flow.

That is a useful first step. But it is not where Kafka becomes difficult.

In real data engineering environments, the harder questions usually start after the first successful demo:

  • Who is allowed to create Kafka topics?
  • Who owns a topic once it exists?
  • Which applications may write to it?
  • Which teams may consume from it?
  • What schema should the events follow?
  • How long should the data be retained?
  • How do we prevent dozens of unclear, duplicate, or abandoned topics?
  • How do we make Kafka usable without turning every change into a manual ticket?

For data engineers, Apache Kafka is not only a technology for moving events. It becomes a shared data infrastructure layer. And shared infrastructure needs governance.

Good Kafka governance does not mean slowing teams down. Done well, it does the opposite: it gives teams a safe, automated, and predictable way to create and use event streams.

This field guide focuses on exactly that: how data engineers can think about Kafka topics, automation, schemas, access control, and lifecycle management in practical environments.

Kafka Is Easy to Start, but Harder to Manage at Scale

A single Kafka topic is simple enough to understand. A producer writes events. A consumer reads events. Kafka stores those events in a durable, ordered log.

But production Kafka environments rarely stay that simple.

Over time, more teams start using Kafka. More applications produce events. More consumers depend on those events. Topics become interfaces between systems. Some topics carry operational data. Some carry analytical data. Some contain sensitive information. Some are temporary experiments that never get cleaned up.

Without governance, this usually leads to familiar problems:

Problem Typical result
No naming standards Teams cannot understand what a topic is for
No ownership Nobody knows who maintains or supports a topic
No schema rules Producers accidentally break consumers
No access control process Too many applications get broad permissions
No retention strategy Data is deleted too early or stored for too long
No lifecycle management Deprecated topics stay forever
No automation Every change becomes a manual platform task

Kafka itself does not solve these organizational problems automatically. It gives you the technical foundation. Data engineering teams need to add the operating model around it. That operating model starts with a simple idea:

A Kafka topic should be treated as a governed data product, not as a random technical object.

Think of Topics as Data Contracts

In Kafka, a topic is more than a place where messages are stored. In practice, it is a contract between producers and consumers.

A producer promises:

  • This topic contains a specific kind of event.
  • Events follow a defined structure.
  • The meaning of fields is stable and documented.
  • Changes will be made in a compatible way.
  • The topic has an owner.

Consumers rely on that promise. If the producer suddenly removes a field, changes a data type, renames an event, or changes the meaning of a value, downstream applications can break. That is why topic design belongs in the data engineering discipline. It is not only a developer convenience.

A well-governed Kafka topic should answer at least these questions:

  • What business or technical event does this topic represent?
  • Who owns the topic?
  • Which system produces the events?
  • Which teams or applications consume the events?
  • What schema do events follow?
  • What compatibility rules apply?
  • How long is data retained?
  • Is the data sensitive?
  • How is the topic monitored?
  • When should the topic be deprecated?

If these questions cannot be answered, the topic is probably not ready for production.

Topic Creation Should Be a Process, Not a Side Effect

One of the most important governance decisions is how topics are created.

In early development environments, it can be convenient to allow topics to be created automatically. A producer writes to a topic name, and the topic appears. In production, that convenience can become a problem.

Automatic topic creation can lead to misspelled topic names, inconsistent configurations, unclear ownership, and topics with default settings that do not match the actual use case. A topic called orders.created.v1 and a typo like order.created.v1 may both exist before anyone notices.

A more mature model is to make topic creation explicit. A good topic creation process might look like this:

  1. A team requests or defines a new topic.
  2. The topic name is validated against naming standards.
  3. Ownership and business domain are assigned.
  4. Partitions and replication settings are reviewed.
  5. Retention and cleanup policies are defined.
  6. A schema or event contract is registered.
  7. Producer and consumer permissions are configured.
  8. Metadata is published to a catalog.
  9. Monitoring and alerting are attached automatically.
  10. The topic lifecycle is tracked over time.

This process should not be slow or bureaucratic. Ideally, it should be automated. The goal is not to prevent teams from creating topics. The goal is to help them create topics correctly.

Topic-as-Code: A Practical Automation Pattern

One of the most effective patterns for Kafka governance is topic-as-code. The idea is simple: Kafka topics are defined in a version-controlled file, reviewed like application code, and applied automatically by a pipeline or platform service.

Instead of creating a topic manually with a command, a team defines what it needs:

name: customer.orders.created.v1
owner: data-platform
domain: customer
description: Events emitted when a customer order is created

partitions: 12
replicationFactor: 3

retention:
  policy: delete
  retentionMs: 604800000

schema:
  format: avro
  compatibility: backward

access:
  producers:
    - commerce-api
  consumers:
    - fraud-detection
    - analytics-platform
    - customer-service-dashboard

classification:
  dataSensitivity: internal
  containsPersonalData: true

A platform automation can then read this definition and create or update the required resources:

  • Kafka topic
  • Topic configuration
  • Access control rules
  • Schema registration
  • Metadata catalog entry
  • Monitoring dashboard
  • Alerts
  • Ownership information

This turns Kafka governance into an engineering workflow. It also gives teams a history of changes. If retention was changed from seven days to thirty days, the change is visible in Git. If a new consumer was granted access, there is a review trail. If a topic was deprecated, the lifecycle is documented.

For data engineering teams, this is much more reliable than maintaining Kafka configuration manually.

Naming Conventions Are the First Layer of Governance

A naming convention is one of the cheapest and most useful Kafka governance tools. Good topic names make event streams easier to discover, understand, and manage. Bad topic names create confusion for years.

A practical topic naming pattern could look like this:

...

Examples:

customer.profile.updated.v1
orders.order.created.v1
payments.payment.authorized.v1
logistics.shipment.delayed.v1
iot.sensor.temperature-recorded.v1

This pattern gives consumers useful information immediately:

  • The business domain
  • The entity or subject
  • The event that happened
  • The version of the event contract

There is no single perfect naming convention for every organization. What matters is that the convention is documented, enforced, and easy to understand.

Avoid topic names like events, data, test, orders2, new-topic, or stream-prod-final. These names may be convenient in the moment, but they do not scale across teams.

A good rule of thumb:

A data engineer should be able to look at a topic name and make a reasonable guess about what kind of event it contains.

Partitions and Retention Are Governance Decisions Too

Kafka partitions are often explained as a scalability mechanism. That is true, but from a data engineering perspective, partitions are also a design decision with long-term consequences.

The number of partitions affects:

  • Parallelism
  • Consumer group scaling
  • Ordering guarantees
  • Resource usage
  • Operational complexity
  • Future growth options

Choosing partitions randomly is rarely a good idea. A topic that receives a few events per minute does not need the same partitioning strategy as a high-volume clickstream topic. A topic that requires strict ordering by customer ID may need a different keying strategy than a topic used for independent sensor events.

Retention is similar. Kafka can retain data for a short time, a long time, or compact records based on keys. Each option has implications. Ask:

  • Is this topic used for real-time processing only?
  • Do consumers need to replay historical data?
  • Is the topic part of an audit trail?
  • Does the data contain personal or regulated information?
  • How much storage will the topic use over time?
  • Should old records be deleted or compacted?

Retention should not be left to defaults. It should reflect the use case. For example:

Topic type Possible retention approach
Operational event stream 3–7 days
Analytics ingestion stream 7–30 days
Audit-related event stream Longer retention, depending on policy
State-like changelog topic Log compaction
Temporary development topic Short retention

These are not universal rules. They are starting points for discussion. The important point is this: partitioning and retention should be part of the topic design, not an afterthought.

Schema Governance: The Missing Half of Topic Governance

A topic name tells you where events are. A schema tells you what those events mean.

Without schema governance, Kafka can become a collection of loosely structured messages that are difficult to trust. This is especially dangerous when multiple teams consume the same topic.

Schema governance helps answer questions like:

  • Which fields are required?
  • Which fields are optional?
  • What data types are expected?
  • What does each field mean?
  • Can a producer add new fields?
  • Can a producer remove fields?
  • Are changes backward compatible?
  • How are breaking changes handled?

Common schema formats in Kafka environments include Avro, JSON Schema, and Protobuf. The specific format matters less than the discipline around compatibility and ownership.

A simple rule is:

Producers should not be able to break consumers silently.

That means schema changes should be checked before they reach production. If a change removes a required field or changes a data type in an incompatible way, automation should catch it. A mature Kafka platform can include schema checks in CI/CD:

  1. Developer changes event schema.
  2. Pipeline validates formatting.
  3. Pipeline checks compatibility.
  4. Breaking changes are rejected.
  5. Compatible schemas are registered.
  6. Application deployment continues.

This is governance as automation, not governance as paperwork.

Access Control Is Part of Data Governance

Kafka security is often discussed as an infrastructure topic. But for data engineers, access control is also a data governance topic.

Kafka topics may contain customer events, financial transactions, system logs, operational metrics, or personally identifiable information. Not every application should be able to read or write every topic.

A governed Kafka environment should define:

  • Who can create topics
  • Who can produce to each topic
  • Who can consume from each topic
  • Who can change topic configuration
  • Who can delete topics
  • Who can grant access to others

Broad permissions are convenient, but risky. For example, an application that only needs to consume orders.order.created.v1 should not have read access to every topic in the cluster. A producer for payment events should not be able to write to unrelated customer profile topics.

Access control should follow the same principle as other production systems:

Grant only the access that is needed, and make that access visible.

This is another area where automation helps. If topic access is defined together with the topic itself, teams can review access changes before they are applied.

access:
  producers:
    - order-service
  consumers:
    - analytics-ingestion
    - fulfillment-service
    - fraud-detection

This is easier to review than a collection of manual permission changes made over time.

Self-Service Kafka Without Chaos

A common challenge in data platforms is finding the balance between control and speed. If every Kafka topic requires a manual ticket to the platform team, delivery slows down and teams may look for shortcuts. If every team can create anything at any time, the platform becomes chaotic.

The better model is governed self-service. A self-service Kafka workflow could look like this:

  1. A data or application team submits a topic definition.
  2. Automated checks validate the name, owner, schema, retention, and access rules.
  3. Low-risk changes are approved automatically.
  4. High-risk changes require review.
  5. The platform creates the topic and related resources.
  6. Metadata is published to a catalog.
  7. Monitoring is configured automatically.

This approach gives teams autonomy while still enforcing standards. The key is to move governance into the platform. Instead of asking humans to remember every rule, encode the rules into automation. Examples of automated checks:

  • Topic name follows the required pattern.
  • Owner field is present.
  • Retention is within allowed limits.
  • Sensitive topics require stricter access rules.
  • Schema compatibility is valid.
  • Consumer group names follow standards.
  • Production topics cannot use development prefixes.
  • Deprecated topics cannot receive new consumers.

This makes governance repeatable. It also makes platform teams more effective. Instead of manually creating topics all day, they design the rules and automation that allow other teams to move safely.

Metadata and Discoverability Matter

Kafka topics are often created for one team but later become useful to others. That only works if people can find and understand them.

A Kafka topic catalog does not need to be complicated at first. Even a simple internal page or metadata repository can help. Useful metadata includes:

  • Topic name
  • Description
  • Owner
  • Producing system
  • Consuming systems
  • Schema link
  • Data classification
  • Retention policy
  • Example event
  • Contact channel
  • Deprecation status

This turns Kafka from a hidden technical system into a discoverable data platform. For data engineers, discoverability is important because duplicate pipelines often appear when teams do not know that useful data already exists. One team creates a topic for customer updates. Another team creates a similar topic months later because they could not find the first one.

Governance should make reuse easier.

Observability: Governance After Creation

Kafka governance does not end when the topic is created. A topic can be well-designed on day one and still become unhealthy later. Producers may stop sending data. Consumers may fall behind. Storage may grow faster than expected. Schemas may evolve. Teams may stop using a topic without deleting it.

Data engineering teams should monitor:

  • Producer throughput
  • Consumer lag
  • Error rates
  • Message size
  • Storage usage
  • Under-replicated partitions
  • Schema compatibility failures
  • Unauthorized access attempts
  • Topic age and activity
  • Deprecated or unused topics

Consumer lag deserves special attention. It tells you whether consumers are keeping up with the stream. A topic may look healthy from the producer side while downstream consumers are hours behind.

A practical Kafka platform should automatically create dashboards and alerts when a topic is provisioned. For example:

  • Alert if consumer lag exceeds a threshold.
  • Alert if no messages are produced for a critical topic.
  • Alert if storage grows unexpectedly.
  • Alert if partitions become under-replicated.
  • Alert if a schema compatibility check fails.

This is where Kafka operations and data engineering meet. Reliable event streaming is not only about sending messages. It is about knowing whether the entire data flow is working.

Lifecycle Management: Topics Should Not Live Forever by Accident

Many Kafka environments accumulate old topics. Some were created for tests. Some belonged to retired applications. Some were replaced by newer versions. Some no longer have active consumers. Some still receive data, but nobody knows why.

Without lifecycle management, Kafka becomes harder to operate and understand. A topic should have a lifecycle:

  1. Proposed
  2. Approved
  3. Active
  4. Deprecated
  5. Retired
  6. Deleted or archived

Deprecation is especially important. If a topic is replaced by a new version, consumers need time to migrate. The old topic should be marked as deprecated, but not removed immediately. Owners should know who still consumes it. Consumers should receive a migration deadline. After the migration period, the topic can be retired safely.

A simple metadata field can make this visible:

lifecycle:
  status: deprecated
  replacementTopic: customer.profile.updated.v2
  deprecationDate: 2026-06-01
  plannedRemovalDate: 2026-09-01

This is much better than deleting topics without warning or leaving old topics forever.

Common Kafka Governance Mistakes

Many Kafka problems are not caused by Kafka itself. They are caused by missing decisions around Kafka. Here are common mistakes to avoid.

  1. Treating Kafka like a simple queue. Kafka can support queue-like patterns, but it is more than a queue. It is a distributed event log with replay, retention, partitions, and multiple independent consumers. Designing topics as if they were temporary queues often leads to poor data contracts.
  2. Creating topics without owners. Every production topic needs an owner. Without ownership, incidents become difficult, schema changes become risky, and deprecation becomes almost impossible.
  3. Ignoring naming standards. A weak naming convention creates long-term confusion. Topic names should communicate domain, meaning, and version.
  4. Forgetting schema compatibility. If producers can change event structures freely, consumers will eventually break. Schema compatibility checks should be automated.
  5. Using default retention everywhere. Retention should match the use case. Some topics need short retention. Some need replayability. Some need compaction. Defaults are not a strategy.
  6. Granting broad access. Access should be specific. Applications should produce and consume only what they need.
  7. Not monitoring consumer lag. Consumer lag is one of the most important Kafka signals. If consumers fall behind, data may technically be flowing but business processes may still be delayed.
  8. Never retiring topics. Old topics create noise and operational overhead. A deprecation and retirement process keeps Kafka understandable.
  9. Solving governance manually. Manual governance does not scale. Standards should be encoded into automation wherever possible.

A Practical Governance Checklist

Before creating a production Kafka topic, ask these questions:

  • Does the topic name follow the standard?
  • Is there a clear owner?
  • Is the business domain documented?
  • Is the producing system known?
  • Are expected consumers known?
  • Is the event schema defined?
  • Are compatibility rules clear?
  • Is the partition count justified?
  • Is the retention policy appropriate?
  • Is the data classification known?
  • Are producer and consumer permissions defined?
  • Will the topic appear in a catalog?
  • Will monitoring be created?
  • Is there a lifecycle or deprecation plan?

If this sounds like a lot, that is exactly why automation matters. The goal is not for every engineer to fill out long forms manually. The goal is to create a platform workflow where these questions are answered as part of the normal development process.

What Data Engineers Should Learn Next

For data engineers, Kafka knowledge should go beyond producing and consuming messages. A practical learning path looks like this:

  1. Understand topics, partitions, offsets, producers, and consumers.
  2. Learn how consumer groups scale processing.
  3. Understand how keys affect partitioning and ordering.
  4. Learn topic configuration, especially retention and cleanup policies.
  5. Work with schemas and compatibility rules.
  6. Understand Kafka Connect for system integration.
  7. Learn access control and security basics.
  8. Monitor consumer lag, throughput, and broker health.
  9. Practice topic creation automation.
  10. Design event streams as governed data products.

This is where Kafka becomes more than a tool. It becomes part of the data platform.

Conclusion: Governed Kafka Is Better Kafka

Apache Kafka is powerful because it allows many systems to exchange events reliably and at scale. But that same power can create complexity when teams, topics, schemas, and consumers grow.

For data engineers, the real challenge is not only learning how Kafka works. It is learning how to design Kafka usage so that other teams can trust it. That means:

  • treating topics as data contracts
  • defining naming standards
  • managing schemas carefully
  • controlling access
  • monitoring data flows
  • automating topic creation instead of relying on manual processes
  • retiring what is no longer needed

Good Kafka governance is not about slowing teams down. It is about making Kafka safe enough for teams to move faster. When topic creation, schemas, permissions, metadata, and monitoring are automated, Kafka becomes easier to use and easier to operate. That is the foundation of a reliable real-time data platform.

For data engineers, this is the practical side of Kafka: not just sending events, but building event streams that are discoverable, trustworthy, and production-ready.

Share

Related Posts