deduplicate
Removes duplicate events based on a common key.
deduplicate [key:any, limit=int, distance=int, create_timeout=duration, write_timeout=duration, read_timeout=duration]
Description
Section titled “Description”The deduplicate
operator removes duplicates from a stream of events, based
on the value of one or more fields.
key: any (optional)
Section titled “key: any (optional)”The key to deduplicate. To deduplicate multiple fields, use a record expression
like {foo: bar, baz: qux}
.
Defaults to this
, i.e., deduplicating entire events.
limit = int (optional)
Section titled “limit = int (optional)”The number of duplicate keys allowed before an event is suppressed.
Defaults to 1
, which is equivalent to removing all duplicates.
distance = int (optional)
Section titled “distance = int (optional)”Distance between two events that can be considered duplicates. A value of 1
means that only adjacent events can be considered duplicate.
When unspecified, the distance is infinite.
create_timeout = duration (optional)
Section titled “create_timeout = duration (optional)”The time that needs to pass until a surpressed event is no longer considered a duplicate. The timeout resets when the first event for a given key is let through.
write_timeout = duration (optional)
Section titled “write_timeout = duration (optional)”The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when any event for a given key is let through.
For a limit of 1
, the write timeout is equivalent to the create timeout.
The write timeout must be smaller than the create timeout.
read_timeout = duration (optional)
Section titled “read_timeout = duration (optional)”The time that needs to pass until a suppressed event is no longer considered a duplicate. The timeout resets when a key is seen, even if the event is suppressed.
The read timeout must be smaller than the write and create timeouts.
Examples
Section titled “Examples”Simple deduplication
Section titled “Simple deduplication”Consider the following data:
{"foo": 1, "bar": "a"}{"foo": 1, "bar": "a"}{"foo": 1, "bar": "a"}{"foo": 1, "bar": "b"}{"foo": null, "bar": "b"}{"bar": "b"}{"foo": null, "bar": "b"}{"foo": null, "bar": "b"}
For deduplicate
, all duplicate events are removed:
{"foo": 1, "bar": "a"}{"foo": 1, "bar": "b"}{"foo": null, "bar": "b"}{"bar": "b"}
If deduplicate bar
is used, only the field bar
is considered when
determining whether an event is a duplicate:
{"foo": 1, "bar": "a"}{"foo": 1, "bar": "b"}
And for deduplicate foo
, only the field foo
is considered. Note, how the
missing foo
field is treated as if it had the value null
, i.e., it’s not
included in the output.
{"foo": 1, "bar": "a"}{"foo": null, "bar": "b"}
Get up to 10 warnings per hour for each run of a pipeline
Section titled “Get up to 10 warnings per hour for each run of a pipeline”diagnostics live=truededuplicate {id: pipeline_id, run: run}, limit=10, create_timeout=1h
Get an event whenever the node disconnected from the Tenzir Platform
Section titled “Get an event whenever the node disconnected from the Tenzir Platform”metrics "platform", live=truededuplicate connected, distance=1where not connected