to_hive
Writes events to a URI using hive partitioning.
to_hive uri:string, partition_by=list<field>, format=string, [timeout=duration, max_size=int]
Description
Section titled “Description”Hive partitioning is a partitioning scheme where a set of fields is used to
partition events. For each combination of these fields, a directory is derived
under which all events with the same field values will be stored. For example,
if the events are partitioned by the fields year
and month
, then the files
in the directory /year=2024/month=10
will contain all events where
year == 2024
and month == 10
.
uri: string
Section titled “uri: string”The base URI for all partitions.
partition_by = list<field>
Section titled “partition_by = list<field>”A list of fields that will be used for partitioning. Note that these fields will be elided from the output, as their value is already specified by the path.
format = string
Section titled “format = string”The name of the format that will be used for writing, for example json
or
parquet
. This will also be used for the file extension.
timeout = duration (optional)
Section titled “timeout = duration (optional)”The time after which a new file will be opened for the same partition group.
Defaults to 5min
.
max_size = int (optional)
Section titled “max_size = int (optional)”The total file size after which a new file will be opened for the same partition
group. Note that files will typically be slightly larger than this limit,
because it opens a new file when only after it is exceeded. Defaults to 100M
.
compression = string (optional)
Section titled “compression = string (optional)”Compress the output files with the given compression algorithm. See docs for the
compress
operator for supported compression algorithms.
Examples
Section titled “Examples”Partition by a single field into local JSON files
Section titled “Partition by a single field into local JSON files”from {a: 0, b: 0}, {a: 0, b: 1}, {a: 1, b: 2}to_hive "/tmp/out/", partition_by=[a], format="json"// This pipeline produces two files:// -> /tmp/out/a=0/1.json:// {"b": 0}// {"b": 1}// -> /tmp/out/a=1/2.json:// {"b": 2}
Write a Parquet file into Azure Blob Store
Section titled “Write a Parquet file into Azure Blob Store”Write as Parquet into the Azure Blob Filesystem, partitioned by year, month and day.
to_hive "abfs://domain/bucket", partition_by=[year, month, day], format="parquet"// -> abfs://domain/bucket/year=<year>/month=<month>/day=<day>/<num>.parquet
Write partitioned JSON into an S3 bucket
Section titled “Write partitioned JSON into an S3 bucket”Write JSON into S3, partitioned by year and month, opening a new file after 1 GB.
year = ts.year()month = ts.month()to_hive "s3://my-bucket/some/subdirectory", partition_by=[year, month], format="json", max_size=1G// -> s3://my-bucket/some/subdirectory/year=<year>/month=<month>/<num>.json