Parse nested data
Logs are often a mess. There’s no other way to put it. Here’s an example that we found in the wild just recently:
2023-06-09T05:51:36.433Z mrs www[929]: I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png
Let’s take this apart. First, we notice that this looks like Syslog. Let’s throw the Syslog parser at it.
load_file "sample.txt"read_syslog
{ facility: null, severity: null, timestamp: 2023-06-09T05:51:36.433Z, hostname: "mrs", app_name: "www", process_id: "929", content: "I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png",}
The content
field appears to contain the actual log message. It has four sections:
I103
looks like some identifier.1703182696
looks like a Unix timestamp in seconds.request
looks like it specified an action.baddr=194.1.1.1 bport=58630 …
looks like a list of key-value pairs.
Let’s start by taking the content
field apart:
load_file "sample.txt"read_syslogcontent = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")
{ facility: null, severity: null, timestamp: 2023-06-09T05:51:36.433Z, hostname: "mrs", app_name: "www", process_id: "929", content: { id: "I103", ts: 1703182696, action: "request", params: "baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png", },}
Let’s further take apart the content.params
:
load_file "sample.txt"read_syslogcontent = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")content.params = content.params.parse_kv()
{ facility: null, severity: null, timestamp: 2023-06-09T05:51:36.433Z, hostname: "mrs", app_name: "www", process_id: "929", content: { id: "I103", ts: 1703182696, action: "request", params: { baddr: 194.1.1.1, bport: 58630, c_ctype: "image/png", caddr: 194.1.1.1, cin: 860, cout: 898, cport: 58630, duration: 0.0, forwarded_for: "192.1.1.1, 191.1.1.1", laddr: 127.128.11.145, lport: 1, method: "GET", paddr: 172.0.0.1, pport: 80, proto: 6, relay_name: "wwwserver", response_code: 200, rnum: 6489895, s_ctype: "image/png", saddr: 172.0.0.1, sin: 898, sout: 819, sport: 80, status: "OK", url: "http://test.de/images/icons/144x144.png", }, },}
Let’s fix up a few types:
load_file "sample.txt"read_syslogcontent = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")content.params = content.params.parse_kv()// Interpret the timestamp as a seconds since Unix epoch.content.ts = from_epoch(content.ts * 1s)// Interpret the duration parameter as seconds.content.params.duration = content.params.duration * 1s// Split the forwarded_for field into an array of IP addresses.content.params.forwarded_for = content.params.forwarded_for.split(", ").map(x, x.ip())
{ facility: null, severity: null, timestamp: 2023-06-09T05:51:36.433Z, hostname: "mrs", app_name: "www", process_id: "929", content: { id: "I103", ts: 2023-12-21T18:18:16Z, action: "request", params: { baddr: 194.1.1.1, bport: 58630, c_ctype: "image/png", caddr: 194.1.1.1, cin: 860, cout: 898, cport: 58630, duration: 0ns, forwarded_for: [ 192.1.1.1, 191.1.1.1, ], laddr: 127.128.11.145, lport: 1, method: "GET", paddr: 172.0.0.1, pport: 80, proto: 6, relay_name: "wwwserver", response_code: 200, rnum: 6489895, s_ctype: "image/png", saddr: 172.0.0.1, sin: 898, sout: 819, sport: 80, status: "OK", url: "http://test.de/images/icons/144x144.png", }, },}
Let’s now reshape the data, keeping only the fields we’re interested in:
load_file "sample.txt"read_syslogcontent = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")content.params = content.params.parse_kv()// Interpret the timestamp as a seconds since Unix epoch.content.ts = from_epoch(content.ts * 1s)// Interpret the duration parameter as seconds.content.params.duration = content.params.duration * 1s// Split the forwarded_for field into an array of IP addresses.content.params.forwarded_for = content.params.forwarded_for.split(", ").map(x, x.ip())this = { // We keep both timestamps, labelling the one from the message `time` and the // one from Syslog itself `forwarded_time`: time: content.ts, forwarded_time: timestamp, // Copy the other id and the action from the Syslog message: id: content.id, action: content.action, // Lastly, take all the params: ...content.params,}
{ time: 2023-12-21T18:18:16Z, forwarded_time: 2023-06-09T05:51:36.433Z, id: "I103", action: "request", baddr: 194.1.1.1, bport: 58630, c_ctype: "image/png", caddr: 194.1.1.1, cin: 860, cout: 898, cport: 58630, duration: 0ns, forwarded_for: [ 192.1.1.1, 191.1.1.1, ], laddr: 127.128.11.145, lport: 1, method: "GET", paddr: 172.0.0.1, pport: 80, proto: 6, relay_name: "wwwserver", response_code: 200, rnum: 6489895, s_ctype: "image/png", saddr: 172.0.0.1, sin: 898, sout: 819, sport: 80, status: "OK", url: "http://test.de/images/icons/144x144.png",}
Now we just need to switch from reading from a file to accepting Syslog over UDP
directly, and we’re done. That’s as simple as replacing load_file "…"
with
load_udp "…"
or load_tcp "…"
.