Parse nested data
Logs are often a mess. Long lines of text without structure. Especially when onboarding data from Syslog feeds or other dump-everything-in-here data streams.
This guide shows you how to deconstruct such messy log lines into its atomic data values, so that you can begin proceed working with them in a structured fashion.
Deconstruct Syslog messages into atomic values
Section titled “Deconstruct Syslog messages into atomic values”Here’s a Syslog example:
2023-06-09T05:51:36.433Z mrs www[929]: I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png
The prefix looks like Syslog, so we’ll attempt to parse it as Syslog parser.
from {line: "2023-06-09T05:51:36.433Z mrs www[929]: I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png"}this = line.parse_syslog()
{ facility: null, severity: null, timestamp: 2023-06-09T05:51:36.433Z, hostname: "mrs", app_name: "www", process_id: "929", content: "I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png",}
After parsing the Syslog framing, we can see that content
field contains the
actual log message. It has four sections:
I103
looks like some identifier.1703182696
looks like a Unix timestamp in seconds.request
looks like it specified an action.baddr=194.1.1.1 bport=58630 …
looks like a list of key-value pairs.
To parse an arbitrary string into values, the parse_grok
function comes in
handy. Let’s take apart the content
:
from {line: "2023-06-09T05:51:36.433Z mrs www[929]: I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png"}this = line.parse_syslog()content = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")
{ facility: null, severity: null, timestamp: 2023-06-09T05:51:36.433Z, hostname: "mrs", app_name: "www", process_id: "929", content: { id: "I103", ts: 1703182696, action: "request", params: "baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png", },}
The params
field follows a key-value pair pattern. The parse_kv
function
fits this pattern best:
from {line: "2023-06-09T05:51:36.433Z mrs www[929]: I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png"}this = line.parse_syslog()content = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")content.params = content.params.parse_kv()
{ facility: null, severity: null, timestamp: 2023-06-09T05:51:36.433Z, hostname: "mrs", app_name: "www", process_id: "929", content: { id: "I103", ts: 1703182696, action: "request", params: { baddr: 194.1.1.1, bport: 58630, c_ctype: "image/png", caddr: 194.1.1.1, cin: 860, cout: 898, cport: 58630, duration: 0.0, forwarded_for: "192.1.1.1, 191.1.1.1", laddr: 127.128.11.145, lport: 1, method: "GET", paddr: 172.0.0.1, pport: 80, proto: 6, relay_name: "wwwserver", response_code: 200, rnum: 6489895, s_ctype: "image/png", saddr: 172.0.0.1, sin: 898, sout: 819, sport: 80, status: "OK", url: "http://test.de/images/icons/144x144.png", }, },}
The above output still has several values that we can improve in terms of typing:
content.ts
: numeric value instead of timestampcontent.params.duration
: numeric value instead of durationcontent.params.forwarded_for
: string instead of list of IP addresses
Let’s fix those up:
from {line: "2023-06-09T05:51:36.433Z mrs www[929]: I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png"}this = line.parse_syslog()content = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")content.params = content.params.parse_kv()// Interpret the timestamp as a seconds since Unix epoch.content.ts = from_epoch(content.ts * 1s)// Interpret the duration parameter as seconds.content.params.duration = content.params.duration * 1s// Split the forwarded_for field into an array of IP addresses.content.params.forwarded_for = content.params.forwarded_for.split(", ").map(x, x.ip())
{ facility: null, severity: null, timestamp: 2023-06-09T05:51:36.433Z, hostname: "mrs", app_name: "www", process_id: "929", content: { id: "I103", ts: 2023-12-21T18:18:16Z, action: "request", params: { baddr: 194.1.1.1, bport: 58630, c_ctype: "image/png", caddr: 194.1.1.1, cin: 860, cout: 898, cport: 58630, duration: 0ns, forwarded_for: [ 192.1.1.1, 191.1.1.1, ], laddr: 127.128.11.145, lport: 1, method: "GET", paddr: 172.0.0.1, pport: 80, proto: 6, relay_name: "wwwserver", response_code: 200, rnum: 6489895, s_ctype: "image/png", saddr: 172.0.0.1, sin: 898, sout: 819, sport: 80, status: "OK", url: "http://test.de/images/icons/144x144.png", }, },}
Reconstruct values into the desired form
Section titled “Reconstruct values into the desired form”We’ve now reached an important state: all data is deconstructed into atomic values with proper types. We can’t take the data apart any further. This is the inflection point where we shift from disassembly of the original data into reassembly into the desired form. For example, you could map your events to OCSF.
In our toy example, we’ll just create a slightly flatter structure:
from {line: "2023-06-09T05:51:36.433Z mrs www[929]: I103 1703182696 request baddr=194.1.1.1 bport=58630 c_ctype=image/png caddr=194.1.1.1 cin=860 cout=898 cport=58630 duration=0.00 forwarded_for='192.1.1.1, 191.1.1.1' laddr=127.128.11.145 lport=1 method=GET paddr=172.0.0.1 pport=80 proto=6 relay_name=wwwserver response_code=200 rnum=6489895 s_ctype=image/png saddr=172.0.0.1 sin=898 sout=819 sport=80 status=OK url=http://test.de/images/icons/144x144.png"}this = line.parse_syslog()content = content.parse_grok("%{DATA:id} %{DATA:ts} %{DATA:action} %{GREEDYDATA:params}")content.params = content.params.parse_kv()content.ts = from_epoch(content.ts * 1s)content.params.duration = content.params.duration * 1scontent.params.forwarded_for = content.params.forwarded_for.split(", ").map(x, x.ip())this = { // We keep both timestamps, labelling the one from the message `time` and the // one from Syslog itself `forwarded_time`: time: content.ts, forwarded_time: timestamp, host: hostname, id: content.id, action: content.action, source: content.params.caddr, destination: content.params.saddr,}
{ time: 2023-12-21T18:18:16.000Z, forwarded_time: 2023-06-09T05:51:36.433Z, host: "mrs", id: "I103", action: "request", source: 194.1.1.1, destination: 172.0.0.1}
This pipeline structure is very common, except that we would perform the data shaping on live input as opposed to a single synthetic log line. Check out our Syslog integration page for further related topics.