EXPAND ALL
  • Home

Distributed bpftrace Deployment

Pixie can deploy bpftrace programs to your cluster, collect the resulting data, and display it using the Live UI. This tutorial will demonstrate how to run an iovisor/bpftrace program using a PxL script and discuss the guidelines for running arbitrary bpftrace code using Pixie.

Background

Much of the data in Pixie's no-instrumentation monitoring platform is collected with eBPF. Pixie Edge Modules (PEMs), deployed on the nodes in your cluster, use eBPF based tracing to collect network transactions without any code changes.

One increasingly popular way to write BPF programs is to use bpftrace, an open source high-level tracing language for Linux. bpftrace is easy to use and is generally written as one-liners or stand-alone scripts.

Now, using Pixie, developers can dynamically run their own bpftrace programs on their cluster. Pixie will handle:

  • deploying the bpftrace program to all of the nodes in your cluster.
  • capturing the output of the bpftrace program from the printf statement.
  • pushing this data into a table which has a column per printf argument.
  • making the data available to be queried and visualized in the Pixie UI.
  • removing the probe(s) after a set expiration time.

Limitations

This beta feature has limitations:

  • kprobes only (for now).
  • The probe should be of the type that outputs its results using a printf statement, rather than other output formats like print that dumps the contents of a map.*
  • The program can only contain 1 printf statement, as the output of the printf statement defines the record that will be pushed to our tables. Most programs in the bpftrace repository that have more than one printf statement can be easily adapted to meet this requirement.
  • The bpftrace program's END block will not be run, so the printf should not be in the END block.
  • Remove any calls to time(). This is a different type of print statement which violates the "only 1 printf" rule.
  • The column names in the printf statement cannot contain any whitespaces.
  • To output time in a manner that is recognizable by Pixie, label the column time_ and pass the argument nsecs.

*Note that it is possible to convert many of the existing bpftrace tools that accumulate data into a global structure to instead print the data on a regular interval using an interval block. pidpersec.bt is a good example of this design pattern.

Tutorial

In this demo, we'll deploy Dale Hamel's bpftrace TCP retransmit tool using Pixie. TCP retransmits are usually a sign of poor network health and this open-source tool will help us discover if any connections in our cluster are experiencing a high number of retransmits.

Running the PxL Script in the Live UI

We've incorporated this trace into a PxL script called px/tcp_retransmits. To run this script:

  • Open up Pixie's Live View and select your cluster.
  • Select the px/tcp_retransmits script using the drop down script menu or with Pixie Command. Pixie Command can be opened with the ctrl/cmd+k keyboard shortcut.
  • Run the script using the Run button in the top right, or with the ctrl/cmd+enter keyboard shortcut.

Once the probe is deployed to all the nodes in the cluster, the probes will begin to push out data into tables. The PxL script queries this data and the Vis Spec defines how this data will be displayed.

Pixie Live UI view of TCP Retransmissions

In the Live View, we'll see a graph of the pods (hexagonal grey box icons) and the services (hexagonal grey tree icons) who are are experiencing TCP retransmits.

The color and weight of the arrows between these entities indicates the number of retransmits. Hovering over an arrow will display the number of retransmits for a particular connection. The data displayed in this graph can also be seen in the Data Drawer (use the ctrl/cmd+d keyboard shortcut to open and close this table).

In this particular example, the 3 pods experiencing high levels of retransmits are located on the same node, perhaps indicating an issue with that particular node.

How does the PxL script work?

1# Copyright (c) Pixie Labs, Inc.
2# Licensed under the Apache License, Version 2.0 (the "License")
3
4import pxtrace
5import px
6
7# Adapted from https://github.com/iovisor/bpftrace/blob/master/tools/tcpretrans.bt
8program = """
9// tcpretrans.bt Trace or count TCP retransmits
10// For Linux, uses bpftrace and eBPF.
11//
12// Copyright (c) 2018 Dale Hamel.
13// Licensed under the Apache License, Version 2.0 (the "License")
14
15#include <linux/socket.h>
16#include <net/sock.h>
17
18BEGIN
19{
20 // See include/net/tcp_states.h:
21 @tcp_states[1] = \"ESTABLISHED\";
22 @tcp_states[2] = \"SYN_SENT\";
23 @tcp_states[3] = \"SYN_RECV\";
24 @tcp_states[4] = \"FIN_WAIT1\";
25 @tcp_states[5] = \"FIN_WAIT2\";
26 @tcp_states[6] = \"TIME_WAIT\";
27 @tcp_states[7] = \"CLOSE\";
28 @tcp_states[8] = \"CLOSE_WAIT\";
29 @tcp_states[9] = \"LAST_ACK\";
30 @tcp_states[10] = \"LISTEN\";
31 @tcp_states[11] = \"CLOSING\";
32 @tcp_states[12] = \"NEW_SYN_RECV\";
33}
34
35kprobe:tcp_retransmit_skb
36{
37 $sk = (struct sock *)arg0;
38 $inet_family = $sk->__sk_common.skc_family;
39 $AF_INET = (uint16) 2;
40 $AF_INET6 = (uint16) 10;
41 if ($inet_family == $AF_INET || $inet_family == $AF_INET6) {
42 // initialize variable type:
43 $daddr = ntop(0);
44 $saddr = ntop(0);
45 if ($inet_family == $AF_INET) {
46 $daddr = ntop($sk->__sk_common.skc_daddr);
47 $saddr = ntop($sk->__sk_common.skc_rcv_saddr);
48 } else {
49 $daddr = ntop($sk->__sk_common.skc_v6_daddr.in6_u.u6_addr8);
50 $saddr = ntop($sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr8);
51 }
52 $lport = $sk->__sk_common.skc_num;
53 $dport = $sk->__sk_common.skc_dport;
54 // Destination port is big endian, it must be flipped
55 $dport = ($dport >> 8) | (($dport << 8) & 0x00FF00);
56 $state = $sk->__sk_common.skc_state;
57 $statestr = @tcp_states[$state];
58 printf(\"time_:%llu pid:%u pid_start_time:%llu src_ip:%s src_port:%d dst_ip:%s dst_port:%d state:%s\",
59 nsecs,
60 pid,
61 ((struct task_struct*)curtask)->group_leader->start_time / 10000000,
62 $saddr,
63 $lport,
64 $daddr,
65 $dport,
66 $statestr);
67 }
68}
69
70END
71{
72 clear(@tcp_states);
73}
74"""
75
76
77def demo_func():
78 name = 'tcp_retransmits'
79 pxtrace.UpsertTracepoint(name,
80 name,
81 program,
82 pxtrace.kprobe(),
83 "10m")
84 # Rename columns
85 df = px.DataFrame(table=name,
86 select=['time', 'src_ip', 'src_port', 'dst_ip', 'dst_port', 'state'])
87
88 # Convert IPs to domain names.
89 df.resolved_src = px.pod_id_to_pod_name(px.ip_to_pod_id(df.src_ip))
90 df.resolved_dest = px.pod_id_to_pod_name(px.ip_to_pod_id(df.dst_ip))
91 df.ns_src = px.nslookup(df.src_ip)
92 df.ns_dst = px.nslookup(df.dst_ip)
93 df.src = df.resolved_src
94 df.dst = px.Service(df.ns_dst)
95
96 # Count retransmits.
97 df = df.groupby(['src', 'dst']).agg(retransmits=('ns_src', px.count))
98
99 # Filter for a particular service, if desired.
100 df = df[px.contains(df['dst'], '')]
101
102 # Set a threshold to display, if desired.
103 df = df[df['retransmits'] > 0]
104
105 return df

Pixie's scripts are written using the Pixie Language (PxL), a domain-specific language that is heavily influenced by the popular Python data processing library Pandas.

On line 8, we've included Dale Hamel's tcpretrans.bt bpftrace tool from the iovisor/bpftrace repo as a string. We've tweaked the original trace in order to work with Pixie's bpftrace rules (seen in the "Limitations" section above):

  • removed the informational print statements on lines 25-26 of tcpretrans.bt so that the program contains a single printf statement.
  • removed the time() call on line 71
  • modified the printf statement on line 72 to name the output columns (no whitespaces)
  • modified the printf statement on line 72 to output time using the reserved column name time_ and passing it the nsecs argument.

On line 79, we call UpsertTracepoint with the following arguments:

  • the name of the tracepoint
  • the name of the table to push data into
  • the type of the trace probe
  • the expiration time for the tracepoint

Lines 85-109 query the collected data, convert known IPs to domain names, and group the retransmits by source and destination IPs tallying the number of retransmits.

If you'd like to filter the results to a particular service, modify line 106 to include the namespace:

df = df[px.contains(df['dst'], 'sock-shop')]

Running other bpftrace programs

With a few modifications to obey the Rules listed above, all of the TCP bpftrace programs are known to work with Pixie:

  • tcpaccept.bt
  • tcpconnect.bt
  • tcpdrop.bt
  • tcplife.bt
  • tcpretrans.bt
  • tcpsynbl.bt

Based on a quick visual inspection of the code, the following programs should theoretically work with modifications, but have not yet been tested:

  • biosnoop.bt
  • dcsnoop.bt
  • execsnoop.bt
  • opensnoop.bt
  • statsnoop.bt
  • syncsnoop.bt
  • capable.bt
  • loads.bt
  • mdflush.bt
  • naptime.bt
  • oomkill.bt
  • writeback.bt

If you have any questions about this feature or how to incorporate your own bpftrace code, we'd be happy to help out over on our Slack.

Copyright © 2020 Pixie Labs Inc.