This is a quick overview of Pixie's dynamic tracing capability which allows developers to add tracepoints without any instrumentation. To test this out yourself, check out the tutorial here.
Allow developers to save hours and days debugging code-level performance issues by giving the ability to dynamically add tracepoints in production code without any instrumentation.
What are the arguments and return value of calls to Foo(x, y, z)
?
What is the latency (call to return) of calls to Login(username)
?
Here's a quick overview video where we dynamically injecting tracepoints a function (link) within the "Online-Boutique" demo application's checkout service:
As reference, here's the PXL script used in the video:
import pxtraceimport px# Change to the Pod you want to trace.pod_name = 'online-boutique/checkoutservice'# The name of the table where results will be gathered.# Make sure to change this name if you change the data being collected in the pxtrace.probe below,# otherwise there may be a column mis-match.table_name = 'checkout_tracer_table'# The following pxtrace.probe specifies the application code to be traced.# func Sum(l, r pb.Money) (pb.Money, error)@pxtrace.probe('github.com/GoogleCloudPlatform/microservices-demo/src/checkoutservice/money.Sum')def probe_func():return [{'lUnits': pxtrace.ArgExpr('l.Units')},{'lNanos': pxtrace.ArgExpr('l.Nanos')},{'rUnits': pxtrace.ArgExpr('r.Units')},{'rNanos': pxtrace.ArgExpr('r.Nanos')},{'retUnits': pxtrace.RetExpr('$0.Units')},{'retNanos': pxtrace.RetExpr('$0.Nanos')}]# This UpsertTracepoint deploys the dynamic tracepoint on the specified pod.pxtrace.UpsertTracepoint('checkout_tracer',table_name,probe_func,pxtrace.PodProcess(pod_name),'10m')# Query and output the results to screen.df = px.DataFrame(table_name)px.display(df)
Note that there is a known bug in which re-running the script after modifying the probe_func
definition will cause the tracepoint to fail to deploy. To get around this bug, whenever you modify the probe_func
definition, please rename the table_name
(and update the table_name
in the df = px.DataFrame(table_name)
line as well.
Currently it has been tested on Go with limited support for C++. Other compiled languages such as Rust, Haskell, etc. are well supported by our approach.
Our system does not currently work with interpreted or VM based languages. These languages usually have fairly sophisticated debug environments that we will integrate with in the future.
We currently require Dwarf information to be present in the binary for it to work. We support optimized binaries (there are issues like inlined functions, that stirling does not yet fully support) but they need to contain the debug symbols. Future versions of Pixie will add support for remotely hosted symbol files. We are actively seeking feedback about how remote symbol files are used in practice, in order to design proper features.
Dynamic tracepoints connect up to the Pixie platform. Native streaming support is core to Pixie and will be in the near future.
We currently only support tracepoints that are generated by Pixie. We can leverage our approach to add support for this in the future if there is a significant demand for this feature.
Since Dynamic Tracepoints natively slot into Pixie they can leverage the platform's visualization environment. We will add support for views such as flame graphs in the future.
We currently support capturing function arguments, return values and latencies.
We don’t currently support any operators that will mutate the state of the application.
Not currently supported, but since this is such a useful feature we will explore adding it.
Dynamic tracepoints don’t rely on any K8s specific features. They will be supported outside of K8s when Pixie can be installed there.
Tracepoints work on a declarative specification. Since Pixie is designed to work both inside and outside of K8s we don’t leverage CRDs to transmit the specification. In the future we might add support for providing specs from CRDs that are read into Pixie.
Minimal. A few tracepoints should have very little to no visible impact on non-trivial applications. Our studies on BPF probes have shown <1% overhead to capture full messages from a simple HTTP server. How often a tracepoint is triggered, and the amount of data being collected will affect this number.
Since Dynamic Tracepoints can basically observe any function and its respective arguments there are significant privacy and security concerns. We will alleviate this by adding in RBAC support with the ability to have specific templates that are reviewed and allowed to be deployed. This feature can also leverage PII masking and other future enhancements to Pixie.
Unlike most existing approaches we don’t actually stop execution of the program or mutate state. This allows us to easily capture data in production environments with limited overhead.
Tracepoints have a TTL (time to live) when registered. This will allow automatic garbage collection of old tracepoints They can also be manually deleted.