Solve problems with the Tanzu Observability tile and the TAS integration.

This doc page looks at possible causes for problems you might encounter with your Tanzu Application Service (TAS) to Tanzu Observability integration and explains how to address them.

Sizing and Scaling for Large TAS Foundations

Larger TAS foundations are more demanding to monitor than smaller foundations.

  • If more application instances are running on a foundation, then more container-level metrics have to be collected and forwarded to Tanzu Observability.
  • If more virtual machines are in a foundation, then more VM-level metrics are reported.

If your foundation is large, tune the following parameters, in this order:

  1. Increase the size of your Telegraf Agent Virtual Machine. The Telegraf agent is responsible for collecting metrics and transforming them into the Wavefront data format. The is typically CPU and memory bound, so increasing virtual machine size can increase perfrmance.
  2. Reduce the scrape interval. If collection times for some scrape targets are greater than 12 seconds, consider changing the scrape interval for your environment to a lower frequency. Typically, 120% of the longest observed collection time is safe.

Symptom: No Data Flowing or Dashboards Show Now Data

You have successfully set up the nozzle and the integration. However, you don’t see any data for the out-of-the-box dashboards. The most common cause is a problem with sending data to Tanzu Observability.

Potential Solutions:

  • Ensure that the setup flow has completed. Check back a few hours after you perform setup.
  • Verify that the proxy uses the correct API token and Wavefront instance URL. You specify that information in Ops Manager in the Proxy Config page.
  • Go through the the Proxy Troubleshooting information and the Telegraf Troubleshooting information.
  • Verify that data are flowing from the Wavefront proxy to your Wavefront instance. See Proxy Troubleshooting
  • In your Tanzu Application Service environment, verify that the Bosh jobs for Wavefront proxy and for the Telegraf agent are running

Symptom: Higher than Expected PPS Rate

The PPS (points-per-second) rate can affect performance and potentially the cost of using Tanzu Observability.

  • 4.x: The PPS generated by the TAS Nozzle version 4.x should be predictable and relatively consistent for any given foundation, because metrics are scraped at a fixed interval.
  • 3.x: Version 3.x of the Nozzle follows a push-based model. PPS varies based on factors such as HTTP requests being served by the gorouter, so PPS is less predictable.

However, it can be difficult to predict the average PPS of a TAS foundation ahead of time because several factors affect the total number of metrics that are generated:

  • The TAS version
  • The size of the foundation
  • Other TAS components running on the foundation

PPS might increase or decrease when individual TAS components are installed, upgraded or removed. Each individual component contributes its own metrics.

Solution:

  • Increase the Telegraf agent’s scrape interval. Metrics will be collected less frequently, and average PPS decreases.

Future releases will allow more targeted approaches to reducing PPS, for example, by filtering out unwanted metrics.

Symptom: Incomplete Data in Tanzu Observability

Data from your TAS foundation are visible in Tanzu Observability dashboards and charts, but seem incomplete.

Potential Cause:

Incomplete data is most likely caused by one or more components failing to keep up with the volume of metrics that are generated by the TAS. Typically this happens when the gauge exporter emits large numbers of metrics, and the Telegraf agent is not able to ingest these metrics and to forward them to the Wavefront proxy before the next collection cycle begins. Errors might result and metrics are dropped as the Telegraf agent tries to catch up.

Investigation:

Here are some things you can do.

  • Look for errors in bpm logs on the Telegraf agent or in the Wavefront proxy logs. See Proxy Troubleshooting and Telegraf Troubleshooting for details.
  • Look for collection errors from Telegraf (tas.observability.telegraf.internal_gather.errors)
  • Look for long collection times from Telegraf (tas.observability.telegraf.internal_gather.gather_time_ns)

Potential Solutions: In the Ops Manager tile:

  • Increase the size of the Telegraf Agent Virtual Machine
  • Increase the Telegraf scrape interval