Data Center

Using Visibility and Automation Together

by LindsayHillBRCD on ‎01-09-2017 11:42 AM (3,163 Views)

In Nabil Bukhari's blog on December 6, he detailed the vision outlined in our announcement that day: multidimensional agility, with visibility and automation working in concert.

 

On December 14, we showed at a very high level how Brocade Workflow Composer can be used to automate troubleshooting and remediation by using information from the SLX Insight Architecture and Visibility Services.

 

In this blog, we’ll walk through a specific example of how to use this combination of platform and automation features to determine the root cause of an application performance problem, whether it has to do with the network, the application itself, or another reason having to do with the physical or virtual compute resources.

 

Our Scenario

Let’s assume that your software team has deployed a distributed application on a scale-out leaf-spine IP fabric network (Figure 1). The fabric uses BGP-EVPN, and application isolation, to provide Layer 2 services across the fabric.

Figure 1: Distributed Application across Scale-Out IP Fabric

Figure 1: Distributed Application across Scale-Out IP Fabric

 

Selected users have been reporting intermittent, inconsistent performance problems. The software team suspect a network problem, and have passed it to the network team to investigate further.

 

How do we go about troubleshooting the problem? We start at a high level, then work down to deeper detail until we isolate the issue:

  1. Check overall traffic – look for any link congestion problems
  2. Drill into specific per-application server traffic levels to identify abnormalities
  3. Capture traffic from anomalous servers to drill down into specific packets

Streaming Data: The Big Picture

Brocade SLX switches support streaming interface counters. Using Brocade Workflow Composer, we can run a workflow to configure the streaming settings on our switches. This pushes out a profile that defines the statistics we want to stream, and where to send the data to. No need to login to each individual switch. Our profile needs to include interface counters.

 

This data can be collected and displayed by tools such as Splunk, Influx DB, Grafana, or the Elastic Stack.

Our starting point is to login to a dashboard showing interface utilization graphs (Figure 2). This will tell us if there is any congestion occurring on links within the fabric, or at the edge ports:

 

throughput_graph.png

 

Figure 2: Sample Dashboard Showing Interface Utilization

 

But these graphs don’t show anything unusual. Traffic levels are normal, no interfaces are showing congestion. We need to go deeper.

 

Visibility Services

SLX Visibility Services gives us multilayer classification capabilities including network parameter filters such as IP and MAC addresses, port numbers, VNIs, and workload matching. We can then take action on matching packets, such as count, drop or mirror.

 

We want to get traffic counters for each of our application servers, at every leaf switch that the application currently uses.

 

We need to:

  • Identify the IP addresses used by our application, which compute nodes they currently run on, and which switch ports they are connected to
  • Figure out which VNIs are used for that traffic
  • Create rules to match that traffic, and install on all relevant leaf switches
  • Monitor the results

The first three steps are tedious, repetitive work: a perfect case for automation. So we run a workflow to gather the IP addresses from our compute system, identify the VNIs used, and pass the details through to a workflow that sets up the matching rules, with a “count” action.

 

Watching the results, we can then see traffic on a per-IP basis, rather than the aggregated interface stats we had earlier. This reveals something unusual: one of the servers has lower traffic volumes than the other. It’s not zero, but it is lower than the others. What’s going on with that server?

 

SLX Insight Architecture

So now we want to dig deeper into that traffic. We run a new workflow that applies a “mirror” action to the interesting traffic, and sets up a packet capture on our Guest VM in the SLX Insight Architecture. No dedicated taps or hardware needed.

 

Now we have a pcap file that we can analyze in Wireshark. Looking at the packets in more detail, we see something a little unusual: one of the application components isn’t loading. Clients are timing out with that component, and failing over to another server.

 

Armed with this information, we can go back to the software team, who resolve the issue. Traffic is now balanced properly across all systems, all are working as expected, and users are happy.

 

Finally, we run a “cleanup” workflow that removes our packet capturing rules, and we’re done!

 

Related Links  

SLX Insight Architecture User Guide

Visibility in the Modern Data Center with SLX Switches and Routers

Brocade Workflow Composer Automation Platform