How-we-got-reliable-sampled-netflow-data-from-the-Nexus-7kWith the Cisco Nexus 7k’s making an appearance in more and more of our customers’ workplaces, we found getting reliable data from your Netflow Analyzer somewhat troublesome. Your typical “turn it on” doesn’t quite cut it with these devices, and more than a few engineers have struggled with this for a while before getting it right. We thought we’d share our experience – and indeed our journey – toward the ultimate solution we found.

Right from the outset, we realised that the data we were getting from the Nexus was inaccurate. After verifying the configuration on the Nexus and doing some packet captures using our Netflow Analyzer in debug mode, we concluded that we were receiving Netflow version 9, and all the sources and destinations seemed right. No surprises there. However, the byte-count and packet-count were way off, and I mean WAY off! Clearly something was very wrong…

Our debugging methodology

Time for the lab. For debugging purposes, we generated 4Gbps of random traffic, which we verified by graphing the SNMP graphing counters on the interface. Once we had verified that this was correct, we began looking at the netflow data.

Immediately we saw the discrepancy. Our netflow collector told us the interface was moving 4Mbps, so while it seemed consistent, it was wrong. The incorrect sampling rate was the obvious culprit, so we started looking there.

From the netflow export configuration on the router, we knew, that the data received was meant to be sampled 1000 to 1. So, with this in mind, we started hunting around for the sampling information.

The problem we ran into

Netflow version 9 sends its data very differently from version 5, as it sends a Template FlowSet that tells us how to make sense of the actual data we receive in the Data FlowSet. We needed four things:

  • The Data FlowSet which contains the actual netflow data
  • A Template FlowSet that tells us how to decode the information in the Data FlowSet
  • A special kind of Data FlowSet, called the Option FlowSet, which contains the sampling information.
  • The Option Template FlowSet, which in a similar way tells us how to make sense of that Option FlowSet.

What we really wanted was to end up with was this information:

Option Flowset:
FLOW_SAMPLER_ID = 12345
FLOW_SAMPLER_NAME = foobar
FLOW_SAMPLER_RANDOM_INTERVAL = 1000

 

But for some reason, we were just not getting it in our collector. Time to look at the actual raw packet – time for wireshark.

Down to the packet level

Using wireshark, we found that despite running the trace for several hours, we never got the Option FlowSet. We didn’t even get an Option Template FlowSet explaining how to decode the missing FlowSet.

Basically, we had no information whatsoever.

To make matters worse, even the Template FlowSet was sent so infrequently that it was a while before wireshark could decode a single packet!

So, not only were we getting the sampling rate wrong, we were also missing data at the start of collecting flows for long periods. The latter is an issue with all collectors, but routers usually send this Template FlowSet often enough so that you won’t need to wait longer than a minute or two before getting the new one. Our collector actually persists Templates in case it restarts so that we don’t have to wait for the Template – but should the Template change, it would spell problems if not sent frequently enough.

So our problem was that instead of seeing the data in gigabytes as we were expecting, we were seeing it in megabytes. Since some of our clients use this data for billing, any inaccuracy could result in serious loss of revenue – especially if the data was out by a factor of 1,000.

The Fix

We tried a few options in the configuration and finally we managed to see the Option Template FlowSet coming through and with it the FlowSet containing the sampling information. Now that we had the Nexus sending us all the necessary information, we turned to our Netflow Analyzer and we started seeing the sampling rate being applied.

Here is the configuration we used:

 

Router Configuration:
! flow exporter FLOWEXPORTER-NEW-IRIS*
! version 9
! template data timeout 60
! option exporter-stats timeout 60
! option interface-table timeout 60
! option sampler-table timeout 60

 

We heard a few opinions about these devices from engineers, including “it cannot be done”, “we hardcode the sampling rate” and “we don’t export from these devices for that reason”. So, now we know that it CAN be done, but just requires a bit of configuration.

The issue here simply seems to be that unless you actually specify a timeout value, the Nexus 7k doesn’t “default” to some value, but actually never sends the packet. Perhaps it does eventually, you’d have to ask Cisco, but we didn’t have time to wait all day for it while we were losing data.

For more information on IRIS Networking Systems, or if you have any more questions about getting reliable results from the Nexus 7k, or to find  out more about our network management software, please don’t hesitate to contact us.

[hs_action id=”1851″]

Image credit: Networkworld