One of the biggest strengths of the Tealeaf Customer Experience platform is the vast amount of data it is able to capture and analyze. However, that very same torrential flow of data through Tealeaf can also pose a challenge. What do you do when your Tealeaf servers no longer seem capable of processing a day’s worth of web session traffic in real time?
The beauty of Tealeaf is the near real-time availability of the session replay and reporting data. At first, it might seem like session spooling is only an issue on extremely high volume dates. Before long, spooling is happening every day, throughout the day, and now users aren’t able to find recent (and certainly not active) sessions and dashboard reporting alerts are delayed to the point where they’re not particularly useful. Proactive alerting is often a large cost-justifier for a Tealeaf solution. How can you get it back under control?
Note: For the purposes of this article we’re talking about Tealeaf spooling on Windows-based servers (HBRs, Canisters, etc.) Capacity issues on the Linux capture servers is a different topic for another day.
Getting Data Pipeline Under Control
Rule out the hardware
Before diving into possible configuration changes and intrusive, potentially outage-causing efforts, it’s a good idea to rule out throughput issues on the disk I/O subsystem. If the canister is unable to write to disk quickly enough, spooling upstream is a strong possibility. Tealeaf’s minimum requirements specify 50 MB/s write times on the HDD or SAN drives.
Microsoft provides a free tool, SQLIO, to test and benchmark the I/O capacity of the disk subsystem. Once installed, stop the Tealeaf services on the server and run the following tests:
sqlio.exe -kW -t1 –s60 -b32 -dn
sqlio.exe -kR -t1 –s60 -b32 –dn
Replace the “n” with the drive letter where Tealeaf data (i.e. the Canister.dbs directory) is stored. This will run a single threaded, 60 second test using 32KB blocks. Here’s a result you wouldn’t want to see:
C:\SQLIO>sqlio.exe -kW -t1 –s60 -b32 -dF
sqlio v1.5.SG
1 thread writing for 60 secs to file F:testfile.dat
using 32KB IOs over 2048KB stripes with 64 IOs per run
size of file F:testfile.dat needs to be: 134217728 bytes
current file size: 0 bytes
need to expand by: 134217728 bytes
expanding F:testfile.dat ... done.
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 84.85
MBs/sec: 2.65
Looking at this output, the reported speed for writing is reporting as 2.65 MB/second, which is well under the 50 MB/s required. Time to engage the hardware support team to help identify if a drive or controller is bad.
Finding bottlenecks in the Tealeaf configuration files
So if it’s not defective hardware, is it simply a matter of too much traffic for the existing hardware to handle? Maybe, but before making that pitch to purchase new hardware, there are a number of configuration options to try out.
-
While the Canister process itself is 64-bit and can utilize about as much RAM as you can throw at it, the Tealeaf Transport process (TealeafCaptureSocket.exe) is still 32-bit. That gives it a maximum theoretical memory overhead of 2GB. In practice it will max out at 1700MB. If your HBR’s Transport process is hitting that limit, trouble is brewing. The Transport service handles not just the main pipeline, but all of the child pipelines as well, and memory usage across all of those adds up.
If the Transport service is bumping up against that 1700MB limit, you can make a change in the [TLTREF] section of the TealeafCaptureSocket.cfg file, changing SpoolRollSizeMB from its default value of 100 to 60. In each of the child HBR pipeline files, cut the SpoolRollSizeMB value from 20 to 10. If you have many canisters (and therefore child pipeline processes) you might recover substantial memory overhead with just this adjustment.
-
If high memory usage is NOT the problem, but hits queueing to disk is, you can try changing the MaxCacheSize in [TLTREF] from 2000 to 8000. TLTREF can be a bottleneck and this may help in getting those hits evaluated and moved through the pipeline more quickly.
If that doesn’t help, and spooling is still occurring, consider moving the TLTREF agent to the canisters instead. It’s likely you have more canisters than HBRs, so spreading out this workload can help. If you do this, be sure to update the Browscap.csv and WURFL definition files on those canisters, now that the work of evaluating browser version has moved to them.
-
The Privacy agent might also causing a bottleneck in the Tealeaf pipeline. If you have quite a few privacy rules, and Regex expressions being evaluated, Privacy and PrivacyEx can weigh heavily on the Transport Service’s resources. This would be a good time to look at the existing privacy rules to see if some are no longer needed (these config files can end up as a graveyard of previous experiments and long ended projects), or could be combined with others. Make sure you’re not running privacy rules that already exist on the PCA servers, causing unnecessary processing of rules that have already been evaluated on the hit data. If memory usage for TealeafCaptureSocket isn’t excessive, you can move Privacy to each of the child pipelines. Note that you should use Privacy and not PrivacyEx if you plan to do this, due to the latter’s higher memory requirements.
Similarly to the TLTREF issue above, moving the Privacy agent to the Canisters can help out as well. You might have some discomfort with having customer sensitive data making it further through the Tealeaf pipeline, being only redacted at the canister. But keep in mind that this hit data exists as memory objects that are not readable outside of the pipeline process, and would still be PCI-compliant. Company policy (and future PCI auditors looking for an easy “find” because memory objects can wind up in the server’s paging file) may dictate otherwise though.
Maybe you need new hardware after all
If the hardware checks out okay, the config files are optimized, and Tealeaf still can’t handle the traffic load, maybe it is time to look into new hardware. It’s very possible that the hardware was spec’d out years ago, based on a now-obsolete prediction of web traffic growth and new apps have come on board (and are now being captured by Tealeaf) in the meantime.
It’s easy to forget the needs of Tealeaf as the product usually works quietly, happily processing sessions, until it finally fails. The four canisters installed four years ago might simply be unable to keep up with today’s demand. At Stratigent we help organizations deal with the growing pains that Tealeaf can experience and we can help you work through them as well. Don’t fear, with just a little bit of help, you can safely recover the expected near real-time availability of the session replay and reporting data.
Have questions?