Connectwise Automate Patch 2021.5 was released on May 11, and it contains security fixes for device inventory. It also contains performance issues for large Automate partners.
Patch performance woes
After applying the 2021.5 patch partners with large servers started to report miscellaneous problems — login issues, scripts not running as expected, UI slowness, IIS timeouts/errors, and high resource usage. The impact of these issues ranged based on the particulars of the partner environment, and many partners reported that a reset of IIS fixed the issue. However, after investigating a particularly impacted instance, we found that these were symptoms of a larger problem (that will cripple larger Automate servers if they aren’t tuned correctly).
Inventory sending — what changed in patch 2021.5?
The release notes for that patch state that additional sanitation for agent inventory was introduced (the discerning reader will likely suspect that this is a fix for a SQL injection vulnerability). The release notes also state that this would be a breaking change until the agents updated, but a review of the schema shows that nothing major changed in the database structure between patch 4 and patch 5.
However, here at Automation Theory, we have some database monitoring tools, and we noticed that there were new queries introduced after the patch, and they were running a lot (more than computer check-ins on large servers). We didn’t think much of this initially until we dug deeper into IIS performance (long-running requests). Upon review, we found that for processes, services, and event logs each row was being inserted individually and that for each insert, a new connection to MySQL was being opened and closed.
Database Overload — death by 1,000
cuts connections and query congestion
MySQL can open connections very rapidly, but there is overhead for each connection, so it’s best practice to open one connection and reuse it wherever possible. In a 2019 blog post, the MySQL Server Team put out research showing that MySQL can handle 80,000 connection requests a second, and at a glance, it seems like plenty of capacity.
So, what would it take to hit this 80k/second limit? Surprising, not a lot after patch 2021.5. We took an Automate server and counted the average number of services and processes per computer, and the results were 636 and 172 respectively. That means that each PC would open 808 connections to the database when doing an inventory resend for services and processes alone. This means that 100 devices sending inventory would theoretically max out Automate’s database connection throughput (event log collection brings this number to 73). The saving grace of this situation is that inventory sending isn’t constant and simultaneous for all agents — the scheduling and natural usage patterns of the remote devices normally prevent this limit from being reached.
However, a second issue also emerges: there are a lot more queries running after the 2021.5 patch (as these inserts weren’t done one-by-one before). One larger insert is up to 30% more efficient than multiple smaller inserts. As a result, this causes the MySQL equivalent of a traffic jam (think a congested highway; traffic moving, but slowly). A MySQL server that is performing well probably isn’t going to respond optimally to surges of new connections and up to 30% more overhead in responding to those requests. This back-pressure is the root cause of the resource issues, UI slowness, IIS timeouts, and other application behavior issues.
Fixing the patch 2021.5 issues
The first issue of connection throughput doesn’t have a perfect fix. Ideally, this would be corrected in the application so that a single connection would be reused as applicable. However, we can try and shore up MySQL as much as possible. The big-ticket item here is the thread cache (which the official docs suggest hard-coding to 50). As mentioned here, the “one-size-fits-all” database tuning approach isn’t helpful, as the proper value will depend on how many simultaneous new requests your server would experience on average (based on agent count and inventory schedules). The algorithm in the full version of our plugin Database Tuner has been updated to calculate the number of thread cache misses and compensate dynamically in real-time.
The congestion due to the higher volume of queries, unfortunately, doesn’t have a “one-size-fits-all” solution. Both Database Tuner Lite and Database Tuner contain optimizations for transaction handling, and that might bring things back into balance. We’d also suggest looking at our guide for IIS tuning to ensure that the web front end is also operating efficiently.
We hope this information has been helpful to you. Here at Automation Theory, we’re certified MySQL DBAs dedicated exclusively to the Connectwise Automate software stack. If you need any additional assistance we’re happy to offer consulting services, and you can find more information here.