Issue
I have a simple project that allows you to add keys to a distributed cache in an application that is running Infinispan version 13 in embedded mode. It is all published here.
I run a kubernetes setup that can run in minikube. I observe that when I run my example with six pods and perform a rolling update, my infinispan performance degrades from the start of the roll out up until four minutes after the last pod has restarted and created its cache. After this time the cluster operates as normal again. With degrading I mean that the operation of getting the count of items in the cache takes 2-3 seconds to execute, compared to below 0.5 seconds in normal mode. With my setup this is consistently happening, and consistently working again after four minutes.
When running the project on my local machine without a kubernetes environment I have not experienced the same kind of delays.
I have tried using TRACE logs, but can see no event of significance that happens after these four minutes.
Is there something obvious that I'm missing in my configuration of Infinispan (that you can see in my referenced project), or some additional operation that needs to be performed? (currently I start the cache on startup, and perform stop on shutdown).
Solution
A colleague found the following logs when running Infinispan in non embedded mode:
2022-01-09 14:56:45,378 DEBUG (jgroups-230,infinispan-server-2) [org.jgroups.protocols.UNICAST3] infinispan-server-2: removing expired connection for infinispan-server-0 (240058 ms old) from recv_table
After these logs the service performance was returned to normal again. This lead us to suspect that JGroups somehow tries to use old connections to pods that have been removed. By changing the conn_close_timeout setting on UNICAST3 for Jgroups to 10 seconds instead of the default value 4 minutes we could confirm that service degradation was fixed in 10s instead of 4 minutes.
Additionally it seems that this fix only works when the service is running as a StatefulSet and not when it runs as a Deployment. I don't have explanation for exactly why this is, but in conclusion make the service to a stateful set and changing the conn_close_timeout on UNICAST3 in the JGroups configuration fixed our problem.
Answered By - Johan Carlsson
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.