What are the limits on actorevents in service fabric?

I am currently testing the scaling of my application and I ran into something I did not expect. The application is running on a 5 node cluster, it has multiple services/actortypes and is using a shared process model. For some component it uses actor events as a best effort pubsub system (There are fallbacks in place so if a notification is dropped there is no issue). The problem arises when the number of actors grows (aka subscription topics). The actorservice is partitioned to 100 partitions at the moment. The number of topics at that point is around 160.000 where each topic is subscribed 1-5 times (nodes where it is needed) with an average of 2.5 subscriptions (Roughly 400k subscriptions). At that point communications in the cluster start breaking down, new subscriptions are not created, unsubscribes are timing out. But it is also affecting other services, internal calls to a diagnostics service are timing out (asking each of the 5 replicas), this is probably due to the resolving of partitions/replica endpoints as the outside calls to the webpage are fine (these endpoints use the same technology/codestack). The eventviewer is full with warnings and errors like:

EventName: ReplicatorFaulted Category: Health EventInstanceId PartitionId ReplicaId 132580461505725813 FaultType: Transient, Reason: Cancelling update epoch on secondary while waiting for dispatch queues to drain will result in an invalid state, ErrorCode: -2147017731 10.3.0.9:20034-10.3.0.13:62297 send failed at state Connected: 0x80072745 Error While Receiving Connect Reply : CannotConnect , Message : 4ba737e2-4733-4af9-82ab-73f2afd2793b:382722511 from Service 15a5fb45-3ed0-4aba-a54f-212587823cde-132580461224314284-8c2b070b-dbb7-4b78-9698-96e4f7fdcbfc 
P. Gramberg asked Feb 26, 2021 at 8:24 P. Gramberg P. Gramberg 329 1 1 gold badge 3 3 silver badges 15 15 bronze badges

As communicated on the youtube.com/watch?v=oX9cX69mk5o earlier this week at 44:17, there is no hard limit for actor events, but I inferred that you might have luck filing a support ticket to dig into the actual resource utilization on your cluster to identify the root of your problem.

Commented Mar 20, 2021 at 7:56

Thanks, I ended up giving more info to Matt after which he asked me to open a support ticket. I will update/answer this question when I know more