Change propagation between AOS nodes

When working with D365FO, we usually expect that every change has immediate effect all over the application. D365FO runs on a multi-node cluster so all nodes should see the same data. However, most data are rarely changed, so several caching mechanisms are used to improve performance.

I was curious to know if these caching mechanisms have some unexpected side effects like delays in change propagation. Since I have not found anything about it on the internet, I have done the research by myself and, in this article, I am happy to share the results of my investigation with you.

Testing setup

In a (Tier 1) development environment different user sessions run in the same web server application domain so they can share certain objects (in .NET and kernel). Batch job sessions run in a different process (Batch.exe) so we have two processes running on one Windows server and my understanding is that RPC is used to propagate changes between them exactly as if these processes would run on different servers of a multi-node environment.

Non-Development environments (Tier 2+) run on a cluster of multiple Windows server nodes running AOS and Batch instances. (An environment contains also other types of nodes, but they are not relevant for this topic.) For more realistic testing I have done all tests on a Tier2 self-service environment with 2 AOS nodes and 1 Batch node.

SysGlobalObjectCache

Static variables, singletons and SysGlobalCache can be used for sharing data between code parts in one session, while SysGlobalObjectCache is intended for sharing across all sessions.

By documentation (Link 1, Link 2), it is divided in several cache scopes, each containing several key-value pairs. Inserted data is stored in one instance and not propagated, while clear calls are propagated and should clear the selected cache scope on all instances.

Tests on Tier 2 environments have shown that each AOS or Batch node has one cache instance, so all sessions that share the same cache instance see the same data. Sessions on other cache instances will see different data (the data that was inserted in the cache scope on that instance), but the clear call will clear the selected cache scope on all nodes (cache instances).

On Tier 2 environment it took 30-60s to propagate the clear call from AOS1 to AOS2. In the meantime the cache scope on AOS2 still contained the old data. I have not managed to measure the delay between AOS1/AOS2 and BATCHAOS1, but that is probably because I was not measuring the time of batch job execution precise enough. I have also not managed to measure the delay between AOS and BATCH instance on Tier 1 environments. My understanding is that the same RPC mechanisms for inter-node change propagation are used on all environments, so the delays are probably the same or similar.

Table caching

Even though the table data is stored on a SQL server that is one for all nodes, some data is still cached on the nodes to improve performance. It works automatically in the kernel and the developer is not explicitly aware of it. The Cache Lookup table property defines the table’s caching behavior. It you set it to EntireTable all table records will be read and stored in the cache when first accessing the table or if the cache is cleared. If you set it to FoundAndEmpty, Found, NotInTTS or None then just some records will be cached in certain cases.

I have tested change propagation on Tier 2 environment for different tables. For tables with Cache Lookup=EntireTable the change propagation normally takes 15-30s. In that time other nodes still see the old content of the table. This applies to inserting, deleting and updating the records. If you flush the table buffer before reading, the changes are immediately visible. As I understand, it means that changes are always instantly written in the database and if the reader flushes the table buffer, the next time they are also read from the database. Flush on the writer side has no effect.

The above change propagation delay applies to writing and reading data from code and also from most of the forms. It is interesting that some forms like SysTableBrowser (made editable with an extension) propagate changes instantly. It is probably because the datasets are added to the form a bit differently, but I have not further investigated it. However, that probably means that the data are read from SQL server on every access which can be slow if it is widely used.

For tables with Cache Lookup=FoundAndEmpty, the changes are normally instantly propagated. I have seen also the cases with 15s delay, but I was not able to reliably reproduce it, so it could also be just a testing error. For tables with Cache Lookup=None, the changes are always propagated instantly. My understanding is that FoundAndEmpty, Found and NotInTTS caching strategies use some smarter algorithm that instantly propagates changes, but I have not confirmed that with the test.

Conclusion

There are use cases where some delay is acceptable and others where it is not. Whenever you are using SysGlobalObjectCache or the EntireTable cache you have to live with the fact that different nodes will temporarily see different data. Even if the delay is just a second, it can lead to split-brain problem and data inconsistencies.

So, it’s essential that you know where you can afford it. Caching can significantly improve the performance of frequent operations but use it thoughtfully.

Leave a Reply

Your email address will not be published.

*

Docentric respects your privacy. Learn how your comment data is processed >>

Docentric respects your privacy. Learn how your comment data is processed >>