SQL Azure, Dynamics NAV & Data I/O Issues

When I write a blog, and I should write more I know, it’s most of the time to share a tip or to write my opinion on something.

This blog is more of a question, or a call to share knowledge of a piece of technology that is not used by many.

About 4 months ago my largest customer moved to SQL Azure with their 200GB database after a lot of testing and careful considerations.

I’m not going to post about the business value of the Azure platform and the benefits of running on this. I’m going to share some issues we run into and I’m looking for others with experiences to solve this.

We have a very good performance experience running on P4. This gives us 500 DTU’s and we typically stay below 50% usage.

We could probably switch to P2 were it not that we have moments where DTU goes up to 80% and to avoid these peaks we need some time to optimise processes

Cost & ‘Auto’ Scaling

To save on Azure Cost (P4 is pretty expensive) we use a PowerShell script to scale down to P2 after business hours and to P1 during the night hours and in weekends.

This makes Azure SQL cost 800 euro/month = 9.600 euro/year and 48.000 euro/five years.

Yes, this is a lot of money, but so is setting up a SQL Cluster On-Premises with maintenance, backups and an occaisional hardware issue.

We expect to be able to scale down to P2 which will cut the cost down a bit more.

The Problem

So far, so good. We’ve been running for four months and on average everyone is happy except for a few moments. This is when shit hits the fan, the phone starts ringing and users start to be angry at us.

The system is slow! Fix it!

Funny enough when this happens we don’t see issues with slow queries on the database. DTU does not go up and even the processes on the Job Queue don’t take longer than average. Only the end-users complain.

We have four service tiers on two virtual machines. The users and the job queue are on different machines. The other two service tiers we need for interfacing with different credential types to support some older add-ons we are using that don’t support AAD.

When we ask a user during performance issues to switch to another service tier on a different machine the performance is ok. No problems.

Then we restart the service tier and everybody restarts NAV and works happily ever after.

The Telemetry

Unfortunately we don’t have a lot of telemetry when this happens. The only conistent number is what SQL Azure calls “Data I/O Percentage”.

Normally, like 99.5% of the time this number does not top 0.5%

However, when users call us with performance issues we see this going up to 25%.

Most often this happens when we scale down the database, but sometimes, not often, it happens in the middle of the day.

In some cases the memory of the Users’ service tier is quite high, like 90% or 95% but we’ve also seen it happen with only 75% or 80%.

Any Ideas?

We discussed this issue with our SQL Azure experts, with Microsoft in Redmond and Lyngby but nobody seems to have a real answer to what causes this to happen.

If anyone has an idea you would most definately make my day.

The situation is under control. We have procedures in place that allow the users to switch service tiers when this happens but it’s just anoying and bad for the “reputation” of NAV in this company.

We run NAV2018 build 33015 with 80 concurrent windows client users and a few hundred interfaces, portals and jobs that pound the system pretty hard.

4 Comments

  1. The simplest comment I can make on this, is that right now I see NAV on Azure as being to business what NAV on 4.00 SQL was in 2005 ๐Ÿ™‚

    Like

  2. kriki says:

    It looks to me like a few queries read (past sense) a lot of data, kicking other, useful, data out of memory. So sql needs to read that data again, going from a few milliseconds (or less) responses to some 10โ€™s of milliseconds.
    Result: slower responses and more reads.

    Liked by 1 person

  3. Marek Mengel says:

    Lot of things will happen (in Azure), when Change Log is enabled and Change Log Entry table is used and it’ size is more than XGB. So my suggestion is to remove older than 3 months Change Log Entries to another database and use it as external database. Second thing is to enable elastic database job for optimizing table indexes.

    Liked by 1 person

Leave a Reply to Anonymous Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.