I wanted to share my recent experiences with workflow performance and optimising this on a SharePoint 2010 environment.

I’m running a number of Nintex workflows for one of my clients, one of which is very heavy on server resources whilst it provisions a team site from a complex template amongst other things.

The challenge is multiplied by concurrency, not only does the workflow take a long time to run, but it needs to run potentially hundreds of times per day.

My goal was to increase the number of instances that the environment could successfully process each day.

“To the SharePoint lab environment batman…”

With so many variables that could impact performance, a structured series of tests was in order which focused on a number of keys areas:

  • How to simulate load
  • Make sure the SQL environment is well fed and watered
  • Workflow settings and topology

How to simulate load

This is the easy part.  In my situation I’m picking up XML files from an email enabled list and “triaging” these to process according.  One particular format is responsible for generating those workflows that are represent high load.

For me, it was a simple case of writing a basic PowerShell script to generate n XML files, each with some slightly different values.

Uploading these manually all at once to my library was sufficient to replicate a large spike in load.  It’s probably worth pointing out that in the real world I would receive these high load requests in batches so it’s considered reasonable for these to take a long time to process.

First for the control.  My focus is not on directly comparing metrics between live and my development lab, but on the relative performance variability.

For the environment I went for a basic environment:

  • WFE
  • APP
  • SQL

This is on a virtual lab dedicated to me hence I can ignore external load from other users impacting the tests.  To try and avoid wondering whether physical resources were a significant factor (as I’m focusing on logical optimisations), I gave each server a generous 32GB RAM and 4 virtual CPU cores.

– and yes I’m aware that more is not always better with virtualisation when considering resource contention and increased allocation time for a larger number of VCPUs.  Let’s just pretend that this is “OK” for this test 😉

A batch of 50 items was generated and then uploaded to my document library, which in turn kicked off a workflow for each item.

Control results:

  • SQLIO (20GB test file with 120 second 64KB random access test)
Drive IOPs/sec MBs/Sec Max Lat
Standard 1,069.1 66.81 753
  • Batch processing time

performance_1

Make sure the SQL environment is well fed and watered

There are a number of more comprehensive performance improvements that can be considered in addition to the list below.  I have simply picked a number of optimisations based on lots of reading and selecting those that just “make sense”. I won’t go into detail on each one as that’s a tangent I may never make it back from…

The usual disclaimer applies, I am not a SQL Server ninja and fall into the category of knowing enough to be dangerous.  The below works for me, so it may for you as well.

Firstly give your disks a fighting chance to service SQL as quickly as possible. SQL works in sets of 8 pages (8KB each) so if our disk can work with 64KB blocks of data, that’s going to help.  Note that by default, if you are a “next, next, nexter” on wizards you would have ended up with only 4KB block sizes when formatting your data drive by default.  That’s not always a bad thing, just in this case it is 🙂

To verify what you have in place, navigate to your SQL data drive and create a text file called “1.txt”.

Next, edit this and just insert the number 1.  Save the file then look at the size on disk.  For a file with 1 byte of information within it, this essentially tells us the current block size for that drive.

performance_2

Now through the power of virtualisation, spin up a new drive but this time don’t format it right away.  We want to format this one with a 64KB block size and also set a partition alignment of 1,024KB.

You can simply do this with disk part as below.  In my example, we want to optimise disk 3 for SQL.

C:\>diskpart
 Microsoft DiskPart version 6.0.6001
 Copyright (C) 1999-2007 Microsoft Corporation.
 On computer: ASPIRINGGEEK
 DISKPART> list disk
   Disk ###  Status      Size     Free     Dyn  GPT
   --------  ----------  -------  -------  ---  ---
   Disk 0    Online       186 GB      0 B
   Disk 1    Online       100 GB      0 B
   Disk 2    Online       120 GB      0 B
   Disk 3    Online       150 GB   150 GB

DISKPART> select disk 3
Disk 3 is now the selected disk.
DISKPART> create partition primary align=1024
DiskPart succeeded in creating the specified partition.
DISKPART> assign letter=F
DiskPart successfully assigned the drive letter or mount point.
DISKPART> format fs=ntfs unit=64K label="SQL Data" nowait quick

Now that we have a new potential data disk, how does this stack up with disk IO? Enter SQLIO again.

Have a look at those results (remembering SQL has an unhealthy obsession with 64KB).

Drive IOPs/sec MBs/Sec Max Lat
Standard 1,069.1 66.81 753
Optimised 1,258 78.02 540

Holy disk on steroid’s batman!

Again to keep our experiment simple, let’s just assume that I love putting all data such as search DBs on the same data disk.  I would recommend of course having a separate disk for logs and TempDB, purely because they have different IOP requirements to satisfy SharePoint along with a different profile to keep your SAN guys loving you (e.g. logs = write intensive).

Now let’s go a little further and look at our SQL instance itself…

There are a few good practise things to configure such as:

  • Set the maximum memory for the instance
  • Set the max degree of parallelism to 1 (compulsory for SharePoint 2013, optional for 2010)
  • Enable instant file initialisation
  • Set the fill factor to 80
  • Create a 1-1 relationship between the number of CPU cores and tempDB files.  Note that there is some debate on how much impact this really has unless you have separate disks in place for each file etc.

I’ll leave the above items as homework as there are plenty of other sites which cover the steps in detail.

Now that we have given SQL a fighting chance, let’s run another batch of tests:

performance_3

As you can see we have made a significant impact on the average pressing time of each workflow.  The initial batching is also much more stable with less performance spikes.

Workflow settings and topology

We have a number of built in settings at our disposal such as:

Purpose STSADM PowerShell
Defines the number of events delivered to a single instance of a workflow stsadm -o setproperty -pn workflow-eventdelivery-throttle Set-SPFarmConfig –WorkflowPostponeThreshold
Determines the number of workflows that can be executed against a single content database (note paused workflows don’t count towards this) stsadm -o setproperty -pn workflow-eventdelivery-batchsize Set-SPFarmConfig –WorkflowBatchSize
Specifies the length of time between workflow batch execution stsadm – job-workflow Set-SPTimerJob –Identity job-workflow

The key thing to note is that genuinely, everyone’s environment is different and based on what your workflows are doing – you will need to change these appropriately.

For me, my focus was on getting as many workflows to run as quickly as possible, and in parallel.

I also had to consider how to involve other servers in the farm to share the burden of running the workflows.  The problem with the batch size settings is that your servers are greedy and want to be a hero to their fellow servers.

Firstly, we can set the threshold to be as low as possible (1) and increase the Workflow Timer Service job frequency from the 5 minute default to run every minute.

Why is this important? So we reduce the number of workflows processed by the WFE server and push these to be scheduled by the workflow timer service instead.

By deferring as much as possible to the workflow timer service we can control which servers run this.  One additional objective is to keep the WFE server sitting on its hands ready and waiting to fulfil user requests without being burdened with bulk workflows.

If you stop the Workflow Timer service on the WFE you ensure that it only processes the first few tasks before deferring to the timer service.

Running this through with a couple of APP servers, I started to see good results.  The not so obvious issue here is that it’s inconclusive.  The results were great, but not linear with regards to improved processing time and the number of APP servers.

How far could I go?? I have my own virtual host with 192GB RAM starting to write its CV in search of more fulfilling opportunities.

Enter some stubbornness with a splash of OCD…

I expanded my matrix to ramp up to 4 serves and also try more combinations of workflow threshold settings.

Workflow Thresholds

1

10

100

1000

1 APP Server

Test

Test

Test Test
2 APP Servers

Test

Test

Test

Test

3 APP Servers

Test

Test

Test

Test

4 APP Servers

Test

Test

Test

Test

The results quickly favoured a threshold of 10 for me.  Note that this will almost certainly differ on another environment with different workflow profiles.

Firstly, we can look at the processing time per test instance.  There is enough fluctuation here to make this graph look messy…

performance_4

If we instead shift our focus to the overall elapsed time, we see a graph that is much easier to follow:

performance_5

This was a worthwhile activity and resulted in a more optimal configuration.  I was worried at the tapering off of improvement between 3-4 APP servers, but saw that this corresponds nicely with where Microsoft pitches the approximate point of diminishing returns.

Conclusion

As you can hopefully see, the challenge is working out where your particular bottleneck is.  Fixing it is typically straight-forward.

My test workflow went from an average of 4 minutes to just over 1.5 minutes simply from configuration settings and some additional application servers in the topology to make use of this.

There are plenty of other ways to optimise things (like change your approach re: how things are written), but in this case I wanted to understand/demonstrate how topology alone can influence your workflow performance on SharePoint 2010.

Links

I found the following people/sites useful in the above experience.  It’s worth checking them out for more information:

Leave a Reply

Your email address will not be published. Required fields are marked *