Matrices and object field relative speed.

SilverShaded · Aug 8, 2020

Some background, im an engineer not a professional programmer so apologies for any dumb questions. I have a program that is calculation intensive and i want to speed it up. My understanding (from reading forums) was that C# is optimised for Jagged Arrays, but my test below seems to say not. If i'm doing something wrong please let me know, also the test which uses the tray object (could be any moderately complex object) seems too fast, is there a problem with the code?

The relative speeds im getting (in debug) are; (Note: I just realised the times include array initislisation, but thats ok, the arrays will be cleared or re-initialised many times in the program).

Object/field access 0ms
Jagged Array access 87ms
Normal Array access 12ms

If i dont count initialising the arrays i get 0, 7 and 7 ms respectively.

C#:

var watch = Stopwatch.StartNew();
double res=0;
int count = 10000;
Tray tray = column[0][4];
for (int i = 0; i < count; i++)
{
    tray = column[0][4];
    tray.T = i;
    res = tray.T;
}
watch.Stop();

var elapsedMs = watch.ElapsedMilliseconds;
Debug.WriteLine("Column Solution Time1 ms: " + elapsedMs.ToString() + " " + res.ToString());
MessageBox.Show("Object Field" + elapsedMs.ToString());


watch = Stopwatch.StartNew();
double res2 = 0;
int count2 = count;
double[][] tray1 = new double[count2][];
for (int i = 0; i < count2; i++)
    tray1 = new double[count2];

for (int i = 0; i < count2; i++)
{
    tray1 = i;
    res2 = tray1;
}
watch.Stop();

elapsedMs = watch.ElapsedMilliseconds;
Debug.WriteLine("Column Solution Time2 ms: " + elapsedMs.ToString() +" "+ res2.ToString());
MessageBox.Show("Jagged Matix" + elapsedMs.ToString());

watch = Stopwatch.StartNew();
double res3 = 0;
int count3 = count;
double[,] tray3 = new double[count3,count3];

for (int i = 0; i < count3; i++)
{
    tray3[i,i] = i;
    res3 = tray3[i,i];
}
watch.Stop();

elapsedMs = watch.ElapsedMilliseconds;
Debug.WriteLine("Column Solution Time3 ms: " + elapsedMs.ToString() + " " + res3.ToString());
MessageBox.Show("Matix" + elapsedMs.ToString());

SilverShaded · Aug 8, 2020

Note the section above for tray1 should be accesing the jagged array, it seems to have pasted incorreclty .

Skydiver · Aug 8, 2020

SilverShaded said:
My understanding (from reading forums) was that C# is optimised for Jagged Arrays,

Can you share some links? This is the first I've heard of this.

Skydiver · Aug 8, 2020

SilverShaded said:
The relative speeds im getting (in debug) are

Do not do perf testing against debug. The code is completely unoptimized. The code generated is to help the programmer find bugs faster, not to help the program run faster.

Skydiver · Aug 8, 2020

SilverShaded said:
Note the section above for tray1 should be accesing the jagged array, it seems to have pasted incorreclty .

Can you post the appropriate code in code tags. Chances are that the indexing you had got stripped away because you posted your code as plain text instead of using code tags. There forum software saw some text in square brackets and thought the text was invalid BBcode mark up.

Skydiver · Aug 8, 2020

It all comes down to compile time vs runtime computation of what memory needs to be accessed. Anyway to explain the speed differences:

Field access
When you have code that looks like this

C#:

for (int i = 0; i < count; i++)
{
    tray = column[0][4];
    tray.T = i;
    res = tray.T;
}

On line 3, the compiler sees the access to the jagged array (see below) but the indexes are the fixed values 0 and 4. Normally it would insert the standard runtime bounds checks code, but if it can infer what the size of the array is, it will take away the runtime bounds checks and just get a reference to the array element directly. Furthermore, if you were compiling for Release rather than debug, the compiler is usually smart enough to lift the line 3 out of the loop since you are always accessing the same element.

Anyway, the field accesses on lines 4 and 5 are fast because at compile time the offset of the field T is computed and then values are written to/read from that address.

Array access
When you have code that looks like this:

C#:

for (int i = 0; i < count; i++)
{
    tray3[i,i] = i;
    res3 = tray3[i,i];
}

On lines 3 and 4, at the array index i,i is computed at runtime. What makes this faster than the jagged array (see below) is the that typically there is only one bounds check after the memory offset computed.

Jagged array access
When you have code that looks like this:

C#:

for (int i = 0; i < count; i++)
{
    tray1[i][i] = i;
    res3 = tray1[i][i];
}

On lines 3 and 4, there are two runtime bounds array checks. The first check is done by tray1[i] to make sure that i is within range for the first dimension. If it is valid, that returns an array for the second dimension. To access an element in that second dimension array, another bounds check is performed.

SilverShaded · Aug 8, 2020

Skydiver said:
It all comes down to compile time vs runtime computation of what memory needs to be accessed. Anyway to explain the speed differences:

Field access
When you have code that looks like this

C#:

for (int i = 0; i < count; i++) { tray = column[0][4]; tray.T = i; res = tray.T; }

On line 3, the compiler sees the access to the jagged array (see below) but the indexes are the fixed values 0 and 4. Normally it would insert the standard runtime bounds checks code, but if it can infer what the size of the array is, it will take away the runtime bounds checks and just get a reference to the array element directly. Furthermore, if you were compiling for Release rather than debug, the compiler is usually smart enough to lift the line 3 out of the loop since you are always accessing the same element.

Anyway, the field accesses on lines 4 and 5 are fast because at compile time the offset of the field T is computed and then values are written to/read from that address.

Thanks I think that is making some sense to me now, I think this may have been a better example as the thing i was really trying to test was the likely calculation speed. Column here is an indexed object of an indexed object so not an array actually, sorry i didn't make that clear. In reality the tray object is retrieved once and then has lots of calcualtions done on some fields. The alternative was store everything in arrays, either jagged or multidimensional. I think what your saying is that multidimensional arrays are likley to be faster that a jagged array, especially if i have to re-initialise them fairly frequently? And where possible storing data as fields on objects may be faster where the object is retrieved once and then has several calcualtions performed on fields? Presumaby if the object (ie tray) had to retrieved a lot using indexers it would also be slower and maybe similar to using an array?

tray = column[0][4];

C#:

for (int i = 0; i < count; i++)
{
    tray.T = i;
    res = tray.T;
}

SilverShaded · Aug 8, 2020

Skydiver said:
Do not do perf testing against debug. The code is completely unoptimized. The code generated is to help the programmer find bugs faster, not to help the program run faster.

Thanks, yes i tried it both in debug and reelease, release was obviously faster but gave similar relative performace.

Skydiver · Aug 8, 2020

As a famous quote goes: Premature optimization is the root of evil.

Often what will get you more speed is a better algorithm. If you can prove using complexity analysis that you are already using the best algorithm, that is the time to look at micro-optimizations like these where you are trying to improve access speed.

As a rule of thumb, code with less branches go faster. Bounds checks involve branches. Modern processors implement pipelining by reading/executing several instructions ahead. When they hit branches, they try to predict which branch will be taken and continue filling the pipeline with those instructions. If the prediction is wrong, then the pipeline needs to be cleared. This slows down processing.

Another rule of thumb is code which makes better use of data locality runs faster. So if you can keep the data that you need to play with close together, you will take advantage of modern processor's fast memory caches. When data is not in the fast memory cache, it needs to be pulled from main memory.

Try searching for the video of Bjarne Stroustrups talk at Microsoft regarding the darling of C and C++ data structures teachers, and the favorite interview question topic: linked lists. He covers why even though it has a theoretical O(1) complexity, it gets its pants beat by the "bad" O

array.

SilverShaded · Aug 10, 2020

Skydiver said:
As a famous quote goes: Premature optimization is the root of evil.

Often what will get you more speed is a better algorithm. If you can prove using complexity analysis that you are already using the best algorithm, that is the time to look at micro-optimizations like these where you are trying to improve access speed.

As a rule of thumb, code with less branches go faster. Bounds checks involve branches. Modern processors implement pipelining by reading/executing several instructions ahead. When they hit branches, they try to predict which branch will be taken and continue filling the pipeline with those instructions. If the prediction is wrong, then the pipeline needs to be cleared. This slows down processing.

Another rule of thumb is code which makes better use of data locality runs faster. So if you can keep the data that you need to play with close together, you will take advantage of modern processor's fast memory caches. When data is not in the fast memory cache, it needs to be pulled from main memory.

Try searching for the video of Bjarne Stroustrups talk at Microsoft regarding the darling of C and C++ data structures teachers, and the favorite interview question topic: linked lists. He covers why even though it has a theoretical O(1) complexity, it gets its pants beat by the "bad" O array.

The algorithm is failry straightforward, newton raphson, lots of matrix building & inversion etc. Apart from tri-diagonal matrix with offband element inversion, the other slow steps tend to involve Exp & Log functions. Probably not a huge amount i can do about those (i am aware of approximate faster methods for log/exp etc, but not really tried them yet). Some simple improvements like using fields instead of properties and the like have easily halved the solution time. I've tried solving the matrices in the processor core built in routines but the overhead of doing that made it not worth the effort. There are some possible tricks with the thrmodynamics routines which may help. Current solution time for a simple distillation column problem is about 400ms, my feeling is maybe that can be squeezed down to about 300ms, but probably no less then that.

The time is critical as for a full chemical plant simulation it can be very iterative and this 300ms is only one step out of hundreds or thousands.

Skydiver · Aug 10, 2020

Personally, I would have stopped at Newton-Raphson, too. Looking at Wikipedia, there look to be two there methods created, one made during Newton's time, and another as recent as 1956 which converge on roots faster. They maybe worth investigating.

I am going to assume that you have already explored some reputable off the shelf numerics libraries, or that the built in Matrix class is too small, or that you have decided not to look at them because you want the joy of writing your own.

Look at places where you can unroll some loops. The current C# compiler and JIT compiler are not yet as good as the C/C++ compilers with automatically unrolling looks where optimal. This let me get some speed in some of my code where performance matters.

Next look for opportunities for parallelization. Matrix operations are usually ripe for this, but as you are discovering, there maybe a need to change the underlying representation to let each thread access data independently and quickly. In one bit of code I had, it was faster to make two copies of the same matrix: one row major and another column major, and then use Span<T> to get more speed. (I was expecting the overhead of making the copy would overwhelm the speed savings, but I was surprised to find that it was worth it.)

If you are willing to compile code with the /unsafe flag, and use unsafe blocks of code, you can get several more microseconds but using pointers. This will involve you computing matrix offsets yourself. Personally, if/when using computing the offlsets becomes too unweildy, I would rather just write some C++ code in another DLL and P/Invoke the C++ code.

Matrices and object field relative speed.

SilverShaded

Well-known member

SilverShaded

Well-known member

Skydiver

Skydiver

Skydiver

Skydiver

SilverShaded

Well-known member

SilverShaded

Well-known member

Skydiver

SilverShaded

Well-known member

Skydiver

Similar threads

Share this page

Latest posts