Array Acces Speed

SilverShaded

Well-known member
Joined
Mar 7, 2020
Messages
93
Programming Experience
10+
Hello, i have a program which is heavy on array access, so i've been tyring to optimise the speed. This is important for the overall program which is highly iterative with lots of matrix manipulation. I was hoping to get any criticism of the code below or any further suggestions to speed it up more.

Current run times (in release mode) are respectively, in ms, the relative time differences are relatively constant, at the moment option 4 is looking like the best possible, which represent a significant increase in speed over option 2 which is the current method in the progam. Before i go changing a whole bunch of arrays are there any obvious improvements possible or better ways to do it or any problems with the code below that may give odd results?

elapsedMs1 128
elapsedMs2 29
elapsedMs3 228
elapsedMs4 7

Array Access Speed Test:
 [TestMethod]
        [MethodImpl(MethodImplOptions.NoInlining)]
        public unsafe void TestDoubleDoubleArrayWithPts()
        {

            double[,] Array = new double[10000,1000];
            var watch = Stopwatch.StartNew();
            int UpperBound0 = Array.GetUpperBound(0);
            int UpperBound1 = Array.GetUpperBound(1);

            watch.Restart();
            for (int i = 0; i < Array.GetUpperBound(0); i++)
                for (int y = 0; y < Array.GetUpperBound(1); y++)
                    Array[i,y] = 10D * 10D;

            var elapsedMs1 = watch.ElapsedMilliseconds;

            watch.Restart();
            for (int i = 0; i < UpperBound0; i++)
                for (int y = 0; y < UpperBound1; y++)
                    Array[i, y] = 10D * 10D;

            var elapsedMs2 = watch.ElapsedMilliseconds;

            watch.Restart();
            fixed (double* ptr = Array)
            {
                for (int i = 0; i < Array.GetUpperBound(0); i++)
                    for (int y = 0; y < Array.GetUpperBound(1); y++)
                        *(ptr + i * Array.GetUpperBound(1) + y) = 10D * 10D;    
            }

            var elapsedMs3 = watch.ElapsedMilliseconds;

            watch.Restart();
            fixed (double* ptr = Array)
            {
                for (int i = 0; i < UpperBound0; i++)
                    for (int y = 0; y < UpperBound1; y++)
                        *(ptr + i*UpperBound1 + y) = 10D * 10D;      
            }

            var elapsedMs4 = watch.ElapsedMilliseconds;

            System.Windows.Forms.MessageBox.Show(elapsedMs1.ToString() + " " + elapsedMs2.ToString() + " " + elapsedMs3.ToString() + " " + elapsedMs4.ToString());
       
        }
 
Solution
Yes your right, it looks odd now you mention it, when i tested it the results looked ok, late here so ill check it tommorrow. If the results are actually correct maybe that routine can be simplified, the final matrix should be symmetrical around the diagonal, some odd thing in the math maybe. Could just be wrong though. Should be able to cut the sqrt calls in half actually, as its symmetrical, but i was too knackered to get it working today. Need to think about it with a fresh head on.

parralell might be interesting!

i can’t reduce the number of calls to this function for pure tower simulations but definately can for streams and general flash calculations.
interesting, yes im not familiar with Spans, a quick search seems to give mostly examples with one dimensional arrays, are they as good for two dimensional?
 
Multidimensional arrays (as opposed to jagged arrays) are laid out contiguously in memory. That is why your pointer code on lines 36-41 works. A span just represents a contiguous range of memory.

Lines 36-41 could be re-written as:
C#:
fixed (double* ptr = Array)
{
    double * dst = ptr;
    for (int i = 0; i < UpperBound0; i++)
        for (int y = 0; y < UpperBound1; y++)
            *dst++ = 10D * 10D;
}
to show the contiguous nature of the array elements.
 
im not familiar with Spans
I mentioned Spans in your other thread where you were also asking about optimizing matrix access:
 
I mentioned Spans in your other thread where you were also asking about optimizing matrix access:
Thank yes, it probably didn't fully register with me at the time, im kind of slow to pick up new programming tricks and tend to just plow on with the bulk of the code. Its a 'spare time' project so paid work (chemical engineering/thermodynamics) often interrupts my train of thought.

Actualy im quite impressed with how fast C# actually is with some of these tricks, my program performace isnt quite up to commercially available program performace yet (e.g. stuff probably written in C) but it is getting there for this particular application.
 
fixed (double* ptr = Array) { double * dst = ptr; for (int i = 0; i < UpperBound0; i++) for (int y = 0; y < UpperBound1; y++) *dst++ = 10D * 10D; }
Thanks for that, it's reliably faster than the methods i was trying... and easier to read for me, i may go with this approach.
 
Also as I noted in the other thread, always choose to try to find a more efficient algorithm before you start going down the path of micro-optimizations. Go beyond the Newton-Raphson method, and use something more efficient.

Out of curiosity, are you using simulated annealing as an optimization method? Is this why you have to do multiple iterations of what is sounding like a simulation? Why doesn't the simplex method for optimization work for your problem space?

If also need to do matrix multiplication and the profiler is identifying it as the bottleneck, make note of the comment I made in the other thread with regards to creating 2 versions of the matrix: one in the standard row major layout, and a transformed version that is effectively column major. This will speed things up because the memory accesses will be contiguous and help take advantage of the processor's memory cache as well as its predictors regarding what memory to bring into the cache.

Micro-optimizations may speed up processing, but it slows down the developer because the code becomes much harder to parse and understand. With more complexity, it's easier for bugs to sneak in. And also with more complexity, it is harder to have a big mental picture of all the moving pieces. Without a big picture understanding, it's harder to see where other optimizations could be made, or where refactoring can be done.
 
Also as I noted in the other thread, always choose to try to find a more efficient algorithm before you start going down the path of micro-optimizations. Go beyond the Newton-Raphson method, and use something more efficient.

Out of curiosity, are you using simulated annealing as an optimization method? Is this why you have to do multiple iterations of what is sounding like a simulation? Why doesn't the simplex method for optimization work for your problem space?

If also need to do matrix multiplication and the profiler is identifying it as the bottleneck, make note of the comment I made in the other thread with regards to creating 2 versions of the matrix: one in the standard row major layout, and a transformed version that is effectively column major. This will speed things up because the memory accesses will be contiguous and help take advantage of the processor's memory cache as well as its predictors regarding what memory to bring into the cache.

Micro-optimizations may speed up processing, but it slows down the developer because the code becomes much harder to parse and understand. With more complexity, it's easier for bugs to sneak in. And also with more complexity, it is harder to have a big mental picture of all the moving pieces. Without a big picture understanding, it's harder to see where other optimizations could be made, or where refactoring can be done.

Yes you are correct its a distillation tower simulation. Matricies come into it in various aspects, one of the main industray standard approaches is an 'inside out' method where the inner problem is solved with simplified thermodynamics and the outer problem iterates around by updating the inner simplified equations. There are matrices to solve the component balance for every component (so for a crude oil tower that would be ~100 matrices) solved direclty by matrix inversion. There is a jacobian matrix which contains the gradients (evaluated by pertubing the 100 component matrices * the number of variables, upto possible 100 for a large complex tower). This jacobian is then solved by typical newtown raphson type iterations until all the constraints are met. Currently i'm inverting the jacobian at every step rather than using the Sherman-Morrison update formula.

The slowest step in all of this is actually in the thermodynamics even though its the outer loop of the inside out method and not in the component matrices or jacobian matrix inversion and updating. (this may change as i try larger problems).

There is a small function which was consuming about 20% or so of the overall runtime (tower solving time is roughly 300ms to 1s, this will grow with bigger problems) This small function is buried in the thermodynamics and is used throughout the simulation not just in tower problems. So this is where i'm currently targeting improvement(s).

Ok so 300 ms doesn't sound a lot to solve a tower (i recently heard from an expert that the 1980's mantra was 'a tower an hour'), but its still significant if there is a simulation with several towers and there are recycles in the flowsheet from one tower to the next and back.


Slow Step:
        static public double CalcAmix(Components cc, double[] XY, double[] ai, double[,] kijm, out double[,] Aij, out double[] sumAi)
        {
           // Debug.Print(": CalcAmix");
            //
            // T is in units of Kelvin; Tc in Kelvin; Pc in Pa
            //
            int n = cc.components.Count;
            double amix = 0;
            Aij = new double[n,n];
            sumAi = new double[n];

            for (int i = 0; i < n; i++)
            {
                for (int j = 0; j < n; j++)
                {
                    if (i == j)
                        Aij[i, j] = ai[i];
                    else if (i > j)
                        Aij[i, j] = Aij[j, i];
                    else
                        Aij[i, j] = Math.Sqrt(ai[i] * ai[j]) * (1 - kijm[i, j]);
                }
            }

            for (int i = 0; i < n; i++)
            {
                for (int j = 0; j < n; j++)
                    sumAi[i] += XY[j] * Aij[i, j];
               
                amix += sumAi[i] * XY[i];
            }

            return amix;
        }
 
Yes, I recall, that my father used to say "a tower an hour", but then he used to say it with such glee. It's because he was a chemical engineer working with oil refineries since the 60's, so to bring down the simulation to an hour with a lot of fidelity was a major improvement from their overnight runs on the mainframe with just rough numbers. I also recall him having tons of fun of being able to do optimization problem solving with Lotus 1-2-3 and a 1980's PC clone, and what a time difference adding in a math co-processor made to the calculations.
 
Yes, I recall, that my father used to say "a tower an hour", but then he used to say it with such glee. It's because he was a chemical engineer since the 60's, so to bring down the simulation to an hour with a lot of fidelity was a major improvement from their overnight runs on the mainframe with just rough numbers. I also recall him having tons of fun of being able to do optimization problem solving with Lotus 1-2-3 and a 1980's PC clone, and what a time difference adding in a math co-processor made to the calculations.

It has moved on a long way and these days simulation is allmost treated like a simple calculator, instant accurate (unrealistically so) answers are often demanded.

I am intrigued that there may may be better alternatives to Newton Raphson, hadn't really thought about it much. NR can have a tendency to diverge on this problem without good intial guesses. I spent a lot of time improving the original published algorithm (which often requires good guesses for starting conditions) so that guesses are not required, they are now estimated by the program (this is a (small) improvement over several commercial simulators). However a more robust method than NR (or better vrsion of NR rather than my own) wouldn't do any harm at all.
 
Found time to try it out, its no longer the slowest step.

Actually calling Math.Sqrt is now my slowest step and its called 35,700 times, consuming 10% of the total run time... so running out of significant potential improvements maybe...

Speeded up version:
   static public unsafe double CalcAmix(Components cc, double[] XY, double[] ai, double[,] kijm, out double[,] Aij, out double[] sumAi)
        {
            // Debug.Print(": CalcAmix");
            //
            // T is in units of Kelvin; Tc in Kelvin; Pc in Pa
            //
            // Pointer Examples
         
            int n = cc.components.Count;
            double amix = 0;
            Aij = new double[n,n];
            sumAi = new double[n];
            int UpperBound0 = Aij.GetUpperBound(0);
            int UpperBound1 = Aij.GetUpperBound(1);

            fixed (double* ptr = Aij, aiPtr= ai, kijmPtr=kijm)
            {
                double* element = ptr;
                double* kijelement = kijmPtr;
                double* aielement = aiPtr;
                for (int i = 0; i <= UpperBound0; i++)
                {                
                    for (int j = 0; j <= UpperBound1; j++)
                    {
                        if (i == j)
                            *element = *aielement;
                        //else if (i > j)
                        //    *(ptr + i * UpperBound1 + j) = *(ptr + j * UpperBound1 + i);
                        else
                            *element = Math.Sqrt(*(aiPtr + i) * *(aiPtr + j)) * (1 - *kijelement);

                        element++;
                        kijelement++;
                    }
                   
                    aielement++;
                }
            }

            fixed (double* ptrAij = Aij, ptrXY = XY, ptrsumAI = sumAi)
            {
                double* elementAij = ptrAij;
                double* elementsumAI = ptrsumAI;
                double* elementXy = ptrXY;

                for (int i = 0; i < n; i++)
                {                
                    for (int j = 0; j < n; j++)
                    {
                        *elementsumAI += *elementXy * *elementAij;
                        elementXy++;
                        elementAij++;
                    }

                    amix += *elementsumAI * *(ptrXY+i);
                    elementXy = ptrXY;
                    elementsumAI++;
                }
            }
            return amix;
        }
 
I think that there are opportunities for parallelism so get more speed out of it because it looks like your outputs don't overlap. I'll look at it a bit more when I'm off work.

Unless your can roll your own faster lookup based square root function, I don't think you can get square root to go any faster.

But in the meantime, I suspect that there is something wrong with the elements that you are accessing to pass into the square root:
C#:
Math.Sqrt(*(aiPtr + i) * *(aiPtr + j)) * (1 - *kijelement)
translates to array notation:
C#:
Math.Sqrt(Aij[0,i] * Aij[0,j] * (1 - kijm[i,j])
I doubt that you always want just the first row of the Aij matrix.
 
Yes your right, it looks odd now you mention it, when i tested it the results looked ok, late here so ill check it tommorrow. If the results are actually correct maybe that routine can be simplified, the final matrix should be symmetrical around the diagonal, some odd thing in the math maybe. Could just be wrong though. Should be able to cut the sqrt calls in half actually, as its symmetrical, but i was too knackered to get it working today. Need to think about it with a fresh head on.

parralell might be interesting!

i can’t reduce the number of calls to this function for pure tower simulations but definately can for streams and general flash calculations.
 
Solution
I think that there are opportunities for parallelism so get more speed out of it because it looks like your outputs don't overlap. I'll look at it a bit more when I'm off work.

Unless your can roll your own faster lookup based square root function, I don't think you can get square root to go any faster.

But in the meantime, I suspect that there is something wrong with the elements that you are accessing to pass into the square root:
C#:
Math.Sqrt(*(aiPtr + i) * *(aiPtr + j)) * (1 - *kijelement)
translates to array notation:
C#:
Math.Sqrt(Aij[0,i] * Aij[0,j] * (1 - kijm[i,j])
I doubt that you always want just the first row of the Aij matrix.
My mistake. I was confused by your naming convention. So on closer look now that I'm off work, this:
C#:
Math.Sqrt(*(aiPtr + i) * *(aiPtr + j)) * (1 - *kijelement)
translates over to:
C#:
Math.Sqrt(ai[i] * ai[j]) * (1 - kijm[i,j])
which now looks kind of more reasonable.
 
Back
Top Bottom