Excel using NPOI - Efficiency

b^3 · Apr 6, 2020

New to C# and my first project is for reading Excel (.xlsx) files.
I am reading every cell from a spreadsheet and then formatting to output as a CSV file. Currently, I am just writing to STDOUT.
My first attempt just pulled all cell entries and processed .ToString(). This, understandably, returns formulas, when, what I really want is the value created by the formula.
Therefore, I put additional code to test whether the entry was a value type and then processed through the .ToString() as mentioned above, otherwise used .NumericCellValue.ToString() instead.
However, once this check was added, the processing time increased by 8x (e.g. <2min to ~15min). Is that a reasonable expectation or can this be made more efficient?
Here is my code with comments:

C#:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using NPOI.XSSF.UserModel;

using NPOI.HSSF.UserModel;

using NPOI.SS.UserModel;

using System.IO;



namespace ExcelReader

{

    class ReadDataNPOI

    {

        IWorkbook Book;

        ISheet DesiredSheet;

        const string Delimiter = ",";

        DateTime StartTime = DateTime.Now;

        StringBuilder RowContent = new StringBuilder(String.Empty);

        int LNoRow = 8, LNoCol = 1, CustDRow = 7, CustDCol = 1;

        string LNo, CustD, IntD = "DRDZ-050-120-C_HEC", GlobalD = "GlobalD62941.4Z";  //Need to get the IntD and GlobalD from MES.

        string Filename = "";

        string LInfo = "";

      

        public ReadDataNPOI(string path, int Sheet)

        {

            try

            {

                FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read);

                Filename = path;

                try

                {

                    Book = new XSSFWorkbook(fs);

                }

                catch

                {

                    Book = null;

                }

                // If reading fails, try to read workbook as XLS:

                if (Book == null)

                {

                    Book = new HSSFWorkbook(fs);

                }





            }

            catch (Exception ex)

            {

                Console.WriteLine(ex.Message, "Excel read error");

                return;

            }

            try

            {

                DesiredSheet = Book.GetSheetAt(Sheet);

            }

            catch (Exception ex)           {

                Console.WriteLine(ex.Message, "ERROR: Sheet read error or not available.");

                return;

            }



        }

        public  string ReadWorkbook(int i, int j)

        {

            //Get L number from spreadsheet.

            if (null != DesiredSheet.GetRow(LNoRow).GetCell(LNoCol) && !string.IsNullOrEmpty(DesiredSheet.GetRow(LNoRow).GetCell(LNoCol).ToString()))

            {

                LNo = DesiredSheet.GetRow(LNoRow).GetCell(LNoCol).ToString();

            }

            else

            {

                if (null != DesiredSheet.GetRow(LNoRow - 1).GetCell(LNoCol) && !string.IsNullOrEmpty(DesiredSheet.GetRow(LNoRow - 1).GetCell(LNoCol).ToString()))

                    LNo = DesiredSheet.GetRow(LNoRow - 1).GetCell(LNoCol).ToString();

            }

            if (LNo == null)

            {

                Console.WriteLine("No L number recorded in file:" + Filename);

                return "";

            }



            //Get CustD from spreadsheet.

            //This checks to see if the L Number is in its proper cell; if so that implies the CustD will also

            //be in its proper cell.  This will hopefully handle the older files and the newer files with L number location correction.

            if (null != DesiredSheet.GetRow(LNoRow).GetCell(LNoCol) && !string.IsNullOrEmpty(DesiredSheet.GetRow(LNoRow).GetCell(LNoCol).ToString()))

            {

                CustD = DesiredSheet.GetRow(CustDRow).GetCell(CustDCol).ToString();

            }

            else

            {

                CustD = "";

            }



            //Console.WriteLine(LNo + Delimiter + IntD + Delimiter + GlobalD + Delimiter + CustD + System.Environment.NewLine);

            LInfo = LNo + Delimiter + IntD + Delimiter + GlobalD + Delimiter + CustD + Delimiter;



            for (int Row=30; Row <= DesiredSheet.LastRowNum; Row++)

            // NOTE:  This is hard coded for now.  Will later be defined in a global variable.

            //  This is where the main data section begins in all workbooks to be processed.

            {

                RowContent.Append(LInfo);



                if (DesiredSheet.GetRow(Row) != null)

                {

                    for (int Col = 0; Col <= DesiredSheet.GetRow(Row).LastCellNum; Col++)

                    {

                      

                        if (null != DesiredSheet.GetRow(Row).GetCell(Col) && !string.IsNullOrEmpty(DesiredSheet.GetRow(Row).GetCell(Col).ToString()))

                        {

                            CurrentCell = DesiredSheet.GetRow(Row).GetCell(Col);

                            // Before checking for IsValueType

                            // RowContent.Append(CurrentCell.NumericCellValue.ToString());

                            // This is now the exception to IsValueType.

                            if (CurrentCell.GetType().IsValueType)

                            {

                                RowContent.Append(CurrentCell.ToString());

                            }

                            else

                            {

                                RowContent.Append(CurrentCell.NumericCellValue.ToString());

                            }

                            if (Col < DesiredSheet.GetRow(Row).LastCellNum-1)

                            {

                                RowContent.Append(Delimiter);

                            }

                            //The next else was used to place all data in one long string.  Took too long.

                            //else

                            //{

                            //    Console.WriteLine(DesiredSheet.GetRow(Row).LastCellNum);

                            //    RowContent.Append(System.Environment.NewLine);

                            //}

                        }



                    }

                    Console.WriteLine(RowContent);

                    RowContent.Clear();

                }

            }

            Book.Close();

            Console.WriteLine("Start: " + StartTime);

            DateTime EndTime = DateTime.Now;

            Console.WriteLine("End: " + EndTime);



            return String.Empty;

        }

    }

}

Any feedback is greatly appreciated.

Skydiver · Apr 7, 2020

Out of curiosity... If you are writing code in C#, why are you using POI? Here's the first few words on Apache POI's web site:
Apache POI - the Java API for Microsoft Documents
If you are using C#, use C# why go to call a Java library?

b^3 · Apr 8, 2020

@Skydiver, valid question. What libraries would you recommend?
Regarding NPOI, here is my best response: The NuGet project description of NPOI is "This project is the .NET port of POI from Apache Foundation. NPOI can read and write xls (Excel 97-2003), xlsx (Excel 2007+) and docx (Word 2007+)".
I started with Microsoft.Office.Interop .Excel, however, this requires Excel to be installed on whatever system is running the program; because of the work environment, this may not be suitable. Also, the interop libraries are far slower than NPOI. What was taking ~35 minutes with the interop libraries took under 2 minutes using NPOI. This, I believe, is partly because the program does not have to start a background instance of Excel when using NPOI.

I have made some improvements since this was first posted and the processing time has improved, but if anyone has an idea on how to improve efficiency I am still interested in your thoughts.

Thanks.

NoUserHere · Apr 8, 2020

What makes you think you need a library?

b^3 · Apr 8, 2020

@Sheepings please excuse my ignorance here, but I'll admit I don't follow you.
Care to elaborate?

NoUserHere · Apr 8, 2020

Not particularly. I think my question was quite profound.

Skydiver · Apr 9, 2020

If you only have to deal with Office documents from 2007 and onwards (.xlsx) for efficiency, I would suggest using OpenXML that comes packaged with the .NET Framework SDK. It is the same library used by Microsoft. It is is quite efficient and has gone through a lot of performance testing (for the scenario that Office consider as critical).

The down side of using the OpenXML (directly) is that although it is very efficient in terms of operations, it is not very efficient in terms of developer time. You will have to learn the intricacies of how Office documents are stored at the lowest levels. There are a lot of "magic numbers", and strange XML heirarchies that are embedded in the documents. To make matters worse, documentation is disjoint. All the information that you need exists, it just may not be where you would expect to find it. (I am speaking from experience.)

So essentially, using OpenXML is like programming in assembly. Lots of power. Lots of efficiency. But also requires lots of developer time.

So if you want something without that steep of a learning curve, several teams at my place of work have been very happy with the speed of Aspose. Aspose also supports the older document formats as well. As someone who have to keep an eye on the servers running those teams code, I can say that I'm pretty happy with the memory and CPU usage of their apps. I've not seen their apps become hogs, unlike other apps who are using two other different libraries whose names escape me right now.

b^3 · Apr 9, 2020

@Skydiver Thank you that was very informative. As time permits I'll give it a go.

Excel using NPOI - Efficiency

b^3

New member

Skydiver

b^3

New member

NoUserHere

Well-known member

b^3

New member

NoUserHere

Well-known member

Skydiver

b^3

New member

Similar threads

Share this page

Latest posts