Resolved How to get values from elements in JSON without indexing?

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
I have the following code that extracts json elements values and outputs to csv:

C#:
public static void Json_to_Csv(string jsonInputFile, string csvFile)
{
    using (var p = new ChoJSONReader(jsonInputFile).WithJSONPath("$..readResults")) // "readResults": [
    {
        using (var w = new ChoCSVWriter(csvFile).WithFirstLineHeader())
        {
            w.Write(p
                .Select(r1 =>
                {
                    var lines = (dynamic[])r1.lines;
                    return new
                    {
                        FileName = jsonInputFile,
                        Page = r1.page,
                        PracticeName = lines[2].text,
                        OwnerFullName = lines[4].text,
                        OwnerEmail = lines[6].text,
                    };
                }
        }
    }
}

csv output:

File Name,Page,Practice Name,Owner Full Name,Owner Email
file1.json,1,Some Practice Name,Bob Lee,Bob@someemail.com

Currently there is no other contextual information on each item to reference them so the only way is by indexing, e.g. lines[2]

This works for now but I may have other JSON files that have an extra field, therefore the values pulled will be wrong.

In order to address this scenario, how can i pull the values contextually instead of indexing the lines?

Ive tried
C#:
PracticeName = lines["Practice Name"].text

but i get Cannot implicitly convert type string to int error


file1.json sample:

JSON:
{
  "status": "succeeded",
  "createdDateTime": "2020-10-22T19:35:35Z",
  "lastUpdatedDateTime": "2020-10-22T19:35:36Z",
  "analyzeResult": {
    "version": "3.0.0",
    "readResults": [
      {
        "page": 1,
        "angle": 0,
        "width": 8.5,
        "height": 11,
        "unit": "inch",
        "lines": [        
          {
            "boundingBox": [
              0.5016,
              1.9141,
              2.5726,
              1.9141,
              2.5726,
              2.0741,
              0.5016,
              2.0741
            ],          
           "text": "Account Information",
            "words": [
              {
                "boundingBox": [
                  0.5016,
                  1.9345,
                  1.3399,
                  1.9345,
                  1.3399,
                  2.0741,
                  0.5016,
                  2.0741
                ],
                "text": "Account",
                "confidence": 1
              },
              {
                "boundingBox": [
                  1.3974,
                  1.9141,
                  2.5726,
                  1.9141,
                  2.5726,
                  2.0741,
                  1.3974,
                  2.0741
                ],
                "text": "Information",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              1.7716,
              2.4855,
              2.8793,
              2.4855,
              2.8793,
              2.6051,
              1.7716,
              2.6051
            ],
            "text": "Practice Name",
            "words": [
              {
                "boundingBox": [
                  1.7716,
                  2.4855,
                  2.3803,
                  2.4855,
                  2.3803,
                  2.6051,
                  1.7716,
                  2.6051
                ],
                "text": "Practice",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.4362,
                  2.4948,
                  2.8793,
                  2.4948,
                  2.8793,
                  2.6051,
                  2.4362,
                  2.6051
                ],
                "text": "Name",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              2.9993,
              2.5257,
              4.7148,
              2.5257,
              4.7148,
              2.714,
              2.9993,
              2.714
            ],
            "text": "Some Practice Name",
            "words": [
              {
                "boundingBox": [
                  3.0072,
                  2.5385,
                  3.6546,
                  2.5284,
                  3.6516,
                  2.7131,
                  3.0105,
                  2.712
                ],
                "text": "Some",
                "confidence": 0.984
              },
              {
                "boundingBox": [
                  3.6887,
                  2.5281,
                  4.2112,
                  2.5262,
                  4.2028,
                  2.7159,
                  3.6854,
                  2.7132
                ],
                "text": "Parctice",
                "confidence": 0.986
              },
              {
                "boundingBox": [
                  4.2453,
                  2.5263,
                  4.7223,
                  2.5297,
                  4.7091,
                  2.72,
                  4.2366,
                  2.7161
                ],
                "text": "Name",
                "confidence": 0.986
              }
            ]
          },
          {
            "boundingBox": [
              1.6116,
              2.9999,
              2.8816,
              2.9999,
              2.8816,
              3.1158,
              1.6116,
              3.1158
            ],
            "text": "Owner Full Name",
            "words": [
              {
                "boundingBox": [
                  1.6116,
                  3.0039,
                  2.1026,
                  3.0039,
                  2.1026,
                  3.1157,
                  1.6116,
                  3.1157
                ],
                "text": "Owner",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.1541,
                  2.9999,
                  2.3784,
                  2.9999,
                  2.3784,
                  3.1158,
                  2.1541,
                  3.1158
                ],
                "text": "Full",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.4384,
                  3.0052,
                  2.8816,
                  3.0052,
                  2.8816,
                  3.1155,
                  2.4384,
                  3.1155
                ],
                "text": "Name",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              2.9993,
              3.0242,
              3.6966,
              3.0242,
              3.6966,
              3.2125,
              2.9993,
              3.2014
            ],
            "text": "Bob Lee",
            "words": [
              {
                "boundingBox": [
                  3.0063,
                  3.0303,
                  3.3439,
                  3.0349,
                  3.3461,
                  3.2125,
                  3.007,
                  3.2081
                ],
                "text": "Bob",
                "confidence": 0.987
              },
              {
                "boundingBox": [
                  3.3788,
                  3.0349,
                  3.6931,
                  3.0326,
                  3.697,
                  3.2121,
                  3.3813,
                  3.2125
                ],
                "text": "Lee",
                "confidence": 0.983
              }
            ]
          },
          {
            "boundingBox": [
              1.945,
              3.5063,
              2.8748,
              3.5063,
              2.8748,
              3.6261,
              1.945,
              3.6261
            ],
            "text": "Owner Email",
            "words": [
              {
                "boundingBox": [
                  1.945,
                  3.5143,
                  2.4359,
                  3.5143,
                  2.4359,
                  3.6261,
                  1.945,
                  3.6261
                ],
                "text": "Owner",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.4874,
                  3.5063,
                  2.8748,
                  3.5063,
                  2.8748,
                  3.6259,
                  2.4874,
                  3.6259
                ],
                "text": "Email",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              3.0104,
              3.5005,
              4.6042,
              3.5005,
              4.6042,
              3.6888,
              3.0104,
              3.6777
            ],
            "text": "bob@gmail.com",
            "words": [
              {
                "boundingBox": [
                  3.0212,
                  3.5047,
                  4.5837,
                  3.5039,
                  4.5769,
                  3.6886,
                  3.0129,
                  3.6787
                ],
                "text": "bob@gmail.com",
                "confidence": 0.951
              }
            ]
          }
        ]
      }
    ]
  }
}
 
Last edited by a moderator:
Solution
The following outputs the pairs to the console:
C#:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Newtonsoft.Json;

class RootObject
{
    [JsonProperty("analyzeResult")]
    public AnalyzeResult AnalyzeResult { get; set; }
}

class AnalyzeResult
{
    [JsonProperty("readResults")]
    public ReadResults[] ReadResults { get; set; }
}

class ReadResults
{
    [JsonProperty("lines")]
    public Line[] Lines { get; set; }
}

class Line
{
    [JsonProperty("text")]
    public string Text { get; set; }

    public override string ToString() => Text;
}

public static class IEnumerableExtensions
{
    public static IEnumerable<KeyValuePair<T, T>> Pairs<T>(this IEnumerable<T> items)
    {...

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
2,393
Location
Chesapeake, VA
Programming Experience
10+
You would use LINQ to Objects to filter to the object instance that you want. That would important that you need to deserialize your JSON into objects. I'm not familiar with ChoJSONReader, but I know that I already don't like it because it does not follow the .NET Framework naming conventions.
 

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
You would use LINQ to Objects to filter to the object instance that you want. That would important that you need to deserialize your JSON into objects. I'm not familiar with ChoJSONReader, but I know that I already don't like it because it does not follow the .NET Framework naming conventions.
I tried following the deserialization way but for what Im looking to do it seemed too complicated to implement. I found ChoJSONReader as an alternative because its so far been the only way I could achieve what I want.
I would appreciate if you can show me an example of what you mentioned above, as that may help improve the current design I have and make thing way more flexible.
 

Sheepings

Retired Programmer
Staff member
Joined
Sep 5, 2018
Messages
1,877
Location
UK
Programming Experience
10+
If you know the contents of your file, you can populate your data to a class. From that class, you can then serialise to your csv file rather easily. One method I was showing Skydiver on another topic was the populate method : Populate an Object which he admitted was useful. Serialising and deserialising can be found on those pages amongst other useful examples.
 

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
If you know the contents of your file, you can populate your data to a class. From that class, you can then serialise to your csv file rather easily. One method I was showing Skydiver on another topic was the populate method : Populate an Object which he admitted was useful. Serialising and deserialising can be found on those pages amongst other useful examples.
Under normal circumstances, we will have a property, and then give it a value, like this:

C#:
public string Test { get; set; }

        public Program()

        {

            Test = "Test";

        }

Then we can get the value based on the property name in other places.

But in this json, "Owner Full Name" and "Bob Lee" are not the relationship between property and value, but the values of Text property in two unrelated objects, like this:

C#:
public class Line

    {

        public float[] BoundingBox { get; set; }

        public string Text { get; set; }

        public Word[] Words { get; set; }

    }

 

    ***********************



    new Line() { Text = "Owner Full Name" };

    new Line() { Text = "Bob Lee" };

We can't establish a connection between them, except to specify manually as in my original code.

Therefore, id have to reconstruct a qualified JSON before attempting to import it into a csv file, but the problem is this json is the result of the response from the Azure Computer Vision Read API.

I guess my current code using choJSON is the only way to accomplish this.
 

Sheepings

Retired Programmer
Staff member
Joined
Sep 5, 2018
Messages
1,877
Location
UK
Programming Experience
10+
If you can be bothered to read the documentation on the links I gave you, you will see it is more than possible and very easy.
 

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
If you can be bothered to read the documentation on the links I gave you, you will see it is more than possible and very easy.
I have read it.
Populating the object works in the Account class example they show because it's a property/attribute to value relationship. Therefore, deserializing it dynamically is very easily done. However, in the json I've given in my post, this method does not work because the attributes and supposed ”values” have no connection. As a human, I can tell for example that ”Bob Lee” is the value of ”Owner Full Name ” property but the program cannot distinguish that like I can. Because there is no connection between them.

The only way is to populate a class for every file manually, which defeats the purpose of using a program to do this since I can just fill the data manually in the csv directly by reading the original pdf file.
 

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
2,393
Location
Chesapeake, VA
Programming Experience
10+
Looking at the JSON there, it looks like it roughly maps to the following class structure:
C#:
class RootObject
{
    AnalyzeResult AnalyzeResult { get; set; }
}

class AnalyzeResult
{
    ReadResults ReadResults { get; set; }
}

class ReadResults
{
    Line [] Lines { get; set; }
}

class Line
{
    string Text { get; set; }
    Word [] Words { get; set; }
}

class Word
{
    string Text { get; set; }
}

If you know that each pair of lines is always a name-value pair, then you can just ingest the lines in pairs and setup the values to go out into the CSV.

As an aside, I took a glance at the source code for ChoJSONReader in GitHub. It's just a wrapper around the NewtonSoft JSON.NET library.
 

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
Interesting...could you demonstrate ”ingesting the lines in pairs and setting up the values to go out into the csv”? I think what compelled me to use choJSONReader is due to choCSVWriter since ultimately that's what I want, to write properties/values to csv.
If it's just a wrapper around the JSON.NET library, does this mean it's possible to pull the values contextually instead of indexing the lines? And if so, how?

Mod edit : No need to quote in whole or quote the person directly above you.
 
Last edited by a moderator:

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
2,393
Location
Chesapeake, VA
Programming Experience
10+
The following outputs the pairs to the console:
C#:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Newtonsoft.Json;

class RootObject
{
    [JsonProperty("analyzeResult")]
    public AnalyzeResult AnalyzeResult { get; set; }
}

class AnalyzeResult
{
    [JsonProperty("readResults")]
    public ReadResults[] ReadResults { get; set; }
}

class ReadResults
{
    [JsonProperty("lines")]
    public Line[] Lines { get; set; }
}

class Line
{
    [JsonProperty("text")]
    public string Text { get; set; }

    public override string ToString() => Text;
}

public static class IEnumerableExtensions
{
    public static IEnumerable<KeyValuePair<T, T>> Pairs<T>(this IEnumerable<T> items)
    {
        var enumerator = items.GetEnumerator();

        while (enumerator.MoveNext())
        {
            var name = enumerator.Current;
            if (enumerator.MoveNext())
                yield return new KeyValuePair<T, T>(name, enumerator.Current);
            else
                throw new InvalidDataException("Odd number of items found in IEnumerable<T>");
        }
    }
}

class Program
{
    static IEnumerable<Line> GetLines(string jsonText)
    {
        var root = JsonConvert.DeserializeObject<RootObject>(jsonText);
        return root.AnalyzeResult
                   .ReadResults
                   .SelectMany(r => r.Lines);
    }

    static void Main(string[] args)
    {
        var lines = GetLines(File.ReadAllText("response.json"));

        // Skip(1) to skip over the "Account Information" Line.
        var pairs = lines.Skip(1).Pairs();

        foreach(var pair in pairs)
            Console.WriteLine($"{pair.Key}: {pair.Value}");
    }
}

which produces the following output:
Code:
Practice Name: Some Practice Name
Owner Full Name: Bob Lee
Owner Email: bob@gmail.com
 
Last edited:
Solution

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
The following outputs the pairs to the console:
C#:
    static void Main(string[] args)
    {
        // Skip(1) to skip over the "Account Information" Line.
        var pairs = lines.Skip(1).Pairs();
    }
This is awesome, thank you Skydiver. Although, wouldnt the fact that i have to explicitly skip over a certain Line, i.e. "Account Information", mean that Im still technically confined to the JSON structure? In other words, doesnt this mean that if i had another JSON file structure with more fields preceding even Account Information, that I would have to adjust the skip once again in order to make sure the first field it reads is Practice Name?
Is there a way then to make it even more dynamic so that it directly goes to Practice Name instead of just having to skip() over a certain number of Lines?
something like:
C#:
var pairs = lines.SkipAllUntil("Practice Name").Pairs();
 

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
2,393
Location
Chesapeake, VA
Programming Experience
10+
I only put in the skip there as hard coded because I was trying to highlight the pulling of JSON elements in pairs since that is what you asked about. With software anything is possible. It just depends how much time, energy, and money you want to invest. To answer your new question, you can operate on any IEnumerable using LINQ's SkipWhile().
 

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
I only put in the skip there as hard coded because I was trying to highlight the pulling of JSON elements in pairs since that is what you asked about. With software anything is possible. It just depends how much time, energy, and money you want to invest. To answer your new question, you can operate on any IEnumerable using LINQ's SkipWhile().
i see. so ive tried the following:
C#:
var pairs = lines.SkipWhile(r => r == "Practice Name").Pairs();
However I am getting "Operator '==' cannot be applied to operands of type 'Line' and 'String'"

I think I understand what this error means, in that Line is not of Type String to enable a direct comparison like that.

So ive figured ok, easy enough, i just gotta convert lines to string type:
C#:
var pairs = lines.SkipWhile(r => r.ToString() == "Practice Name").Pairs();

However this not only print out "Account Information", but the pairing became messed up on console output and then i got an exception
"Unreachable code, 'Odd number of items found in IEnumerable<T>'"

But anyways, wouldnt this mean that lines will always have to be == to practice name using while? which means only that part of the JSON gets executed?

pardon my asking a lot of questions, the last i programmed in c# was 6 years ago and ive had to recently start using it again. Almost there though, truly appreciate your guidance so far!
 

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
2,393
Location
Chesapeake, VA
Programming Experience
10+
Flip the logic.
C#:
var pairs = lines.SkipWhile(l => l.Text != "Practice Name").Pairs();
foreach(var pair in pairs)
    Console.WriteLine($"{pair.Key}: {pair.Value}");
seems to do the right thing for me.
 

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
Flip the logic.
C#:
var pairs = lines.SkipWhile(l => l.Text != "Practice Name").Pairs();
foreach(var pair in pairs)
    Console.WriteLine($"{pair.Key}: {pair.Value}");
seems to do the right thing for me.
Awesome, this does the trick indeed :)
btw, this JSON structure is generated through Azure computer vision REST API from a pdf input like this:

1603905939908.png


I mention this because i understand what the IEnumerable code is doing, but there is one edgecase that may not conform to how it operates. Basically, my understanding is that:

Scan of JSON seems to show that the .text of odd numbered lines is the name of a field
and the .text of even numbered lines is the value of that field.
For example:
If lines[3].text is "Owner Full Name",
then lines[3+1] is "Bob Lee"

The skipped variable would be the 'lines' input with everything prior to the field of interest
removed. We then just skip over the field name line and return the .text property of the next line.

The full JSON from the picture/pdf is derived by the API as:

JSON:
{
  "status": "succeeded",
  "createdDateTime": "2020-10-22T19:35:35Z",
  "lastUpdatedDateTime": "2020-10-22T19:35:36Z",
  "analyzeResult": {
    "version": "3.0.0",
    "readResults": [
      {
        "page": 1,
        "angle": 0,
        "width": 8.5,
        "height": 11,
        "unit": "inch",
        "lines": [       
          {
            "boundingBox": [
              0.5016,
              1.9141,
              2.5726,
              1.9141,
              2.5726,
              2.0741,
              0.5016,
              2.0741
            ],         
           "text": "Account Information",
            "words": [
              {
                "boundingBox": [
                  0.5016,
                  1.9345,
                  1.3399,
                  1.9345,
                  1.3399,
                  2.0741,
                  0.5016,
                  2.0741
                ],
                "text": "Account",
                "confidence": 1
              },
              {
                "boundingBox": [
                  1.3974,
                  1.9141,
                  2.5726,
                  1.9141,
                  2.5726,
                  2.0741,
                  1.3974,
                  2.0741
                ],
                "text": "Information",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              1.7716,
              2.4855,
              2.8793,
              2.4855,
              2.8793,
              2.6051,
              1.7716,
              2.6051
            ],
            "text": "Practice Name",
            "words": [
              {
                "boundingBox": [
                  1.7716,
                  2.4855,
                  2.3803,
                  2.4855,
                  2.3803,
                  2.6051,
                  1.7716,
                  2.6051
                ],
                "text": "Practice",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.4362,
                  2.4948,
                  2.8793,
                  2.4948,
                  2.8793,
                  2.6051,
                  2.4362,
                  2.6051
                ],
                "text": "Name",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              2.9993,
              2.5257,
              4.7148,
              2.5257,
              4.7148,
              2.714,
              2.9993,
              2.714
            ],
            "text": "Some Practice Name",
            "words": [
              {
                "boundingBox": [
                  3.0072,
                  2.5385,
                  3.6546,
                  2.5284,
                  3.6516,
                  2.7131,
                  3.0105,
                  2.712
                ],
                "text": "Some",
                "confidence": 0.984
              },
              {
                "boundingBox": [
                  3.6887,
                  2.5281,
                  4.2112,
                  2.5262,
                  4.2028,
                  2.7159,
                  3.6854,
                  2.7132
                ],
                "text": "Practice",
                "confidence": 0.986
              },
              {
                "boundingBox": [
                  4.2453,
                  2.5263,
                  4.7223,
                  2.5297,
                  4.7091,
                  2.72,
                  4.2366,
                  2.7161
                ],
                "text": "Name",
                "confidence": 0.986
              }
            ]
          },
          {
            "boundingBox": [
              1.6116,
              2.9999,
              2.8816,
              2.9999,
              2.8816,
              3.1158,
              1.6116,
              3.1158
            ],
            "text": "Owner Full Name",
            "words": [
              {
                "boundingBox": [
                  1.6116,
                  3.0039,
                  2.1026,
                  3.0039,
                  2.1026,
                  3.1157,
                  1.6116,
                  3.1157
                ],
                "text": "Owner",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.1541,
                  2.9999,
                  2.3784,
                  2.9999,
                  2.3784,
                  3.1158,
                  2.1541,
                  3.1158
                ],
                "text": "Full",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.4384,
                  3.0052,
                  2.8816,
                  3.0052,
                  2.8816,
                  3.1155,
                  2.4384,
                  3.1155
                ],
                "text": "Name",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              2.9993,
              3.0242,
              3.6966,
              3.0242,
              3.6966,
              3.2125,
              2.9993,
              3.2014
            ],
            "text": "Bob Lee",
            "words": [
              {
                "boundingBox": [
                  3.0063,
                  3.0303,
                  3.3439,
                  3.0349,
                  3.3461,
                  3.2125,
                  3.007,
                  3.2081
                ],
                "text": "Bob",
                "confidence": 0.987
              },
              {
                "boundingBox": [
                  3.3788,
                  3.0349,
                  3.6931,
                  3.0326,
                  3.697,
                  3.2121,
                  3.3813,
                  3.2125
                ],
                "text": "Lee",
                "confidence": 0.983
              }
            ]
          },
          {
            "boundingBox": [
              1.945,
              3.5063,
              2.8748,
              3.5063,
              2.8748,
              3.6261,
              1.945,
              3.6261
            ],
            "text": "Owner Email",
            "words": [
              {
                "boundingBox": [
                  1.945,
                  3.5143,
                  2.4359,
                  3.5143,
                  2.4359,
                  3.6261,
                  1.945,
                  3.6261
                ],
                "text": "Owner",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.4874,
                  3.5063,
                  2.8748,
                  3.5063,
                  2.8748,
                  3.6259,
                  2.4874,
                  3.6259
                ],
                "text": "Email",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              3.0104,
              3.5005,
              4.6042,
              3.5005,
              4.6042,
              3.6888,
              3.0104,
              3.6777
            ],
            "text": "bob@gmail.com",
            "words": [
              {
                "boundingBox": [
                  3.0212,
                  3.5047,
                  4.5837,
                  3.5039,
                  4.5769,
                  3.6886,
                  3.0129,
                  3.6787
                ],
                "text": "bob@gmail.com",
                "confidence": 0.951
              }
            ]
          },
          {
            "boundingBox": [
              1.945,
              6.5768,
              2.8886,
              6.5768,
              2.8886,
              6.7271,
              1.945,
              6.7271
            ],
            "text": "Server Setup",
            "words": [
              {
                "boundingBox": [
                  1.945,
                  6.5768,
                  2.4165,
                  6.5768,
                  2.4165,
                  6.6884,
                  1.945,
                  6.6884
                ],
                "text": "Server",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.4643,
                  6.5768,
                  2.8886,
                  6.5768,
                  2.8886,
                  6.7271,
                  2.4643,
                  6.7271
                ],
                "text": "Setup",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              3.5085,
              6.5025,
              3.7298,
              6.5136,
              3.7188,
              6.7351,
              3.4974,
              6.7241
            ],
            "text": "V",
            "words": [
              {
                "boundingBox": [
                  3.5672,
                  6.5046,
                  3.7293,
                  6.5128,
                  3.7183,
                  6.734,
                  3.5561,
                  6.7259
                ],
                "text": "V",
                "confidence": 0.984
              }
            ]
          },
          {
            "boundingBox": [
              3.7471,
              6.6145,
              4.1792,
              6.6145,
              4.1792,
              6.7304,
              3.7471,
              6.7304
            ],
            "text": "Cloud",
            "words": [
              {
                "boundingBox": [
                  3.7471,
                  6.6145,
                  4.1792,
                  6.6145,
                  4.1792,
                  6.7304,
                  3.7471,
                  6.7304
                ],
                "text": "Cloud",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              4.904,
              6.6105,
              5.5344,
              6.6105,
              5.5344,
              6.7301,
              4.904,
              6.7301
            ],
            "text": "Location",
            "words": [
              {
                "boundingBox": [
                  4.904,
                  6.6105,
                  5.5344,
                  6.6105,
                  5.5344,
                  6.7301,
                  4.904,
                  6.7301
                ],
                "text": "Location",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              6.2924,
              6.6037,
              7.8618,
              6.6037,
              7.8618,
              6.752,
              6.2924,
              6.752
            ],
            "text": "Central (multi-location)",
            "words": [
              {
                "boundingBox": [
                  6.2924,
                  6.6145,
                  6.8385,
                  6.6145,
                  6.8385,
                  6.7301,
                  6.2924,
                  6.7301
                ],
                "text": "Central",
                "confidence": 1
              },
              {
                "boundingBox": [
                  6.8929,
                  6.6037,
                  7.8618,
                  6.6037,
                  7.8618,
                  6.752,
                  6.8929,
                  6.752
                ],
                "text": "(multi-location)",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              0.6466,
              7.0788,
              2.8775,
              7.0788,
              2.8775,
              7.2388,
              0.6466,
              7.2388
            ],
            "text": "Number of Locations Enrolling",
            "words": [
              {
                "boundingBox": [
                  0.6466,
                  7.0832,
                  1.2496,
                  7.0832,
                  1.2496,
                  7.1991,
                  0.6466,
                  7.1991
                ],
                "text": "Number",
                "confidence": 1
              },
              {
                "boundingBox": [
                  1.2969,
                  7.0788,
                  1.4364,
                  7.0788,
                  1.4364,
                  7.1988,
                  1.2969,
                  7.1988
                ],
                "text": "of",
                "confidence": 1
              },
              {
                "boundingBox": [
                  1.4892,
                  7.0793,
                  2.2013,
                  7.0793,
                  2.2013,
                  7.1988,
                  1.4892,
                  7.1988
                ],
                "text": "Locations",
                "confidence": 1
              },
              {
                "boundingBox": [
                  2.2576,
                  7.0793,
                  2.8775,
                  7.0793,
                  2.8775,
                  7.2388,
                  2.2576,
                  7.2388
                ],
                "text": "Enrolling",
                "confidence": 1
              }
            ]
          },
          {
            "boundingBox": [
              3.4421,
              7.0342,
              3.6413,
              7.0453,
              3.6413,
              7.3001,
              3.4421,
              7.289
            ],
            "text": "1",
            "words": [
              {
                "boundingBox": [
                  3.4757,
                  7.0352,
                  3.6451,
                  7.0446,
                  3.631,
                  7.299,
                  3.4616,
                  7.2896
                ],
                "text": "1",
                "confidence": 0.987
              }
            ]
          },
          {
            "boundingBox": [
              4.1835,
              7.0896,
              7.8999,
              7.0896,
              7.8999,
              7.2158,
              4.1835,
              7.2158
            ],
            "text": "*If more than 1 location, add info on the locations form",
            "words": [
              {
                "boundingBox": [
                  4.1835,
                  7.0896,
                  4.3291,
                  7.0896,
                  4.3291,
                  7.1979,
                  4.1835,
                  7.1979
                ],
                "text": "*If",
                "confidence": 1
              },
              {
                "boundingBox": [
                  4.3611,
                  7.1193,
                  4.725,
                  7.1193,
                  4.725,
                  7.1988,
                  4.3611,
                  7.1988
                ],
                "text": "more",
                "confidence": 1
              },
              {
                "boundingBox": [
                  4.7701,
                  7.0936,
                  5.0809,
                  7.0936,
                  5.0809,
                  7.1988,
                  4.7701,
                  7.1988
                ],
                "text": "than",
                "confidence": 1
              },
              {
                "boundingBox": [
                  5.1307,
                  7.0985,
                  5.1613,
                  7.0985,
                  5.1613,
                  7.1979,
                  5.1307,
                  7.1979
                ],
                "text": "1",
                "confidence": 1
              },
              {
                "boundingBox": [
                  5.2006,
                  7.09,
                  5.7803,
                  7.09,
                  5.7803,
                  7.2158,
                  5.2006,
                  7.2158
                ],
                "text": "location,",
                "confidence": 1
              },
              {
                "boundingBox": [
                  5.8268,
                  7.0936,
                  6.102,
                  7.0936,
                  6.102,
                  7.1988,
                  5.8268,
                  7.1988
                ],
                "text": "add",
                "confidence": 1
              },
              {
                "boundingBox": [
                  6.1394,
                  7.0896,
                  6.3896,
                  7.0896,
                  6.3896,
                  7.1988,
                  6.1394,
                  7.1988
                ],
                "text": "info",
                "confidence": 1
              },
              {
                "boundingBox": [
                  6.435,
                  7.1193,
                  6.6005,
                  7.1193,
                  6.6005,
                  7.1988,
                  6.435,
                  7.1988
                ],
                "text": "on",
                "confidence": 1
              },
              {
                "boundingBox": [
                  6.6481,
                  7.0936,
                  6.865,
                  7.0936,
                  6.865,
                  7.1988,
                  6.6481,
                  7.1988
                ],
                "text": "the",
                "confidence": 1
              },
              {
                "boundingBox": [
                  6.9081,
                  7.09,
                  7.5365,
                  7.09,
                  7.5365,
                  7.1988,
                  6.9081,
                  7.1988
                ],
                "text": "locations",
                "confidence": 1
              },
              {
                "boundingBox": [
                  7.5783,
                  7.0896,
                  7.8999,
                  7.0896,
                  7.8999,
                  7.1988,
                  7.5783,
                  7.1988
                ],
                "text": "form",
                "confidence": 1
              }
            ]
          }
        ]
      }
    ]
  }
}

I think the API is not perfect and therefore we have an edgecase when it comes to multiple elements in a bounding box. For example, the way the API interprets "Server Setup" is the same as it interprets "Owner Full Name", basically just sticking to the convention that the .text of odd numbered lines is the name of a field and the .text of even numbered lines is the value of that field.
It fails to place the supposed "values" of that field into an "inner" bounded box inside the "Server Setup" text, therefore we end up with an output like this:

Practice Name: Some Practice Name
Owner Full Name: Bob Lee
Owner Email: bob@gmail.com
Server Setup: V
Cloud: Location
Central (multi-location): Number of Locations Enrolling
1: *If more than 1 location, add info on the locations form

While Practice Name, Owner, Full Name, and Owner Email fields/values are correct, the Server Setup field and values unfortunately is not. and that is understandable because the JSON structure is like that to begin with, missing a "child"-like element dependency as we would otherwise observe in the pdf/image.

1603906337179.png


Note: "V" represents the checkmark, since it looks like the API is incapable of interpreting symbols into the JSON.

The ideal output however should be this:

Practice Name: Some Practice Name
Owner Full Name: Bob Lee
Owner Email: bob@gmail.com
Server Setup: Cloud
Location: Central (multi-location)
Number of Locations Enrolling: 1

How do i adjust the IEnumerable code to accommodate this edgecase, (if thats even possible)?
 

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
2,393
Location
Chesapeake, VA
Programming Experience
10+
Recall in post #8, I said:
If you know that each pair of lines is always a name-value pair, then you can just ingest the lines in pairs and setup the values to go out into the CSV.

You're data doesn't meet that condition.

Take time to read about how LINQ pipelining and fluent interface works. You can add extra filters and modifiers to change how the data flows down the pipeline, and try to adjust the enumeration of the fields.

Personally, I think that you are using the wrong approach of doing OCR first and then trying to massage the data that comes out. My recommendation is to do some AI classifications of the scanned PDF's or papers first which classifies the different kinds of forms into particular kinds of buckets. Then for each bucket, you apply the Strategy pattern and have a custom mask that you use with the OCR to only scan in the data that matters to you. Then as the data from each bucket comes out, it gets fed into a POCO that is the correct shape of data that you want to eventually want to save to your CSV. My gut says that you'll have to go down this path any which way because you'll have variances where some forms may say "Owner Full Name", while other forms will have "Owner's Name", or just "Name". And other forms may have the different fields in different orders (and so that SkipWhile() will end up skipping over important data. Consider what happens when you have to deal with languages that are RTL, where the field labels will be on the right and the values will be on the left.
 

WeyardWiz

Member
Joined
Oct 23, 2020
Messages
23
Programming Experience
3-5
You're data doesn't meet that condition.

Take time to read about how LINQ pipelining and fluent interface works. You can add extra filters and modifiers to change how the data flows down the pipeline, and try to adjust the enumeration of the fields.
because you'll have variances where some forms may say "Owner Full Name", while other forms will have "Owner's Name", or just "Name". And other forms may have the different fields in different orders (and so that SkipWhile() will end up skipping over important data. Consider what happens when you have to deal with languages that are RTL, where the field labels will be on the right and the values will be on the left.
This is a very good point. Ive reviewed with the team and they said this form will be the official template, so we won't have to worry about variances.
Since that's the case, does the desired output im seeking from the JSON still not meet the condition in post#8? In other words, the above code could only work for JSON items meeting that condition, like Owner Full Name?
 

Skydiver

Staff member
Joined
Apr 6, 2019
Messages
2,393
Location
Chesapeake, VA
Programming Experience
10+
Even if there are no variances, the data still doesn't meet the condition that adjacent pairs of lines are related to each other. The "Server Setup" line is followed by 4 lines which are all related to it. The "Number of Locations" has 2 lines which are related to it.
 

JohnH

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
992
Location
Norway
Programming Experience
10+
Perhaps try to group the lines by the second number in boundingBox which seems to be the line Y coordinate, a proximity comparison of +-0.12 would be needed, see LINQ (Or pseudocode) to group items by proximity
Then you could handle each group, where first text would be the label and second the value. The "Server Setup" line would need special treatment to find out which text follows the "V" selection.
 

JohnH

C# Forum Moderator
Staff member
Joined
Apr 23, 2011
Messages
992
Location
Norway
Programming Experience
10+
Follow up to my suggestion, this adds to @Skydiver's code in post 10.
Line class:
[JsonProperty("boundingBox")]
public double[] BoundingBox { get; set; }
IEnumerableExtensions:
internal static IEnumerable<IEnumerable<Line>> GroupByProximity(this IEnumerable<Line> source, double threshold)
{
    var g = new List<Line>();
    foreach (var x in source)
    {
        if ((g.Count != 0) && (!x.BoundingBox[1].IsProximity(g[0].BoundingBox[1], threshold)))
        {
            yield return g;
            g = new List<Line>();
        }
        g.Add(x);
    }
    yield return g;
}

private static bool IsProximity(this double value, double compareTo, double treshold) {
    return value >= compareTo - treshold && value <= compareTo + treshold;
}
example usage:
var lines = GetLines(File.ReadAllText("response.json"));
var groups = lines.GroupByProximity(0.12);

var dictionary = groups.ToDictionary(g => g.First().Text, g => g.Skip(1).Select(line => line.Text));
var value1 = dictionary["Practice Name"].First();
var value2 = dictionary["Server Setup"].SkipWhile(s => s != "V").Skip(1).FirstOrDefault();
var value3 = string.Join(" ", dictionary["Number of Locations Enrolling"]);
ToDictionary requires that the labels are distinct, if they are not you could use ToLookup instead.
value2 example allows for no selection, it will be null in that case.
value3 is just an example combining all line values to a single string.
 
Top Bottom