Resolved GZipStream not reading all of file

MPIon

Well-known member
Joined
Jun 13, 2020
Messages
73
Location
England
Programming Experience
10+
I have been using GZipStream with Net Framework 4.8 successfully to read compressed blocks (40,000 bytes) into memory.
C#:
                    fs.Seek(offset, SeekOrigin.Begin);
                    GZipStream zs = new GZipStream(fs, CompressionMode.Decompress, true);
                    zs.Read(block[blockNo]!, 0, 40000);
                    zs.Close();
                    fs.Close();
Now that I have moved to Net 7, it does not seem to read the whole block successfully. It is exactly the same code.
I have analyzed the data and is seems to read bytes 0 to 32801 the same, but beyond that, they are all zero.
(thought I was onto something when I got near to examining bytes beyond 32767, but that can't be it as it reads to 32801).

Unfortunately I can't change the block size to something smaller as the files have already been created with blocks of 40,000 bytes and would take weeks to recreate at smaller size.

This could be a show stopper unless I can fix it.

Anyone had this problem before or could shed any light on it? Maybe Net 7 has a newer Uncompress class?
 
Did you check the return value of Read()? Was it also 40000? There is no guarantees that the Read() call will read everything in the first pass. You are supposed to loop around as needed.
 
Seems to work without problems for me:
1694383299500.png


I was using the following code for both versions:
C#:
using System;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Linq;

class Program
{
    IEnumerable<byte> GetBytes()
    {
        byte value = 0;
        while (true)
            yield return value++;
    }

    void WriteOutData(string filename, byte[] data)
    {
        Console.WriteLine($"Writing compressed data to {filename}...");
        using (var stream = File.OpenWrite(filename))
        using (var zip = new GZipStream(stream, CompressionLevel.Optimal, true))
            zip.Write(data, 0, data.Length);
    }

    void ValidateData(string filename, byte[] data)
    {
        Console.WriteLine($"Validating compressed data in {filename}...");
        var buffer = new byte[data.Length];
        using (var stream = File.OpenRead(filename))
        using (var zip = new GZipStream(stream, CompressionMode.Decompress, true))
        {
            int offset = 0;
            int totalRead = 0;
            while (totalRead < buffer.Length)
            {
                Console.WriteLine($"Reading into offset {offset} ...");
                int read = zip.Read(buffer, offset, buffer.Length - offset);
                totalRead += read;
                offset += read;
            }
        }

        for (int i = 0; i < data.Length; i++)
        {
            if (data[i] != buffer[i])
            {
                Console.WriteLine("Data read back in doesn't match.");
                return;
            }
        }

        Console.WriteLine("Data matches.");
    }

    void Run(string[] args)
    {
        string filename = Path.Combine(Path.GetTempPath(), "MyTestData.zip");
        var data = GetBytes().Take(40000).ToArray();

        if (args.Length >= 1 && args[0]?.ToLower() == "create")
            WriteOutData(filename, data);

        ValidateData(filename, data);
    }

    public static void Main(string[] args)
        => new Program().Run(args);
}
 
Did you check the return value of Read()? Was it also 40000? There is no guarantees that the Read() call will read everything in the first pass. You are supposed to loop around as needed.

No I have not been checking the return value. It was 32802, but with Net Framework it was 40000.

So, now I'll have to experiment to see if I need to do multiple passes, or if there is some reason why it is not reading all 40000 bytes in Net 7.
I'll probably do a separate test using your code.
 
On further investigation, I notice that there is a method zs.ReadExactly
I tried that and lo and behold it now works and gets the same result. There is no return value to ReadExactly either.

I will have to do a lot more testing though to ensure it works in all cases.
 
Read returns the number of bytes read, because it may not be what you asked for. I haven't looked but I would assume that ReadExactly will always read exactly what you tell it to and throw if it can't, so a return value would be pointless.
 
I just had a look at the documentation and it appears that ReadExactly was added in .NET 7 and does indeed throw an exception if there are fewer bytes available than you specify to read. If there's any chance that the file will not be as big as you specify, you'd have to catch that exception. If it might be smaller, you'd have to handle that too. Using Read in a loop enables you to read files of any size in blocks of a specific size. That is generally preferable as it avoids monopolising large amounts of memory and also handles files of unknown sizes. Obviously there are times when you do know exactly how many bytes there are and it's not too many to read in one go.
 
I just had a look at the documentation and it appears that ReadExactly was added in .NET 7 and does indeed throw an exception if there are fewer bytes available than you specify to read. If there's any chance that the file will not be as big as you specify, you'd have to catch that exception. If it might be smaller, you'd have to handle that too. Using Read in a loop enables you to read files of any size in blocks of a specific size. That is generally preferable as it avoids monopolising large amounts of memory and also handles files of unknown sizes. Obviously there are times when you do know exactly how many bytes there are and it's not too many to read in one go.

I have experimented and ReadExactly does indeed throw an error if fewer bytes are available. However, in my case I always know the block size (it is 40000 except for the last block of the file, which I can easily calculate - these are the uncompressed sizes). So, for me ReadExactly is the right solution.

I am still curious though as to why zs.Read in Net 7 does not always read the full block, whereas in Net Framework it does - it does not even throw an error on the last block which is smaller than 40000.
In my first example, it consistently reads 32802 bytes of a 40000 block, but Skydiver has no problem reading 40000 bytes.
Something must have changed in how zs.Read works, probably long before Net 7 but since Net Framework 4.8.
 
I am still curious though as to why zs.Read in Net 7 does not always read the full block, whereas in Net Framework it does - it does not even throw an error on the last block which is smaller than 40000

Read has always worked the same; you supply a buffer and tell it to read up to X bytes. It tells you how many it read and you proceed from there. Read in this way is flexible enough to allow you to patch bytes into a buffer in various places; it's not always `Read(buf, 0, buf.Length)`

The underlying IO system perhaps works differently between the two versions- it's pointless to ask why, because you can't change it and it's by design. You just use Read as it is specified and if you don't get enough bytes you read again. Perhaps ReadExactly blocks until the requested number of bytes are available, unless there are definitely no more bytes (end of stream) - perhaps you don't want to block until the requested number of bytes are available, but instead do something else while you wait for more to come in

IO is nearly always the slowest part of the system, and we avoid waiting for it if we can
 
For those curious about how it is implemented in .NET Core:

ReadExactly:
public void ReadExactly(Span<byte> buffer) =>
    _ = ReadAtLeastCore(buffer, buffer.Length, throwOnEndOfStream: true);

public void ReadExactly(byte[] buffer, int offset, int count)
{
    ValidateBufferArguments(buffer, offset, count);

    _ = ReadAtLeastCore(buffer.AsSpan(offset, count), count, throwOnEndOfStream: true);
}

ReadAtLeastCore:
private int ReadAtLeastCore(Span<byte> buffer, int minimumBytes, bool throwOnEndOfStream)
{
    Debug.Assert(minimumBytes <= buffer.Length);

    int totalRead = 0;
    while (totalRead < minimumBytes)
    {
        int read = Read(buffer.Slice(totalRead));
        if (read == 0)
        {
            if (throwOnEndOfStream)
            {
                ThrowHelper.ThrowEndOfFileException();
            }

            return totalRead;
        }

        totalRead += read;
    }

    return totalRead;
}
 
In my first example, it consistently reads 32802 bytes of a 40000 block, but Skydiver has no problem reading 40000 bytes.

It's likely in the nature of the data being compressed. In my test program, I filled 40000 bytes with a simple repeating pattern of 256 bytes. This compresses very well, specially for modern dictionary based compression algorithms. I suspect that your compressed data makes zlib's internal data structure cause a dictionary reset at that particular border point in your data. Also recall, that the version of zlib being used by .NET Framework 4.8 is likely different from the version being used in later versions of .NET Core. Consider that .NET Framework was before 2022, but in 2022 this critical vulnerability was found in zlib's decompression: CVE-2022-37434 .
 
It's likely in the nature of the data being compressed. In my test program, I filled 40000 bytes with a simple repeating pattern of 256 bytes. This compresses very well, specially for modern dictionary based compression algorithms. I suspect that your compressed data makes zlib's internal data structure cause a dictionary reset at that particular border point in your data. Also recall, that the version of zlib being used by .NET Framework 4.8 is likely different from the version being used in later versions of .NET Core. Consider that .NET Framework was before 2022, but in 2022 this critical vulnerability was found in zlib's decompression: CVE-2022-37434 .

That's a bit worrying. Hope it won't mean I'll have to regenerate all my compressed file data.
So far, not found any problems and it seems to be reading the data correctly using ReadExactly.
 

Latest posts

Back
Top Bottom