Why does X4 on Windows not cache data files?

Post by **Imperial Good** » Mon, 18. Mar 19, 22:48

People are overlooking the problems introduced by binary formats that XML solves. Such as modability, robustness, etc.

Post by **radcapricorn** » Mon, 18. Mar 19, 23:14

There's no reason why modding can't be allowed via some user-friendly data interchange (of which, by the way, XML is not), while keeping game's data and save files in more efficient structures.
As for robustness, I wouldn't call robust a 300Mb file, of which ~225Mb are wasted on useless overhead, and where the most repeated string, counting upwards of 250 thousand occurrences, has nothing to do with useful data, because it's "</ware>".

Post by **Imperial Good** » Tue, 19. Mar 19, 07:09

radcapricorn wrote: ↑Mon, 18. Mar 19, 23:14 There's no reason why modding can't be allowed via some user-friendly data interchange (of which, by the way, XML is not), while keeping game's data and save files in more efficient structures.
As for robustness, I wouldn't call robust a 300Mb file, of which ~225Mb are wasted on useless overhead, and where the most repeated string, counting upwards of 250 thousand occurrences, has nothing to do with useful data, because it's "</ware>".

Except the files are not 300MB because they are compressed. They are only 30MB or so. The repetition of "</ware>" does not matter much due to compression.

Currently if someone wants to change something in their save they can edit it fairly easily in a textual form. Good luck doing that with an arbitrary, closed source binary form.

Another advantage is that the structure is very open. There is (or at least could be) no concern about field order, size and positioning and other such general data layout concepts which are a big part of binary file formats.

One cannot even say processing the XML is slow. Yes it could be faster but something else clearly is the cause of most of the loading times. Moving to a highly optimized binary format might save a second or two, but ultimately will not touch the 15+ seconds the game spends doing other stuff unrelated to directly parsing the XML.

Kumba42 · Post by **Kumba42** » Tue, 19. Mar 19, 07:10

Imperial Good wrote: ↑Mon, 18. Mar 19, 20:07
Seriously am surprised that X4 doesn't use multi-threaded compression. It's got a number of open-source libs in it, so one more for multi-threaded compression doesn't seem like it'd make a difference.
Since a single stream of data is being compressed, multi threaded compression might not yield much gains. Also depends on the compression algorithm used as some just do not scale well with multiple threads when compressing a single file.

The zlib/gzip format is known to be slow, especially if you dial up the compression level. The default in most 'gzip' invocations is 6, which is a balanced tradeoff of compression vs speed. Lowering that value gives you faster speed, but less compression, and vice-versa for raising it. As for multi-threading it, what programs like 7-Zip on Windows do is they break the data stream up into multiple chunks and compress each individually in its own thread. Like with gzip/zlib's compression level, this affects speed versus compression ratio, but at least with 7-zip, the difference in compressed output is not that much. That said, I don't think current gzip/zlib implementations support multi-threading. It's a very old algorithm at the heart of the zlib format, so it just may not be capable of it.

There are newer alternatives, though, like LZ4, for example, that supports parallelism. Its ratio is not as good as zlib at the default setting, but it's insanely fast. The ZFS filesystem under Solaris/FreeBSD recommend using it for filesystem compression as a modern CPU can compress/decompress data faster than it can write it to disk over the default 'gzip' compression option. There's newer algos, too, like Facebook's Zstd. though that might be too new for X4, since development likely started on it before Zstd was even known publicly.

I'm not sure if there's a library for it, but there's a parallelized version of the old Block-Wheeler algorithm, commonly known as Bzip2, which is really fast if you have a decent number of cores. In the open-source world, it's available under an "lbzip2" package name. On my Linux machine, I can lbzip2 a ~2.2GB tarball of source code, at compression level 9, in about 8-12 seconds, and squash it down to ~140MB-160MB. Granted, that specific measurement is on a large ramdrive, but that machine has roughly the same disk setup as my desktop that I described earlier, so writing to spinning rust would only add an extra second or two, as such a compressed file fits into the RAID cache and the write operation would return pretty quickly (and the card would scribble out to disk at its leisure).

Anyways, just saying there are options. Maybe it's something Egosoft can consider for X5 whenever they start development on that up.

Post by **radcapricorn** » Tue, 19. Mar 19, 12:13

Imperial Good wrote: ↑Tue, 19. Mar 19, 07:09
radcapricorn wrote: ↑Mon, 18. Mar 19, 23:14 There's no reason why modding can't be allowed via some user-friendly data interchange (of which, by the way, XML is not), while keeping game's data and save files in more efficient structures.
As for robustness, I wouldn't call robust a 300Mb file, of which ~225Mb are wasted on useless overhead, and where the most repeated string, counting upwards of 250 thousand occurrences, has nothing to do with useful data, because it's "</ware>".
Except the files are not 300MB because they are compressed. They are only 30MB or so. The repetition of "</ware>" does not matter much due to compression.

Yes it is 300Mb. In fact, it's 300+30Mb that goes through your CPU caches once at decompression, and 300Mb again on parsing. What's the size of your L1, L2, and L3?

Most of those 300Mb is dead weight, wasted cache.

Currently if someone wants to change something in their save they can edit it fairly easily in a textual form. Good luck doing that with an arbitrary, closed source binary form.

You're fixated on using one data format for everything. Mods could use a different structure. Or there could be, you know, tools...

Another advantage is that the structure is very open. There is (or at least could be) no concern about field order, size and positioning and other such general data layout concepts which are a big part of binary file formats.

Which is what makes it even slower. When rebuilding your data structures, you want as much information as possible. Sizes, offsets, layouts... X4 save files and metadata files contain none of that. You must sequentially parse them in order to find out.

One cannot even say processing the XML is slow.

Yes, one can. It is slow. It has to pump a lot of useless data through the CPU. It is highly redundant. It is sequential. It is an arbitrary tree, for crying out loud.

Yes it could be faster but something else clearly is the cause of most of the loading times. Moving to a highly optimized binary format might save a second or two, but ultimately will not touch the 15+ seconds the game spends doing other stuff unrelated to directly parsing the XML.

Moving to an optimized binary format can enable parallelized universe reconstruction (and saving as well), laying out the memory in advance, and just filling it with data directly without or with minimal parsing.
Most of the game's metadata is XML. Just look at the catalog files. That's just wasted time and space.
It seems I'm derailing this thread into a different topic. If you wish, we could continue this discussion in another one.

Post by **Imperial Good** » Wed, 20. Mar 19, 02:24

Yes it is 300Mb. In fact, it's 300+30Mb that goes through your CPU caches once at decompression, and 300Mb again on parsing. What's the size of your L1, L2, and L3? Most of those 300Mb is dead weight, wasted cache.

A 10 year old I7 920 with tri channel DDR3 had a stock memory bandwidth of 25.6 GB/s (no memory OC). With this logic it takes at most 0.013 seconds or 13 milliseconds to move that volume of data to/from RAM to CPU. Processing almost certainly will be the bottleneck and even then one is looking at potentially gigabytes per second. Cache has little to do with performance in this case. A few mechanical drive seek operations take longer than this.

A more modern processor like a 9700k running with dual channel DDR4 has a memory bandwidth of 41.6 GB/s.

You're fixated on using one data format for everything. Mods could use a different structure. Or there could be, you know, tools...

Or they could just use XML which is what they do. No need to invent solutions for problems which do not exist.

Which is what makes it even slower. When rebuilding your data structures, you want as much information as possible. Sizes, offsets, layouts... X4 save files and metadata files contain none of that. You must sequentially parse them in order to find out.

Such information might yield an order of magnitude performance improvement at best. XML decompression and parsing should not be a major performance problem to begin with so it is basically a solution to a problem that does not exist.

Yes, one can. It is slow. It has to pump a lot of useless data through the CPU. It is highly redundant. It is sequential. It is an arbitrary tree, for crying out loud.

Well here is the results of a Java based test program. This is trivial to write since the standard API supports everything one needs to parse X4 save files. The program opens the save file as a stream, decompresses the stream and constructs a DOM model of the XML file from the decompressed stream.

For accurate measurements the process was repeated 33 times with the timing of the first run discarded (to avoid quirks of Java and file IO to produce a more accurate reliable XML parse speed) for 32 runs measured. The save being measured has an compressed size of 40,094,805 bytes (~40MB) and uncompressed file size of 330,080,356 bytes (~330MB) and is the more recent save from my 50 hour accumulated playthrough with a functioning warf and hundreds of ship assets.

The running time for each parse was 12,560ms or 12.5 seconds. This is a rate of about 26.3 MB/sec of XML parsed.

It is worth noting that DOM parsing of XML is inherently slow since it generates a lot of useless meta data such as nodes to preserve indentation. Java is also slower than C/C++ for some tasks and this likely is one of those. The entire DOM tree had to be garbage collected between cycles which itself is a non trivial operation. The stream pipeline used was entirely single threaded, one could further optimize this by putting decompression on a separate thread from parsing, effectively making the entire process bottleneck to which ever of the two is slowest.

If one searches the internet one can easily find references to XML parsers achieving 100+ MB/sec. Some quote as high as 250 MB/sec on ancient Core2Duo processors. Of course if more speed was desired they could always fragment saves into sectors or other top level granularity to allow for multi core parsing and get performance improvements near 300-700% depending on processor (quad or oct core).

Moving to an optimized binary format can enable parallelized universe reconstruction (and saving as well), laying out the memory in advance, and just filling it with data directly without or with minimal parsing.

One could also enable parallel universe reconstruct by separating the save into separate XML files for each sector. Multiple separate XML files can be decompressed and parsed in parallel.

Even if a binary format is used, the data will still need to be directly parsed. This is because it has to be stored in a platform independent way. Additionally concepts such as cross version save compatibility become much harder since properties may be added or removed between versions and all the code to deal with that starts to slow parse speed and make maintenance harder.

Without performance profiling one cannot say that XML is the cause of the long save/load times. If it turns out not to be the case, then one would spend a lot of development resources making a process faster for minimal gains.

As it stands currently, simply upgrading from a mechanical drive to a SSD will cut load (not save) times far more than migrating to a binary save format ever could.

Post by **Tamina** » Wed, 20. Mar 19, 11:12

Beats me to it. How hard can it be to read and parse some 300 mb file?
I have a parser running in the background everyday on work which does approximately 500 XML files ~700mb in ~1 minute. Just not long enough to get coffee in that time >.<
Also written in Java.

It raises one point though: Why does saving still take so much time in X4?

Post by **Imperial Good** » Wed, 20. Mar 19, 18:23

I thought I should update the original topic of this thread that with 2.20 there has been a massive improvement to resource loading. Specifically resource stalls occur less frequently and for less time. Even loading saves seems faster with one getting in game at 35%. Still running it off the same mechanical drive as well so great to see such improvements.

Unbekanntes Feindschiff

Tamina wrote: ↑Wed, 20. Mar 19, 11:12 Just not long enough to get coffee in that time >.<

Tamina wrote: ↑Wed, 20. Mar 19, 11:12 Also written in Java.

I feel like I might be onto something there

Post by **radcapricorn** » Wed, 27. Mar 19, 19:27

Imperial Good wrote: ↑Wed, 20. Mar 19, 02:24
Yes it is 300Mb. In fact, it's 300+30Mb that goes through your CPU caches once at decompression, and 300Mb again on parsing. What's the size of your L1, L2, and L3? Most of those 300Mb is dead weight, wasted cache.
A 10 year old I7 920 with tri channel DDR3 had a stock memory bandwidth of 25.6 GB/s (no memory OC). With this logic it takes at most 0.013 seconds or 13 milliseconds to move that volume of data to/from RAM to CPU. Processing almost certainly will be the bottleneck and even then one is looking at potentially gigabytes per second. Cache has little to do with performance in this case. A few mechanical drive seek operations take longer than this.

As far as parsing is concerned, cache has everything to do with performance. An average tag in the save translates to 16 bytes of game data (most of the tags seem to represent a pointer to a string + some int/float value, maybe a hash key as well). Some more, some less. Most of the tags themselves approach or exceed the size of a single cache line. So when parsing, you're likely to get 3 or more L1 misses per one L1 write. That's just nasty.
But, that's only as far as parsing itself is concerned, which is not at all what I'm talking about.

Well here is the results of a Java based test program...
The running time for each parse was 12,560ms or 12.5 seconds. This is a rate of about 26.3 MB/sec of XML parsed.

Congrats, you've measured DOM parsing. Which is far from everything that the game has to do when loading. It has to reconstruct (or patch) all the game's data structures (that's if we "forget" even about loading the assets). You can't hope to achieve any speed there if for every entry you're doing a hash lookup, a strcmp or three, a (re-) allocation, a hash insertion, conversion from string to an int/float, etc, while jumping all over memory to boot. Which brings us back to cache, which would have to be shared between the 300Mb of uncompressed XML and all the game's data structures. All of which wouldn't fit into any L3, let alone L1, no matter how badly you'd like it to

(Actually, if one's not careful when parsing that 300Mb XML, the resulting DOM in-memory may end up significantly larger, and fragmented into oblivion, but I don't know what EGO does there).

Without performance profiling one cannot say that XML is the cause of the long save/load times. If it turns out not to be the case, then one would spend a lot of development resources making a process faster for minimal gains.

Yes, one can. The further you diverge your on-disk data representation from your in-memory representation, the slower your conversion gets. XML loses nearly everything, from layout to lookups. But you're right in that we can't know how large of an impact it is without measuring. Perhaps I should take a weekend to write some tests, even if to shut myself up.

Post by **Imperial Good** » Thu, 28. Mar 19, 03:17

As far as parsing is concerned, cache has everything to do with performance. An average tag in the save translates to 16 bytes of game data (most of the tags seem to represent a pointer to a string + some int/float value, maybe a hash key as well). Some more, some less. Most of the tags themselves approach or exceed the size of a single cache line. So when parsing, you're likely to get 3 or more L1 misses per one L1 write. That's just nasty.

Well since the data is only read once, there will always be cache misses no matter how compact the data is. Cache only really applies when processing data where related values might be read or written multiple times in a short period of time.

In theory the optimum performance for single threaded loading would be achieved using an intermediate buffer for the decompressed data that fits inside the L1 processor cache. For example 8 or 16KB. Larger and it will spill over to the L2 and eventually L3 caches which are still faster than memory. That way the only full cache misses would be with respect to reading file data or game state, both of which are not avoidable and most of the data parsing will be in very fast L1 cache. If using a multi threaded pipeline to process the data (one thread decompresses, another thread parses) then the total buffer volume for decompressed data must be made to easily fit within the shared cache for optimum performance of moving the data between cores.

Of course this all assumes memory mapped IO where all file pages reside in main memory already. Otherwise the kernel level switching needed for the IO operations will dominate the read time far more than a cache miss ever could.

Which brings us back to cache, which would have to be shared between the 300Mb of uncompressed XML and all the game's data structures. All of which wouldn't fit into any L3, let alone L1, no matter how badly you'd like it to

Which is why the data should be processed in smaller buffers which are refilled as needed to avoid this. Not like this is much of an issue as modern CPUs can handle gigabytes of cache misses every second.

As mentioned above this is only really an issue if one uses memory mapping for IO and all pages of the file are in memory. Otherwise kernel level switching to perform IO will limit performance far more than cache misses ever could.

(Actually, if one's not careful when parsing that 300Mb XML, the resulting DOM in-memory may end up significantly larger, and fragmented into oblivion, but I don't know what EGO does there).

DOM is a dumb way to parse XML. It is very convenient for hacking together something to read or manipulate save files since it does not need to know how the XML data is structured, however it is painfully slow as a result and it even creates objects for indentation and comments which are not useful data. DOM itself is another form of the data which is effectively another stage in processing which is a huge overhead for so much data.

Since the developers know the structure of the data they likely use a stream based processor which will be a lot faster as it can create useful game objects directly as well as work on optimum buffer sizes.

Yes, one can. The further you diverge your on-disk data representation from your in-memory representation, the slower your conversion gets. XML loses nearly everything, from layout to lookups. But you're right in that we can't know how large of an impact it is without measuring. Perhaps I should take a weekend to write some tests, even if to shut myself up.

Online there are references to XML files being parsed at speeds of 100s MB/sec. X4 saves are hardly that big compared to the realm of big XML data processing. It could entirely be the slow load times are due to state reconstruction, something a binary format would not speed up.

Binary formats still need to be parsed for portability and version support. On top of that they will still need to undergo internal state reconstruction since the game might use caches and lookup tables and such which do not needed to be saved as they are massive and are just a faster form of data already in the save.

Post by **radcapricorn** » Thu, 28. Mar 19, 07:59

I feel we're not on the same page here. Let's try an example. zcat a save, look for the first occurrence of "<resourceareas>". In one of my saves, that tag alone is 177662 bytes. It's an array of 554 "areas", which means each area on average is ~320 bytes.

But what actual data is there?

Three coordinates per area. Assuming those are in world space, let's be greedy and say they are 64 bit per. So 24 bytes.
Per resource: max recharge (whatever that means), time, and yield. Looking at value ranges, I'm assuming unsigned 16-bit integers for recharge and time, and 8-bit for yield (that's an enum). So padded, each resource is 8 bytes.

Now, XML "benefits" from not storing absent resource info, so if an area doesn't contain nividium or helium or whatever, you'll see no tags for them. In my sample, areas seem to only contain ore and silicon. 320 / (24 + 8*2) = 8x overhead. Or, you read 280 useless bytes of markup to extract 40 bytes of payload.

Parsing each "area", on average:

Read 320 bytes (because we're always parsing, that's 320 table lookups, one per character, relaxing to this particular case of all ASCII).
~56 strcmps and branches for tags and attributes.
~6 atols for converting data.
Not including branches for choosing correct storage location, as these can be replaced by offsets.
Check if your staging buffer is full (another branch).
If it is, grow the sector's areas array. That's another branch + a costly operation on slow path. memcpy staging buffer into game storage.
Repeat until closing tag (ends up, obviously, 554 times).

This straightforward 177kB array blows up data cache (and likely instruction cache as well), and absolutely murders all benefits of modern CPUs. And for what? To write measly 554 * 40 = 22kB of payload.

Let's translate this into a hypothetical binary storage. The game has 6 resource types (ore, silicon, nividium, helium, hydrogen, methane). So each area will be represented by 24 + 6*8 = 72 bytes. Even resources absent in an area will be there, simply all zeroes. 554 areas = 39888 bytes. Add 4 more bytes for array length. 39892. With a "huge" overhead of 0.01%, this is four and a half times smaller than an XML that, in this particular sample, would only represent a third of the data (2 resources per area instead of 6). Parsing? Read the 4 bytes of array length, parsing's done. Grab your sector, allocate length*72 bytes, memcpy length*72 bytes. That's it. And, unlike the XML, that's not average.

Of course, this is just a random example. Without a thorough look at all the data in saves it may even be extreme. But hopefully illustrates my general point.

Versioning? It's no less a concern for XML (at least if we're talking pull streaming). As for binary portability, as far as I'm aware, X4 isn't currently shipped to big endian non-IEEE754 potatoes. But lookups? This isn't even funny. Although, be an interesting exercise to count how many logarithmic lookups need to be done while parsing the XML and fetching values from hash tables for all those string attributes, where trivial O(1) lookups could be used instead.

Addendum:
Continuing on lookups, in that same save, there's ~5 million quoted alphanumeric strings (attribute values), excluding numbers and bracketed hex indices. They amount to ~61Mb of data. Of those 5 million, only ~35 thousand are unique, reducing total size 100 times to ~650Kb.

egosoft.com

egosoft.com

Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?

Re: Why does X4 on Windows not cache data files?