Character set handling in libopenmpt

Started by cspiegel, July 10, 2023, 03:49:58

Previous topic - Next topic

cspiegel

This is about libopenmpt.

I've noticed some module messages have "weird" characters; and when those characters are interpreted as CP-437 instead of Unicode/UTF-8, things look as they should. I looked over the libopenmpt source, the issue of character sets is being dealt with m_modFormat.charset. However, it's not always right. There seem to be some heuristics to determine the charset, but it's not perfect. So this may be an unsolvable issue in general, but I'll still describe some of what I've found.

First, I wrote a small program to try to ascertain whether there are encoding issue in module messages. It's available here: https://github.com/cspiegel/openmpt-charset

If you run this against a MOD (or MODs) it'll do conversions from CP-437 to UTF-8 and show the differences.  Things have to eyeballed because, of course, the conversion will happily convert an already-converted UTF-8 message to some garbage, as there's no automatic way to know what's right. But here are some examples of what it found (left side is as returned by libopenmpt, right is the conversion assuming CP-437):

https://modarchive.org/index.php?request=view_by_moduleid&query=156520

þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ  þþþþ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■  ■■■■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ  þþþþ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■  ■■■■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ  þþþþ  þþþþ    þþþþ  þþþþþþþþ  þ     þþþ   þþþ  þþþþ    þþþ  þþ  þþ           | ■■  ■■■■  ■■■■    ■■■■  ■■■■■■■■  ■     ■■■   ■■■  ■■■■    ■■■  ■■  ■■
þþ        þþþ  þþ  þþþ   þþþþþþ   þ  þþþþþ  þ  þþþþþþþ  þþ  þþ  þ þþþþ           | ■■        ■■■  ■■  ■■■   ■■■■■■   ■  ■■■■■  ■  ■■■■■■■  ■■  ■■  ■ ■■■■
þþ  þþþþ  þþ  þþþþ  þþ  þ þþþþ þ  þ     þþþ  þþþþ  þþ  þþþþþþþ   þþþþþ           | ■■  ■■■■  ■■  ■■■■  ■■  ■ ■■■■ ■  ■     ■■■  ■■■■  ■■  ■■■■■■■   ■■■■■
þþ  þþþþ  þþ  þþþþ  þþ  þþ þþ þþ  þ  þþþþþþþ  þþþ  þþ  þþþþþþþ   þþþþþ           | ■■  ■■■■  ■■  ■■■■  ■■  ■■ ■■ ■■  ■  ■■■■■■■  ■■■  ■■  ■■■■■■■   ■■■■■
þþ  þþþþ  þþþ  þþ  þþþ  þþþ  þþþ  þ  þþþþþ  þ  þþ  þþþ  þþ  þþ  þ  þþþ           | ■■  ■■■■  ■■■  ■■  ■■■  ■■■  ■■■  ■  ■■■■■  ■  ■■  ■■■  ■■  ■■  ■  ■■■
þþ  þþþþ  þþþþ    þþþþ  þþþ  þþþ  þ     þþþ   þþþ  þþþþ    þþþ  þþ  þþ           | ■■  ■■■■  ■■■■    ■■■■  ■■■  ■■■  ■     ■■■   ■■■  ■■■■    ■■■  ■■  ■■
þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþþþ    þþþþ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■■■    ■■■■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþþ  þþ  þþþ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■■  ■■  ■■■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ  þþþþ  þþ  þþ  þþ     þþ  þþþþ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■  ■■■■  ■■  ■■  ■■     ■■  ■■■■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ  þþþþ  þþ  þþþþþþ  þþþþþ   þþþ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■  ■■■■  ■■  ■■■■■■  ■■■■■   ■■■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ        þþ  þþ  þþ     þþ  þ þþ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■        ■■  ■■  ■■     ■■  ■ ■■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ  þþþþ  þþ  þþ  þþ  þþþþþ  þþ þ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■  ■■■■  ■■  ■■  ■■  ■■■■■  ■■ ■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ  þþþþ  þþ  þþ  þþ  þþþþþ  þþþ   þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■  ■■■■  ■■  ■■  ■■  ■■■■■  ■■■   ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþ  þþþþ  þþ  þþ  þþ     þþ  þþþþ  þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■  ■■■■  ■■  ■■  ■■     ■■  ■■■■  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþ           | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

The attached "phorte_-_airborne.xm", as I can't find it on the MOD archive:

ÉÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ»  | ╔═════════════════════════════════════════════════════════════════════════════╗
 º good evening all...                                                         º |  ║ good evening all...                                                         ║
 º                                                                             º |  ║                                                                             ║

etc...
So in at least these files the encoding is clearly CP-437, but libopenmpt apparently is using Windows-1252.  However:

https://modarchive.org/index.php?request=view_by_moduleid&query=40235

Copyright © 2002 by Petr Cvikl.                                                  | Copyright ⌐ 2002 by Petr Cvikl.

So this one is right... which circles back to, which heuristic is best to use? Maybe you've gone through this in much more detail and this has the fewest false positives, but I want to at least bring attention to it. Actually, as far as I can tell, these are the only 3 .it/.xm files I have which have any non-ASCII characters in their messages at all, so it's a tiny sample size.

And there's one more point here: Composer 669 and Digitrakker files get converted to UTF-8 properly, except for control characters! For example:

ftp://ftp.modland.com/pub/modules/Composer%20669/Steve%20Mason/euphorium.669

          ^N Euphorium ^N                                                          |           ♫ Euphorium ♫

Or ftp://ftp.modland.com/pub/modules/Composer%20669/Maxwell%20The%20Madman/new%20beginnings.669

        ^C^C^C^CJennifer A.^C^C^C^C  (C)1993                                             |         ♥♥♥♥Jennifer A.♥♥♥♥  (C)1993

Finally: ftp://ftp.modland.com/pub/modules/Composer%20669/Buttwheet/the%20ritual!.669

***|^N       Thë Rïtüäl!        ^N|***                                             | ***|♫       Thδ R∩tⁿΣl!        ♫|***   
*^B*|   ^N        By:         ^N   |*^B*                                             | *☻*|   ♫        By:         ♫   |*☻*   
 ^E |^N        ButtWheet         ^N| ^E                                              |  ♣ |♫        ButtWheet         ♫| ♣

In the first two, the ^N and ^C represent actual control characters, which map to symbols in CP-437/Extended ASCII. In the last one, you can see that libopenmpt has the UTF-8 right ("Rïtüäl" instead of my garbage "R∩tⁿΣl", for example), but it's just not doing control characters.

So this is a separate issue. I'd recommend that when CP-437 is being used, control characters (apart from newline) should be translated as well. MODs written in the DOS days clearly intended for them to be displayed as such.

I'm attaching a proof-of-concept patch which does this. It adds Unicode translations for characters 1-31 (except 10) and 127, when converting from CP-437. This applies to libopenmpt 0.7.2.

manx


So, there are 2.5 "issues":


1: Windows codepages in ModPlug Tracker files

greatness_awaits_you.it
ModPlug Tracker 1.09 - 1.16

phorte_-_airborne.xm
ModPlug Tracker 1.09

cvikl.it
ModPlug Tracker 1.09 - 1.16

These are all saved with old versions of ModPlug Tracker. ModPlug Tracker did not care about character encoding and just wrote verbatim what the user entered. As ModPlug Tracker was/is a ANSI-codepage Windows application, these would be displayed to the user in the current active Windows codepage (CP_ACP) by ModPlug Tracker, which is Windows-1252 on most Western Windows installations. OpenMPT itself also still has this issue in some cases (but not all). libopenmpt assumes Windows-1252 for these files, as this is arguably the best guess. It is wrong for a lot of files (i.e. (a) anything made on a japanese Windows installation for example), and it is also wrong (b) for files made with Impulse Tracker 2 or Fast Tracker 2 and then saved again (for whatever reason) by ModPlug Tracker. Additionally, it is also wrong (c) if the user actively worked-around what ModPlug Tracker did, and entered text that would be displayed as intended on the DOS trackers.

However, I do not think there is anything we can do for these files. In (most) ModPlug Tracker installation, the person saving the file would have seen exactly what libopenmpt is returning, so this is "correct" in that sense. The examples you gave are likely case (b) or (c).


2: CP437 control characters

euphorium.669

new%20beginnings.669

the%20ritual!.669

This is a simpler case. We currently do not differentiate between CP437 (as defined by Unicode, see https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT) and CP437-with-printable-C0 (I think I have seen an official Unicode mapping for that in the past, but I cannot find it right now).

In any case, the fix would not be to change the meaning of CP437 in the complete code base. The meaning of CP437 is clearly defined by Unicode, and we would rather not deviate from that. Also, we are not using our custom conversion table on Windows at all because we use the platform API to transcode these common character encodings.

The proper fix would be to add a new encoding, named cp437c0 (or something similar, preferably an official Unicode name), and determine which original DOS trackers display these values as actual glyphs, and then use this encoding for matching file types.


3:

Then, there is also the case for DOS trackers in text mode (cannot remember which these are right now, SagaMusix likely knows better) that are running on a system which does not even use CP437. This ends up in a very similar situation as the ModPlug Tracker case, and there is likely not much we can do except assuming the likely most common encoding, which is CP437.

cspiegel

Thank you for the detailed response!

1. This is what I expected, and does make sense. Without an encoding being specified, there's no way to ensure things are "right". If module authors paste the wrong characters in, there's not much to do. Understood.

2. I did a bit of searching and came up with a couple of things that aren't definitive, but here they are anyway.  First is a document at unicode.org entitled "IBM PC memory-mapped video graphics to Unicode":

https://unicode.org/Public/MAPPINGS/VENDORS/MISC/IBMGRAPH.TXT

This document may have been informed by a 1984 IBM document entitled "REGISTRY, Graphic Character Sets and Code Pages" which purports to describe code page 00437:

https://public.dhe.ibm.com/software/globalization/gcoc/attachments/CP00437.txt

Being from 1984 it clearly isn't Unicode related, but at least it may be a useful historical reference.

Since you sound open to at least exploring the idea of rendering control characters (in appropriate circumstances), I'll do some research into the relevant DOS trackers and how they were displayed. If I get useful information, I'll open a request on the bug tracker to add in a new "control-enabled" 437 and we'll see how things look.

Thanks again for the in-depth explanation.

Saga Musix

In addition to what manx mentioned, there are actually more like 2.6 issues. ;)

ModPlug Tracker and earler OpenMPT versions used the OEM_CHARSET flag for creating the comments font, which, on my particular Windows XP virtual machine, causes those characters in the first example to show up as triangles, for instance:

oem_font_mpt.png

I do remember that on my Windows 98 machine, when I was using the same ModPlug Tracker version, they would show up as CP437 characters. It is also possible that they did show up as CP437 characters on the machine of the creator, or it could be that they wrote the message in Impulse Tracker but later re-saved the file in ModPlug Tracker, effectively breaking our detection heuristics. Running the same ModPlug Tracker build on a modern Windows 10 machine does not show those triangles but uses the locale (so they show up as ß).
Effectively this means that for modules saved with old MPT versions, there might be two completely independent unknown charsets being used for encoding 1) the song message (which uses the OEM charset flag) and 2) all other text strings, which is not something libopenmpt can currently handle. Once the codebase gets updated to handle these encodings individually, we could, in theory, assume CP437 encoding for the song message in old MPT-made files (but not for OpenMPT-made files of course).

QuoteIn any case, the fix would not be to change the meaning of CP437 in the complete code base. The meaning of CP437 is clearly defined by Unicode, and we would rather not deviate from that.
It would be safe to assume that in all supported tracker formats, control characters were supposed to show up as printable characters such as hearts or musical notes. I think it would make sense to convert them to unicode equivalents instead of keeping them as control codes.
» No support, bug reports, feature requests via private messages - they will not be answered. Use the forums and the issue tracker so that everyone can benefit from your post.