SoundFont Format Extensions Specification

Started by stgiga, February 08, 2025, 22:07:54

Previous topic - Next topic

stgiga

After 2 years and a computer failure, I and some other people have managed to put our 2023 attempt at extending the SoundFont specification into an actual and proper specification, called SFe, available here:

https://github.com/SFe-Team-was-taken/SFe/releases/tag/4.0

We have Polyphone's developer, as well as other player developers, in on this.

We have also resurrected Silicon SoundFonts (Section 11 of the SoundFont spec, which is an official SF2-to-ROMpler mode, and I've documented it with a reference implementation here: https://github.com/stgiga/siliconsfe/tree/main and it's factored in to our new spec.)

I've talked about this with Saga before, and I'm willing to wait until the format gets adopted before it's added. This post is just to signify that it exists in a dedicated spec now rather than just a bunch of documentation as a post.
I use they/them pronouns and do tech.

Saga Musix

Okay, I tried to, but I really don't know how to put this nicely. What a train wreck.
How can you claim to write a standard when there isn't a single reference implementation? (A demo soundfont file is not a reference implementation, it is reference data.)
How can you design all of this without directly involving any of the people that you expect to implement your specification, or without having apparently ever written a soundfont player yourself?
I hope you realize that this is not a viable approach when the first thing that happens after releasing your specification is people coming out pointing out the errors in the design.

Random observation just by skimming the specification: Why is sample compression a bitset and not an enum?! A bitset implies that several bits can be set at the same time, but how can a sample be both a FLAC sample and an Opus sample at the same time?  Why do you design a file format in 2025 that can contain MP3 samples, without even going into detail how you intend to address inherent problems of the MP3 format for this use case, such as encoder pre-delay and the inability to have samples of arbitrary length (unless you are using LAME/Xing extensions)? Both of these problems have already been solved by Ogg Vorbis, 25 years ago.
This smells very much like a feature you thought would be cool to have, without ever having implemented it.

And yet another inconsistency regarding compressed samples: "Compressed samples are always 16-bit samples" (page 48). So what happens if I were to store a compressed 32-bit FLAC sample? You probably meant to say instead that the sm24 chunk simply does not apply to compressed samples, but the language is too vague for that. Specifications must use watertight language or someone is bound to implement them incorrectly.

And seriously, Silicon SoundFonts?! Do you seriously think anyone is going to build a hardware soundfont player based on this 30-year old specification in the year 2025? And the worst thing is, the "improved" documentation actually makes many things less clear than the "poor documentation" of the original spec, it seems like you just wanted to write documentation for the sake of writing documentation, and not because there is an actual need for it.
Not that it would even matter, but here's just one extremely obvious example, where the new documentation is actually worse than the original:
You renamed "checksum" and "checksum2sComplement" to "bankChecksum" and "bankChecksum2sComplement", probably because you assumed that the checksum includes just the soundfont itself, but not the ROM header. At the same time, you claim that it's a CRC-16 checksum. Both of these things are, very obviously, in direct contradiction to the original documentation, which becomes clear if you read the documentation on the "checksum2sComplement" value: "for updating checksum variable w/o changing file checksum value".
Anyone that has ever worked with checksummed ROMs would be able to tell you that this implies that the checksum is for the entire ROM contents, including this very header, and that a CRC-16 checksum would be technically impossible to use that way, because adding the checksum and its complement to the ROM would change the CRC-16. Most likely the checksumming algorithm that Creative intended to use here was a simple running sum of all bytes (or words) in the ROM modulo 65536, because adding the checksum and its complement to such a sum would not change the final checksum. Nobody would have added circuitry for calculating a CRC-16 checksum 30 years ago, just for verifying the correctness of ROM contents.

Please remove my name from the credits. All I have ever told you is the things you shouldn't be doing (like in this post, again), and I really don't want to be associated with such a half-baked specification.
» No support, bug reports, feature requests via private messages - they will not be answered. Use the forums and the issue tracker so that everyone can benefit from your post.

n0cturn


stgiga

Addressing a few things:

Spessasynth's developer had a similar idea and joined forces later on, and said user's player may as well be the first player supporting it.

Truth be told, a significant chunk of my involvement was in the earlier stages, and by the time Spessasynth's developer got involved, I mostly was in the role of either spec research or line-item changes/vetoes. As in, there were times where much of what I was doing was determining the sensibility of certain changes and offering feedback, or doing research in the spec, not codifying anything. Sometimes there were moments where I, busy in uni, allowed the other devs to figure out certain things on their own.

That said, in regards to the Silicon stuff, during a LOT of the research phase, especially into unused fields, I DID say that some field info was conjecture, and I wanted that made clear but that didn't happen. As for the "modern hardware synth being a wild idea", I mean, Silicon SoundFonts itself was a wild idea, but as a chiptune fan it just felt alluring. Prior to learning of it I had already wanted my banks on a chip, so the whole thing felt like a dream come true. And some of my teammates did agree with you on stuff like the checksum of the entire file. The CRC16 situation was also conjecture. Some of it came out of the fact that the pre-Silicon bank's CRC16 was a clean hex number. I don't think modulus ever crossed my mind given my lack of math experience. I ALSO wasn't aware of CRC16 being harder to decode than a modulus.

I was not involved in any of the compression parts, nor did I choose to rename the fields in Silicon. Also, the rest of the team shot down putting the sample chunks at the end, and I didn't feel like starting contention over it. The consensus was that you were right that sample chunks at the end isn't the best idea.

None of us wrote this "for the sake of writing documentation". I may be studying to be a technical writer, but I didn't use THIS as my testbed. My contribution to the wording was not significant. Oh and I DID make the font used for codeblocks (UnifontEX, whose documentation started out as a technical manual but became so much more). And sure, lines like the "orphaned" one (orphaned sm24) are things I coined, in this case even borrowed. It's not the only one in which a quote from me on Discord or a Github Issue/Discussion became part of the standard. But the "leaf" thing was coined by the other founding dev. Both of our semi-neologisms made in into the spec in one way or another, even though my role was more of a research one. Or by the end, R&D to an extent. But this effort was a serious effort. And of note is that I was the person who found Silicon SoundFonts in the original spec. I had been looking for information on sample rate to prove the 50kHz figure was to keep cards happy, but I went too far down and saw Silicon SoundFonts and my jaw dropped, and then one thing lead to another.


Furthermore, in regards to "reference data without a reference implementation", I have actually done the reverse at one point.

Another problem is that communication with the other developers is slower than it used to be given one of them going on a hiatus from a quicker form of social media, and the other not being involved. We are ALSO in wildly-different time zones. I'm 8 hours behind the other founding dev. Both of us are night owls. Given how by 2024 Discord contact between us became spotty, I felt like a LOT of things I felt needed a quicker response than Github Issues weren't addressed ideally.

And do note that I don't think or talk like most people, in more ways than one. Even the other founding dev sometimes gets what I say wrong, no matter how much specific detail I go into. That is part of the whole debate we had over Silicon's quirks, especially when the AWE32 situation went down. It seemed like the point was being missed, most likely on both sides. We both had issues interpreting what the other meant at times.

The other members often rapid-fire communicated faster than I can on Github, hence why in conjunction with university I at some stages let them try to figure things out. Maybe there were some elements where I should have done something. And I think we all were stumped in some areas of Silicon and went for educated hypotheses based on what seemed likely. Emphasis on seemed. I think I took the programmer's approach. As for the other founding dev, they were surmising Silicon was just a clone of the AWE32 ROM, and that didn't seem correct, and so it was something we were mixed about. I also was mixed about the romRsrc field's purpose, with myself even. Admittedly, I even theorized about using trailing sample chunks plus the offset shenanigans in the Silicon header to amplify the effects of trailing sample chunks, but that fell to the wayside.

And if you think this spec is bad, well... let's just say that my own code is hacky, and I've created other unconventional file formats, one of which I'm now having to extend after the fact to correct something I missed. And the readme for my extension of GNU Unifont is 200KiB (and one that definitely is quirky), and it comes in niche formats, some of them I created. So, like, I've done worse. That being said, most of the wording of the SFe spec was *not* my doing. I was more of the research side of things. I didn't create the reference data, like sine8192.bin, plus the bin file translating the header spec to an actual file. Also, the dev who did, I had to correct on how large the header should be. We both had mixed views on the more-unclear parts of the spec. And I admit, I was somewhat idealistic about certain sections. We had some pitched brainstorming sessions over some of those unclear parts. In some aspects, I did feel a bit guilty about certain sections that had a more-idealistic origin. As in, I felt like I should have just *not* said anything on them. But I did. Sometimes the gut, even educated, lies.

As for "not consulting", well, the other founding dev is more-stringent on open-source than I am, and in 2022, they had creative differences with another collaborator who I had gotten them involved with in 2019, which is why I'm now second-in-command. I have my own issues with communication too. So the main problem with "not communicating" was that it was already hard for the team to coordinate by the later era, and believe me, coordinating all SF2 devs wasn't easy. Oh and we tried talking about our extensions as early as 2023 to SoundFont people and it took 23 months for any of it to get anywhere. We weren't doing an "after-the-fact" communication. We just ran into gridlock until quite recently. And honestly, the reason the majority dropped trailing sample chunks for 8GiB rather than RIFF64 was because of your stance on it. The consensus listened and also felt that extension wasn't needed.

I think these are what immediately comes to mind, but likely isn't everything.
I use they/them pronouns and do tech.