SoundFont Format Extensions Specification

Started by stgiga, February 08, 2025, 22:07:54

Previous topic - Next topic

stgiga

After 2 years and a computer failure, I and some other people have managed to put our 2023 attempt at extending the SoundFont specification into an actual and proper specification, called SFe, available here:

https://github.com/SFe-Team-was-taken/SFe/releases/tag/4.0

We have Polyphone's developer, as well as other player developers, in on this.

We have also resurrected Silicon SoundFonts (Section 11 of the SoundFont spec, which is an official SF2-to-ROMpler mode, and I've documented it with a reference implementation here: https://github.com/stgiga/siliconsfe/tree/main and it's factored in to our new spec.)

I've talked about this with Saga before, and I'm willing to wait until the format gets adopted before it's added. This post is just to signify that it exists in a dedicated spec now rather than just a bunch of documentation as a post.
I use they/them pronouns and do tech.

Saga Musix

Okay, I tried to, but I really don't know how to put this nicely. What a train wreck.
How can you claim to write a standard when there isn't a single reference implementation? (A demo soundfont file is not a reference implementation, it is reference data.)
How can you design all of this without directly involving any of the people that you expect to implement your specification, or without having apparently ever written a soundfont player yourself?
I hope you realize that this is not a viable approach when the first thing that happens after releasing your specification is people coming out pointing out the errors in the design.

Random observation just by skimming the specification: Why is sample compression a bitset and not an enum?! A bitset implies that several bits can be set at the same time, but how can a sample be both a FLAC sample and an Opus sample at the same time?  Why do you design a file format in 2025 that can contain MP3 samples, without even going into detail how you intend to address inherent problems of the MP3 format for this use case, such as encoder pre-delay and the inability to have samples of arbitrary length (unless you are using LAME/Xing extensions)? Both of these problems have already been solved by Ogg Vorbis, 25 years ago.
This smells very much like a feature you thought would be cool to have, without ever having implemented it.

And yet another inconsistency regarding compressed samples: "Compressed samples are always 16-bit samples" (page 48). So what happens if I were to store a compressed 32-bit FLAC sample? You probably meant to say instead that the sm24 chunk simply does not apply to compressed samples, but the language is too vague for that. Specifications must use watertight language or someone is bound to implement them incorrectly.

And seriously, Silicon SoundFonts?! Do you seriously think anyone is going to build a hardware soundfont player based on this 30-year old specification in the year 2025? And the worst thing is, the "improved" documentation actually makes many things less clear than the "poor documentation" of the original spec, it seems like you just wanted to write documentation for the sake of writing documentation, and not because there is an actual need for it.
Not that it would even matter, but here's just one extremely obvious example, where the new documentation is actually worse than the original:
You renamed "checksum" and "checksum2sComplement" to "bankChecksum" and "bankChecksum2sComplement", probably because you assumed that the checksum includes just the soundfont itself, but not the ROM header. At the same time, you claim that it's a CRC-16 checksum. Both of these things are, very obviously, in direct contradiction to the original documentation, which becomes clear if you read the documentation on the "checksum2sComplement" value: "for updating checksum variable w/o changing file checksum value".
Anyone that has ever worked with checksummed ROMs would be able to tell you that this implies that the checksum is for the entire ROM contents, including this very header, and that a CRC-16 checksum would be technically impossible to use that way, because adding the checksum and its complement to the ROM would change the CRC-16. Most likely the checksumming algorithm that Creative intended to use here was a simple running sum of all bytes (or words) in the ROM modulo 65536, because adding the checksum and its complement to such a sum would not change the final checksum. Nobody would have added circuitry for calculating a CRC-16 checksum 30 years ago, just for verifying the correctness of ROM contents.

Please remove my name from the credits. All I have ever told you is the things you shouldn't be doing (like in this post, again), and I really don't want to be associated with such a half-baked specification.
» No support, bug reports, feature requests via private messages - they will not be answered. Use the forums and the issue tracker so that everyone can benefit from your post.

n0cturn


stgiga

Addressing a few things:

Spessasynth's developer had a similar idea and joined forces later on, and said user's player may as well be the first player supporting it.

Truth be told, a significant chunk of my involvement was in the earlier stages, and by the time Spessasynth's developer got involved, I mostly was in the role of either spec research or line-item changes/vetoes. As in, there were times where much of what I was doing was determining the sensibility of certain changes and offering feedback, or doing research in the spec, not codifying anything. Sometimes there were moments where I, busy in uni, allowed the other devs to figure out certain things on their own.

That said, in regards to the Silicon stuff, during a LOT of the research phase, especially into unused fields, I DID say that some field info was conjecture, and I wanted that made clear but that didn't happen. As for the "modern hardware synth being a wild idea", I mean, Silicon SoundFonts itself was a wild idea, but as a chiptune fan it just felt alluring. Prior to learning of it I had already wanted my banks on a chip, so the whole thing felt like a dream come true. And some of my teammates did agree with you on stuff like the checksum of the entire file. The CRC16 situation was also conjecture. Some of it came out of the fact that the pre-Silicon bank's CRC16 was a clean hex number. I don't think modulus ever crossed my mind given my lack of math experience. I ALSO wasn't aware of CRC16 being harder to decode than a modulus.

I was not involved in any of the compression parts, nor did I choose to rename the fields in Silicon. Also, the rest of the team shot down putting the sample chunks at the end, and I didn't feel like starting contention over it. The consensus was that you were right that sample chunks at the end isn't the best idea.

None of us wrote this "for the sake of writing documentation". I may be studying to be a technical writer, but I didn't use THIS as my testbed. My contribution to the wording was not significant. Oh and I DID make the font used for codeblocks (UnifontEX, whose documentation started out as a technical manual but became so much more). And sure, lines like the "orphaned" one (orphaned sm24) are things I coined, in this case even borrowed. It's not the only one in which a quote from me on Discord or a Github Issue/Discussion became part of the standard. But the "leaf" thing was coined by the other founding dev. Both of our semi-neologisms made in into the spec in one way or another, even though my role was more of a research one. Or by the end, R&D to an extent. But this effort was a serious effort. And of note is that I was the person who found Silicon SoundFonts in the original spec. I had been looking for information on sample rate to prove the 50kHz figure was to keep cards happy, but I went too far down and saw Silicon SoundFonts and my jaw dropped, and then one thing lead to another.


Furthermore, in regards to "reference data without a reference implementation", I have actually done the reverse at one point.

Another problem is that communication with the other developers is slower than it used to be given one of them going on a hiatus from a quicker form of social media, and the other not being involved. We are ALSO in wildly-different time zones. I'm 8 hours behind the other founding dev. Both of us are night owls. Given how by 2024 Discord contact between us became spotty, I felt like a LOT of things I felt needed a quicker response than Github Issues weren't addressed ideally.

And do note that I don't think or talk like most people, in more ways than one. Even the other founding dev sometimes gets what I say wrong, no matter how much specific detail I go into. That is part of the whole debate we had over Silicon's quirks, especially when the AWE32 situation went down. It seemed like the point was being missed, most likely on both sides. We both had issues interpreting what the other meant at times.

The other members often rapid-fire communicated faster than I can on Github, hence why in conjunction with university I at some stages let them try to figure things out. Maybe there were some elements where I should have done something. And I think we all were stumped in some areas of Silicon and went for educated hypotheses based on what seemed likely. Emphasis on seemed. I think I took the programmer's approach. As for the other founding dev, they were surmising Silicon was just a clone of the AWE32 ROM, and that didn't seem correct, and so it was something we were mixed about. I also was mixed about the romRsrc field's purpose, with myself even. Admittedly, I even theorized about using trailing sample chunks plus the offset shenanigans in the Silicon header to amplify the effects of trailing sample chunks, but that fell to the wayside.

And if you think this spec is bad, well... let's just say that my own code is hacky, and I've created other unconventional file formats, one of which I'm now having to extend after the fact to correct something I missed. And the readme for my extension of GNU Unifont is 200KiB (and one that definitely is quirky), and it comes in niche formats, some of them I created. So, like, I've done worse. That being said, most of the wording of the SFe spec was *not* my doing. I was more of the research side of things. I didn't create the reference data, like sine8192.bin, plus the bin file translating the header spec to an actual file. Also, the dev who did, I had to correct on how large the header should be. We both had mixed views on the more-unclear parts of the spec. And I admit, I was somewhat idealistic about certain sections. We had some pitched brainstorming sessions over some of those unclear parts. In some aspects, I did feel a bit guilty about certain sections that had a more-idealistic origin. As in, I felt like I should have just *not* said anything on them. But I did. Sometimes the gut, even educated, lies.

As for "not consulting", well, the other founding dev is more-stringent on open-source than I am, and in 2022, they had creative differences with another collaborator who I had gotten them involved with in 2019, which is why I'm now second-in-command. I have my own issues with communication too. So the main problem with "not communicating" was that it was already hard for the team to coordinate by the later era, and believe me, coordinating all SF2 devs wasn't easy. Oh and we tried talking about our extensions as early as 2023 to SoundFont people and it took 23 months for any of it to get anywhere. We weren't doing an "after-the-fact" communication. We just ran into gridlock until quite recently. And honestly, the reason the majority dropped trailing sample chunks for 8GiB rather than RIFF64 was because of your stance on it. The consensus listened and also felt that extension wasn't needed.

I think these are what immediately comes to mind, but likely isn't everything.
I use they/them pronouns and do tech.

Exhale

Saga being a puffed up pratt about an open source project that might take some of his suggestions but otherwise functions as a healthy and dynamic open source project... how surprising and unexpected.
___________________
No longer helping. Do not expect replies.

Saga Musix

#5
Quote from: Exhale on February 18, 2025, 07:02:07Saga being a puffed up pratt about an open source project that might take some of his suggestions but otherwise functions as a healthy and dynamic open source project... how surprising and unexpected.
If all you have left to contribute to this community is insulting me or manx in quite literally every post outside of the song sharing forum, without apparently even understanding even a single bit of the criticism in my post above, or even accepting the fact that this format specification may wreak complete havoc in the soundfont community: Please leave. This is the last time I (and manx) will accept this kind of slander from you.
» No support, bug reports, feature requests via private messages - they will not be answered. Use the forums and the issue tracker so that everyone can benefit from your post.

manx


My comments from skimming through the specification:


Technical:

 *  I was not able to find a concise high level overview of features that SFe4 adds on top of SF2.04. I have to read the whole spec, and compare it with prior knowledge about SF2.04, to even get a glimpse at the fundamental "WHY?".
 *  SFe4 files are not compatible or readable by an SF2.04 parser anyway, so why keep all the historic format nonsense in a supposedly (but failed) backwards-compatible fashion if no old software will be able to read it anyway?
 *  RIFF file structure is really not explained, and neither is any one of the various RIFF specifications explicitly mentioned or linked. This assumes prior knowledge of the reader, probably including a lot of unclear aspects about RIFF.
 *  Is the order of chunks mandated by specification? semantically significant? flexible? dependent on context/chunk? There are examples for all of these in the wild, by Microsoft alone.
 *  It is unclear to me whether "legacy SF2.04" refers to what is specified in sfspec24.pdf, or whether the SFe4 spec tries to retro-actively amend SF2.04 with specification clarifications contained within the SFe4 spec. If the latter is the intention, these should be separate documents (or you are doing what USB and HDMI did - please just dont), if the former is the intention, why even mention the legacy format if the spec does not add anything of value to it - it just confuses the reader, except when explicitly mentioned as comparing to previous behavior. Also, you really should have made the document independent (which implies re-wording everything written by E-mu) of sfspec24.pdf of which I was not even able to locate a properly licensed copy that I would be allowed to have received.
 *  Why even keep 32bit RIFF around? That's just confusing for SF2.04 software.
 *  And why even a big-endian variant on top of that? That's 3 distinct file format definitions in one specification. Why do you need *3*?
 *  Why keep 4 byte FOURCC magic values that just read like gibberish and are severely limited in namespace?
 *  Like most people who never implemented portable code that has to deal with fourcc codes specified in the specification using the following C code `DWORD fourcc = 'abcd';`, you completely missed that this is not portable C code, and even for compilers where this is implemented, the resulting in-memory byte order is INDEPENDENT of platform endianness and purely an arbitrary choice by the compiler, potentially different even on the same compiler on different platforms or on different compilers on the same platform. YOU NEED TO EXPLICITLY SPECIFY THE STORED ORDER OF YOUR FOURCC BYTES, DAMMIT. YOU REALLY NEED TO! Or even better, stop calling the data type for your fourcc codes "DWORD", and use "char[4]" instead.
 *  Why RIFF in the first place? XML or JSON are well understood encodings of structural data and perfectly fit the needs of SFe4, and can structurally and syntactically be validated without analyzing semantic information. If you are so concerned with file size (which arguably you are not, given various examples of space wasting in the specification), they both have well-specified binary representations.
 *  Why x86/Windows-specific type names? DWORD, SHORT, ... in a portable format specification, seriously?
 *  Why the random mixup auf 2byte-aligned and non-2byte-aligned RIFF chunks? 2byte alignment makes NO sense on modern systems. If anything, use 8byte alignment, or better remove all alignment padding all together, as this is a file format with encoded data, and not an in-memory runtime data structure. Anything else is just stupid.
 *  The non-2byte-aligned smpl sub-chunk does not specify what precisely may be non-aligned. The stored length? The actual stored data? The stored chunk? The following chunks with respect to file start? The following chunks with respect to previous chunks? Hell, this is THE SINGLE ONE PRIMARY REASON why RIFF is a stupid format, and in particular fucking around with its 2byte-alignment just causes even more severe parsing troubles. The specification needs to be absolutely watertight in this area, because getting this wrong desyncs the structural chunk parsing. I should not have any such questions unanswered after reading the spec whatsoever at all. With the current writing, I seriously have no idea what the spec wants me to implement here. The intention might be "whatever Werner SF3" does. But it is really up to the spec writers to figure that out and put it in writing. Whatever the actual semantics are, with this point you finally made it absolutely impossible to validate syntactic structure of the format without requiring semantic information. Exceptionally bad design for no reason. You require information stored in contained layers to validate correctness of the containing layers. That's just asking for security bugs in implementations.
 *  Absolutely bonkers 24bit and 32bit sample encoding. If samples are supposed to ever be streamable (which they supposedly are later in the spec), this just kills performance for absolutely no reason.
 *  Why keep the total nonsense of 46 zero samples between sample data? If people cannot bounds check their sample reading functions, they better should stop programming. ... Or do you maybe not keep that, making it incompatible with SF2.04? The introduction of smpl sub-chunk lists the 46 samples as removed, however the compressed sample section right after it talks about the 46 zero samples being missing (supposedly? maybe?) only for compressed samples (talking decoded samples or encoded binary data here? also unclear).
 *  Why only integer sample rates? The sample rate may very well be off by fractions of a Hz between different recording situations.
 *  What do the File Format Level and Sample Format Level have to do with the recorded sample rates? If anything, software (or hardware) samplers will be limited by the effectively replayed sample rate, which will depend on the effective pitch and is 100% unrelated to the recorded sample rate.
 *  Speaking of hardware. Just no. Why even talk about 31 year old hardware synthesizer chips and the limits of their ROM sizes in one particular use case?
 *  Sample rate is underspecified in meaning when the compressed encoding also encodes a sample rate itself, which almost all compressed formats do, especially if they are used in their normal file format container as opposed to a custom framing which might skip such redundant metadata.
 *  Compressed sample support does not specify how to deal with decoded sample values outside of 0dBFs. Are they truncated to [-1.0..1.0], or are they kept as is? If they are kept, why is there no uncompressed 32bit float sample support?
 *  "OGG" probably means "Ogg Vorbis", but that should be mentioned explicitly. Opus and FLAC can also be contained within an Ogg container. ".opus" files in fact always use an Ogg container. ".flac" files normally use their own container format, however Ogg FLAC is also well-defined (and most commonly used with a ".oga" file extension, but ".flac" and ".ogg" are also common). Does SFe4 require Ogg FLAC support or is that excluded?
 *  Why are compressed samples constrained to mono? Are stereo compressed samples expected to be stored as 2 independent mono compressed samples? That firstly fucks up compression performance, and secondly introduces uncorrectable phase differences between left and right channels because a mono lossy encoder can ignore phase differences between source and encoded data in order to achieve better compression ratio. In general, being limited to at most stereo feels like an arbitrary limitation nowadays.
 *  Are chained Ogg streams allowed? forbidden?
 *  Are Frankenstein MP3s allowed? forbidden?
 *  Why multiple lossy compression formats at all? That just unnecessarily increases library dependencies of a full implementation. MPTM did not add Vorbis or MP3 samples, even though the code was already there due to MO3. MPTM did also not add internal FLAC samples because this would add another dependency to libopenmpt. The added dependency was not worth the gain in compression efficiency over the already included IT sample compression.
 *  Why is WavPack mentioned in the writing but not supported in the actual format signalling bits?
 *  For MP3, there is no encoder and decoder delay specified. Is the LAME/Xing Tag mandatory? Otherwise, sample offsets will be somewhat random.
 *  Is metadata in the encoded sample data in general allowed? ignored? honored?
 *  In general, using general purpose audio file container formats as an encoding of solely raw sample data is a bad idea. It is always done in order to "simplify" things at first, but later it causes ambiguities in data interpretation. Defining a custom framing of raw codec data is more design work, but in the end a much more favorable approach. Having personally dealt with AVI and WAV (with MP3 contents) and OGM and OggXM and MO3 (which all got this completely wrong), this is absolutely blatantly obvious.
 *  Why is the sample name length limited?
 *  Why use the word "program" in a MIDI-related specification with explicitly not the MIDI meaning? "application" or "app" is the correct term to use here, or use "software" or "implementation". "program" put me off several times during reading.
 *  "If the sample rate is zero, then it should be corrected automatically to the correct value via automatic pitch detection and then highlighted for the user to verify. Bad original pitch values should be corrected to the value with automatic pitch detection." ... W H A T   T H E   A C T U A L   F U C K   ? ? ?
 *  Why even have repairing instructions for out-of-spec behavior *inside* the spec? That makes out-of-spec files somewhat "in-spec-ish". If the file format is so complex that implementations regularly fuck it up to write correct files, maybe, even only maybe, reconsider the whole design of the whole file format in the first place? Frankly, the whole section is utter madness. It is a concern totally separate from a "format specification". If anything, this should be a sub-section of a potential "implementation recommendations" document.
 *  Specifying conversion between SFe4 and SF2.04 on a structural level makes absolutely no sense. No sane application will ever do format conversion at this abstraction level.
 *  As someone not intricately familiar with SoundFont, even after reading over most of that specification, I am really not sure if what you specified is existing practice by various implementations, or if it consolidates various custom/proprietary extensions into a common and new encoded file format, or if you are inventing new features (maybe even out of thin air) without existing implementation practice, or any combination of these 3? Am I supposed to look up the motivation for the new features in GitHub git history? Issues? Discussions? Wiki? on Discord?


All in all, there appears to be more questions risen than questions answered by the spec. This is not a good sign.


Meta:

 *  You really should have consulted existing implementers (numerous really big ones come to mind that you have not even mentioned having talked to *at all*) of SF2.04 *before* touting a final released specification. May be you are all very experienced in file format design (I seriously doubt that, though) and SF2.04, however releasing a *final* specification without acquiring any feedback on a draft before that just feels very very wrong to me, especially in a field with lots and lots of existing implementations.
 *  Also, a very basic question is, why are you even piggy-backing on top of an ancient proprietary format? The format is ancient, and a lot has been learned from its shortcomings, as well as from other format's shortcomings in the same field (DLS and SFZ come to mind). IMHO, starting over would have been a lot simpler and much more future proof. The way you did it just accumulated more cruft on top of other ancient cruft.
 *  As you are not endorsed by the original creator of SF2.04 (E-mu / Creative Labs), you really should not be re-using their format's name. That's IMHO somewhat rude. Even if you explicitly extended/changed the name (which is good) to avoid confusion, it still takes away the name "SoundFont enhanced" from them. Even OpenMPT is nowadays often associated/confused with ModPlug, sometimes to our advantage, however more often to our disadvantage. SFe4 has painted itself into a similar corner, for really no reason, which will probably hurt it on a marketing level in the long run. For OpenMPT, this ship has long sailed, and rebranding now would probably cause even more harm than good. In the introduction you apparently seem to even acknowledge the problematic naming situation and put your specification in jeopardy over this fact. If your spec is conditioned on E-Mu / Creative Labs never doing anything with SoundFont again, why the hell should anyone implement your spec which *according to your own writing* voids itself based on future actions by a sole third party? Also, why the heck did you on purpose overload a pre-existing file extension in the same field that you knew about? If someone used .sf4 in public, even if unsuccessful, you simply use a different extension. The namespace is vast and there is no reason to limit yourself to 3 characters. The plain and simple .sfe4 would have worked.
 *  A lot can also be learned from extending pre-existing RIFF formats with newer features and/or 64bit support in the past. This can be seen by EBU WAV->RF64->BW64, or Sony WAV->Wave64 or even in OpenMPT's IT->MPTM , and probably also other formats. This has always resulted in very complicated parsers that have to handle all sorts of weird corner cases, and as history has shown, they get these wrong more often than not. The supposed compatibility between RF64 and WAV or between MPTM and IT was really never useful in practice. It most cases, this "compatibility" fools non-aware older applications into silently loosing information amended by the newer format.


I have not even looked at anything remotely synthesis related at all, and have not mentioned any unclear and suboptimal aspects in this area that have been inherited from SF2.04. All this would come on top of what I just outlined.


So, the motivation is unclear, the scope is unclear, the actually specified thing is unclear. All of these should be reasonably clear before even considering publishing a *draft*, yet you already did a *final release*. What kind of response do you expect? You unsolicitedly put a final thing in front of people, and expect them to like it, and even to implement it? Yeah, that obviously does not work, sorry. What you released as a final specification would better have been a sketch of a draft of a specification.


And last, and most importantly, I agree with Saga Musix' wish to not be credited. If he (or me), or really anyone else with 10+ years of experience implementing various file formats and audio applications, had any meaningful input, the spec would look radically different. The thanks paragraph (even if maybe not meant to read as such) also to me reads like some form of acquired endorsement, which clearly is not the case, and I would thus consider this an active mis-representation. Please stop doing that immediately. Also please do not demand a GitHub pull request for this change, as that would result in Saga Musix or me appearing in your git version history, which we also would prefer not to be associated with.


tl;dr: It's sad. It's bad, it's very very bad.


Foot note: I am not interested in taking part in SFe4. I frankly do not even understand the motivation for its existence. Amending SF2.04 by clarifications? Sure. Adding crucially missing features (if they exist) in a compatible manner and organized between interested major implementations? Sure. Extending the file format to 64bit? Maybe. Inventing a new file format? Also maybe. Doing all 4 things half-baked at once? Absolutely NO. If you want to regain some respect, you IMHO should withdraw your specification as soon as possible. I also do not want to get credited at all. All questions I posed can and will be posed in a similar fashion by any sufficiently knowledgeable audio developer. There is no reason to refer to me when answering any of these questions. They can all be asked frankly straight up from common sense with a tiny bit of domain knowledge.


@stgiga

All your explanations of why your internal or external communication might or might not have been bad honestly does not matter. You did put out a final spec, and that is what we are judging. If it is bad because you failed to communicate, you need to improve your communication as a team, and do so BEFORE inflicting some final specification on the world. Apparently almost all of your supposed discussion with users of SoundFont happened in private and is not accessible publicly. At least it is not linked from your project's GitHub page in any way that I could find. That might look fine for you on your part, but it inhibits a very crucial part of a specification process: discussing amongst the target audience. You should not mediate all discussion solely privately through yourself. Nobody knows what other developers actually have told you. You make it sound like that what you want to make it sound like, but that could just equally well be totally wrong. I am not saying that you are lying, but the way you have been communicating makes it difficult for you to prove otherwise, and Saga Musix' reaction somewhat supports my view of things here. You are really missing a key point of an *open* specification process. Just to illustrate this even more: As a 10+ year OpenMPT developer, I absolutely positively knew nothing about what your team was doing or even that it existed at all before this very forum post. The *final* spec is likely a surprise to everyone, and this alienates people.


addendum:
The spec is so bad that absolutely nobody should adopt it. And I hope major implementers also see this and agree with our assessment. I absolutely agree with Saga Musix that this would wreak havoc if adopted. Extending such a widely used format requires great care and puts a severe burden of responsibility on the extension inventors. You appear to not be aware of this and of the risks a bad specification involves when adopted. I sadly do not know how to educate here, and arguably it's not my job either.

stgiga

#7
Firstly, the reason the spec didn't include what Creative already put in and was only a diff was because the lead author was paranoid about infringing copyright of Creative. I never requested it. Also, we had made a PDF infographic about what was offered, but it did seem incomplete.

Backwards-compatibility was something that proved to be quite a point of contention among the team. I don't have a definitive and/or simple answer on the "chunk order" aspect. The "legacy SF2.04" refers to `sfspec24.pdf`. Yes, the intent was for people to look at THAT, because of the aforementioned "copyright issue", a decision I had no bearing in. Honestly, I don't recall that we intended to do the USB/HDMI route. The "properly licensed copy" scenario for the sfspec24.pdf was part of the reason for the diff nature, according to the maker of that decision.

Keeping 32-bit RIFF around was done for the reason of making it less of a major change, essentially a gradual phase-in of extensions. I do see how it could be confusing. I do agree that confusing developers is a bad thing. I wasn't involved with the "big-endian" choice. The only FourCC value I had anything to do with was what would go into romRsrc. I had nothing to do with any of the C code. My apps (which have nothing to do with SF2) are written in JS for a reason. I had nothing to do with any of the stuff relating to the typing, though I was the person who noticed the two-byte nature of wBank and wPreset in the official spec, and I did interpret the byte sizing of certain unclear parts of Silicon (for what good that did), but I didn't have anything to do with defining the type names beyond this, and I don't recall rewriting the actual type names in Silicon. Also, I will say that when the Silicon extensions were being worked on, I was surprised that the text wasn't byteswapped, but I DID try and make the numbers little-endian like how SFSPEC24 seemed to prefer.

As for *why* RIFF, well, the whole effort to do better than SF2 started with giving up on SF2 to begin with, as something called NGMSS (Next Generation Musical Sampler Standard), created by the repo creator, but around 2023, after I had dug into the spec and posted my observations in the Discord, the repo owner shifted the entire effort to extending SF2 rather than making a whole new format like was originally planned. I think that they wanted to make the lives of developers easier. That said, I know that implementing format hacks can be quite an involved process (as I can attest when I have done so myself to formats I have created).
That said, the XML/JSON idea would have been nice, though it became like SFZ. That said, originally we had intended with NGMSS to have monolithic nature like SF2 be optional. I had some ambivalence on that, namely because I've had clunkiness issues happen when dealing with file formats that are exceedingly non-monolithic. The consensus to extend SF2 more-or-less killed the SFZ split-file idea in most forms.

The SF2 spec has DWORD and SHORT in it. As for the "byte alignment" stuff, I wasn't involved with THAT either.

SF2 also encodes 24bit as 16+8, and so to get 32-bits, an extra 8-bit chunk was added to get 32-bit. I do see how that can be clunky. I regrettably came up with the idea to make 8-bit sample depth out of orphaned 8-bit chunks. We actually chose to remove the 46 zero samples thing. And I am aware the compressed samples section is bugged, and I had no involvement with it.

SF2's sample rate is stored as a 32-bit integer. We didn't know what doing a float would do. I think the "format level" interaction with the sample rate was chosen so that environments that couldn't go above the recommended 50kHz wouldn't complain. Personally, I REALLY didn't like that idea.

As for the "talking about older hardware" section, I think it was done because some of SF2's quirks arose out of its roots. That being said, I DID try to steer the head dev away from trying to make Silicon a clone of the AWE32 ROM format. Also, the reason the ROM sample emulation was done is because some SF2s out there use it, and in non-Creative players they make no sound, even though the samples HAVE been dumped. We actually went to the effort of creating stand-ins that behave similarly but are libre. I think the reason we even cared was because the SF2 that I was most famous for, for the longest time, had ROM samples in it that I had to surgically convert to using AweRomGM.sf2 samples late in development. In essence, we wanted players that played SF2s with such data to not produce empty sound. THAT is why we went to all that trouble.

In the wild, I've seen SF2s with sample rates that are significantly not conventional rates, and basically SF2 tells what sample rate a given sample is.

Apparently, the SF2 derivatives that used Ogg used Ogg Vorbis, and FLAC was supported by another SF2 derivative. There's a non-zero chance the individuals responsible for the ill-fated compression section actually wanted to allow freedom, though I do see how that could be a mess. The 0dBFS part I had nothing to do with either.

SF2 natively stores stereo samples as two mono samples linked, so we went that direction as part of our ill-fated "backwards compatibility" plan. I think doing it for compressed samples is a bad idea given how Polyphone for years had a bug that would result in broken links when certain edits to only one half were done. It's something I've ran into cases of it happening.

We actually have plans to use the sample link type that *isn't* R or L to store surround sound samples, but unfortunately the ill-fated "support features in stages" idea was a hindrance to that.

As for the "chained Ogg streams" and "Frankenstein MP3s" parts, I don't think that was even considered, and even I am a bit lost on what those exactly do.

The reason for "multiple lossy compression formats" is because of giving users choice as well as the fact that some SF2 compression schemes like SF2Pack didn't decide on forcing a single format. Do I like that this was done? No. In my view, lossy and SF2 don't exactly harmonize, as useful as it can sometimes be.

As for the whole "WavPack" being mentioned without actually adding it, either it was being used as an analogy or it was earmarked. I didn't write that section.

The "delays" in MP3 and the LAME tag shenanigans are stuff that wasn't considered, and this is basically my first exposure to it because I don't store my music in lossy formats. I don't know about any of the other team members. I didn't even touch the MP3 stuff.

If it were up to me, metadata would be allowed, but there's a non-zero chance the other devs would want space saving and remove it.

I think that the inclusion of "common" codecs was designed so that samples downloaded from, say, FreeSound, could be directly stored without reconversion, though personally that DOES rub me the wrong way.

The "name length being limited" was also a victim of the whole "backwards compatibility" mess. If it were my decision there would be no limits.

As for the use of the word "program" rather than "app" or "application", A: I wasn't the chooser of that, and B: I feel like we targeted the SF2 spec literally, looking over every inch of it, while being so focused on it that looking at, say, MP3, FLAC, Ogg, WavPack, and MIDI's specs may have fallen to the wayside, and I admit that it was pretty silly that it happened. And I'm one of the people who has the least attachment to the idea of "scope creep".

As for the "if the sample rate is zero" line, I wasn't involved with it, and admittedly "asking the user" sounds like a cop-out. I don't see it working too well.

I think the "out of spec repair function" existed because of the fact that earlier versions of my 4GiB SF2 had experienced subtle corruption in Viena and it was a nightmare to fix, and so the lead dev wanted to make sure there were means of preventing it from happening to others, though checksums would probably have worked better. It wasn't directly my doing.

Your idea of "implementation recommendations" is honestly the better option.

The "conversion level" part wasn't my decision.

We consolidated proprietary extensions, and added new features. We went for the "combination of those 3" in actuality.

The Discord is where a LOT of the earlier stuff went on, and the link IS public (though I don't know if it ever made it to the Github given the owner taking a hiatus from Discord for personal reasons) on even Disboard. Later talks (2024-2025) happened on the Github Issues AND Discussions.

I have designed file formats, though nothing like the type SF2 is involved in.

We extended SF2 because of its ubiquity and the fact that it is less files to move around than SFZ is. We had originally wanted to make a new format, and in hindsight that would have given us infinite freedom.

As for the "reusing the name", part, well, the diff part of the format was done to avoid using ALL of the old standard for copyright reasons, the same reason the "SF" part of the name existed. I never requested any of it, and the lead dev is more stringent about that type of thing than I am. The lack of a unilateral full rebrand was done to not completely alienate SF2 users, though I do agree that it kind of went against the whole "differentiating" choice. I wasn't very involved with the wording of the spec, I was more of a research and later R&D role.

We had originally wanted to use SFe32 and SFe64 or SF5 and SF6 as file extensions, but for some silly reason, the lead dev chose to hijack SF4 from Cognitone solely because THAT program never actually worked based on their determination. At least when I claimed the .B3K extension for BWTC32Key I intentionally chose a file extension that had never been used by anyone else. Unfortunately the message wasn't gotten.

Oh and we at least have attempted to lift certain stuff related to synthesis, though admittedly some of it is being pushed to the "less-compatible" future, even though admittedly that whole compatibility approach sucks as has been pointed out.

I honestly felt that there were too many things still in-development to release. And heck, the "final" was supposed to be for stage 1. Releasing the entire spec fully-formed would have been a far better idea.

As for the "bad communication" aspect, I agree. And I sorely wish that there would be more actual conversations done rather than stuff delayed by a day and done on the slow-moving Github compared to the fast-moving Discord.

But even confining it to Discord seemed like a bad idea. Also, I had told most player devs about plans to extend the SF2 format in the early days when we just wanted to extend wBank and hadn't gone to the level we did. And some other devs like the dev of Spessasynth jumped in very late. Perhaps the reason things were accelerated without outside input was a lack of outside feedback or even any replies to the start of the ideas. Honestly, even I had trouble keeping up with the most recent developer actions.

We did choose to get rid of the section in Github's "Releases" with the bad spec, though why amending it so soon after was done I wish I knew. No, we DO need to seriously rework it. And more than just line-item vetoes are needed.

Also, I wasn't at all thinking of doing a pull request.

There's a LOT of things that went on that could have been better, and so much so. There's a lot of work that needs to be done to remedy these very real problems that exist. Also, I'm not someone who condones lying. I genuinely want to do the right thing, as does the rest of the team. These problems SHOULD be corrected, and they are serious.
I use they/them pronouns and do tech.

manx

I have the feeling you did not get at all what we are saying.

I did explicitly ask you to not answer my questions directly. They frankly have solely been illustrations of the overall quality and state of your specification.

Also, again, any explanatory reasons why something went as it did do not matter, especially if they are solely rooted in your team's internal communication problems or opinion differences.

Except for maybe for the single paragraph directed explicitly towards you as a person, every single time Saga Musix or me said or implied "you", this was directed at the team as a whole. We are not talking to you as a person here, but instead as a representative of the team that issued that final specification. We are both not native English speakers, and maybe we should have said "the team" instead of "you" every single time, but my language intuition tells me that "you" implying the the whole team would not be completely uncommon phraseology. Yet, you prefer to talk from your personal perspective, which frankly is irrelevant. We do not want to get dragged into your team's internal opinion differences. They do not matter to us when interacting with you as a team. If your team cannot even communicate with a somewhat coherent opinion with outside people, you should really stop everything you are doing.

Saga Musix is still mentioned in the credits/thanks section. Each reply you write here without fixing this very basic thing makes you appear even more offensive and rude. YOU HAVE TO STOP MENTIONING HIM. FUCKING IMMEDIATELY. We did ask nicely twice.

Some of your responses to individual questions let shine through that you do not have sufficient knowledge about the things you are trying to specify (either explicitly mentioned by you, or in other cases frankly implied by how/what you answered). This must never happen. Maybe this is also only your personal knowledge level, but in that case you should better consult with your team and give a proper educated response after that (or at the very least redirect the answering responsibility). There is really no need to reply to everything within 3 hours. I will not go into more detail about individual questions any more. This is not the point here, really.

Every single question's intention is to spark wondering inside your team about why you have not thought of any of these particular (and other) questions before releasing a final specification. And the sum of these wonderings should hopefully then cause some amount of self-reflection about the quality of your work.

Of all the things you mentioned in your reply, you did not even address the single most important one: The immensely large existing SoundFont implementation space, and the interaction (or really non-communication) of your team with their developers. Even from solely a moral perspective you are simply not allowed to touch SoundFont in the way you did. My comment about great care and an absolutely crucial and large extent of responsibility associated with modifying such a widely adopted format totally did not sink in. You do not know what you are doing, seriously. And this is not fixable by addressing any individual technical aspect of the criticism.