How does OpenMPT's audio pipeline work?

nyanpasu64 · August 28, 2019, 23:05:33

I'm working on a new cross-platform tracker in Rust, using SDL to output audio. Currently, its audio pipeline is loosely based off 0CC-FamiTracker's, which I spent weeks studying the source code attempting to understand (link to Google Doc). Does anyone want to discuss how OpenMPT's audio pipeline works, and is it worth cloning? (As a user, I only have 0CC-FT experience, not OpenMPT or others.)

0CC's engine, like the NES native driver, runs a fixed number of times a second. These "engine frames" are usually synchronized with vblank, but can be changed.
My tracker will also have the property that "all notes are quantized to "engine frames", and delay effects will be an integer number of engine frames.
0CC's audio pipeline runs on a "synth thread". It renders an entire "engine frame" of audio at once, then copies it 1 sample at a time to a single-threaded circular buffer. Whenever it's full, it pushes the entire buffer to a circular queue.
0CC maintains a fairly large circular queue, but divides it into "pages", each the size of the single-threaded circular buffer (unsure if true). The synth thread writes to this queue one page at a time, and DirectSound (windows-only) reads this queue at its own pace (not separated into pages, I think?).

Is this a good design? One noticeable flaw is that interesting things (no audio output, repeated blocks) happen if the size of the circular buffer is too small (below 30-ish milliseconds).

How does OpenMPT's audio pipeline differ? How much audio does it render at a time? Do different output device types have significantly different pipelines? Is OpenMPT worth copying?

Saga Musix · August 29, 2019, 17:34:18

First off: I don't think SDL is necessarily a good idea for audio output. It has very little configuration options, which can be vital for high-quality low-latency audio playback. A library like PortAudio or RtAudio may me more suitable if you don't want to roll the low-level API code yourself. OpenMPT uses PortAudio (and optionally RtAudio, mostly for its Wine support) which some custom patches to iron out some bugs, plus custom ASIO/DirectSound/WaveOut implementations. For a modern tracker, you will probably only care about ASIO, WASAPI and WaveRT if you're on Windows. DirectSound is only an emulation layer on top of WASAPI since Windows Vista.

How audio threading works also largely depends on the API you are using; audio APIs are either push APIs or pull APIs, so either you have to actively and regularly feed them with audio data from your own thread, or they ask you to deliver a specific amount of data in their own thread. In either case, OpenMPT renders a variable amount of audio data directly in one of those two threads. Whether or not that's a good idea very much depends on the rest of the architecture, I'd say. You have to consider that many choices in OpenMPT are historical ones and it could very well be that some things should be done differently these days, e.g. having the actual rendering happen in a separate thread.

Now regarding the tracker engine itself:
Whether anything OpenMPT does or what your bullet points describe is a good idea very much depends on what your goal is. Should it work exactly like an oldskool tracker but not be a direct clone (i.e. building on top of existing formats)? Then my next question would be: Why?
If you want a modern tracker: Scrap the idea of having a low amount of ticks ("engine frames") per second. Offer much higher granularity. For example, one approach could be to divide every row into 256 ticks for fine-grained delays, and not make this amount variable. This combined with how OpenMPT's modern tempo mode works would offer very flexible timing and more understandable effect behaviour. Generally I would say that a modern implementation that doesn't have to support legacy formats should have envelopes / slides on a per-frame (not engine frames, but audio frames) basis, i.e. do not increment/decrement volume on every engine frame but on every audio frame (sampling point). CPUs are powerful enough for that these days. OpenMPT doesn't do that as it's mostly building on top of legacy formats, but I'd want to have this kind of granularity in the future at least for its own MPTM format. It's not very easy to have all of that in the same engine though, hence you should choose wisely before even starting writing a single line of code and be sure what kind of tracker you want to build.

nyanpasu64 · August 29, 2019, 22:10:28

I intend to produce a new format which allows placing notes at rational-fractions of a "beat", with no fixed "row" duration. This matches neither MIDI nor tracker paradigms, but I have implemented this in a format which compiles to MML, and I found it useful and easy to understand.

However my tracker is intended to compile to the NSF format (NES/Famicom music), where the driver only runs once every vblank or "engine frame" (but can be customized using delay loops instead of vblank). During the compilation process, all gaps between notes will be precomputed into a MIDI/PPMCK (note, delay) stream. My sequencer will probably allow users to move notes later or earlier in increments of "engine frames".

Thanks for warning me about SDL. ~~Is SFML (which has a Rust wrapper) good at low-latency audio playback? It has some 3D-positioned audio that I definitely don't need.~~
Seems https://github.com/RustAudio/rust-portaudio exists, and I may look into it. Don't see any RtAudio Rust wrappers.
https://github.com/RustAudio/cpal hmmmmm
Is outputting to JackAudio a good choice? Too Linux-centric? Unnecessary for monolithic programs where I don't need an audio routing graph?

I've tried OpenMPT on Wine a few months back, and the "alsa passthrough" was impossible to prevent from stuttering (I think PulseAudio was running), whereas Wine playback was smooth.

(Not sure how much of this message is worth responding to.)

manx · August 30, 2019, 07:06:44

Quote from: nyanpasu64 on August 29, 2019, 22:10:28
Thanks for warning me about SDL.

Well, even with all its quirks, SDL should be fine to get you started. Its (default) callback-based paradigm is suitable for music production applications (less so for games, which is awkward, given SDLs mission). Locking is kind of strange and inflexible though. Avoid the newer non-callback based API (QueueAudio), because it introduces yet another buffering layer internally in SDL.

Quote from: nyanpasu64 on August 29, 2019, 22:10:28
Seems https://github.com/RustAudio/rust-portaudio exists, and I may look into it. Don't see any RtAudio Rust wrappers.

PortAudio has a severe limitation on modern Linux systems in the fact that it does not feature a native PulseAudio backend. PulseAudio's ALSA emulation is fragile at best. A major part of that is due to the sheer complexity and awkwardness of the ALSA API. Sadly, this is the single most important reason for PulseAudio's bad reputation.

Quote from: nyanpasu64 on August 29, 2019, 22:10:28
https://github.com/RustAudio/cpal hmmmmm

Look fine, even though choice of API backend on Linux is questionable. Writing anything on plain ALSA is a bad decision since PulseAudio exists and is default on any major distribution.

Quote from: nyanpasu64 on August 29, 2019, 22:10:28
Is outputting to JackAudio a good choice? Too Linux-centric? Unnecessary for monolithic programs where I don't need an audio routing graph?

If you want your users to not be able to use your program, use Jack. More serious though, install 10 random Linux distributions, and see that Jack is neither configured nor even installed by default on any single one of them. Jack still does not work properly with PulseAudio on the same system, which further limits it applicability to standard Linux setups. It is only used on special installs or distributions geared towards audio production.

Quote from: nyanpasu64 on August 29, 2019, 22:10:28
I've tried OpenMPT on Wine a few months back, and the "alsa passthrough" was impossible to prevent from stuttering (I think PulseAudio was running), whereas Wine playback was smooth.

Well, do not use ALSA if you have PulseAudio running. It's that simple. ALSA will either use PulseAudio's emulation or fight with PulseAudio for a single device, or fight with PulseAudio sharing a single device.

All together, in the long-term, you will probably be better off with implementing multiple backends. The SoundDevice abstraction layer in OpenMPT is probably a good example of how to do that.
However, just to get started, that will be too much work. If you are using SDL to do your graphics output anyway, just stick to SDL for audio for now. Otherwise, cpal or PortAudio if you are developing primarily on Windows, or, if you do not care about other systems compatibility at the beginning, even just PulseAudio (Simple API) for now if you are developing on Linux.

nyanpasu64 · August 30, 2019, 10:40:10

I think that picking Rust has definitely made it harder to find audio libraries. RtAudio seems to support PulseAudio in addition to ALSA, but is C++ and has no Rust wrapper currently. However, Rust comes with easier/safer concurrency and match{}, "move by default", no need to edit header files separately, and no copy construction by default. And SDL is sufficient as a baseline (and the latency can't be worse than FamiTracker?).

On Windows, a 5ms sleep during each SDL callback causes intermittent gaps in the audio, even when I set each callback to supply 4096 samples (48000 smp/s). Maybe SDL calls the callback when there are <5ms * 48000smp/s = <240smp or 128smp left in the buffer? Or maybe at an arbitrary point? I assume the inability to tweak this parameter is what you mean by "SDL isn't configurable for low latency".

My GUI is currently built in GTK (because Qt is hard to use with Rust), not SDL.

What files does OpenMPT uses for its sequencer and synth? Other than that, there isn't much more for me to ask, and I should probably pick a design myself (or stick to SDL).

Surprisingly, OpenMPT WaveRT 2ms plays ~~smoothly~~ (never mind, I heard a pop) with "0%" CPU usage, on Realtek, Windows 10, and Microsoft drivers (not Realtek). But apparently it breaks other audio devices trying to play (or Audacity trying to record via WASAPI loopback).

manx · August 30, 2019, 11:24:13

Quote from: nyanpasu64 on August 30, 2019, 10:40:10
I think that picking Rust has definitely made it harder to find audio libraries. RtAudio seems to support PulseAudio in addition to ALSA, but is C++ and has no Rust wrapper currently. However, Rust comes with easier/safer concurrency and match{}, "move by default", no need to edit header files separately, and no copy construction by default.

I wont argue against Rust. It's a great language.

Quote from: nyanpasu64 on August 30, 2019, 10:40:10
And SDL is sufficient as a baseline (and the latency can't be worse than FamiTracker?).

On Windows, a 5ms sleep during each SDL callback causes intermittent gaps in the audio, even when I set each callback to supply 4096 samples (48000 smp/s). Maybe SDL calls the callback when there are <5ms * 48000smp/s = <240smp or 128smp left in the buffer? Or maybe at an arbitrary point? I assume the inability to tweak this parameter is what you mean by "SDL isn't configurable for low latency".

Last time I looked at its source, SDL's implementation was not really suitable for low latency for various reasons. However, you should not estimate actual real-world performance by simulating it with a sleep. Every operating system scheduler interprets an actual sleep as a "this thread is unimportant, not latency sensitive, and not high priority with regard to compute"-hint.

Quote from: nyanpasu64 on August 30, 2019, 10:40:10
My GUI is currently built in GTK (because Qt is hard to use with Rust), not SDL.

In that case, PortAudio or cpal should probably be better choices, even if they do not support PulseAudio natively. PortAudio generally does work ok-ish with the ALSA emulation of PulseAudio, albeit not with as good latencies as native PulseAudio is able to achieve.

Just to summarizes again, the goto audio backend APIs for initial development should be WASAPI on Windows, PulseAudio on Linux, CoreAudio on Mac, OSS on FreeBSD, I have no idea about Android (probably not your focus anyway though).
Other APIs are either outdated and nowadays emulated (WaveOut/MME, DirectSound, OpenAL on Windows; OpenAL, OSS on Linux), or low-level device-hogging APIs (WaveRT on Windows, ALSA on Linux), or special purpose APIs (ASIO for direct device access on Windows, Jack for audio session management and routing on Linux and Mac).

Quote from: nyanpasu64 on August 30, 2019, 10:40:10
What files does OpenMPT uses for its sequencer and synth? Other than that, there isn't much more for me to ask, and I should probably pick a design myself (or stick to SDL).

Most aspects are spread out over various files. Mainly Sndfile.* Snd_fx.* Sndmix.* pattern.* RowVisitor.* for the "sequencer" (pattern playback and effect interpretation), Mixer.* Resampler.*, IntMixer.* Sndmix.* Sndflt.* Tables.* WindowedFIR.* for the "synth" (sampler). Various other aspects (like plugin handling) are in even other files, and I might have also missed some files right now.

Quote from: nyanpasu64 on August 30, 2019, 10:40:10
Surprisingly, OpenMPT WaveRT 2ms plays ~~smoothly~~ (never mind, I heard a pop) with "0%" CPU usage, on Realtek, Windows 10, and Microsoft drivers (not Realtek). But apparently it breaks other audio devices trying to play (or Audacity trying to record via WASAPI loopback).

Yes, WaveRT bypasses all upper audio layers and is thus not generally useful in a desktop context.

nyanpasu64 · August 31, 2019, 09:07:35

What are the reasons that SDL is unsuitable for real-time audio?

Currently my synth can be called either in a separate thread (pushing data to a channel). Maybe I can make it also work as a generator/coroutine. (called by the audio callback. whenever it synthesizes enough data, it yields data to the audio thread and suspends its stack frame until the audio callback runs again.) I could synthesize 1 engine frame of audio at once, or incrementally. But I don't see how an easy way to run emulation logic over time, and the original FamiTracker updates rightwards channels a bit later than leftwards channels, to simulate the slow NES CPU updating channels in turn.

I've been looking into rust cpal, and it's worse to use than SDL. Its macro-laden design makes IDE autocompletion fail. Also the API picks a sample format, rate, and channel count for you at runtime, expecting your code to know how to render to any of u16, i16, or f32 (though you can override its choices at the risk of them being rejected, I picked i16 and 2 channels which should work everywhere). I think I cannot pick the size that the callback is expected to fill, nor how much buffering is done by cpal or the host API. Is this also unsuitable for real-time audio? What features should I expect from a good API?

I have some extra notes from their issue tracker at https://docs.google.com/document/d/149xFMivBZGAXRCEze1UUUhAKayucMPXFRehjYeSHj24/edit#heading=h.mbtlgdwouzoh which seem to indicate that cpal uses mutexes during audio processing at one point (supposedly fixed in master, but I can't find any commits from the author at the time he said so, and unsure if 0.10.0 has no mutexes), and picks a buffer size for you (10ms on WASAPI). Maybe not good signs.

Thanks for the OpenMPT file list, I'll look when I have time and the motivation to do it.

manx · August 31, 2019, 09:58:13

Quote from: nyanpasu64 on August 31, 2019, 09:07:35
What are the reasons that SDL is unsuitable for real-time audio?

It ties the callback period (SDL_AudioSpec.samples) either to the buffer size (implying a simple double buffer scheme, which the implementation in SDL is not fit for when using a small callback buffer), or introducing an unknown amount of additional internal buffering. Combined with no API to determine the overall latencies, this IMHO makes it unfit for realtime use.

Quote from: nyanpasu64 on August 31, 2019, 09:07:35
Also the API picks a sample format, rate, and channel count for you at runtime, expecting your code to know how to render to any of u16, i16, or f32 (though you can override its choices at the risk of them being rejected, I picked i16 and 2 channels which should work everywhere).
I think I cannot pick the size that the callback is expected to fill, nor how much buffering is done by cpal or the host API. Is this also unsuitable for real-time audio? What features should I expect from a good API?
I have some extra notes from their issue tracker at https://docs.google.com/document/d/149xFMivBZGAXRCEze1UUUhAKayucMPXFRehjYeSHj24/edit#heading=h.mbtlgdwouzoh which seem to indicate that cpal uses mutexes during audio processing at one point (supposedly fixed in master, but I can't find any commits from the author at the time he said so, and unsure if 0.10.0 has no mutexes), and picks a buffer size for you (10ms on WASAPI). Maybe not good signs.

I have not looked at Rust cpal in detail, thus I am not really qualified to comment. However, if the points you mention are true, I would not want to use it.

Quote from: nyanpasu64 on August 31, 2019, 09:07:35
What features should I expect from a good API?

Difficult question

.
First off, having worked with all kinds of different audio APIs in the past 20 years, I can say that I like no single one of them. They all have their own particular quirks or problems.
Second, every time an API tries to support both pull and push (or better refer to these variants as "callback" ("pull") vs. "synchronous" ("push"), as this makes the concept clearer when also considering recording) at the same time, they fail, and at least one of the variants is far from perfect. Some APIs emulate one on top of the other, however in my experience, one is far better off with implementing that emulation oneself. In particular, converting a synchronous API to a callback one is as simple as doing the synchronous calls in a separate thread. Converting the other way around is less simple as it involves implementing a properly synchronized buffer between the callback thread and the thread that calls synchronous functions. Both conversions induce additional implicit or explicit buffering (and thus latency) respectively.
A good audio API also provides a clearly defined way to determine timing information. Either by providing an instantaneous current output sample position (example MME/WaveOut), or precise amount of latency at a precisely defined timepoint (like callback begin or write position) (example: PulseAudio), or correlated timestamps of the sampleclock vs some systemclock (example: ASIO). PortAudio tries to do all three variants together, which mainly leads to confusion. SDL and DirectSound provide none, which is awful and requires the application to somehow guess, based on buffer sizes.
A good audio API also abstracts away sampleformat and samplerate and lets the application choose whatever it suits best and handles all conversion internally. The only exception to this rule should be low-level hardware APIs (like ASIO, WaveRT, ALSA) which need to give fine-grained control to the application so that it can configure the hardware exactly as needed. Low-level APIs should *NEVER* be the default for any application, as it breaks the "casual user"-usecase because low-level APIs tend to interfere with system audio for other applications.
In any kind of even halfway serious music production application, the audio rendering requires somewhat low latencies, which a GUI eventloop thread cannot provide (because it might be handling some GUI interaction/drawing). This implies having to do the rendering in a separate thread, in which case callback-based APIs are far more suitable than synchronous ones.

Having ruled out cpal and SDL, the solution for you is probably: Use PortAudio until it causes problems with PulseAudio on Linux for you.

nyanpasu64 · August 31, 2019, 12:45:16

>A good audio API also provides a clearly defined way to determine timing information.
>SDL and DirectSound provide none
Q: Does IDirectSoundBuffer::GetCurrentPosition() not count? 0CC-FamiTracker uses this function to monitor how full its chunked buffer is.

0CC is based around "synth thread renders entire engine frames at once, splits it into fixed-size chunks and writes to a queue or "chunked circular buffer", which is read by the audio output", and I approximately emulated that in my tracker. (I noticed my current SDL code behaves well at 11ms-ish latencies plus SDL buffering, whereas 0CC's audio output malfunctions at 20ms and below, possibly because DirectSound functions poorly on Vista+. Also my sound synthesis is just 2 white-noise generators, far simpler than 0CC's chip emulation.)

Also 0CC's latency slider is a lie. CDSound::OpenChannel() increments the latency 1ms at a time until the audio buffer can be divided evenly into 2 or more blocks. And no other code uses the latency.

(My code renders simulated "engine frames" of 800 samples which is 60/second at 48000Hz, sends them to a length-1 queue of 512 samples/frames, followed by SDL's internal buffering. My 11ms calculation may be wrong or too low, but may be wrong in the same way as 0CC is wrong.)

I think 0CC was (from above post) "implementing a properly synchronized buffer between the callback thread and the thread that calls synchronous functions" (and I copied that decision). I think it makes the synth code easier to read, more straightforward, and eliminates issues and edge cases around the "gap between callbacks". However, it was hard for me to discover what prevented the synth thread from running ahead (it was queue backpressure).

Q: Is this a reasonable arrangement, or does it introduce too much latency, or am I just perpetuating bad decisions and writing more and more code with those assumptions in mind?

(i swear i'll port my program to portaudio someday, but not today)

Q: Is cubeb good? It was suggested by someone on Discord saying "it may have been fixed" (but didn't specify the past issues), it supports Pulse, Firefox uses it, Dolphin uses it (>50ms of latency on Windows, but someone on Discord said that's good for Windows), and there's a Rust wrapper with active development but only 5 github stars.

Q: Is latency actually a problem in a tracker? I usually enter notes into 0CC when the tracker is paused and not playing. I've tried entering notes in real time, but ended up spending time fixing note placement afterwards. But I hear some people play MIDI keyboards while 0CC-famitracker is playing. Entering notes quickly and in time would require good piano skills (which i lack), or computer keyboard input skills (which I either lack or my ergonomic split keyboard makes it more difficult for me) or maybe the latency on my computers is too high. (Using 0CC on Wine requires latencies of 70ms or so, which some people would likely cringe at.)

(Should I keep making these posts, or are they annoying or too long or irrelevant?)

nyanpasu64 · September 05, 2019, 06:57:48

bump... no reply?

Saga Musix · September 05, 2019, 17:52:14

I'll leave the first few questions to manx as he's more experienced in those regards.

QuoteIs latency actually a problem in a tracker? I usually enter notes into 0CC when the tracker is paused and not playing. I've tried entering notes in real time, but ended up spending time fixing note placement afterwards. But I hear some people play MIDI keyboards while 0CC-famitracker is playing. Entering notes quickly and in time would require good piano skills (which i lack), or computer keyboard input skills (which I either lack or my ergonomic split keyboard makes it more difficult for me) or maybe the latency on my computers is too high. (Using 0CC on Wine requires latencies of 70ms or so, which some people would likely cringe at.)

Latency is a problem whenever you want to do realtime recording or play along a song. The default options used to be a lot worse (In particular MME on Windows added a lot of latency which made these things impractical), these days it's a lot better with WASAPI on Windows, especially for casual use, but still not quite perfect. The lower the latency, the more precisely you can place recorded notes, and in particular if you are good at live playing, this also means that less notes need to be fixed afterwards. Depending on your audience this may or may not be relevant, but I for example do use low-latency (5ms) ASIO with OpenMPT because I can perceive this difference when recording compared to, say, 30ms latency with WASAPI. High latency can be quite confusing to the brain in this scenario. If you're not aiming for MIDI support, it's probably less relevant for you.

manx · September 05, 2019, 18:28:21

Quote from: nyanpasu64 on August 31, 2019, 12:45:16
>A good audio API also provides a clearly defined way to determine timing information.
>SDL and DirectSound provide none
Q: Does IDirectSoundBuffer::GetCurrentPosition() not count? 0CC-FamiTracker uses this function to monitor how full its chunked buffer is.

No, IDirectSoundBuffer::GetCurrentPosition() is the most awkward interface ever invented to query the amount of buffer space that is writable for the application. It is also the only way to guess latency in DirectSound (and fails at that for various reasons, like for example because it does not (and can not by the design of its interface) represent additional latency by lower layers. It is unsuitable to tie a sample output position to the wallclock.

Quote from: nyanpasu64 on August 31, 2019, 12:45:16
0CC is based around "synth thread renders entire engine frames at once, splits it into fixed-size chunks and writes to a queue or "chunked circular buffer", which is read by the audio output", and I approximately emulated that in my tracker.

I'm confused about what this wants to tell me. In any case, it sounds overly complicated. Note that whatever structure you choose, if the amount of rendered PCM data by your synth is not directly influenced by the audio callback, but instead works in any kind of unrelated chunking, you will either introduce additional latency, and/or limit your maximum available CPU time to less than 100%. Also note that some synthesis algorithms imply internal chunking (e.g. a FFT) and thus by necessity add additional latency.

Quote from: nyanpasu64 on August 31, 2019, 12:45:16
(My code renders simulated "engine frames" of 800 samples which is 60/second at 48000Hz, sends them to a length-1 queue of 512 samples/frames, followed by SDL's internal buffering. My 11ms calculation may be wrong or too low, but may be wrong in the same way as 0CC is wrong.)
Q: Is this a reasonable arrangement, or does it introduce too much latency, or am I just perpetuating bad decisions and writing more and more code with those assumptions in mind?

Well, you render to a 800 sample frames buffer (1), submit that to a 512 sample frames buffering layer (2), send that in whatever chunk size SDL uses (let's assume 256 or 1024, just to make things more interesting) (3), which on Linux talks to PulseAudio, which in turn has its own internal buffering (4), which then sends the data to the soundcard via ALSA with its own ringbuffer (5).
So, 5 layers of buffering. At the *very* least, you should get rid of that 800-to-512 layering. It serves no purpose whatsoever. Unless you are required to process in chunks (i.e. because you are using a FFT or something like that) I highly suggest getting rid of any synth-internal chunking completely. And even if you are required to to internal chunking, abstract it away at the interface level (of the synth), i.e. by introducing *internal* (internal to the synth) buffering and exposing its latency.

Quote from: nyanpasu64 on August 31, 2019, 12:45:16
Q: Is cubeb good? It was suggested by someone on Discord saying "it may have been fixed" (but didn't specify the past issues), it supports Pulse, Firefox uses it, Dolphin uses it (>50ms of latency on Windows, but someone on Discord said that's good for Windows), and there's a Rust wrapper with active development but only 5 github stars.

Never used it. At least it should be very compatible with various system setups, as Firefox relies on it. 50ms seems weird on Windows, WASAPI should trivially provide 20..30ms. 50ms is to be expected with MME/WaveOut however. cubeb supports both.

nyanpasu64 · September 08, 2019, 06:30:50

Quotesubmit that to a 512 sample frames buffering layer (2), send that in whatever chunk size SDL uses (let's assume 256 or 1024, just to make things more interesting) (3)...

My buffering layer always uses the same chunk size as the audio callback.

I switched to PortAudio, which I had to manually build on Windows, and place the .lib an undocumented path, but works fine. Problem is, it supports nothing but MME! Do I need the DirectX headers to get WASAPI?
I can specify how many samples to be generated by each callback, but not how many buffers are used (double-buffer or more?). And if I ask for 1 second of audio and sleep half a second, there's no stuttering at all (unlike SDL).
Q: Is PortAudio an acceptable low-latency API if I can't control how it buffers audio? Does it always double-buffer and is that good enough?
Q: Is it normal that PortAudio works fine on Windows with non-power-of-2 buffer sizes (accidentally set buffer size, not sampling rate, to 48000)?
Q: Should I file an feature request in CPAL for an API to configure "how many samples generated per callback"? The maintainer of rust-portaudio has moved onto CPAL (and you say its API is awful) and stopped working on PortAudio.

Cubeb's API seems unstable. The current example code in cubeb-rs crashes on Windows with a COM-related error due to upstream cubeb changes which are still in flux (I think cubeb is a firefox library with no stable release cycle, and cubeb-rs just imports master as a submodule). Fixing the code requires me to add Windows-specific code (or maybe revert to older cubeb-rs/cubeb that doesn't make the user manage COM threading.)

Quote50ms seems weird on Windows, WASAPI should trivially provide 20..30ms.

https://dolphin-emu.org/blog/2017/06/03/dolphin-progress-report-may-2017/#50-3937-add-cubeb-audio-backend-by-ligfx claims that XAudio2 has 62-68ms of latency (possibly some from Dolphin).

Latency, buffering, and NES hardware

Quoteif the amount of rendered PCM data by your synth is not directly influenced by the audio callback, but instead works in any kind of unrelated chunking, you will either introduce additional latency, and/or limit your maximum available CPU time to less than 100%. And even if you are required to to internal chunking, abstract it away at the interface level (of the synth), i.e. by introducing *internal* (internal to the synth) buffering and exposing its latency.
...you should get rid of that 800-to-512 layering. It serves no purpose whatsoever.

I'm ripping out the queue soon. I think that even a 0-length queue introduces latency, by allowing the synth to run ahead of the callback up to 1 chunk of audio (the synth blocks trying to push to the queue, until the callback tries to pull from the queue).

In the NES, all audio chips run in lockstep off the master clock, which also controls vblank. Most audio engines including Famitracker only run once per vblank (though Famitracker/NSF allows the engine to be called at a custom rate). I think there's nothing wrong with synthesizing new audio once per engine frame. Even if I were to render audio more finely, I wouldn't get any latency advantages (since all inputs must be quantized to 1/60 of a second), I think.

My idea is for the callback's "persistent object" to own the synth, and I only synthesize audio within the callback. Whenever the synth is out of audio, the callback runs the synth for 1 (or more?) frames into a buffer until I have 1 or more chunks of audio. Then each subsequent callback will pull audio out of the buffer, until it's empty.

However this will result in some callbacks running engine logic and synthesis (high CPU usage), while some don't (minimal CPU usage). Q: Does OpenMPT also behave that way, but maintain low latency anyway? (I haven't read OpenMPT's code yet since I was busy with classes and other tracker research, should I read it?) (I assume with a 5ms period or block size, the synth function can take up to 5ms to complete without stuttering.)

Alternatively, the callback could run engine logic once per vblank, but synthesize audio on demand: look at the buffer size, ask the library "how many clock cycles should I advance CPU time so that X samples of audio are available?", and run all sound chips (and possibly the engine) for that many cycles. Q: How much will this spread out CPU usage between callbacks? Should I run a profiler on FamiTracker and see where most CPU is being spent (probably drawing the GUI, not running audio)?

Q: How is latency computed? Are there any techniques for this, like concurrency/timing diagrams on paper? Note that user inputs can happen at any time within a NES frame (the timing granularity of NES sound engines tied to vblank), leading to an inherent 16ms of variance.

I can assign inputs to either the next NES frame, or the previous NES frame to hide latency. It's possible to "insert note into pattern" as if the key was pressed earlier, but I can't retroactively play audio as if the key was pressed earlier. And I have no clue how famitracker combines "when playing a pattern, use whatever channel the note is located in" and "in edit mode, add user input to the pattern and play in current channel only" and "in read-only mode, shove newly played notes into the cursor channel, but look in the next channel modulo N if this one's occupied, and steal channels too".

NES Audio Synthesis

All NES audio (except for a FM chip called VRC7 used in 1 game) is made of a series of flat lines separated by steps (though the FDS has extra audio filtering after the steps). Famitracker uses the blip_buffer library (by blargg) for all chips (except FM) to generate audio out of bandlimited steps (positioned at CPU clocks), and I think that's a reasonable design to keep.

Unfortunately blip_buffer is (unnecessarily) heavily templated in C++, making it hard to wrap in Rust, and the audio processing is incomprehensible. I'm planning to use the blip_buf library (also by blargg), written in C and having with a Rust wrapper. Incidentally, "how many clocks do I need" is bugged in blip_buf, for 4096 or more audio samples. I can fix by vendoring the dependency and patching the C. (I removed some discussion related to synthesis and not latency.)

manx · September 08, 2019, 07:28:25

Quote from: nyanpasu64 on September 08, 2019, 06:30:50
I switched to PortAudio, which I had to manually build on Windows, and place the .lib an undocumented path, but works fine. Problem is, it supports nothing but MME! Do I need the DirectX headers to get WASAPI?

PortAudio supports MME, WASAPI, DirectSound on Windows (and ASIO, which you should not care about). All required headers come with any supported Windows SDK. I have no idea what went wrong for your setup.

Quote from: nyanpasu64 on September 08, 2019, 06:30:50
I can specify how many samples to be generated by each callback, but not how many buffers are used (double-buffer or more?). And if I ask for 1 second of audio and sleep half a second, there's no stuttering at all (unlike SDL).
Q: Is PortAudio an acceptable low-latency API if I can't control how it buffers audio? Does it always double-buffer and is that good enough?

You can. PAStreamParameters::suggestedLatency.

Quote from: nyanpasu64 on September 08, 2019, 06:30:50
Q: Is it normal that PortAudio works fine on Windows with non-power-of-2 buffer sizes (accidentally set buffer size, not sampling rate, to 48000)?

Sure, there is no reason whatsoever why any API should even require power-of-2 buffer sizes. That's a totally arbitrary limitation.

Quote from: nyanpasu64 on September 08, 2019, 06:30:50
Quote50ms seems weird on Windows, WASAPI should trivially provide 20..30ms.
https://dolphin-emu.org/blog/2017/06/03/dolphin-progress-report-may-2017/#50-3937-add-cubeb-audio-backend-by-ligfx claims that XAudio2 has 62-68ms of latency (possibly some from Dolphin).

XAudio2 is yet another audio API which we have not talked about yet. You probably should not care. It's a higher level API on top of WASAPI, and also available on XBox. Not sure what contributes to those given latency numbers. WASAPI for sure works completely fine with 20ms..30ms latency.

Quote from: nyanpasu64 on September 08, 2019, 06:30:50
Quoteif the amount of rendered PCM data by your synth is not directly influenced by the audio callback, but instead works in any kind of unrelated chunking, you will either introduce additional latency, and/or limit your maximum available CPU time to less than 100%. And even if you are required to to internal chunking, abstract it away at the interface level (of the synth), i.e. by introducing *internal* (internal to the synth) buffering and exposing its latency.
...you should get rid of that 800-to-512 layering. It serves no purpose whatsoever.
I'm ripping out the queue soon. I think that even a 0-length queue introduces latency, by allowing the synth to run ahead of the callback up to 1 chunk of audio (the synth blocks trying to push to the queue, until the callback tries to pull from the queue).

Yes, even a "0-length-queue" introduces latency, implicitly, because you calculate your chunk beforehand.

Quote from: nyanpasu64 on September 08, 2019, 06:30:50
In the NES, all audio chips run in lockstep off the master clock, which also controls vblank. Most audio engines including Famitracker only run once per vblank (though Famitracker/NSF allows the engine to be called at a custom rate). I think there's nothing wrong with synthesizing new audio once per engine frame. Even if I were to render audio more finely, I wouldn't get any latency advantages (since all inputs must be quantized to 1/60 of a second), I think.
My idea is for the callback's "persistent object" to own the synth, and I only synthesize audio within the callback. Whenever the synth is out of audio, the callback runs the synth for 1 (or more?) frames into a buffer until I have 1 or more chunks of audio. Then each subsequent callback will pull audio out of the buffer, until it's empty.
However this will result in some callbacks running engine logic and synthesis (high CPU usage), while some don't (minimal CPU usage).
Q: Does OpenMPT also behave that way, but maintain low latency anyway? (I haven't read OpenMPT's code yet since I was busy with classes and other tracker research, should I read it?) (I assume with a 5ms period or block size, the synth function can take up to 5ms to complete without stuttering.)
Alternatively, the callback could run engine logic once per vblank, but synthesize audio on demand: look at the buffer size, ask the library "how many clock cycles should I advance CPU time so that X samples of audio are available?", and run all sound chips (and possibly the engine) for that many cycles.
Q: How much will this spread out CPU usage between callbacks? Should I run a profiler on FamiTracker and see where most CPU is being spent (probably drawing the GUI, not running audio)?

OpenMPT also *wants* to render audio chunks of a given length (the tick duration), which however can change during playback. However it doesnt. It renderes precisely as much audio as is requested by the callback, and remembers how much audio is yet to be rendered to complete the current tick. Input processing also only happens on tick boundaries as necessary. Compared to generating the actual audio, tick processing has negligible CPU requirements, which result in almost constant CPU requirement per callback (which is good). Pre-rendering a complete tick would introduce a complete tick worth of additional latency.

Quote from: nyanpasu64 on September 08, 2019, 06:30:50
Q: How is latency computed? Are there any techniques for this, like concurrency/timing diagrams on paper? Note that user inputs can happen at any time within a NES frame (the timing granularity of NES sound engines tied to vblank), leading to an inherent 16ms of variance.

OutputLatency = worst-case sum of all output buffering.
InputLatency = processing chunk size
RoundtripLatency = InputLatency + OutputLatency
Yes, diagrams do help.

Quote from: nyanpasu64 on September 08, 2019, 06:30:50
I can assign inputs to either the next NES frame, or the previous NES frame to hide latency. It's possible to "insert note into pattern" as if the key was pressed earlier, but I can't retroactively play audio as if the key was pressed earlier.

That's precisely why all output buffering (which constitutes audio that has already been rendered) contributes to output latency. If you need to react to input faster, reduce the latency.

nyanpasu64 · September 08, 2019, 11:28:23

QuotePortAudio supports MME, WASAPI, DirectSound on Windows (and ASIO, which you should not care about). All required headers come with any supported Windows SDK. I have no idea what went wrong for your setup.

I accidentally printed the wrong variable (default MME twice). 🤦‍♀️ Actually, portaudio-rs only picked up MME and WDM-KS. I built PortAudio in CLion using CMake and MSVC (2019?) x64. And WASAPI was explicitly disabled when running cmake, because the author couldn't get it to build (even though it builds fine for me). They also ship a project in some ancient Visual Studio format which I could try building instead.

QuoteCompared to generating the actual audio, tick processing has negligible CPU requirements, which result in almost constant CPU requirement per callback (which is good).

Good to know, thanks!

QuotePre-rendering a complete tick would introduce a complete tick worth of additional latency.

Famitracker encodes instrument volumes as [tick]volume-level. Assume that I receive a new note halfway into a tick, and need to play a preview of that note. If I were to match Famitracker behavior, the actual sound driver (both NSF and software player) only runs once a tick, so I'd have to wait until the tick ends before triggering the new note. I could deviate from "behavior when playing a pattern" and trigger a new note right away, where its initial volume would be volumes[0]. Half a tick later when the engine actually runs, does it stay at volumes[0] or switch to volumes[1]? It's probably doable, but is it a good idea? Does OpenMPT preview audio immediately when keys are pressed, even in the middle of a tick?

I'm probably going to "run engine logic once per vblank, but synthesize audio on demand" at some point. I'll look into PAStreamParameters::suggestedLatency later. But I'll first build a prototype of my new note-placement system (which differs from other trackers), before working more on audio.