How I Use Audio Worklet to Replace AudioBufferSourceNode

Background

Recently I have been refactoring Nikku, web-based BRSTM audio player. After I moved out the audio decoding part to a worker thread, I noticed that passing data between main thread and worker thread causes a significant delay before the audio can be played back.

Previously, when the decoding happened in main thread, using my favorite track, it was fully decoded in around 250 ms. It means that when a file is selected, the user have to wait around this much time before audio can start playing. However, when the decode happened in worker thread, this delay increased to 300 ms, which starts to be pretty noticeable. This increase is because there is a data passing between main thread and the worker thread. Reading what Surma said about postMessage, I agreed that it is slow exactly because the data I’m passing is pretty large (the file is around 4 MB).

So what to do when you have a big chunk of file? Well, you can split the file into smaller segments. The idea is that I split the file into two segments: the first 3 seconds of the file (first segment), and the rest of the file (second segment). I decode the first segment, load it up to the played and start playing, while simultaneously decode the second segment, and then load it up when it’s ready

The first 3 s of the file can be decoded (including the data passing across threads) in less than 30 ms, so this first-start delay shouldn’t be noticeable anymore.

Now I face a roadblock: when the second segment has been decoded, how do I load it up?

Side Track: Web Audio API

Web Audio API has concepts of audio nodes. There are three kinds of nodes: input nodes (e.g. an oscillator, a media stream, a buffer stream), effect nodes (e.g. gain node to manipulate volume), and destination node (which represents user’s speakers). Developers choose from a range of predefined nodes to suit our use cases.

However, these audio nodes are definitely insufficient to cater to all our use cases. That’s why in the recent years, Audio Worklet emerged as a solution, basically letting developers to implement our own custom nodes.

In context of Nikku, the big picture is as follows:

AudioBufferSourceNode --> AudioMixerNode (Worklet) --> GainNode --> AudioDestinationNode

The fully decoded samples are loaded in AudioBufferSourceNode. This node acts as the “source” of the audio. It controls the looping too.

AudioMixerNode is an audio worklet that mixes different channels/tracks together to produce a two-channel output. This is because BRSTM file can contain more than 1 track and although each track is typically 2 channels, sometimes it can be more (like 3) or less (1). These are described in the file itself (at “track descriptions” file header). Although there is a ChannelMergerNode, we cannot use it because it has a predefined algorithm on how to mix different number of inputs together

GainNode controls the volume of the output. AudioDestinationNode is the speaker.

How to Update a Playing Audio Buffer?

I tried to write the new value on the buffer object of the AudioBufferSourceNode. There was no error, but also it does not seem to take any effect (audio stops after 3 s).

Naturally I started to search StackOverflow and saw this similar question, one choice I have is to recreate the AudioBufferSourceNode with full decoded samples, and then seek to the “current” playback time to restart playback. (By the way I don’t have an accurate playback time, because they didn’t provide API for that, and I have to keep track using Date.now() myself). After I implemented that, I noticed that when the second segment is loaded (and AudioBufferSourceNode is recreated & reconnected), there seem to be a disturbance in the playing audio. I guess that’s because my “current” playback time is inaccurate?

After seeing a similar GitHub issue to the Web Audio standard itself, the recent commenters suggested implementing an AudioWorklet as the source of the audio buffer. I thought that it may make things very complicated as that meant re-implementing behavior of AudioBufferSourceNode but it turned out simpler than I thought.

Audio Worklet to the Rescue

This custom audio source node consist of 0 input and 1 output of 2 channels. When initialized, it takes in the first audio segment and it listened to a message event to receive the second audio segment (samples & offset). At the “process()” method, I have an internal counter “bufferHead” to read from the expected sample index and write to the output. The assumption here is that “process()” method is called sequentially and only once by the browser.

class AudioSourceNode extends AudioWorkletProcessor {
  constructor(options) {
    /** @type {Float32Array[]} */
    this.samples = options?.processorOptions?.initialSamples;
    /** @type {number} */
    this._bufferHead = 0;

    this.port.onmessage = (event) => {
      switch (event.data.type) {
        case 'ADD_SAMPLES': {
          const samples = /** @type {Float32Array[]} */ (
            event.data.payload.samples
          );
          const offset = /** @type {number} */ (event.data.payload.offset);

          for (let c = 0; c < samples.length; c++) {
            for (let s = 0; s < samples[c].length; s++) {
              this.samples[c][s + offset] = samples[c][s];
            }
          }
          break;
        }
      }
    }
  }
  process(_inputs, outputs, _parameters) {
    const output = outputs[0];
    let absoluteSampleIndex = this._bufferHead;
    for (let s = 0; s < output[0].length; s++) {
      output[0][s] = this.samples[0][absoluteSampleIndex];
      output[1][s] = this.samples[1][absoluteSampleIndex];

      absoluteSampleIndex += 1;
    }
    this._bufferHead = absoluteSampleIndex;
  }
}

In this initial implementation, I have one class property “samples” (type Float32Array[]; each Float32Array represents one channel) that contains the decoded audio samples. My idea is to update directly to the Float32Array when I receive new samples.

Message Event Handler Silent Failure

However, this does not work as expected. Somehow, the message event handler works, but the loop here does not work. Neither Firefox nor Chrome throw error here.

          console.log('a')                                 // this is printed in console
          for (let c = 0; c < samples.length; c++) {       // samples.length is 2
            for (let s = 0; s < samples[c].length; s++) {  // samples[0].length is around 3,000,000
              this.samples[c][s + offset] = samples[c][s];
              console.log('b', samples[c][s]);             // this does not print anything!?
            }
          }

My guess is that there is some kind of limit on how much “work” can be done in an audio worklet’s message event handler, and looping that much isn’t allowed. It would be great if browser throw some kind of error instead of silently fail.

I workaround this problem by changing class property “samples” into list of segments, i.e. type Float32Array[][]. When I receive new samples, then I would just push the new samples to the property “samples”

class AudioSourceNode extends AudioWorkletProcessor {
  constructor(options) {
    /** @type {Float32Array[][]} segments of multi-channel samples */
    this.samples = [options?.processorOptions?.initialSamples];
    /** @type {Array<number>} list of offsets, each for each segment in this.samples */
    this.samplesOffsets = [0];
    /** @type {number} */
    this._bufferHead = 0;
    this.port.onmessage = (event) => {
      switch (event.data.type) {
        case 'ADD_SAMPLES': {
          const samples = /** @type {Float32Array[]} */ (
            event.data.payload.samples
          );
          const offset = /** @type {number} */ (event.data.payload.offset);

          this.samples.push(samples);
          this.samplesOffsets.push(offset);
          break;
        }
      }
    }
  }
}

This approach would mean that in the “process()” method, given an absolute sample index, I should read from the relevant segment.

  process(_inputs, outputs, _parameters) {
    const output = outputs[0];
    let absoluteSampleIndex = this._bufferHead;
    for (let s = 0; s < output[0].length; s++) {
      const i = this.getSegmentIndex(absoluteSampleIndex); // To be implented

      const segmentOffset = this.samplesOffsets[i];
      const segment = this.samples[i];
      const segmentSampleIndex = absoluteSampleIndex - segmentOffset;

      output[0][s] = this.samples[i][0][segmentSampleIndex];
      output[1][s] = this.samples[i][1][segmentSampleIndex];

      absoluteSampleIndex += 1;
    }
    this._bufferHead = absoluteSampleIndex;
  }

But given a sample index, how to determine which segment it belongs to?

Well, a naive implementation is to loop through all the samples offsets. This is actually sufficient as there should not be that many segments

  getSegmentIndex(s) {
    for (let i = 0; i < this.samplesOffsets.length - 1; i++) {
       if (this.samplesOffsets[i] <= s && s < this.samplesOffsets[i + 1]) {
         return i;
       }
    }
    return this.samplesOffsets.length; // Indicates it is not found
  }

But it bugs me that it runs in O(N), so I implement a binary search instead.

  getSegmentIndex(s) {
    // https://en.wikipedia.org/wiki/Binary_search_algorithm#Alternative_procedure
    let l = 0,
      r = this.samplesOffsets.length - 1;
    while (l !== r) {
      let mid = Math.ceil((l + r) / 2);
      if (this.samplesOffsets[mid] > s) {
        r = mid - 1;
      } else {
        l = mid;
      }
    }
    return l;
  }

After these changes, although the worklet is still missing some feature, the audio playback is working!

More Features

Implementing a custom source node, it meant that I need to implement the features built-in by AudioBufferSourceNode

Mixing

At this point, the audio player still relies on a separate worklet for mixing different audio tracks together. I thought it would be much simpler to combine it to this worklet. And I did just that.

Mixing different tracks would just mean that given different tracks, we need to combine them into a single 2-channel track. Given a sample from different tracks, I would add them up together and clamp them so they don’t go out of bound. The codes are pretty long so I linked it here.

Seeking

Seeking to a specific playback timestamp simply mean that the worklet’s “bufferHead” is changed to the corresponding timestamp in terms of sample index. My approach for this is to listen to a “seek” message, and then updates the “bufferHead“. There is a slight error here since this value is used only on the next “process” method call, but I think it’s small enough that it can be ignored (I observed that “process” method is called by browser on every 128 samples).

class AudioSourceNode extends AudioWorkletProcessor {
  constructor(options) {
    // ...
    this.port.onmessage = (event) => {
      switch (event.data.type) {
        // ...
        case 'SEEK': {
          const playbackTimeInS = /** @type {number} */ (
            event.data.payload.playbackTimeInS
          );
          this._bufferHead = Math.floor(playbackTimeInS * this.sampleRate);
          break;
        }

      }
    }
  }

Looping

We need to handle two scenario: when user wants looping and when user doesn’t. To get this information, we listed to a “update should look” message.

class AudioSourceNode extends AudioWorkletProcessor {
  constructor(options) {
    // ...
    this.port.onmessage = (event) => {
      switch (event.data.type) {
        // ...
        case 'UPDATE_SHOULD_LOOP': {
          this.shouldLoop = /** @type {boolean} */ (
            event.data.payload.shouldLoop
          );
          break;
        }

      }
    }
  }

When looping is desired, it means that when we increment the sample index, if it went out of bound, then the sample index is moved to the specified “loop start” point. If looping is not desired, we keep the index outside the boundary, but writes output as 0. We can also send back a message to the main thread saying that we have reached the end of the buffer or we looped the buffer.

class AudioSourceNode extends AudioWorkletProcessor {
  process(_inputs, outputs, _parameters) {

    let absoluteSampleIndex = this._bufferHead;

    for (let s = 0; s < output[0].length; s++) {
      if (absoluteSampleIndex >= this.totalSamples) {
        // We've reached the end, just output 0
        output[0][s] = 0;
        output[1][s] = 0;
        continue;
      }
 
      // ... retrieving sample from segments + mixing tracks
      
      absoluteSampleIndex += 1;
      if (absoluteSampleIndex >= this.totalSamples) {
        if (this.shouldLoop) {
          absoluteSampleIndex = this.loopStartSample;
          this.port.postMessage({
            type: 'BUFFER_LOOPED',
          });
        } else {
          this.port.postMessage({
            type: 'BUFFER_ENDED',
          });
        }
      }
    }
    this._bufferHead = absoluteSampleIndex;
  }
}

Why do I write 0 instead of stopping buffer (return false in process method)? This is so that I can “reactivate” this audio worklet node with a message event. If I ever return false in that process method, then I would need to reinitialize the node again (and setting up all the sample data).

Current Playback Position

One of the biggest issue I faced when building this audio player in the past was on retrieving the current playback position. Apparently it is also a long-standing issue at the Web Audio standard itself.

My workaround in the existing Audio Player was to keep track when I start & pause the audio buffer (using Date.now()). Using these information, I calculate roughly where the playback position was. It wasn’t accurate, because it was meant only for display of the progress bar.

But since the Audio Source Node worklet itself know exactly where it is, and there is a message passing mechanism to pass data from worklet to the main thread. What if we can pass the current playback position back to the main thread? If so, how often should it the update be?

My approach is to have the main thread send a message to the worklet to ask for the timestamp, and the worklet send back a reply message. The other design I considered is to have the worklet send it on each process method call, but I felt that the update is too often. So I approached it from the UI point of view. In the main thread, I already have a loop to update the playback time (using requestAnimationFrame). Therefore I extend this to also ask for the current timestamp. In a way, the playback timestamp shown on UI will always have a slight error, since there will be a small delay for the message to travel across threads. But for UI display purpose, I think it’s totally acceptable.

class AudioSourceNode extends AudioWorkletProcessor {
  constructor(options) {
    // ...
    this.port.onmessage = (event) => {
      switch (event.data.type) {
        // ...
        case 'TIMESTAMP_QUERY': {
          this.port.postMessage({
            type: 'TIMESTAMP_REPLY',
            payload: {
              timestamp: this._bufferHead / this.sampleRate,
            },
          });
          break;
        }

      }
    }
  }

Conclusion

In the end, my audio nodes in the audio player looks quite simple:

AudioSourceNode (Worklet) --> GainNode --> AudioDestinationNode

This is because all the heavy lifting has been handled in the audio worklet. Writing my own audio worklet to replace AudioBufferSourceNode turned out to be simpler than I thought. Not only that it resolved all my annoyances about working with Web Audio API, but it also able to handle your custom requirements.

If you face roadblock in Web Audio built-in nodes, maybe you should write your own Audio Worklet 😂