Synchronizing Sound PCM with Image FPS

This post introduces a couple of techniques I used to synchronize audio contents with the output of my animation-fractal project. In two parts, I present:

  • How to load audio data into Pulse-Code Modulation (PCM).
  • An audio player to play the audio along the video’s Frames Per Second (FPS).

The goal of this synchronization is to ensure that the video is related to the audio, as demonstrated in this new demo:


Normalize the audio data

First, we need to decode the audio data into a raw format suitable for real-time playback. The main requirement is to pick a fixed sample rate so that the data can be divided into per frame chunk. In this case the video runs at 60 frames per second, so we can use the 44100 sampling frequency. The resulting audio chunk size is: \(44100/60 = 735\) samples.

Using the typed-process library to run ffmpeg, we decode a given file with this function:

type Samples = VectorS.Vector Float

decodeFile :: FilePath -> IO Samples
decodeFile fname = do
    let args = ["-i", fname]    -- input
            <> ["-ac", "1"]     -- convert to mono
            <> ["-ar", "44100"] -- sample at 44100 (fit for 60 fps)
            <> ["-f", "f32le"]  -- use float
            <> ["-"]            -- output
    pcmBuf <- readProcessStdout_ $ proc "ffmpeg" args
    let (wordPtr, wordSZ) = toForeignPtr0 (from pcmBuf)
        pcmPtr = castForeignPtr wordPtr :: ForeignPtr Float
        pcmSZ = wordSZ `div` sizeOf (0 :: Float)
        samples = VectorS.unsafeFromForeignPtr0 pcmPtr pcmSZ
    pure samples

This function relies on toForeignPtr0 from the bytestring library to access the underlying memory. Then we convert it into a Vector using unsafeFromForeignPtr0.

Playback through pipewire

Using the pulse-simple library, we send the samples to the speaker with this function:

import Sound.Pulse.Simple qualified as PS

playSample :: Samples -> PS.Simple -> IO ()
playSample samples client = PS.simpleWriteRaw client bs
  where
    (floatPtr, floatSZ) = VectorS.unsafeToForeignPtr0 samples
    wordPtr = castForeignPtr floatPtr
    wordSZ = floatSZ * 4
    bs = fromForeignPtr0 wordPtr wordSZ

This function does the opposite operation from decodeFile to convert the data back into a ByteString.

pulse-simple is a synchronous API: the simpleWriteRaw call blocks until the samples are completely consumed. It’s also amazing to see that the library has been published 10 years ago (2012), and it still compiles today!

Audio player thread

We use a dedicated thread to handle the synchronous audio connection with a channel to receive the samples to be played:

clientThread :: TChan (Maybe Samples) -> IO ()
clientThread chan = do
    -- the pulse client reference.
    mClientV <- newTVarIO Nothing
    let startClient = do
            readTVarIO mClientV >>= \case
                -- the client is already started
                Just client -> pure client
                Nothing -> do
                    client <- newClient
                    atomically $ writeTVar mClientV (Just client)
                    pure client

        stopClient = do
            readTVarIO mClientV >>= \case
                Nothing -> pure ()
                Just client -> do
                    PS.simpleDrain client
                    PS.simpleFree client
                    atomically $ writeTVar mClientV Nothing

        run = forever do
            atomically (readTChan chan) >>= \case
                Nothing -> stopClient
                Just samples -> playSample samples =<< startClient

    run `finally` stopClient
  where
    name = "animation-fractal"
    newClient = PS.simpleNew Nothing name PS.Play Nothing "pulse-pipe" spec Nothing Nothing
    spec = PS.SampleSpec (PS.F32 PS.LittleEndian) 44100 1

Synchronize the audio with the video

Finally, we submit the audio samples by calling this function in the video render loop:

playAudioFrame :: Frame -> Samples -> TChan (Maybe Samples) -> IO ()
playAudioFrame (Frame position) samples chan =
    when (even position) do
        let frameSize = 44100 `div` 30
            startingPos = fromIntegral position `div` 2
            chunk = VectorS.slice (startingPos * frameSize) frameSize samples
        atomically $ writeTChan chan (Just chunk)

Playing 60 audio chunks per second may be too fast, resulting in some audible clips. Thus this function only submits 30 chunks per second instead.

Conclusion

Similarly to the previous post Capturing Vulkan Framebuffer with Massiv, we used the Haskell Foreign capabilities to efficiently manipulate raw data. While this initial implementation does not prevent under/over-run, it already performs quite well: we can jump to arbitrary location and the audio is played in sync with the video.

Haskell offers efficient low-level interfaces that are available from a high level of abstraction.

Links to this page
#haskell #blog