Bi-Directional Audio Streaming Integration Document

This document outlines the Web Socket (WS) event structure for bi-directional audio streaming between our system and an endpoint. It enables to send us the voice data along with the information of the caller to the endpoint (webSocket) which can return the voice data back and it would be played out to the caller. The integration involves sending and receiving audio data and metadata in real-time using predefined event types.


1. Overview

The integration stream audio data and metadata between our system and the webSocket endpoint. The following events are exchanged to manage the audio stream lifecycle:

  • Events Sent to the Vendor (endpoint): Used to initiate, manage, and terminate the audio stream.
  • Events Received from the Vendor (endpoint): Used to receive audio data and metadata from the vendor.

2. Events Sent to the Vendor

2.1. Connected

The connected event acts as a handshake response and sets expectations between the client and server. It is the first message sent after establishing the Web Socket connection.

We can send the connected event once a Web Socket connection is established. This is the first message the Web Socket server receives.

PropertyDescription
eventDescribes the type of Web Socket message. In this case, connected.

Payload Structure:

{ 
 "event": "connected"
}

2.2. Start Message

The start message contains metadata about the Stream and is sent immediately after the connected message. It is only sent once at the start of the Stream.

PropertyDescription
eventDescribes the type of Web Socket message. In this case, start.
sequenceNumberNumber used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
startAn object containing Stream metadata
start.streamSidThe unique identifier of the Stream
start.accountSidThe unique identifier of the Account for which the stream was created
start.callSidThe unique identifier of the Call for which the Stream was started
fromThe number from which the call was originated to the above-mentioned account.
toThe number of the account to which the as originated to.
start.mediaFormatAn object containing the format of the payload in the media messages.
start.mediaFormat.encodingThe encoding of the data in the upcoming payload. Value is always audio/x-mulaw. (also known as G.711 µ-law (PCMU))
start.directionThe Direction of the call (inbound/outbound)
start.mediaFormat.sampleRateThe sample rate in hertz of the upcoming audio data. Value is always 8000
start.mediaFormat.bitRateThe number of bits used to represent one second of audio in the input audio data. Value is always 64kbps.
start.mediaFormat.bitDepthIt refers to the number of bits used to represent each sample. (8-bit)
start.customParametersAn object containing the custom parameters that were set when defining the Stream
streamSidThe unique identifier of the Stream

Payload Structure:

{
  "event": "start",
  "sequenceNumber": "1",
  "start": {
    "accountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "callSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "from": "XXXXXXXXXX",
    "to": "XXXXXXXXXX",
    "direction": "outbound"
    "mediaFormat": { 
        "encoding": "audio/x-mulaw", 
        "sampleRate": 8000, 
        "bitRate": 64,
        "bitDepth": 8 },
    "customParameters": {
     "FirstName": "Jane",
     "LastName": "Doe",
     "RemoteParty": "Bob", 
   },
  },
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

2.3. Media message

This message type encapsulates the raw audio data.

Please note, the media message is sent to the vendor in every 100ms.

PropertyDescription
eventDescribes the type of Web Socket message. In this case, "media".
sequenceNumberNumber used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
mediaAn object containing media metadata and payload
media.chunkThe chunk for the message. The first message will begin with 1 and increment with each subsequent message.
media.timestampPresentation Timestamp in Milliseconds from the start of the stream.
media.payloadRaw audio encoded packets in base64
streamSidThe unique identifier of the Stream

Payload Structure:

{ 
 "event": "media",
 "sequenceNumber": "3", 
 "media": {  
   "chunk": "1", 
   "timestamp": "5",
   "payload": "no+JhoaJjpz..."
 },
 "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

2.4. Stop Message

This message indicates when the Stream has stopped, or the call has ended.

PropertyDescription
eventDescribes the type of Web Socket message. In this case, stop.
sequenceNumberNumber used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
stopAn object containing Stream metadata
stop.accountSidThe Account identifier that created the Stream
stop.callSidThe Call identifier that started the Stream
stop.reasonThe reason for ending the Stream.
streamSidThe unique identifier of the Stream

Payload Structure:

{ 
 "event": "stop",
 "sequenceNumber": "5",
 "stop": {
    "accountSid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "callSid": "CAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "reason": "The caller disconnected the call"
  },
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" 
}

2.5 DTMF message

A dtmf message is sent when someone presses a touch-tone number key in the inbound stream, typically in response to a prompt in the outbound stream.

PropertyDescription
eventDescribes the type of Web Socket message. In this case, dtmf.
streamSidThe unique identifier of the Stream
sequenceNumberNumber used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
dtmf.digitthe number-key tone detected

An example dtmf message is shown below. The dtmf.digit value is 1, indicating that someone pressed the 1 key on their handset.

Payload:

{ 
  "event": "dtmf", 
  "streamSid":"MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", 
  "sequenceNumber":"5", 
  "dtmf": { 
      "digit": "1"
  }
}

2.6. Mark message

When endpoint sends a media message, it could then send a mark message with a label; When that media message's playback is complete, we send the mark message to the endpoint using the same label mark.name indicating that the media has been played.

If the endpoint (WebSocket server) sends a clear message, we will empty the audio buffer and send back the mark messages matching any remaining mark messages from the server.

PropertyDescription
eventDescribes the type of Web Socket message. In this case, "mark".
sequenceNumberNumber used to keep track of message sending order. The first message has a value of 1 and then is incremented for each subsequent message.
streamSidThe unique identifier of the Stream
markAn object containing the mark metadata
mark.nameA custom value. We send back the mark.name you specify when it receives a mark message

Payload Structure:

{ 
 "event": "mark",
 "sequenceNumber": "4",
 "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
 "mark": {
   "name": "mark label"
 }
}

3. Events Received from the Vendor

3.1. Media

The payload must be encoded audio/x-mulaw with a sample rate of 8000 encoded with base64 PCM mono audio. The audio can be of any size.

Please note that the payload of media received from the vendor should at least be of 160 bytes or a multiple of 160 bytes (i.e., 320, 800, 4000, etc.). In case the payload is not a multiple of 160 bytes, audio gaps might occur when it is played over the call.

PropertyDescription
eventDescribes the type of Web Socket message. In this case, "media".
streamSidThe SID of the Stream that should play the audio
mediaAn object containing the media payload
media.payloadRaw mulaw/8000 audio in encoded in base64
media.chunkThe chunk for the message. The first message will begin with 1 and increment with each subsequent message.

Payload Structure:

{
  "event": "media",
  "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
  "media": {
    "payload": "a3242sa...",
    "chunk" : 1
  }
}

3.2. Mark

Sends a mark event message after sending a media event message to be notified when the audio that they have sent has been completed. We send back a mark event with a matching name when the audio ends (or if there is no audio buffered).

The Web Socket Server also receives an incoming mark event message if the buffer was cleared using the clear event message.

PropertyDescription
eventDescribes the type of Web Socket message. In this case "mark".
streamSidThe SID of the Stream that should receive the mark
markAn object containing mark metadata and payload
mark.nameA name specific to your needs that will assist in recognizing future received mark event

Payload Structure:

{ 
 "event": "mark",
 "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
 "mark": {
   "name": "my label"
 }
}

3.3. Clear

Sends a clear message if the server want to interrupt the audio that has been sent in various media messages. This empties all buffered audio and causes any mark messages to be sent back to the Web Socket server.

PropertyDescription
eventDescribes the type of Web Socket message. In this case, "clear".
streamSidThe SID of the Stream in which you wish to interrupt the audio.

Payload Structure:

{ 
 "event": "clear",
 "streamSid": "MZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
}

4. Stream Lifecycle

  1. Connection Establishment:
    • Establish a Web Socket connection with the vendor.
    • Send the connected event to initiate the handshake.
  2. Stream Initialization:
    • Send the start event with stream metadata.
  3. Audio Streaming:
    • Send media events with base64 encoded audio data.
    • Receive media events from the vendor with base64 encoded audio data.
  4. Stream Termination:
    • Send the stop event to terminate the stream.
    • Handle the clear event from the vendor to reset the stream.
  5. End of Input:
    • Send the mark event when all media from the bot has been played.
    • Handle the mark event from the vendor to denote the end of input.

5. Example Workflow

  1. Client to Vendor:
    • Send connectedstartmediastop.
  2. Vendor to Client:
    • Receive mediamarkclear.

6. Notes

  • Ensures that the streamSid is unique for each stream and consistent across all events for a given stream.
  • Base64 encoding is used for audio data to ensure compatibility and ease of transmission.
  • The mark event is used to synchronize the end of input between the client and vendor.