XEP-xxxx: Jingle Synchronized Real-Time Text

Abstract
This specification defines a Jingle application extension for negotiating real-time text as part of the same conversational session as audio and video.
Author
Edward Tie
Copyright
© 2026 – 2026 XMPP Standards Foundation. SEE LEGAL NOTICES.
Status

ProtoXEP

WARNING: This document has not yet been accepted for consideration or approved in any official manner by the XMPP Standards Foundation, and this document is not yet an XMPP Extension Protocol (XEP). If this document is accepted as a XEP by the XMPP Council, it will be published at <https://xmpp.org/extensions/> and announced on the <standards@xmpp.org> mailing list.
Type
Standards Track
Version
0.0.2 (2026-05-30)
Document Lifecycle
  1. Experimental
  2. Proposed
  3. Stable
  4. Final

1. Introduction

Real-time text is already defined for XMPP by In-Band Real Time Text (XEP-0301) [1]. Jingle is already used to negotiate real-time audio and video sessions, most commonly using Jingle RTP Sessions (XEP-0167) [2] and Jingle ICE-UDP Transport Method (XEP-0176) [3]. However, when a client establishes a Jingle audio-video call and sends real-time text as ordinary XMPP messages outside the Jingle session, the user experience can look like one conversation while the protocol state is split into two unrelated paths.

This specification defines a way to negotiate real-time text as a Jingle content in the same session as audio and video. The text content can be human typed RTT, captions, ASR output, interpreter text, translation text or transcript text. The goal is Total Conversation: audio, video and text presented as one conversational unit.

The motivating implementation problem is simple: a call can exist, text can exist, and yet the text might not be part of the negotiated Jingle session. In that case the receiver cannot reliably treat the text as synchronized conversational media.

2. Requirements

This specification is designed to meet the following requirements.

  1. Enable a Jingle initiator to offer real-time text in the same session as audio and video.
  2. Enable a responder to accept or reject real-time text independently from audio and video.
  3. Define a first-class Jingle content for text, for example with content name text or rtt.
  4. Allow endpoints to identify the text purpose, source and language.
  5. Allow endpoints to indicate whether the text is synchronized to a media clock, a session clock, the call session only, or not synchronized.
  6. Allow fallback to In-Band Real Time Text (XEP-0301) [1] when synchronized Jingle text is not supported.
  7. Prevent clients from silently presenting fallback RTT as synchronized text.

2.1 Implementation levels

Implementations can support different levels without falsely claiming full synchronization.

Table 1: Implementation levels
Level Name Minimum capability User-visible promise
0 XEP-0301 fallback Ordinary in-band RTT outside Jingle Live text, not media synchronized
1 Jingle co-session text Text is negotiated by the same Jingle session but does not share a media clock Belongs to the call, limited synchronization
2 Session-clock text Text has timestamps relative to a shared call or session clock Call-synchronized text
3 Media-clock text RTP/T.140 or equivalent media-clock timing with audio/video correlation Strict synchronized Total Conversation

An implementation MUST NOT advertise a higher level than it can actually deliver. In particular, a WebRTC data channel that is merely opened during a call is Level 1 unless it can demonstrate a shared session clock or media clock.

3. Glossary

RTT
Real-Time Text, transmitted while it is being typed or created.
Total Conversation
A conversation containing simultaneous audio, video and real-time text.
Jingle content
A named component inside a Jingle session, such as audio, video or text.
Conversation group
A set of Jingle contents intended to be presented as one synchronized conversational unit.

4. Use Cases

4.1 Offering Total Conversation

An initiator offers audio, video and text contents in one Jingle session. The receiver accepts all three contents and presents them as a single Total Conversation.

Example 1. Total Conversation session overview
Jingle session sid = abc123
  content audio -> RTP audio
  content video -> RTP video or signing
  content text  -> RTP T.140 or WebRTC datachannel T.140

4.2 Adding text during a call

A participant starts an audio-video call and later adds captions, ASR or typed text by sending a Jingle content-add action for the text content.

4.3 Fallback to XEP-0301

If the peer does not support this specification, a client can fall back to In-Band Real Time Text (XEP-0301) [1]. The fallback MUST be visible to the user when synchronized text is required.

5. Protocol Overview

A Total Conversation call SHOULD contain three Jingle contents:

Example 2. Jingle contents for Total Conversation
<content name='audio'> ... </content>
<content name='video'> ... </content>
<content name='text'>  ... </content>

The text content is not an ordinary XMPP message stream. It is part of the Jingle session and is described by this extension.

The binding key is the Jingle sid plus the content name and the sync-group. A client MUST NOT infer synchronization only from the peer JID, because a user can have multiple simultaneous sessions, devices or fallback chat streams with the same peer.

6. Discovery

An entity supporting this specification MUST advertise the following feature:

Example 3. Primary discovery feature
<feature var='urn:xmpp:jingle:apps:rtt-sync:0'/>

If the entity supports RTP/T.140, it SHOULD advertise:

Example 4. RTP/T.140 discovery feature
<feature var='urn:xmpp:jingle:apps:rtt-sync:rtp-t140:0'/>

If the entity supports WebRTC datachannel T.140, it SHOULD advertise:

Example 5. Datachannel/T.140 discovery feature
<feature var='urn:xmpp:jingle:apps:rtt-sync:dc-t140:0'/>

If the entity supports fallback to In-Band Real Time Text (XEP-0301) [1], it SHOULD also advertise the normal XEP-0301 feature.

7. Application Format

This specification defines an rtt-sync element qualified by the urn:xmpp:jingle:apps:rtt-sync:0 namespace.

Table 2: Attributes of the rtt-sync element
Attribute Required Values Meaning
role yes conversation, caption, transcript, translation, interpreter Purpose of the text stream
source no human, asr, captioner, interpreter, translation, system Origin of the text
lang no BCP 47 language tag Language of the text
sync-group yes token Group shared by audio, video and text contents
sync-reference no content name Content this text is synchronized with, usually audio
sync-mode yes media-clock, session-clock, co-session, none Synchronization model
max-skew no milliseconds Maximum target presentation difference
finality no partial, final, mixed Whether text can change
Example 6. RTT synchronization element
<rtt-sync xmlns='urn:xmpp:jingle:apps:rtt-sync:0'
          role='caption'
          source='asr'
          lang='nl-NL'
          sync-group='tc1'
          sync-reference='audio'
          sync-mode='media-clock'
          max-skew='500'
          finality='partial'/>

8. RTP/T.140 Profile

The RTP/T.140 profile is the preferred profile when strict synchronization with audio and video is required. The initiator offers a Jingle RTP content with media='text' and payload types for t140 and optionally red.

Example 7. Session initiation with text media
<iq from='romeo@example.org/desktop'
    to='juliet@example.org/mobile'
    id='j1'
    type='set'>
  <jingle xmlns='urn:xmpp:jingle:1'
          action='session-initiate'
          initiator='romeo@example.org/desktop'
          sid='abc123'>
    <content creator='initiator' name='audio'>
      <description xmlns='urn:xmpp:jingle:apps:rtp:1' media='audio'>
        <payload-type id='111' name='opus' clockrate='48000' channels='2'/>
      </description>
      <transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'/>
    </content>
    <content creator='initiator' name='video'>
      <description xmlns='urn:xmpp:jingle:apps:rtp:1' media='video'>
        <payload-type id='96' name='VP8' clockrate='90000'/>
      </description>
      <transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'/>
    </content>
    <content creator='initiator' name='text'>
      <description xmlns='urn:xmpp:jingle:apps:rtp:1' media='text'>
        <payload-type id='98' name='t140' clockrate='1000'/>
        <payload-type id='100' name='red' clockrate='1000'>
          <parameter name='fmtp' value='98/98/98'/>
        </payload-type>
        <rtt-sync xmlns='urn:xmpp:jingle:apps:rtt-sync:0'
                  role='conversation'
                  source='human'
                  lang='nl-NL'
                  sync-group='tc1'
                  sync-reference='audio'
                  sync-mode='media-clock'
                  max-skew='500'
                  finality='mixed'/>
      </description>
      <transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'/>
    </content>
  </jingle>
</iq>

When sync-mode='media-clock' is negotiated, endpoints SHOULD use the same RTCP CNAME for audio, video and text RTP streams belonging to the same endpoint. Receivers SHOULD use RTP/RTCP timing to align text with audio or video where possible. If timing information is unavailable, the receiver MAY fall back to session arrival time and SHOULD indicate reduced synchronization quality.

9. WebRTC Datachannel/T.140 Profile

The datachannel profile supports browser/WebRTC deployments using T.140 over a reliable, ordered data channel. This profile is useful when a WebRTC implementation naturally uses data channels for RTT. However, data channels do not automatically share the RTP media clock, so the synchronization mode MUST be declared carefully.

Example 8. Illustrative datachannel text content
<content creator='initiator' name='text'>
  <description xmlns='urn:xmpp:jingle:apps:rtt-sync:0'
               profile='dc-t140'>
    <datachannel subprotocol='t140'
                 reliability='reliable'
                 order='in-order'
                 label='rtt'/>
    <rtt-sync role='conversation'
              source='human'
              lang='nl-NL'
              sync-group='tc1'
              sync-reference='audio'
              sync-mode='co-session'
              max-skew='700'/>
  </description>
  <transport xmlns='urn:xmpp:jingle:transports:dtls-sctp:1'/>
</content>

The exact Jingle mapping for WebRTC data channel negotiation should be aligned with the relevant Jingle data channel signalling specification. This document does not attempt to replace that signalling.

10. Fallback to XEP-0301

If the responder does not support urn:xmpp:jingle:apps:rtt-sync:0, the initiator MAY fall back to In-Band Real Time Text (XEP-0301) [1]. Fallback MUST be explicit in the user interface when synchronization is required.

Example 9. Informing the peer about fallback
<message from='romeo@example.org/desktop'
         to='juliet@example.org/mobile'
         type='chat'>
  <rtt-fallback xmlns='urn:xmpp:jingle:apps:rtt-sync:0'
                sid='abc123'
                method='xep-0301'
                sync-mode='none'
                reason='peer-unsupported'/>
</message>

Fallback is a state transition, not just a transport choice. If a Jingle text content is rejected but audio and video are accepted, the call MAY continue without synchronized text. If fallback RTT is started for the same conversation, it SHOULD be bound to the Jingle sid and shown as fallback rather than synchronized captions.

11. Business Rules

11.1 Sender rules

  1. A sender that offers synchronized RTT MUST include an rtt-sync element.
  2. A sender MUST identify whether the stream is conversation text, caption text, transcript text, interpreter text or translation text.
  3. A sender SHOULD include a language tag when known.
  4. A sender MUST NOT label ASR text as human captioning.
  5. A sender MUST route Jingle text for the negotiated content through the negotiated Jingle transport, not through an unrelated ordinary chat message path.

11.2 Receiver rules

  1. A receiver MUST treat a Jingle synchronized RTT content as part of the call, not as normal chat.
  2. A receiver SHOULD use the negotiated sync-mode to determine presentation.
  3. A receiver MUST bind incoming synchronized text to the Jingle sid and content name before presenting it as part of a call.
  4. A receiver SHOULD detect duplicate text received through both Jingle text and XEP-0301 fallback and avoid showing it twice.
  5. A receiver SHOULD expose diagnostics when RTT is present in chat but absent from the Jingle session.

12. User Interface Guidance

A user interface SHOULD distinguish at least these cases: live text, live captions, AI captions, human captions, translation and unsynchronized fallback.

During call setup, a client SHOULD expose whether synchronized text was negotiated, whether live text fallback is active or whether text is unavailable in the call.

Example 10. Example user-visible states
Synchronized text: negotiated
Live text fallback: active
Text in call: unavailable

13. Accessibility Considerations

This specification is specifically motivated by accessibility and Total Conversation use cases. A deaf or hard-of-hearing user MUST be able to distinguish between typed text, human captions, AI or ASR captions and translated text where this information is known.

A client SHOULD visibly indicate late captions, uncertain ASR captions or unsynchronized fallback text. A client SHOULD allow users to prefer synchronized captions over lowest-latency captions, or lowest-latency captions over strict synchronization.

14. Internationalization Considerations

Text content MUST support Unicode. Language tags SHOULD use BCP 47. Clients SHOULD support multiple simultaneous text streams where translation or interpreter text is provided in addition to original captions.

15. Security Considerations

Synchronized RTT and captions can contain highly sensitive conversation content. Implementations SHOULD use end-to-end encrypted signalling and encrypted media where available.

For RTP/T.140, implementations SHOULD use SRTP or an equivalent encrypted RTP transport, authenticate the sender of the text stream and protect against injection of false captions. Implementations SHOULD prevent downgrade attacks from synchronized RTT to unsynchronized fallback without user indication.

Clients SHOULD avoid misrepresenting AI captions as human or verified text.

16. Privacy Considerations

Real-time text can reveal text before the sender considers it final. Captions can reveal speech content to captioning, relay or ASR services. A client SHOULD obtain user consent before sending typed RTT and before sending audio to ASR or captioning services.

A client SHOULD not store partial captions or partial RTT as a final transcript unless enabled. A client SHOULD indicate when a third-party captioning, ASR, relay or interpreting service is active.

17. IANA Considerations

This document makes no direct IANA request unless future revisions define new SDP attributes or new media types. The RTP/T.140 profile uses existing text/t140 and text/red media formats.

18. XMPP Registrar Considerations

This specification requests registration of the following namespace:

urn:xmpp:jingle:apps:rtt-sync:0

The following service discovery features are requested:

urn:xmpp:jingle:apps:rtt-sync:0
urn:xmpp:jingle:apps:rtt-sync:rtp-t140:0
urn:xmpp:jingle:apps:rtt-sync:dc-t140:0

19. Design Considerations

This document does not replace In-Band Real Time Text (XEP-0301) [1]. XEP-0301 remains appropriate for chat-oriented real-time text and as a fallback. The distinction is that this specification binds text to a Jingle session when an implementation needs Total Conversation semantics.

RTP/T.140 is the preferred strict synchronization profile. WebRTC datachannel T.140 is useful for browser deployments, but MUST NOT be described as media-clock synchronized unless the implementation can provide the required timing relationship.

20. Implementation Experience

An experimental browser implementation has tested the WebRTC datachannel profile at Level 1. Two browser sessions negotiated one Jingle audio-video session plus a text content using urn:xmpp:jingle:apps:rtt-sync:0, opened a reliable ordered data channel labelled rtt, exchanged live RTT updates, and delivered final text bound to the Jingle session. The client presented the call as live text synchronized with the call session.

The same implementation retained In-Band Real Time Text (XEP-0301) [1] fallback for peers that do not negotiate the Jingle text content, so ordinary live text remains available without being presented as synchronized call media.

21. XML Schema

The following schema is an initial sketch.

<xs:schema
    xmlns:xs='http://www.w3.org/2001/XMLSchema'
    targetNamespace='urn:xmpp:jingle:apps:rtt-sync:0'
    xmlns='urn:xmpp:jingle:apps:rtt-sync:0'
    elementFormDefault='qualified'>

  <xs:element name='rtt-sync'>
    <xs:complexType>
      <xs:attribute name='role' use='required'>
        <xs:simpleType>
          <xs:restriction base='xs:NCName'>
            <xs:enumeration value='conversation'/>
            <xs:enumeration value='caption'/>
            <xs:enumeration value='transcript'/>
            <xs:enumeration value='translation'/>
            <xs:enumeration value='interpreter'/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
      <xs:attribute name='source' use='optional'>
        <xs:simpleType>
          <xs:restriction base='xs:NCName'>
            <xs:enumeration value='human'/>
            <xs:enumeration value='asr'/>
            <xs:enumeration value='captioner'/>
            <xs:enumeration value='interpreter'/>
            <xs:enumeration value='translation'/>
            <xs:enumeration value='system'/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
      <xs:attribute name='lang' type='xs:language' use='optional'/>
      <xs:attribute name='sync-group' type='xs:NCName' use='required'/>
      <xs:attribute name='sync-reference' type='xs:NCName' use='optional'/>
      <xs:attribute name='sync-mode' use='required'>
        <xs:simpleType>
          <xs:restriction base='xs:NCName'>
            <xs:enumeration value='media-clock'/>
            <xs:enumeration value='session-clock'/>
            <xs:enumeration value='co-session'/>
            <xs:enumeration value='none'/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
      <xs:attribute name='max-skew' type='xs:nonNegativeInteger' use='optional'/>
      <xs:attribute name='finality' use='optional'>
        <xs:simpleType>
          <xs:restriction base='xs:NCName'>
            <xs:enumeration value='partial'/>
            <xs:enumeration value='final'/>
            <xs:enumeration value='mixed'/>
          </xs:restriction>
        </xs:simpleType>
      </xs:attribute>
    </xs:complexType>
  </xs:element>
</xs:schema>

22. Open Issues

  1. Should this be a new Jingle application format or an extension to Jingle RTP Sessions (XEP-0167) [2]?
  2. Should RTP/T.140 be mandatory-to-implement for strict synchronization?
  3. Which existing Jingle datachannel signalling elements should be used for the WebRTC datachannel profile?
  4. Should emergency-service profiles have stricter requirements?
  5. Should multiparty RTT support be included here or deferred to a separate specification?

Appendices

Appendix A: Document Information

Series
XEP
Number
xxxx
Publisher
XMPP Standards Foundation
Status
ProtoXEP
Type
Standards Track
Version
0.0.2
Last Updated
2026-05-30
Approving Body
XMPP Council
Dependencies
XEP-0166, XEP-0167, XEP-0176, XEP-0301, RFC 4103, RFC 8865
Supersedes
None
Superseded By
None
Short Name
jingle-rtt-sync

This document in other formats: XML  PDF

Appendix B: Author Information

Edward Tie
Email
info@tiedragon.com

Copyright

This XMPP Extension Protocol is copyright © 1999 – 2024 by the XMPP Standards Foundation (XSF).

Permissions

Permission is hereby granted, free of charge, to any person obtaining a copy of this specification (the "Specification"), to make use of the Specification without restriction, including without limitation the rights to implement the Specification in a software program, deploy the Specification in a network service, and copy, modify, merge, publish, translate, distribute, sublicense, or sell copies of the Specification, and to permit persons to whom the Specification is furnished to do so, subject to the condition that the foregoing copyright notice and this permission notice shall be included in all copies or substantial portions of the Specification. Unless separate permission is granted, modified works that are redistributed shall not contain misleading information regarding the authors, title, number, or publisher of the Specification, and shall not claim endorsement of the modified works by the authors, any organization or project to which the authors belong, or the XMPP Standards Foundation.

Disclaimer of Warranty

## NOTE WELL: This Specification is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. ##

Limitation of Liability

In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall the XMPP Standards Foundation or any author of this Specification be liable for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising from, out of, or in connection with the Specification or the implementation, deployment, or other use of the Specification (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if the XMPP Standards Foundation or such author has been advised of the possibility of such damages.

IPR Conformance

This XMPP Extension Protocol has been contributed in full conformance with the XSF's Intellectual Property Rights Policy (a copy of which can be found at <https://xmpp.org/about/xsf/ipr-policy> or obtained by writing to XMPP Standards Foundation, P.O. Box 787, Parker, CO 80134 USA).

Visual Presentation

The HTML representation (you are looking at) is maintained by the XSF. It is based on the YAML CSS Framework, which is licensed under the terms of the CC-BY-SA 2.0 license.

Appendix D: Relation to XMPP

The Extensible Messaging and Presence Protocol (XMPP) is defined in the XMPP Core (RFC 6120) and XMPP IM (RFC 6121) specifications contributed by the XMPP Standards Foundation to the Internet Standards Process, which is managed by the Internet Engineering Task Force in accordance with RFC 2026. Any protocol defined in this document has been developed outside the Internet Standards Process and is to be understood as an extension to XMPP rather than as an evolution, development, or modification of XMPP itself.

Appendix E: Discussion Venue

The primary venue for discussion of XMPP Extension Protocols is the <standards@xmpp.org> discussion list.

Discussion on other xmpp.org discussion lists might also be appropriate; see <https://xmpp.org/community/> for a complete list.

Given that this XMPP Extension Protocol normatively references IETF technologies, discussion on the <xsf-ietf@xmpp.org> list might also be appropriate.

Errata can be sent to <editor@xmpp.org>.

Appendix F: Requirements Conformance

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

Appendix G: Notes

1. XEP-0301: In-Band Real Time Text <https://xmpp.org/extensions/xep-0301.html>.

2. XEP-0167: Jingle RTP Sessions <https://xmpp.org/extensions/xep-0167.html>.

3. XEP-0176: Jingle ICE-UDP Transport Method <https://xmpp.org/extensions/xep-0176.html>.

Appendix H: Revision History

Note: Older versions of this specification might be available at https://xmpp.org/extensions/attic/

  1. Version 0.0.2 (2026-05-30)

    Document initial browser implementation test results.

    et
  2. Version 0.0.1 (2026-05-30)

    Initial ProtoXEP submission.

    et

Appendix I: Bib(La)TeX Entry

@report{tie2026jingle-rtt-sync,
  title = {Jingle Synchronized Real-Time Text},
  author = {Tie, Edward},
  type = {XEP},
  number = {xxxx},
  version = {0.0.2},
  institution = {XMPP Standards Foundation},
  url = {https://xmpp.org/extensions/xep-xxxx.html},
  date = {2026-05-30/2026-05-30},
}

END