Jingle Synchronized Real-Time Text

Jingle Synchronized Real-Time Text This specification defines a Jingle application extension for negotiating real-time text as part of the same conversational session as audio and video. This XMPP Extension Protocol is copyright © 1999 – 2024 by the XMPP Standards Foundation (XSF). Permission is hereby granted, free of charge, to any person obtaining a copy of this specification (the "Specification"), to make use of the Specification without restriction, including without limitation the rights to implement the Specification in a software program, deploy the Specification in a network service, and copy, modify, merge, publish, translate, distribute, sublicense, or sell copies of the Specification, and to permit persons to whom the Specification is furnished to do so, subject to the condition that the foregoing copyright notice and this permission notice shall be included in all copies or substantial portions of the Specification. Unless separate permission is granted, modified works that are redistributed shall not contain misleading information regarding the authors, title, number, or publisher of the Specification, and shall not claim endorsement of the modified works by the authors, any organization or project to which the authors belong, or the XMPP Standards Foundation. ## NOTE WELL: This Specification is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. ## In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall the XMPP Standards Foundation or any author of this Specification be liable for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising from, out of, or in connection with the Specification or the implementation, deployment, or other use of the Specification (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if the XMPP Standards Foundation or such author has been advised of the possibility of such damages. This XMPP Extension Protocol has been contributed in full conformance with the XSF's Intellectual Property Rights Policy (a copy of which can be found at <https://xmpp.org/about/xsf/ipr-policy> or obtained by writing to XMPP Standards Foundation, P.O. Box 787, Parker, CO 80134 USA). xxxx ProtoXEP Standards Track Standards Council XEP-0166 XEP-0167 XEP-0176 XEP-0301 RFC 4103 RFC 8865 jingle-rtt-sync jingle rtt accessibility webrtc Edward Tie info@tiedragon.com 0.0.2 2026-05-30 et

Document initial browser implementation test results.

0.0.1 2026-05-30 et

Initial ProtoXEP submission.

Real-time text is already defined for XMPP by In-Band Real Time Text (XEP-0301) XEP-0301: In-Band Real Time Text <https://xmpp.org/extensions/xep-0301.html>.. Jingle is already used to negotiate real-time audio and video sessions, most commonly using Jingle RTP Sessions (XEP-0167) XEP-0167: Jingle RTP Sessions <https://xmpp.org/extensions/xep-0167.html>. and Jingle ICE-UDP Transport Method (XEP-0176) XEP-0176: Jingle ICE-UDP Transport Method <https://xmpp.org/extensions/xep-0176.html>.. However, when a client establishes a Jingle audio-video call and sends real-time text as ordinary XMPP messages outside the Jingle session, the user experience can look like one conversation while the protocol state is split into two unrelated paths.

This specification defines a way to negotiate real-time text as a Jingle content in the same session as audio and video. The text content can be human typed RTT, captions, ASR output, interpreter text, translation text or transcript text. The goal is Total Conversation: audio, video and text presented as one conversational unit.

The motivating implementation problem is simple: a call can exist, text can exist, and yet the text might not be part of the negotiated Jingle session. In that case the receiver cannot reliably treat the text as synchronized conversational media.

This specification is designed to meet the following requirements.

Enable a Jingle initiator to offer real-time text in the same session as audio and video.
Enable a responder to accept or reject real-time text independently from audio and video.
Define a first-class Jingle content for text, for example with content name text or rtt.
Allow endpoints to identify the text purpose, source and language.
Allow endpoints to indicate whether the text is synchronized to a media clock, a session clock, the call session only, or not synchronized.
Allow fallback to In-Band Real Time Text (XEP-0301) XEP-0301: In-Band Real Time Text <https://xmpp.org/extensions/xep-0301.html>. when synchronized Jingle text is not supported.
Prevent clients from silently presenting fallback RTT as synchronized text.

Implementations can support different levels without falsely claiming full synchronization.

Level	Name	Minimum capability	User-visible promise
0	XEP-0301 fallback	Ordinary in-band RTT outside Jingle	Live text, not media synchronized
1	Jingle co-session text	Text is negotiated by the same Jingle session but does not share a media clock	Belongs to the call, limited synchronization
2	Session-clock text	Text has timestamps relative to a shared call or session clock	Call-synchronized text
3	Media-clock text	RTP/T.140 or equivalent media-clock timing with audio/video correlation	Strict synchronized Total Conversation

An implementation MUST NOT advertise a higher level than it can actually deliver. In particular, a WebRTC data channel that is merely opened during a call is Level 1 unless it can demonstrate a shared session clock or media clock.

RTT: Real-Time Text, transmitted while it is being typed or created.
Total Conversation: A conversation containing simultaneous audio, video and real-time text.
Jingle content: A named component inside a Jingle session, such as audio, video or text.
Conversation group: A set of Jingle contents intended to be presented as one synchronized conversational unit.

An initiator offers audio, video and text contents in one Jingle session. The receiver accepts all three contents and presents them as a single Total Conversation.

RTP audio content video -> RTP video or signing content text -> RTP T.140 or WebRTC datachannel T.140 ]]>

A participant starts an audio-video call and later adds captions, ASR or typed text by sending a Jingle content-add action for the text content.

If the peer does not support this specification, a client can fall back to In-Band Real Time Text (XEP-0301) XEP-0301: In-Band Real Time Text <https://xmpp.org/extensions/xep-0301.html>.. The fallback MUST be visible to the user when synchronized text is required.

A Total Conversation call SHOULD contain three Jingle contents:

... ... ... ]]>

The text content is not an ordinary XMPP message stream. It is part of the Jingle session and is described by this extension.

The binding key is the Jingle sid plus the content name and the sync-group. A client MUST NOT infer synchronization only from the peer JID, because a user can have multiple simultaneous sessions, devices or fallback chat streams with the same peer.

An entity supporting this specification MUST advertise the following feature:

]]>

If the entity supports RTP/T.140, it SHOULD advertise:

]]>

If the entity supports WebRTC datachannel T.140, it SHOULD advertise:

]]>

If the entity supports fallback to In-Band Real Time Text (XEP-0301) XEP-0301: In-Band Real Time Text <https://xmpp.org/extensions/xep-0301.html>., it SHOULD also advertise the normal XEP-0301 feature.

This specification defines an rtt-sync element qualified by the urn:xmpp:jingle:apps:rtt-sync:0 namespace.

Attribute	Required	Values	Meaning
role	yes	conversation, caption, transcript, translation, interpreter	Purpose of the text stream
source	no	human, asr, captioner, interpreter, translation, system	Origin of the text
lang	no	BCP 47 language tag	Language of the text
sync-group	yes	token	Group shared by audio, video and text contents
sync-reference	no	content name	Content this text is synchronized with, usually audio
sync-mode	yes	media-clock, session-clock, co-session, none	Synchronization model
max-skew	no	milliseconds	Maximum target presentation difference
finality	no	partial, final, mixed	Whether text can change

]]>

The RTP/T.140 profile is the preferred profile when strict synchronization with audio and video is required. The initiator offers a Jingle RTP content with media='text' and payload types for t140 and optionally red.

]]>

When sync-mode='media-clock' is negotiated, endpoints SHOULD use the same RTCP CNAME for audio, video and text RTP streams belonging to the same endpoint. Receivers SHOULD use RTP/RTCP timing to align text with audio or video where possible. If timing information is unavailable, the receiver MAY fall back to session arrival time and SHOULD indicate reduced synchronization quality.

The datachannel profile supports browser/WebRTC deployments using T.140 over a reliable, ordered data channel. This profile is useful when a WebRTC implementation naturally uses data channels for RTT. However, data channels do not automatically share the RTP media clock, so the synchronization mode MUST be declared carefully.

Use sync-mode='co-session' when the text is part of the same call but not strictly media-clock synchronized.
Use sync-mode='session-clock' when the implementation provides a common session clock.
Use sync-mode='media-clock' only if the implementation can provide reliable media-clock alignment.

]]>

The exact Jingle mapping for WebRTC data channel negotiation should be aligned with the relevant Jingle data channel signalling specification. This document does not attempt to replace that signalling.

If the responder does not support urn:xmpp:jingle:apps:rtt-sync:0, the initiator MAY fall back to In-Band Real Time Text (XEP-0301) XEP-0301: In-Band Real Time Text <https://xmpp.org/extensions/xep-0301.html>.. Fallback MUST be explicit in the user interface when synchronization is required.

]]>

Fallback is a state transition, not just a transport choice. If a Jingle text content is rejected but audio and video are accepted, the call MAY continue without synchronized text. If fallback RTT is started for the same conversation, it SHOULD be bound to the Jingle sid and shown as fallback rather than synchronized captions.

A sender that offers synchronized RTT MUST include an rtt-sync element.
A sender MUST identify whether the stream is conversation text, caption text, transcript text, interpreter text or translation text.
A sender SHOULD include a language tag when known.
A sender MUST NOT label ASR text as human captioning.
A sender MUST route Jingle text for the negotiated content through the negotiated Jingle transport, not through an unrelated ordinary chat message path.

A receiver MUST treat a Jingle synchronized RTT content as part of the call, not as normal chat.
A receiver SHOULD use the negotiated sync-mode to determine presentation.
A receiver MUST bind incoming synchronized text to the Jingle sid and content name before presenting it as part of a call.
A receiver SHOULD detect duplicate text received through both Jingle text and XEP-0301 fallback and avoid showing it twice.
A receiver SHOULD expose diagnostics when RTT is present in chat but absent from the Jingle session.

A user interface SHOULD distinguish at least these cases: live text, live captions, AI captions, human captions, translation and unsynchronized fallback.

During call setup, a client SHOULD expose whether synchronized text was negotiated, whether live text fallback is active or whether text is unavailable in the call.

This specification is specifically motivated by accessibility and Total Conversation use cases. A deaf or hard-of-hearing user MUST be able to distinguish between typed text, human captions, AI or ASR captions and translated text where this information is known.

A client SHOULD visibly indicate late captions, uncertain ASR captions or unsynchronized fallback text. A client SHOULD allow users to prefer synchronized captions over lowest-latency captions, or lowest-latency captions over strict synchronization.

Text content MUST support Unicode. Language tags SHOULD use BCP 47. Clients SHOULD support multiple simultaneous text streams where translation or interpreter text is provided in addition to original captions.

Synchronized RTT and captions can contain highly sensitive conversation content. Implementations SHOULD use end-to-end encrypted signalling and encrypted media where available.

For RTP/T.140, implementations SHOULD use SRTP or an equivalent encrypted RTP transport, authenticate the sender of the text stream and protect against injection of false captions. Implementations SHOULD prevent downgrade attacks from synchronized RTT to unsynchronized fallback without user indication.

Clients SHOULD avoid misrepresenting AI captions as human or verified text.

Real-time text can reveal text before the sender considers it final. Captions can reveal speech content to captioning, relay or ASR services. A client SHOULD obtain user consent before sending typed RTT and before sending audio to ASR or captioning services.

A client SHOULD not store partial captions or partial RTT as a final transcript unless enabled. A client SHOULD indicate when a third-party captioning, ASR, relay or interpreting service is active.

This document makes no direct IANA request unless future revisions define new SDP attributes or new media types. The RTP/T.140 profile uses existing text/t140 and text/red media formats.

This specification requests registration of the following namespace:

urn:xmpp:jingle:apps:rtt-sync:0

The following service discovery features are requested:

urn:xmpp:jingle:apps:rtt-sync:0
urn:xmpp:jingle:apps:rtt-sync:rtp-t140:0
urn:xmpp:jingle:apps:rtt-sync:dc-t140:0

This document does not replace In-Band Real Time Text (XEP-0301) XEP-0301: In-Band Real Time Text <https://xmpp.org/extensions/xep-0301.html>.. XEP-0301 remains appropriate for chat-oriented real-time text and as a fallback. The distinction is that this specification binds text to a Jingle session when an implementation needs Total Conversation semantics.

RTP/T.140 is the preferred strict synchronization profile. WebRTC datachannel T.140 is useful for browser deployments, but MUST NOT be described as media-clock synchronized unless the implementation can provide the required timing relationship.

An experimental browser implementation has tested the WebRTC datachannel profile at Level 1. Two browser sessions negotiated one Jingle audio-video session plus a text content using urn:xmpp:jingle:apps:rtt-sync:0, opened a reliable ordered data channel labelled rtt, exchanged live RTT updates, and delivered final text bound to the Jingle session. The client presented the call as live text synchronized with the call session.

The same implementation retained In-Band Real Time Text (XEP-0301) XEP-0301: In-Band Real Time Text <https://xmpp.org/extensions/xep-0301.html>. fallback for peers that do not negotiate the Jingle text content, so ordinary live text remains available without being presented as synchronized call media.

The following schema is an initial sketch.

]]>

Should this be a new Jingle application format or an extension to Jingle RTP Sessions (XEP-0167) XEP-0167: Jingle RTP Sessions <https://xmpp.org/extensions/xep-0167.html>.?
Should RTP/T.140 be mandatory-to-implement for strict synchronization?
Which existing Jingle datachannel signalling elements should be used for the WebRTC datachannel profile?
Should emergency-service profiles have stricter requirements?
Should multiparty RTT support be included here or deferred to a separate specification?