Delivering Conference Information to Jingle Participants (Coin) (XEP-0298)  defines a way for XMPP agents to establish and participate in tightly coupled conference calls. Such conference calls would typically involve a number of regular participants that establish direct one-to-one sessions with a single entity, often referred to as a focus agent. Focus agents are generally responsible for making sure that media sent from one call participant would be distributed to all others so that everyone would effectively hear or see everyone else. In other words they often act as media mixers.
Depending on the mixing technology used by media mixers, they may require significant bandwidth, processing resources or both. It is hence common for mixers to be hosted on dedicated servers that can provide such resources. They are then made reachable as rendez-vous points and conference call participants are required to call in, in order to join a conference call. This requires a certain amount of pre-call configuration to be completed by the service maintainers in order to create conference rooms and grant proper access to the expected participants. The authorization credentials are then often relayed to the participants in preparation of the call by other means, such as IM or mail.
In certain situations, such pre-call preparations are inconvenient and it is important for users to be able to establish ad-hoc conference calls. One way to achieve this is for user agents themselves to act as focus agents and media mixers. Everyone else just calls the user at the focus agent, who then can decide whether to accept or reject the calls as they arrive. This works particularly well for audio only calls as the amount of bandwidth and processing resources that they require is generally within reach for end-user devices.
The situation is quite different for video calls. Media decoding and especially encoding require considerably more resources with video than they do with audio. Today, encoding a single video flow with an acceptable quality is often the maximum that can be expected from an end-user device. The advantages that come with Moore's law will likely be insufficient to improve this, given the massive shift toward mobile devices and the ever-increasing user expectations toward video quality.
Therefore, this specification (COLIBRI) aims to provide a means for user agents to interact with conference mixers. Such interaction allows user agents to allocate mixing channels, indicate what conferences they should be attached to, what integers the various payload types map to, etc. Using COLIBRI would hence allow any user agent to organize conference calls and act as a signalling focus by outsourcing the actual media mixing to a dedicated mixer.
The extension defined herein is designed to meet the following requirements:
This section provides a friendly introduction to COLIBRI.
In essence, the goal of COLIBRI is to provide focus agents with a way of using remote mixers as if as they were available locally. The most important part of that is the possibility to allocate ports on the mixer interfaces and then use these ports when establishing Jingle sessions with the various participants.
Every participant in the conference call is assigned one port for RTP data and one for RTCP. An RTP/RTCP port couple is called a channel. Each participant would use one channel per media type. That is, a client participating with audio only would get one channel, while another one that joins with both audio and video would get two: one for audio and one for video.
Channels are used for streams from the bridge to participants and from participants to the bridge. Typically a channel would contain one stream from a participant to a bridge, for example their webcam or desktop, and one or more streams in the opposite direction (e.g. webcam or desktop streams from other participants to the one using this channel). This is not a requirement though and a channel can certainly be used for transportation of multiple streams in both directions in cases where one bridge is connected to another.
Typically channels would be created by the entity controlling a conference call. This could either be a conferencing server or a smart client capable of handling conferences. We would refer to both of these as the "focus". In either case, the important part is that the focus terminates all signalling. It is a signalling endpoint and it is responsible for all aspects of call signalling including offer/answer.
In other words, when setting up a conference, a focus would first allocate the necessary channels, then directly initiate sessions (invite) other participants into the call. Only, when sending the invitations to these participants, the focus would use the transport information (addresses and ports) that it would have received from the COLIBRI bridge, rather than its own.
The most important thing about setting up a conference is the creation of channels for every participant. Conference setup is not the only chance an organiser gets to declare all participants but typically when a conference call is setup it is because there are at least some number of known participants and there would be no point in delaying channel creation for them.
The following example shows how Romeo creates an audio/video conference at a bridge, requesting that three participant channels be created.
Notice how the 'initiator' channel above is set to true. The setting determines ICE and DTLS/SRTP behaviour for the bridge. In this specific case, 'initiator' being set to 'true' Romeo is requesting that the bridge behave as the initiator of the session which means that it would try to be the controlling ICE agent and also assume the 'dtls-actpass' role for DTLS/SRTP negotiation. A value of 'false' would have meant that the bridge would behave as the controlled ICE agent and assume the 'dtls-active' role.
When sending its result back, the bridge confirms creation of the requested channels and it also delivers transport information that would be necessary for participants to transport media to the bridge. These would most often include ICE candidates, ufrag and pwd parameters, and DTLS fingerprints.
Note that ICE is not mandatory for use and COLIBRI bridges can just as well perform Hosted NAT Traversal using latching and a RAW-UDP transport.
The above "result" also contains the following elements of interest for every channel:
- an ID that is necessary for any further modification from that the focus wants to set on a channel. [FIXME: clients should be able to specify these id-s so as not to rely on ordering to identify channels and get complexes thinking they are SDP parsers]
- an rtp-level-relay-type attribute with possible values of 'mixer' and 'translator' indicating how the bridge is going to deliver data on a specific channel [FIXME: this would definitely need to be specifiable from the client].
- mixer channels would also include ssrc-s for that channels in question. This is particularly necessary when SSRC-s need to be announced to participants (because people never learned how RTP works and are afraid from anything that wasn't explicitly announced with an Offer/Answer exchange). Generally such announcements would be possible by simply propagating SSRCs that other participants announce. In a mixed flow however the SSRC would belong to the mixer (or COLIBRI bridge) so it would need to be known in advance. attribute
- the initiator value is echoed
- expire describes how many seconds the bridge will keep the channel open without media activity
Channel updates can happen for various reasons. The following examples illustrate two of them:
- specifying payload types. While payload types in RTP are sometimes static (e.g. for older codecs such as G.711), this is not always the case for more recent types, which need to be assigned dynamically during session establishment. The tricky part here is that dynamic means dynamic so every participant in a conference call may end up expecting different payload types. As a result, a COLIBRI bridge SHOULD know about everyone's expectations, which is why channels are updated with payload types. Note that if a bridge does see unknown payload types it MUST still relay them to other participants as they might have used some other mechanism to make sure they know what they mean.
- DTLS/SRTP fingerprints.
Note that while the result in this case is essentially an acknowledgement, it still carries a full representation of the bridge.
Essentially that information is the transport description from the bridge.
ICE candidates are another reason why a focus might want to update a channel. Earlier examples indicated how conference setup could be completed without providing any transport information whatsoever. Whenever that is the case, such information would need to be provided through channel modification.
If an entity supports COLIBRI, it SHOULD advertise that fact by returning a feature of "http://jitsi.org/protocol/colibri" in response to a Service Discovery (XEP-0030)  information request.
In order for an application to determine whether an entity supports this protocol, where possible it SHOULD use the dynamic, presence-based profile of service discovery defined in Entity Capabilities (XEP-0115) . However, if an application has not received entity capabilities information from an entity, it SHOULD use explicit service discovery instead.
Jitsi's participation in this specification is funded by the NLnet Foundation.
This document in other formats: XML PDF
This XMPP Extension Protocol is copyright © 1999 – 2018 by the XMPP Standards Foundation (XSF).
Permission is hereby granted, free of charge, to any person obtaining a copy of this specification (the "Specification"), to make use of the Specification without restriction, including without limitation the rights to implement the Specification in a software program, deploy the Specification in a network service, and copy, modify, merge, publish, translate, distribute, sublicense, or sell copies of the Specification, and to permit persons to whom the Specification is furnished to do so, subject to the condition that the foregoing copyright notice and this permission notice shall be included in all copies or substantial portions of the Specification. Unless separate permission is granted, modified works that are redistributed shall not contain misleading information regarding the authors, title, number, or publisher of the Specification, and shall not claim endorsement of the modified works by the authors, any organization or project to which the authors belong, or the XMPP Standards Foundation.
## NOTE WELL: This Specification is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. ##
In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall the XMPP Standards Foundation or any author of this Specification be liable for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising from, out of, or in connection with the Specification or the implementation, deployment, or other use of the Specification (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if the XMPP Standards Foundation or such author has been advised of the possibility of such damages.
This XMPP Extension Protocol has been contributed in full conformance with the XSF's Intellectual Property Rights Policy (a copy of which can be found at <https://xmpp.org/about/xsf/ipr-policy> or obtained by writing to XMPP Standards Foundation, P.O. Box 787, Parker, CO 80134 USA).
The Extensible Messaging and Presence Protocol (XMPP) is defined in the XMPP Core (RFC 6120) and XMPP IM (RFC 6121) specifications contributed by the XMPP Standards Foundation to the Internet Standards Process, which is managed by the Internet Engineering Task Force in accordance with RFC 2026. Any protocol defined in this document has been developed outside the Internet Standards Process and is to be understood as an extension to XMPP rather than as an evolution, development, or modification of XMPP itself.
There exists a special venue for discussion related to the technology described in this document: the <email@example.com> mailing list.
The primary venue for discussion of XMPP Extension Protocols is the <firstname.lastname@example.org> discussion list.
Discussion on other xmpp.org discussion lists might also be appropriate; see <http://xmpp.org/about/discuss.shtml> for a complete list.
Errata can be sent to <email@example.com>.
The following requirements keywords as used in this document are to be interpreted as described in RFC 2119: "MUST", "SHALL", "REQUIRED"; "MUST NOT", "SHALL NOT"; "SHOULD", "RECOMMENDED"; "SHOULD NOT", "NOT RECOMMENDED"; "MAY", "OPTIONAL".
Note: Older versions of this specification might be available at http://xmpp.org/extensions/attic/
Initial published version approved by the XMPP Council.