JEP-0106: JID Escaping

This JEP specifies a mechanism that enables the display of Jabber Identifiers (JIDs) with characters disallowed by the Nodeprep profile of stringprep.


NOTICE: This JEP is currently within Last Call or under consideration by the Jabber Council for advancement to the next stage in the JSF standards process. For further details, visit <http://www.jabber.org/council/queue.shtml>.


JEP Information

Status: Proposed
Type: Standards Track
Number: 0106
Version: 0.6
Last Updated: 2005-05-06
JIG: Standards JIG
Approving Body: Jabber Council
Dependencies: XMPP Core, JEP-0030
Supersedes: None
Superseded By: None
Short Name: jid\20escaping

Author Information

Joe Hildebrand

Email: jhildebrand@jabber.com
JID: hildjj@jabber.org

Peter Saint-Andre

Email: stpeter@jabber.org
JID: stpeter@jabber.org

Legal Notice

This Jabber Enhancement Proposal is copyright 1999 - 2005 by the Jabber Software Foundation (JSF) and is in full conformance with the JSF's Intellectual Property Rights Policy <http://www.jabber.org/jsf/ipr-policy.shtml>. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is presently available at <http://www.opencontent.org/openpub/>).

Discussion Venue

The preferred venue for discussion of this document is the Standards-JIG discussion list: <http://mail.jabber.org/mailman/listinfo/standards-jig>.

Relation to XMPP

The Extensible Messaging and Presence Protocol (XMPP) is defined in the XMPP Core (RFC 3920) and XMPP IM (RFC 3921) specifications contributed by the Jabber Software Foundation to the Internet Standards Process, which is managed by the Internet Engineering Task Force in accordance with RFC 2026. Any protocols defined in this JEP have been developed outside the Internet Standards Process and are to be understood as extensions to XMPP rather than as an evolution, development, or modification of XMPP itself.

Conformance Terms

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.


Table of Contents

1. Introduction
2. Requirements
3. Discovery
4. Transformations
4.1. Concepts
4.2. Encoding Transformation
4.3. Decoding Transformation
5. Business Rules
5.1. Native Processing
5.2. Address Transformation Algorithm
5.3. Exceptions
5.4. JID Escaping vs. Older Methods
6. Implementation Notes
6.1. Email Addresses
6.2. SIP Addresses
6.3. IM and Presence Addresses
6.4. IMPS Addresses
6.5. LDAP Distinguished Names
7. Security Considerations
8. IANA Considerations
9. Jabber Registrar Considerations
9.1. Service Discovery Features
Notes
Revision History


1. Introduction

XMPP Core [1] defines the Nodeprep profile of stringprep (RFC 3454 [2]), which specifies that the following nine Unicode code points are disallowed in the node identifier portion of a Jabber Identifier (hereafter we refer to these as "the disallowed characters"):

This restriction is an inconvenience for users who have one or more of the disallowed characters in their desired usernames, particularly in the case of the ' character, which is common in names like O'Hara and D'Artagnan. The restriction is a positive hardship if existing email addresses are mapped to JIDs, since some of the disallowed characters are allowed in the username portion of an email address (specifically, the characters & ' / as described in Sections 3.2.4 and 3.2.5 of RFC 2822 [3]).

If the & character had not been in the list of disallowed characters, then normal XML escaping conventions (as specified in XML 1.0 [4]) could have been used, with the result that D'Artagnan (for example) could have been rendered as D&apos;artagnan [sic]. Since there are good reasons for each of the disallowed characters, another escaping mechanism is needed.

It might have been desirable to use percent-encoding (e.g., %27 for the ' character) as specified in Section 2.1 of RFC 3986 [5]. However, that approach was rejected since the % character is an often-used character in existing JIDs (e.g., to replace the @ character in gateway addresses) and the resulting ambiguity would have caused misdelivered or undeliverable messages. Therefore, a new mechanism is described herein to escape only the disallowed characters and only in the node identifier portion of JIDs.

2. Requirements

This JEP addresses the following requirements:

  1. The escaping mechanism shall apply to the node identitier portion of a JID only, and MUST NOT be applied to domain identifiers or resource identifiers.
  2. Escaped JIDs MUST conform to the definition of a Jabber ID as specified in RFC 3920, including the Nodeprep profile of stringprep. In particular this means that even after passing through Nodeprep, the JID MUST be valid, with the result that Unicode look-alikes like U+02BC (Modifier Letter Apostrophe) MUST NOT be used.
  3. It MUST NOT be possible for clients to use this escaping mechanism to avoid the goal of stringprep; namely, that JIDs that look alike should have same character representation after being processed by stringprep. Therefore, this mechanism MUST NOT be applied to any characters other than the disallowed characters.
  4. Existing JIDs that include portions of the escaping mechanism MUST continue to be valid.
  5. The escaping mechanism MUST NOT break commonly deployed Jabber/XMPP software implementations such as servers, components, gateways, and clients.
  6. The escaping mechanism SHOULD NOT place undue strain upon server implementations; implementations or deployments that do not need to unescape SHOULD be able to ignore the escaping mechanism.

3. Discovery

If an entity needs to discover whether another entity supports JID escaping, it MUST send a disco#info request to the other entity as specified in Service Discovery [6].

Example 1. Client requests features

<iq type='get'
    from='porthos@musketeers.bourbon.gov/gate'
    to='irc.shakespeare.lit'
    id='info1'>
  <query xmlns='http://jabber.org/protocol/disco#info'/>
</iq>
  

If the queried entity supports JID escaping, it MUST return a jid\20escaping [sic] feature in its reply.

Example 2. Service responds with features

<iq type='get'
    to='porthos@musketeers.bourbon.gov/gate'
    from='irc.shakespeare.lit'
    id='info1'>
  <query xmlns='http://jabber.org/protocol/disco#info'>
...
    <feature var='jid\20escaping'/>
  </query>
</iq>
  

4. Transformations

4.1 Concepts

This JEP specifies encoding each disallowed character as \hexhex -- where "hexhex" is the hexadecimal value of the Unicode code point in question, ignoring the leading "00" in the code point (e.g., 27 for the ' character, resulting in an encoding of \27). (Note: This escaping method is quite similar to that used for disallowed characters in LDAP distinguished names, as specified in RFC 2253 [7].) Full encoding and decoding transformations for all nine disallowed characters are provided in the following sections. In addition, encoding and decoding transformations are shown for the \ character in case it needs to be "double-escaped" when it occurs in a non-XMPP address as part of a string that corresponds to one of the other encoded characters.

Note: All transformations are exactly as specified below. CASE IS SIGNIFICANT. Lowercase was selected since Nodeprep will case fold to lowercase for US-ASCII characters such as A, C, E, and F.

4.2 Encoding Transformation

The encoding transformations are defined in the following table. Typically, encoding is performed only by a client that is processing information provided by a human user in unescaped form, or by a gateway to some external system (e.g., email or LDAP) that needs to generate a JID.

Table 1: Mapping from Unescaped to Encoded Characters

Unescaped Character Encoded Character
<space> \20
" \22
& \26
' \27
/ \2f
: \3a
< \3c
> \3e
@ \40
\ \5c

Example 3. JID Encoding: Porthos starts a chat, typing into his client the JID d'artagnan@musketeers.bourbon.gov:

<message 
    from='porthos@musketeers.bourbon.gov/gate'
    to='d\27artagnan@musketeers.bourbon.gov'
    type='chat'>
  <body>And do you always forget your eyes when you run?</body>
</message>
    

4.3 Decoding Transformation

The decoding transformations are defined in the following table. Typically, decoding is performed only by a client that wants to display JIDs containing encoded characters to a human user, or by a gateway to some external system (e.g., email or LDAP) that needs to generate identifiers for foreign systems.

Table 2: Mapping from Encoded to Decoded Characters

Encoded Character Decoded Character
\20 <space>
\22 "
\26 &
\27 '
\2f /
\3a :
\3c <
\3e >
\40 @
\5c \

Example 4. JID Encoding: D'Artagnan the elder sends SMTP mail through a gateway:

<message 
    from='d\27artagnan@gascon.fr/elder'
    to='tréville%musketeers.bourbon.gov@smtp.example.com'>
  <body>I recommend my son to you.</body>
</message>
    

5. Business Rules

5.1 Native Processing

The following processing rules apply to native XMPP implementations:

  1. A client SHOULD render an encoded character as its decoded equivalent when presenting it to a human user.
  2. A server MAY decode an encoded character for communication with external systems (e.g. LDAP), but only after the Nodeprep profile of stringprep has been applied.
  3. The decoding transformation MUST be NFKC-safe -- i.e., it MUST conform to Unicode normalization form KC (see Appendix B.3 of RFC 3454).
  4. An entity MUST NOT include the unescaped or decoded version of an encoded character over the wire in any XML stanzas sent to another entity.
  5. An entity MUST NOT use the unescaped or decoded version of an encoded character when comparing two JIDs.

5.2 Address Transformation Algorithm

When transforming a non-XMPP address into an XMPP address, an implementation MUST adhere to the following process:

  1. The original address MUST first be properly decoded (e.g., according to the rules in RFC 3986) before it is transformed into a JID.
  2. Any instances of strings that correspond to encodings of the disallowed characters (e.g., the string "\27") in the original address MUST be "double-escaped" by converting the backslash character to the string "\5c".
  3. The URI scheme component MUST be removed.
  4. All disallowed characters in the original address MUST be properly encoded in the resulting JID (as described above).

While the fourth step should be clear from the foregoing text and the third step is necessary since XMPP addresses are not URIs, the meaning of the first and second steps may not be obvious.

Regarding step one, many non-XMPP messaging systems use URIs to identify addresses (examples include the mailto:, sip:, sips:, im:, pres:, and wv: URI schemes) or otherwise encode an identifier (e.g., an LDAP distinguished name). Before transforming an address or identifier into a JID, it MUST first be decoded according the rules specified for that type of address or identifier in order to ensure that the proper characters are transformed.

Regarding step two, it is possible for some non-XMPP addresses to contain strings that correspond to JID-escaped characters (e.g., "\27"). Consider a Wireless Village address of <wv:\3and\2is\5@example.com> -- if that addresses were directly converted into into a JID, the resulting XMPP address would be \3and\2is\5@example.com, which could be construed as :nd\2is\5@example.com if JID escaping logic is applied. Therefore the leading \ character MUST be converted to the string "\5c" during the transformation, leading to a JID of \5c3and\2is\5@example.com (which would be presented to a human user as \3and\2is\5@example.com). Escaping of the backslash character before two hexhex characters MUST NOT be performed if the string is "\5c", only if the string corresponds to the encoded representation of one ofthe disallowed characters.

5.3 Exceptions

In order to maintain as much backward compatibility as possible, partial escape sequences and escape sequences corresponding to characters not on the list of disallowed characters MUST be ignored.

Example 5. Partial escape sequence

\2plus\2is\4 is not modified by encoding or decoding transformations.

Example 6. Invalid escape sequence 1

foo\bar is not modified (to fooºr) by encoding or decoding transformations.

Example 7. Invalid escape sequence 2

foob\41r is not modified (to foobAr) by encoding or decoding transformations.

5.4 JID Escaping vs. Older Methods

When a client attempts to communicate with another entity through a gateway, it needs to know which encoding mechanism to use. A client MUST assume that the gateway does not support the JID escaping mechanism unless it explicitly discovers support for the jid\20escaping [sic] feature via Service Discovery as shown above. If there any errors in the service discovery exchange or if support for JID escaping is not discovered, the client SHOULD proceed as follows:

  1. If the gateway supports the 'jabber:iq:gateway' protocol (as specified in Gateway Interaction [8]), use that protocol.
  2. If the gateway does not support the 'jabber:iq:gateway' protocol, use customary escaping mechanisms (such as transformation of the @ character to the % character).

6. Implementation Notes

In order to assist implementors, this section describes specific mappings between JIDs and addresses or identifiers used in the following standardized protocols:

6.1 Email Addresses

The address format for an Internet mailbox is specified in RFC 2822. The identifier of interest in this context is the "addr-spec" address and more particularly the "dot-atom" rule specified in Section 3.2.4, i.e., the email address shorn of angle brackets, display names, comments, quoted strings, and the like. Because some deployments of XMPP messaging systems may want to re-use existing email addresses as JIDs, it is helpful to define how to transform an email address into a JID.

In general, it is straightforward to transform an email address (i.e., a "dot-atom") into a JID, since traditional email addresses allow US-ASCII characters only rather than the nearly full range of Unicode code points allowed in a JID. [9] However, there are three characters allowed in the local-part of an email address that are not allowed in the node identifier portion of a JID: namely, the characters & ' / as described in Sections 3.2.4 and 3.2.5 of RFC 2822. In order to transform these characters, a compliant implementation MUST use the methods specified herein.

Example 8. An Email Address Containing JID-Disallowed Characters

here's_a_wild_&_/cr%zy/_address@example.com
    

Example 9. The Transformed JID

here\27s_a_wild_\26_\2fcr%zy\2f_address@example.com
    

Example 10. The JID as Presented to a User

here's_a_wild_&_/cr%zy/_address@example.com
    

(Note: Because the backslash character is forbidden in the "dot-atom" construction, an email address should not contain a string that corresponds to one of the encoded characters specified in the Transformations section of this document; therefore, no such examples are shown; see below under IMPS Addresses.)

An email address may also exist in the form of a mailto: URI as specified in RFC 2368 [10]. Before transforming a mailto: URI into a JID, it MUST be URL-decoded and all headers MUST be removed, leaving a mailbox identifier, as shown in the following example.

Example 11. A mailto: URI Containing JID-Disallowed Characters

mailto:here%27s_a_wild_%26_%2Fcr%zy%2F_address@example.com?subject=that%20is%20crazy%21
    

Example 12. The Resulting Mailbox

here's_a_wild_&_/cr%zy/_address@example.com
    

Example 13. The Transformed JID

here\27s_a_wild_\26_\2fcr%zy\2f_address@example.com
    

Example 14. The JID as Presented to a User

here's_a_wild_&_/cr%zy/_address@example.com
    

6.2 SIP Addresses

As specified in RFC 3261 [11], a SIP address (i.e., a sip: or sips: URI) can be quite complex if URI parameters or headers are included. However, a basic SIP address (the combination of the optional "userinfo" and required "hostport" constructions) is essentially similar to an email address (e.g., the same characters & ' / allowed in an email address but disallowed in an XMPP node identifier are also allowed in a basic SIP address).

Example 15. A Basic sip: URI Containing JID-Disallowed Characters

sip:here%27s_a_wild_%26_%2Fcr%zy%2F_address@example.com
    

Example 16. The URL-Decoded Address

here's_a_wild_&_/cr%zy/_address@example.com
    

Example 17. The Transformed JID

here\27s_a_wild_\26_\2fcr%zy\2f_address@example.com
    

Example 18. The JID as Presented to a User

here's_a_wild_&_/cr%zy/_address@example.com
    

6.3 IM and Presence Addresses

The im: and pres: URI schemes are specified in RFC 3860 [12] and RFC 3859 [13] respectively. With the exception of headers, an im: or pres: URI is simply a mailbox (as specified in RFC 2822) prepended with the im: or pres: scheme. Thus a basic IM or PRES address (not including optional headers) is essentially similar to an email address (e.g., the same characters & ' / allowed in an email address but disallowed in an XMPP node identifier are also allowed in a basic IM or PRES address).

Example 19. A Basic im: URI Containing JID-Disallowed Characters

im:here%27s_a_wild_%26_%2Fcr%zy%2F_address@example.com
    

Example 20. The URL-Decoded Address

here's_a_wild_&_/cr%zy/_address@example.com
    

Example 21. The Transformed JID

here\27s_a_wild_\26_\2fcr%zy\2f_address@example.com
    

Example 22. The JID as Presented to a User

here's_a_wild_&_/cr%zy/_address@example.com
    

6.4 IMPS Addresses

The Instant Messaging and Presence Service (IMPS) protocol was originally defined by the Wireless Village consortium and is now maintained by the Open Mobile Alliance (OMA) [14]. An IMPS address is formatted as a wv: URI, as specified in WV Client-Server Protocol v1.1 [15]. A basic address (not including a private resource) is of the form <wv:user-id@domain> and an address with a private resource is of the form <wv:user-id/resource@domain>.

The "User-ID" construction is either a mobile phone number (beginning with "+1" for international numbers and a digit for national numbers) or an "Internet-Identity". An "Internet-Identity" may contain any US-ASCII character other than / @ + SP TAB and thus may include the following characters that are disallowed in the node identifier portion of a JID: " & ' / : < > (which characters MUST be escaped when transforming an IMPS address into a JID). However, some of those characters are also reserved in URI syntax (namely the & ' / characters) so those characters will be found in encoded form within a wv: URI.

Example 23. A Basic wv: URI Containing JID-Disallowed Characters

wv:here%27s_a_wild_%26_%2Fcr%zy%2F_address_for%3A%3Cwv%3E%28%22IMPS%22%29@example.com
    

Example 24. The URL-Decoded Address

here's_a_wild_&_/cr%zy/_address_for:<wv>("IMPS")@example.com
    

Example 25. The Transformed JID

here\27s_a_wild_\26_\2fcr%zy\2f_address_for\3a\3cwv\3e(\22IMPS\22)@example.com
    

Example 26. The JID as Presented to a User

here's_a_wild_&_/cr%zy/_address_for:<wv>("IMPS")@example.com
    

Unlike the foregoing address types, IMPS addresses are allowed to contain backslashes. This implies that it is possible for an IMPS address to contain a string that corresponds to one of the encoded character representations for code points that are disallowed in XMPP node identifiers. And example would be the IMPS address <wv:\3and\2is\5@example.com>, where the string "\3a" could be interpreted as the : character if that IMPS address is directly converted into a JID. Therefore, the leading \ character MUST be transformed to "\5c" in order to avoid possible ambiguity. Thus the transformed JID would be <\5c3and\2is\5@example.com>, which would be presented to a user as <\3and\2is\5@example.com>.

If an IMPS address contains a private resource, a gateway between XMPP and IMPS should process the resource and append it to the end of the JID; however, such gateway behavior is out of scope for this JEP.

6.5 LDAP Distinguished Names

Within the Lightweight Directory Access Protocol (see RFC 2251 [16]), a "distinguished name" (DN) is a hierarchically-organized string representation that uniquely identifies a user, system, or organization. It is possible that some messaging systems use LDAP distinguished names to identify entities that can communicate using the system (e.g., this is reputed to be the case for certain releases of the Lotus Sametime system sold by IBM), and in any case it may be helpful to transform an LDAP distinguished name into an XMPP address for identification or addressing purposes.

As previously mentioned, a UTF-8 string representation of LDAP distinguished names is specified in RFC 2253. This representation specifies that the characters , + " \ < > ; are to be escaped with the backslash character (e.g., the string "\," would be used to escape the , character) and that any other non-US-ASCII characters are to be escaped using a string of the form "\xx".

The following example shows a distinguished name (and transformations thereof) for a person whose common name is "D'Artagnan Saint-André" and who is associated with an organization called "Example & Company, Inc." whose domain name is "example.com":

Example 27. A Distinguished Name

CN=D'Artagnan Saint-André,O=Example & Company, Inc.,DC=example,DC=com
    

Example 28. UTF-8 Representation of Distinguished Name

CN=D'Artagnan Saint-Andr\E9,O=Example &amp; Company\, Inc.,DC=example,DC=com
    

This example assumes that the specified user is identified with a gateway running at st.example.com (note that the backslash escaping the , character in the organization name is removed during the transformation).

Example 29. The Transformed JID

CN=D\27Artagnan\20Saint-Andr\E9,O=Example\20\26\20Company,\20Inc.,DC=example,DC=com@st.example.com
    

Example 30. The JID as Presented to a User

CN=D'Artagnan Saint-André,O=Example & Company, Inc.,DC=example,DC=com@st.example.com
    

Naturally, a more intelligent gateway could use the Domain Components to construct a more readable JID, such as <D\27Artagnan\20Saint-André@example.com>; however, such gateway behavior is out of scope for this JEP.

7. Security Considerations

An entity that performs JID escaping MUST NOT compare unescaped/decoded versions, otherwise messages and other information could be directed to an entity other than the intended recipient.

An entity that transforms a non-XMPP address into a JID MUST follow the algorithm specified in the Address Transformation Algorithm section of this document, otherwise messages and other information could be directed to an entity other than the intended recipient.

8. IANA Considerations

This JEP requires no interaction with the Internet Assigned Numbers Authority (IANA) [17].

9. Jabber Registrar Considerations

9.1 Service Discovery Features

The Jabber Registrar [18] shall include the jid\20escaping [sic] feature in its registry of service discovery features.


Notes

1. RFC 3920: Extensible Messaging and Presence Protocol (XMPP): Core <http://www.ietf.org/rfc/rfc3920.txt>.

2. RFC 3454: Preparation of Internationalized Strings (stringprep) < http://www.ietf.org/rfc/rfc3454.txt >.

3. RFC 2822: Internet Message Format <http://www.ietf.org/rfc/rfc2822.txt>.

4. Extensible Markup Language (XML) 1.0 (Third Edition) <http://www.w3.org/TR/REC-xml/>.

5. RFC 3986: Uniform Resource Identifiers (URI): Generic Syntax <http://www.ietf.org/rfc/rfc3986.txt>.

6. JEP-0030: Service Discovery <http://www.jabber.org/jeps/jep-0030.html>.

7. RFC 2253: Lightweight Directory Access Protocol (v3): UTF-8 String Representation of Distinguished Names <http://www.ietf.org/rfc/rfc2253.txt>.

8. JEP-0100: Gateway Interaction <http://www.jabber.org/jeps/jep-0100.html>.

9. This specification does not cover recent efforts to define internationalized email addresses.

10. RFC 2368: The mailto URL scheme <http://www.ietf.org/rfc/rfc2368.txt>.

11. RFC 3261: Session Initiation Protocol (SIP) <http://www.ietf.org/rfc/rfc3261.txt>.

12. RFC 3860: Common Profile for Instant Messaging (CPIM) <http://www.ietf.org/rfc/rfc3860.txt>.

13. RFC 3859: Common Profile for Presence (CPP) <http://www.ietf.org/rfc/rfc3859.txt>.

14. The Open Mobile Alliance is the focal point for the development of mobile service enabler specifications, which support the creation of interoperable end-to-end mobile services. For further information, see <http://www.openmobilealliance.org/>.

15. Wireless Village Client-Server Protocol v1.1 <http://www.openmobilealliance.org/tech/affiliates/wv/wvindex.html>.

16. RFC 2251: Lightweight Directory Access Protocol (v3) <http://www.ietf.org/rfc/rfc2251.txt>.

17. The Internet Assigned Numbers Authority (IANA) is the central coordinator for the assignment of unique parameter values for Internet protocols, such as port numbers and URI schemes. For further information, see <http://www.iana.org/>.

18. The Jabber Registrar maintains a list of reserved Jabber protocol namespaces as well as registries of parameters used in the context of protocols approved by the Jabber Software Foundation. For further information, see <http://www.jabber.org/registrar/>.


Revision History

Version 0.6 (2005-05-06)

Changed format from #xx; to \xx per list discussion; added extensive implementation notes. (psa)

Version 0.5 (2005-04-21)

Changed to U+00xx format for code points; added references to various RFCs; corrected terminology; cleaned up text and flow. (psa)

Version 0.4 (2005-04-04)

Corrected several small textual errors and ambiguities; slightly reorganized textual flow. (psa)

Version 0.3 (2005-03-16)

Clarified relationship between JID escaping and traditional client proxy gateway behavior; fixed several small errors. (psa)

Version 0.2 (2003-10-21)

Editorial cleanup; added security considerations. (psa)

Version 0.1 (2003-07-21)

Initial version. (jjh)


END