This is an old revision of this page, as edited by Mjb (talk | contribs) at 10:16, 2 May 2005 (→Character data: phrasing; link "stateful"). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Revision as of 10:16, 2 May 2005 by Mjb (talk | contribs) (→Character data: phrasing; link "stateful")(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)Percent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier under certain circumstances. It is also used in the preparation of data of the "application/x-www-form-urlencoded" media type, as is often used in email messages and the submission of HTML form data.
Percent-encoding in a URI
Types of URI characters
The characters in a URI, regardless of how they might be encoded, are taken from a set of unreserved characters for general use, and a smaller set of reserved characters that sometimes have special meaning in certain contexts. These sets and the circumstances under which certain characters from the reserved set have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
K |
L |
M |
N |
O |
P |
Q |
R |
S |
T |
U |
V |
W |
X |
Y |
Z
|
a |
b |
c |
d |
e |
f |
g |
h |
i |
j |
k |
l |
m |
n |
o |
p |
q |
r |
s |
t |
u |
v |
w |
x |
y |
z
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9
|
- |
_ |
. |
~
|
! |
* |
' |
( |
) |
; |
: |
@ |
& |
= |
+ |
$ |
, |
/ |
? |
% |
# |
[ |
]
|
No other characters are allowed in a URI.
Percent-encoding reserved characters
When a character from the reserved set (a "reserved character") has special meaning (a "reserved purpose") in a certain context, and a URI scheme says that is necessary to use that character for some other purpose, then the character must be percent-encoded. Percent-encoding a reserved character involves converting the character to its corresponding value in ASCII and then representing that value as a pair of hexadecimal digits. The digits, preceded by a percent sign ("%
"), are then used in the URI in place of the reserved character.
For example, the reserved character "/
", if used in the "path" component of a URI, has the special meaning of being a delimiter between path segments. If, according to a given URI scheme, "
/
" needs to be in a path segment, then the three characters "%2F
" or "%2f
" must be used in the segment instead of a raw "/
".
Reserved characters that have no reserved purpose in a particular context may also be percent-encoded, but are not semantically different from those that are not percent-encoded.
For example, in the "query" component of a URI, "
/
" is still considered a reserved character, but it normally has no reserved purpose, unless a particular URI scheme says otherwise. When it has no reserved purpose, the character does not need to be percent-encoded.
URIs that differ only by the percent-encoding of reserved characters are not normally considered equivalent (denoting the same resource), unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.
Percent-encoding unreserved characters
Characters from the unreserved set can be percent-encoded in the same way as reserved characters. That is, if a scheme calls for an unreserved character to be used in a URI, either the raw character or its percent-encoded equivalent may be used interchangably.
URIs that differ only by the percent-encoding of unreserved characters are always considered equivalent. For example, URI consumers should never treat "%41
" differently than "A
" or "%7E
" differently than "~
". URI producers are discouraged from percent-encoding unreserved characters, however.
Percent-encoding arbitrary data
Most URI schemes involve the representation of arbitrary data, such as an IP address or file system path, as components of a URI. URI scheme specifications should, but often don't, provide an explicit, clear mapping between URI characters and all possible data values being represented by those characters.
Binary data
Since the publication of RFC 1738 in 1994, it has been specified that schemes that provide for the representation of binary data in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above, being careful to use raw unreserved characters, rather than percent-encoded sequences, where possible. For example, byte value FF (hexadecimal) should be represented by "%FF
", but byte value 41 (hexadecimal) should be represented by "A
", not "%41
".
Character data
The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the World Wide Web's formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangable. However, the need to represent characters outside the ASCII range quickly grew, and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Consequently, web applications began using different multi-byte, stateful, and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably.
For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified character encoding before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all, and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.
Current standard
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and must convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary data or as character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.
Percent-encoding in application/x-www-form-urlencoded data
This section needs expansion. You can help by adding to it.
External links
The following specifications all discuss and define reserved characters, unreserved characters, and percent-encoding, in some form or other:
- RFC 3986, the current generic URI syntax specification.
- RFC 2396 (obsolete) and RFC 2732 together comprised the previous version of the generic URI syntax specification.
- RFC 1738 (mostly obsolete) and RFC 1808 (obsolete), which define URLs.
- RFC 1630 (obsolete), the first generic URI syntax specification.