てきとうなメモ

本の感想とか技術メモとか

URIの空白を'+'にするという仕様

検索するときのURLがfoo+barという形式になる方がfoo%20barという形式よりもはるかに読みやすい.だけど,仕様として考えると一律%XXに変換する方が一貫性があるよなと.

仕様としてはどうなっているのかなとちょっと調べてみた.

以下のようにHTML Formのw3cの勧告とCGIの仕様には明確に書かれてあるように思える.

This is the default content type. Forms submitted with this content type must be encoded as follows:

Control names and values are escaped. Space characters are replaced by `+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by `%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by `=' and name/value pairs are separated from each other by `&'.

Forms in HTML documents

Form data is a stream of name=value pairs separated by the & character. Each name=value pair is URL encoded, i.e. spaces are changed into plusses and some characters are encoded into hexadecimal.

現在のURIRFCとしては特に書かれていないように思えたのだけども,初期のRFCを見ると書かれてあった.

Within the query string, the plus sign is reserved as shorthand
notation for a space. Therefore, real plus signs must be encoded.
This method was used to make query URIs easier to pass in systems
which did not allow spaces.

RFC 1630 - Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as used in the World-Wide Web

ただ,現在のRFCもクエリ文字列の部分は%エンコーディング以外のものを使った方がいいかもと書かれてある.

However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, it is sometimes better for usability to avoid percent-
encoding those characters.

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax