iconv_unicode(7) 맨 페이지 - 윈디하나의 솔라나라

개요

섹션
맨 페이지 이름
검색(S)

iconv_unicode(7)

Standards, Environments, Macros, Character Sets, and miscellany
                                                              iconv_unicode(7)



NAME
       iconv_unicode - codeset conversion for Unicode

DESCRIPTION
       The  table below lists the names and descriptions of the supported Uni‐
       code encodings or encoding  schemes  (byte  serializations  of  Unicode
       encoding  forms)  that  can be used as fromcode or tocode parameters to
       iconv(1), iconv_open(3C), and cconv_open(3C). There  are  also  aliases
       such as FSS-UTF, UTF8, and so on.


       Available  iconv  and cconv conversions in the current system including
       aliases and optional variant levels can  be  obtained  by  running  the
       iconv -l command as described in the iconv(1) manual page.


       For  additional information on the mappings between canonical names and
       supported aliases with optional variant levels, refer to  the  alias(5)
       manual page and also the /usr/lib/iconv/alias file.


       tab()  box;  lw(0.92i) |lw(4.58i) lw(0.92i) |lw(4.58i) Encoding FormDe‐
       scription _ UTF-8T{ Multibyte sequences of 1-4  character  bytes  T}  _
       UTF-16T{   Represented   in   16-bit   entity   for  U+0000-U+D7FF  and
       U+E000-U+FFFF, and two 16-bit entities for U+10000-U+10FFFF. Is in  the
       platforms default byte ordering and includes the Byte Order Mark (BOM).
       See below for a description on the BOM.   T}  _  UTF-16-INTERNALUTF-16,
       without  BOM _ UTF-16BET{ UTF-16 in the big-endian byte ordering, with‐
       out BOM T} _ UTF-16-BIG-ENDIANT{ UTF-16 in the big-endian  byte  order‐
       ing,  including  BOM  T}  _ UTF-16LET{ UTF-16 in the little-endian byte
       ordering, without BOM T} _ UTF-16-LITTLE-ENDIANT{ UTF-16 in the little-
       endian  byte  ordering, including BOM T} _ UTF-16-SWAPPEDT{ UTF-16 with
       endianness opposite to that of the local platform,  without  BOM  T}  _
       UTF-32T{  Represented in 32-bit entity in platforms default byte order‐
       ing and includes the BOM T}  _  UTF-32-INTERNALUTF-32,  without  BOM  _
       UTF-32BET{  UTF-32  in  the  big-endian byte ordering, without BOM T} _
       UTF-32-BIG-ENDIANT{ UTF-32 in the big-endian byte  ordering,  including
       BOM  T}  _  UTF-32-SWAPPEDT{ UTF-32 with endianness opposite to that of
       the local platform, without BOM T} _ UTF-32LET{ UTF-32 in  the  little-
       endian byte ordering, without BOM T} _ UTF-32-LITTLE-ENDIANT{ UTF-32 in
       the little-endian byte ordering, including BOM T} _ UCS-2T{ Represented
       in  16-bit  entity for U+0000-U+D7FF and U+E000-U+FFFF in the platforms
       default  byte  ordering  and  includes  byte  order  mark  (BOM)  T}  _
       UCS-2-INTERNALUCS-2,  without  BOM  _ UCS-2BET{ UCS-2 in the big-endian
       byte ordering, without BOM T} _ UCS-2-BIG-ENDIANT{ UCS-2  in  the  big-
       endian byte ordering, including BOM T} _ UCS-2LET{ UCS-2 in the little-
       endian byte ordering, without BOM T} _ UCS-2-LITTLE-ENDIANT{  UCS-2  in
       the  little-endian  byte  ordering,  including BOM T} _ UCS-2-SWAPPEDT{
       UCS-2 with endianness opposite to that of the local  platform,  without
       BOM  T} _ UCS-4T{ Represented in 32-bit entity in the platforms default
       byte ordering and includes byte order  mark  (BOM)  T}  _  UCS-4-INTER‐
       NALUCS-4,  without  BOM _ UCS-4BET{ UCS-4 in the big-endian byte order‐
       ing, without BOM T} _ UCS-4-BIG-ENDIANT{ UCS-4 in the  big-endian  byte
       ordering,  including BOM T} _ UCS-4LET{ UCS-4 in the little-endian byte
       ordering, without BOM T} _ UCS-4-LITTLE-ENDIANT{ UCS-4 in  the  little-
       endian  byte  ordering,  including  BOM T} _ UCS-4-SWAPPEDT{ UCS-4 with
       endianness opposite to that of the local platform, without BOM T}



       UCS, or Universal Character Set, refers to the ISO/IEC 10646 family  of
       standards with character set identical to that of Unicode.


       Byte  Order Mark, also known as BOM (U+FEFF), is a special character in
       the beginning of a file or character stream, denoting the byte order of
       the  subsequent  characters. UCS-2, UTF-16, UTF-32, and UCS-4 files and
       character streams usually start with a BOM character  to  indicate  the
       byte ordering used in the file or character stream.


       UTF-8  to UTF-8 conversion simply moves bytes from input buffer to out‐
       put buffer without doing any  conversion.  During  the  moves,  illegal
       character  checking is done to screen out any potentially harmful char‐
       acter bytes. Such illegal characters will cause the conversion to fail.


       UTF-7, a legacy 7-bit Unicode Transformation Format, is only  supported
       by iconv conversions to and from UTF-8, UCS-2 and UCS-4.


       UTF-EBCDIC,  a  legacy EBCDIC-compatible variant of UTF-8, is only sup‐
       ported by iconv conversions to and from UTF-8.

NOTES
       iconv also supports conversion between Unicode encodings and many  dif‐
       ferent codesets. The list of such codesets includes for example the ISO
       8859 character sets, EBCDIC code pages, EUC (Extended  UNIX  Code)  and
       ISO  2022 encodings for Chinese, Japanese, Korean, and many others (see
       iconv_extra(7), iconv_ja(7), iconv_ko(7), iconv_zh(7),  iconv_zh_HK(7),
       and iconv_zh_TW(7)).


       If  a source character code value cannot be mapped to a valid character
       in target codeset, it will be considered as an illegal or a non-identi‐
       cal  character. In the absence of explicit information about the source
       character code value, iconv code conversions uses the  following  rules
       in determining whether a character is illegal or non-identical:


       If the source character code value is not within a range defined by the
       source codeset standard, it is considered as an illegal  character.  If
       the  source  character code value is within the range(s) defined by the
       standard, it will be considered as non-identical, even  if  the  source
       character code value maps to an undefined or a reserved location within
       the valid range. The non-identical character will map to either ? (0x3f
       in  ASCII-compatible  codesets)  if the target codeset is a non-Unicode
       codeset or to Unicode replacement  character  (U+FFFD)  if  the  target
       codeset is an Unicode codeset.


       When  the  BOM  is  present as the first character in the encoding that
       supports it, it will direct the way  the  following  Unicode  character
       sequences  are  interpreted.  If the BOM is not the first character for
       such encodings or for Unicode encodings that do not  support  the  BOM,
       the  BOM  character (U+FEFF) will be interpreted as Zero Width No-Break
       Space (ZWNBSP) character and will not affect the way the Unicode  char‐
       acters are interpreted in terms of byte ordering.


       When  the  target  codeset  is  one  of  UCS-2,  UTF-16, UTF-32, UCS-4,
       UCS-2-BIG-ENDIAN, UCS-2-LITTLE-ENDIAN,  UTF-16-BIG-ENDIAN,  UTF-16-LIT‐
       TLE-ENDIAN,  UCS-4-BIG-ENDIAN,  UCS-4-LITTLE-ENDIAN, UTF-32-BIG-ENDIAN,
       and UTF-32-LITTLE-ENDIAN, expect a BOM character in  the  beginning  of
       the iconv code conversion output buffer.


       When the source codeset is UCS-2, UTF-16, UTF-32, or UCS-4 and there is
       no BOM presented as the first input character, the byte ordering of the
       current  system  is assumed on the input byte stream given to the iconv
       code conversion.

EXAMPLES
       Example 1 The iconv Library Module Filename



       In the conversion library, /usr/lib/iconv (see iconv(3C)), the  library
       module  filename  is composed of two symbolic elements separated by the
       percent sign (%). The first symbol specifies the source  codeset,  i.e.
       the  codeset  that  is being converted; the second symbol specifies the
       target codeset, i.e. the codeset to which the first one is  being  con‐
       verted.



       For  example,  the  library  module filename to convert from the legacy
       UTF-7 codeset to the UTF-8 codeset is UTF-7%UTF-8.so.

       Example 2 The cconv Library Module Filename



       For some conversions, iconv(3C) makes a call to  the  cconv(3C)  inter‐
       faces  to  perform  the  conversion.  The  cconv conversion modules are
       binary tables with .bt suffix generated by geniconvtbl(1) and placed in
       the  same  /usr/lib/iconv library. The cconv library module filename is
       composed of the symbolic elements for source and target  codeset  sepa‐
       rated by the plus sign (+). The cconv conversion is typically performed
       in two steps, with UTF-32 as the intermediate encoding.



       For example, the cconv library module filename to convert from the  Ja‐
       panese EUC codeset to the UTF-32 codeset is eucJP+UTF-32.bt.

FILES
       /usr/lib/iconv/*.so

           iconv conversion modules


       /usr/lib/iconv/*.bt

           cconv  code  conversion  binary tables for iconv(1), cconv(3C), and
           iconv(3C)


       /usr/lib/iconv/geniconvtbl/binarytables/*.bt

           geniconvtbl conversion binary tables


       /usr/lib/iconv/alias

           Alias table file of codeset names


SEE ALSO
       geniconvtbl(1), iconv(1), cconv(3C),  cconv_close(3C),  cconv_open(3C),
       cconvctl(3C), iconv(3C), iconv_close(3C), iconv_open(3C), iconvctl(3C),
       alias(5),    geniconvtbl-cconv(5),     iconv_extra(7),     iconv_ja(7),
       iconv_ko(7), iconv_zh(7), iconv_zh_HK(7), iconv_zh_TW(7)


       The  Unicode Consortium. The Unicode Standard, Version 6.2.0, (Mountain
       View, CA: The Unicode Consortium, 2012. ISBN 978-1-936213-07-8)


       Yergeau, F., UTF-8, a transformation format of Unicode and  ISO  10646,
       RFC 2044, Alis Technologies, October 1996.


       Ohta,  M.,  Character Sets ISO-10646 and ISO-10646-J-1, RFC 1815, Tokyo
       Institute of Technology, July 1995.


       Simonson, K., Character Mnemonics & Character Sets, RFC 1345,  Rationel
       Almen Planlaegning, June 1992.


       Goldsmith,  D., and M. Davis, UTF-7 - A Mail-Safe Transformation Format
       of Unicode, RFC 1642, Taligent, Inc., July 1994.



Oracle Solaris 11.4               11 May 2021                 iconv_unicode(7)
맨 페이지 내용의 저작권은 맨 페이지 작성자에게 있습니다.
RSS ATOM XHTML 5 CSS3