PostgreSQL: Convert Windows-1252 ISO-8859-1-superset characters to UTF-8

From FVue
Jump to: navigation, search

Problem

Characterset Windows-1252 has - compared to ISO-8859-1 - special displayable characters rather than control characters in the 0x80 to 0x9F range. When a Windows-1252 text is converted to Unicode with the source characterset specified as "ISO-8859-1", these special characters will not be converted to Unicode.

How can I convert these special characters to their Unicode equivalent? If I try this code underneath, I get an error:

database# UPDATE table SET text = REPLACE(text, '\x80', '€');
WARNING:  nonstandard use of escape in a string literal 
LINE 2:     text = REPLACE(text, '\x91', '€')
                                 ^
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR:  invalid byte sequence for encoding "UTF8": 0x91
HINT:  This error can also happen if the byte sequence does not match the encoding 
expected by the server, which is controlled by "client_encoding".

Environment

  • PostgreSQL 8.4, 9.3

Solution

Here's an UPDATE statement for the entire 0x80 to 0x9F range, replace mytable and myfield with yours:

UPDATE mytable
SET myfield =
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(myfield,
    U&'\0080', '€'), U&'\0081', '' ), U&'\0082', '‚'), U&'\0083', 'ƒ'), 
    U&'\0084', '„'), U&'\0085', '…'), U&'\0086', '†'), U&'\0087', '‡'), 
    U&'\0088', 'ˆ'), U&'\0089', '‰'), U&'\008a', 'Š'), U&'\008b', '‹'), 
    U&'\008c', 'Œ'), U&'\008d', '' ), U&'\008e', 'Ž'), U&'\008f', '' ), 
    U&'\0090', '' ), U&'\0091', '‘'), U&'\0092', '’'), U&'\0093', '“'), 
    U&'\0094', '”'), U&'\0095', '•'), U&'\0096', '–'), U&'\0097', '—'), 
    U&'\0098', '˜'), U&'\0099', '™'), U&'\009a', 'š'), U&'\009b', '›'), 
    U&'\009c', 'œ'), U&'\009d', '' ), U&'\009e', 'ž'), U&'\009f', 'Ÿ');

If you get this error:

ERROR:  unsafe use of string constant with Unicode escapes
DETAIL:  String constants with Unicode escapes cannot be used when standard_conforming_strings is off.

First do this:

SET standard_conforming_strings=ON;

See also

http://en.wikipedia.org/wiki/Windows-1252

Comments

blog comments powered by Disqus