Tags: apsw, ascii, asqlite, characters, convertion, cx_oracle, database, hii, nightmare, oracle, programming, python, reading, returning, unicode, utf-8, writing

Convertion of Unicode to ASCII NIGHTMARE

On Programmer » Python

31,441 words with 24 Comments; publish: Wed, 02 Jan 2008 20:41:00 GMT; (200108.89, « »)

Hi

I am reading from an oracle database using cx_Oracle. I am writing to a

SQLite database using apsw.

The oracle database is returning utf-8 characters for euopean item

names, ie special charcaters from an ASCII perspective.

I get the following error:

> SQLiteCur.execute(sql, row)

>UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 12: ordinal not in >range(128)

I have googled for serval days now and still cant get it to encode to

ascii.

I encode the SQL as follows:

sql = "insert into %s values %s" % (SQLiteTable, paramstr)

sql.encode('ascii', 'ignore')

I then code each of the row values returned from Oracle like this:

row = map(encodestr, row)

SQLiteCur.execute(sql, row)

where encodestr is as follows:

def encodestr(item):

if isinstance(item, types.StringTypes):

return unicodedata.normalize('NFKD', unicode(item, 'utf-8',

'ignore')).encode('ASCII', 'ignore')

else:

return item

I have tried a thousand of similiar functions to the above,

permitations of the above from various google searches. But I still get

the above exception on the line:

SQLiteCur.execute(sql, row)

and the exception is reslated to the data in one field.

Int the end I resorted to using oracles convert function in the SQL

statement but would like to understand why this is happening and why

its so hard to convert the string in python. I have read many

complaints about this from other people some of whom have written

custom stripping routines. I havent tried a custom routine yet, cause I

think it should be possilble in python.

Thanks,

All Comments

Leave a comment...

  • 24 Comments
    • ChaosKCW wrote:

      > Hi

      > I am reading from an oracle database using cx_Oracle. I am writing to a

      > SQLite database using apsw.

      > The oracle database is returning utf-8 characters for euopean item

      > names, ie special charcaters from an ASCII perspective.

      And does cx_Oracle return those as Unicode objects or as plain strings

      containing UTF-8 byte sequences? It's very important to distinguish

      between these two cases, and I don't have any experience with cx_Oracle

      to be able to give advice here.

      > I get the following error:

      > > SQLiteCur.execute(sql, row)

      > >UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 12: ordinal not

      > > in range(128)

      It looks like you may have Unicode objects that you're presenting to

      sqlite. In any case, with earlier versions of pysqlite that I've used,

      you need to connect with a special unicode_results parameter, although

      later versions should work with Unicode objects without special

      configuration. See here for a thread (in which I seem to have

      participated, coincidentally):

      http://mail.python.org/pipermail/py...une/107526.html

      > I have googled for serval days now and still cant get it to encode to

      > ascii.

      This is a tough thing to find out - whilst previous searches did

      uncover some discussions about it, I just tried and failed to find the

      enlightening documents - and I certainly didn't see many references to

      it on the official pysqlite site.

      Paul

      #1; Wed, 02 Jan 2008 20:42:00 GMT
    • ChaosKCW wrote:

      > Hi

      > I am reading from an oracle database using cx_Oracle. I am writing to a

      > SQLite database using apsw.

      > The oracle database is returning utf-8 characters for euopean item

      > names, ie special charcaters from an ASCII perspective.

      I'm not sure that you are using those terms correctly. From your description

      below, it seems that your data is being returned from the Oracle database as

      unicode strings rather than regular strings containing UTF-8 encoded data. These

      European characters are not "special characters from an ASCII perspective;" they

      simply aren't characters in the ASCII character set at all.

      > I get the following error:

      >> SQLiteCur.execute(sql, row)

      >>UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 12: ordinal not in >range(128)

      > I have googled for serval days now and still cant get it to encode to

      > ascii.

      Don't. You can't. Those characters don't exist in the ASCII character set.

      SQLite 3.0 deals with UTF-8 encoded SQL statements, though.

      http://www.sqlite.org/version3.html

      > I encode the SQL as follows:

      > sql = "insert into %s values %s" % (SQLiteTable, paramstr)

      > sql.encode('ascii', 'ignore')

      The .encode() method returns a new value; it does not change an object inplace.

      sql = sql.encode('utf-8')

      --

      Robert Kern

      robert.kern.python.questionfor.info.gmail.com

      "I have come to believe that the whole world is an enigma, a harmless enigma

      that is made terrible by our own mad attempt to interpret it as though it had

      an underlying truth."

      -- Umberto Eco

      #2; Wed, 02 Jan 2008 20:43:00 GMT
    • > Don't. You can't. Those characters don't exist in the ASCII character set.

      > SQLite 3.0 deals with UTF-8 encoded SQL statements, though.

      That is not entirely correct - one can, if losing information is ok. The OPs

      code that normalized UTF-8 to NFKD, an umlaut like ? is transformed to a

      two-character-sequence basically saying "a with two dots on top". With

      'ignore' specified as parameter to the encoder, this should be result in

      the letter a.

      Regards,

      Diez

      #3; Wed, 02 Jan 2008 20:44:00 GMT
    • Oh, and it occurs to me, as I seem to have mentioned a document about

      PgSQL rather than pysqlite (although they both have the same principal

      developer), that you might need to investigate the client_encoding

      parameter when setting up your connection. The following message gives

      some information (but not much):

      http://groups.google.com/group/comp...27fa9866c9b7b5f

      Sadly, I can't find the information about getting result values as

      Unicode objects, but I believe it involves some kind of SQL comment

      that you send to the database system which actually tells pysqlite to

      change its behaviour.

      Paul

      #4; Wed, 02 Jan 2008 20:45:00 GMT
    • "Paul Boddie" <paul.python.questionfor.info.boddie.org.uk> wrote in message news:1144081137.137744.253790.python.questionfor.info.i39g2000cwa.googlegr oups.com...

      > It looks like you may have Unicode objects that you're presenting to

      > sqlite. In any case, with earlier versions of pysqlite that I've used,

      > you need to connect with a special unicode_results parameter,

      He is using apsw. apsw correctly handles unicode. In fact it won't

      accept a str with bytes >127 as they will be an unknown encoding and

      SQLite only uses Unicode internally. It does have a blob type

      using buffer for situations where binary data needs to be stored.

      pysqlite's mishandling of Unicode is one of the things that drove

      me to writing apsw in the first place.

      Roger

      #5; Wed, 02 Jan 2008 20:46:00 GMT
    • Hi

      Thanks for all the posts. I am still digesting it all but here are my

      initial comments.

      >Don't. You can't. Those characters don't exist in the ASCII character set.

      >SQLite 3.0 deals with UTF-8 encoded SQL statements, though.

      >http://www.sqlite.org/version3.html

      As mentioned by the next poster, there is, its supposed to be encode

      with the 'ignore' option. Thus you lose data, but thats just dandy with

      me. As for SQLite supporting unicode, it probably does, but something

      on the python side (probabyl in apsw) converts it to ascii at some

      point before its handed to SQLite.

      >The .encode() method returns a new value; it does not change an object inplace.

      > sql = sql.encode('utf-8')

      Ah yes, big bistake on my part :-/

      >He is using apsw. apsw correctly handles unicode. In fact it won't

      >accept a str with bytes >127 as they will be an unknown encoding and

      >SQLite only uses Unicode internally. It does have a blob type

      >using buffer for situations where binary data needs to be stored.

      >pysqlite's mishandling of Unicode is one of the things that drove

      >me to writing apsw in the first place.

      Ok if SQLite uses unicode internally why do you need to ignore

      everything greater than 127, the ascii table (256 bit one) fits into

      unicode just fine as far as I recall? Or did I miss the boat here ?

      Thanks,

      #6; Wed, 02 Jan 2008 20:47:00 GMT
    • "Thus you lose data, but thats just dandy with

      me.": Please reconsider this attitude, before you perpetrate a nonsense

      or even a disaster.

      Wrt your last para:

      1. Roger didn't say "ignore" -- he said "won't accept" (a major

      difference).

      2. The ASCII code comprises 128 characters, *NOT* 256.

      3. What Roger means is: given a Python 8-bit string and no other

      information, you don't have a clue what the encoding is. Most codes of

      interest these days have the ASCII code (or a mild perversion thereof)

      in the first 128 positions, but it's anyones guess what the writer of

      the string had in mind with the next 128.

      #7; Wed, 02 Jan 2008 20:48:00 GMT
    • Roger Binns wrote:

      > "Paul Boddie" <paul.python.questionfor.info.boddie.org.uk> wrote in message news:1144081137.137744.253790.python.questionfor.info.i39g2000cwa.googlegr oups.com...

      >>It looks like you may have Unicode objects that you're presenting to

      >>sqlite. In any case, with earlier versions of pysqlite that I've used,

      >>you need to connect with a special unicode_results parameter,

      > He is using apsw. apsw correctly handles unicode. In fact it won't

      > accept a str with bytes >127 as they will be an unknown encoding and

      > SQLite only uses Unicode internally. It does have a blob type

      > using buffer for situations where binary data needs to be stored.

      > pysqlite's mishandling of Unicode is one of the things that drove

      > me to writing apsw in the first place.

      Ah, I misread the OP's traceback.

      Okay, the OP is getting regular strings, which are probably encoded in

      ISO-8859-1 if I had to guess, from the Oracle DB. He is trying to pass them in

      to SQLiteCur.execute() which tries to make a unicode string from the input:

      In [1]: unicode('\xdc')

      -----------------------

      exceptions.UnicodeDecodeError Traceback (most recent call

      last)

      /Users/kern/<ipython console

      UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal

      not in range(128)

      *Now*, my advice to the OP is to figure out the encoding of the strings that are

      being returned from Oracle. As I said, ISO-8859-1 is probably a good guess.

      Then, he would *decode* the string to a unicode string using the encoding. E.g.:

      row = row.decode('iso-8859-1')

      Then everything should be peachy. I hope.

      --

      Robert Kern

      robert.kern.python.questionfor.info.gmail.com

      "I have come to believe that the whole world is an enigma, a harmless enigma

      that is made terrible by our own mad attempt to interpret it as though it had

      an underlying truth."

      -- Umberto Eco

      #8; Wed, 02 Jan 2008 20:49:00 GMT
    • Robert Kern wrote:

      > Roger Binns wrote:

      > > "Paul Boddie" <paul.python.questionfor.info.boddie.org.uk> wrote

      > >>It looks like you may have Unicode objects that you're presenting to

      > >>sqlite. In any case, with earlier versions of pysqlite that I've used,

      > >>you need to connect with a special unicode_results parameter,

      Note that I've since mentioned client_encoding which seems to matter

      for pysqlite 1.x.

      > > He is using apsw. apsw correctly handles unicode. In fact it won't

      > > accept a str with bytes >127 as they will be an unknown encoding and

      > > SQLite only uses Unicode internally. It does have a blob type

      > > using buffer for situations where binary data needs to be stored.

      > > pysqlite's mishandling of Unicode is one of the things that drove

      > > me to writing apsw in the first place.

      For pysqlite 2.x, it appears that Unicode objects can be handed

      straight to the API methods, and I'd be interested to hear about your

      problems with pysqlite, Unicode and what actually made you write apsw

      instead.

      > Ah, I misread the OP's traceback.

      > Okay, the OP is getting regular strings, which are probably encoded in

      > ISO-8859-1 if I had to guess, from the Oracle DB. He is trying to pass them in

      > to SQLiteCur.execute() which tries to make a unicode string from the input:

      [...]

      There's an Oracle environment variable that appears to make a

      difference: NLS_CHARSET, perhaps - it's been a while since I've had to

      deal with Oracle, and I'm not looking for another adventure into

      Oracle's hideous documentation to find out.

      > *Now*, my advice to the OP is to figure out the encoding of the strings that are

      > being returned from Oracle. As I said, ISO-8859-1 is probably a good guess.

      > Then, he would *decode* the string to a unicode string using the encoding. E.g.:

      > row = row.decode('iso-8859-1')

      > Then everything should be peachy. I hope.

      Yes, just find out what Oracle wants first, then set it all up, noting

      that without looking into the Oracle wrapper being used, I can't

      suggest an easier way.

      Paul

      #9; Wed, 02 Jan 2008 20:50:00 GMT
    • "ChaosKCW" <da.martian.python.questionfor.info.gmail.com> wrote in message news:1144150561.389856.302670.python.questionfor.info.v46g2000cwv.googlegr oups.com...

      > me. As for SQLite supporting unicode, it probably does,

      No, SQLite *ONLY* supports Unicode. It will *only* accept

      strings in Unicode and only produces strings in Unicode.

      All the functionality built into SQLite such as comparison

      operators operate only on Unicode strings.

      > but something

      > on the python side (probabyl in apsw) converts it to ascii at some

      > point before its handed to SQLite.

      No. APSW converts it *to* Unicode. SQLite only accepts Unicode

      so a Unicode string has to be supplied. If you supply a non-Unicode

      string then conversion has to happen. APSW asks Python to

      supply the string in Unicode. If Python can't do that (eg

      it doesn't know the encoding) then you get an error.

      I strongly recommend reading this:

      The Absolute Minimum Every Software Developer Absolutely,

      Positively Must Know About Unicode and Character Sets

      http://www.joelonsoftware.com/articles/Unicode.html

      > Ok if SQLite uses unicode internally why do you need to ignore

      > everything greater than 127,

      I never said that. I said that a special case is made so that

      if the string you supply only contains ASCII characters (ie <=127)

      then the ASCII string is converted to Unicode. (In fact it is

      valid UTF-8 hence the shortcut).

      > the ascii table (256 bit one) fits into

      > unicode just fine as far as I recall?

      No, ASCII characters have defined Unicode codepoints. The ASCII

      character number just happens to be the same as the Unicode

      codepoints. But there are only 127 ASCII characters.

      > Or did I miss the boat here ?

      For bytes greater than 127, what character set is used? There

      are hundreds of character sets that define those characters.

      You have to tell the computer which one to use. See the Unicode

      article referenced above.

      Roger

      #10; Wed, 02 Jan 2008 20:51:00 GMT
    • Roger Binns wrote:

      > SQLite only accepts Unicode so a Unicode string has to be supplied.

      fact or FUD? let's see:

      import pysqlite2.dbapi2 as DB

      db = DB.connect("test.db")

      cur = db.cursor()

      cur.execute("create table if not exists test (col text)")

      cur.execute("insert into test values (?)", ["this is an ascii string"])

      cur.execute("insert into test values (?)", [u"this is a unicode string"])

      cur.execute("insert into test values (?)", [u"ths s unicde strng"])

      cur.execute("select * from test")

      for row in cur.fetchall():

      print row

      prints

      (u'this is an ascii string',)

      (u'this is a unicode string',)

      (u'th\xefs \xefs \xf6 unic\xf6de str\xefng',)

      which is correct behaviour under Python's Unicode model.

      </F

      #11; Wed, 02 Jan 2008 20:52:00 GMT
    • "Fredrik Lundh" <fredrik.python.questionfor.info.pythonware.com> wrote in message news:mailman.4102.1144215505.27775.python-list.python.questionfor.info.python.org...

      > Roger Binns wrote:

      >> SQLite only accepts Unicode so a Unicode string has to be supplied.

      > fact or FUD? let's see:

      Note I said SQLite. For APIs that take/give strings, you can either

      supply/get a UTF-8 encoded sequence of bytes, or two bytes per character

      host byte order sequence. Any wrapper of SQLite that doesn't do

      Unicode in/out is seriously breaking things.

      I ended up using the UTF-8 versions of the API as Python can't quite

      make its mind up how to represent Unicode strings at the C api level.

      You can have two bytes per char or four, and the handling/production

      of byte order markers isn't that clear either.

      > import pysqlite2.dbapi2 as DB

      pysqlite had several unicode problems in the past. It has since

      been cleaned up as you saw.

      Roger

      #12; Wed, 02 Jan 2008 20:53:00 GMT
    • Roger Binns wrote:

      > > fact or FUD? let's see:

      > Note I said SQLite. For APIs that take/give strings, you can either

      > supply/get a UTF-8 encoded sequence of bytes, or two bytes per character

      > host byte order sequence. Any wrapper of SQLite that doesn't do

      > Unicode in/out is seriously breaking things.

      > I ended up using the UTF-8 versions of the API as Python can't quite

      > make its mind up how to represent Unicode strings at the C api level.

      > You can have two bytes per char or four, and the handling/production

      > of byte order markers isn't that clear either.

      sounds like your understanding of Unicode and Python's Unicode system

      is a bit unclear.

      </F

      #13; Wed, 02 Jan 2008 20:54:00 GMT
    • "Fredrik Lundh" <fredrik.python.questionfor.info.pythonware.com> wrote in message news:mailman.4114.1144239972.27775.python-list.python.questionfor.info.python.org...

      > sounds like your understanding of Unicode and Python's Unicode system

      > is a bit unclear.

      Err, no. Relaying unicode data between two disparate

      C APIs requires being careful and thorough. That means

      paying attention to when conversions happen, byte ordering

      (compile time) and boms (run time) and when the API

      documentation isn't thorough, verifying the behaviour

      yourself. That requires a very clear understanding of

      Unicode in order to do the requisite test cases, as well

      as reading what the code does.

      Roger

      #14; Wed, 02 Jan 2008 20:55:00 GMT
    • Roger Binns wrote:

      > "Fredrik Lundh" <fredrik.python.questionfor.info.pythonware.com> wrote in message news:mailman.4102.1144215505.27775.python-list.python.questionfor.info.python.org...

      > > Roger Binns wrote:

      > >> SQLite only accepts Unicode so a Unicode string has to be supplied.

      > > fact or FUD? let's see:

      > Note I said SQLite. For APIs that take/give strings, you can either

      > supply/get a UTF-8 encoded sequence of bytes, or two bytes per character

      > host byte order sequence. Any wrapper of SQLite that doesn't do

      > Unicode in/out is seriously breaking things.

      > I ended up using the UTF-8 versions of the API as Python can't quite

      > make its mind up how to represent Unicode strings at the C api level.

      > You can have two bytes per char or four, and the handling/production

      > of byte order markers isn't that clear either.

      I have an impression that handling/production of byte order marks is

      pretty clear: they are produced/consumed only by two codecs: utf-16 and

      utf-8-sig. What is not clear?

      Serge

      #15; Wed, 02 Jan 2008 20:56:00 GMT
    • "Serge Orlov" <Serge.Orlov.python.questionfor.info.gmail.com> wrote in message news:1144295335.353840.322190.python.questionfor.info.i40g2000cwc.googlegr oups.com...

      > I have an impression that handling/production of byte order marks is

      > pretty clear: they are produced/consumed only by two codecs: utf-16 and

      > utf-8-sig. What is not clear?

      Are you talking about the C APIs in Python/SQLite (that is what I

      have been discussing) or the language level?

      At the C level, SQLite doesn't accept boms. You have to

      provide UTF-8 or host byte order two bytes per char

      UTF-16.

      Roger

      #16; Wed, 02 Jan 2008 20:57:00 GMT
    • Roger Binns wrote:

      > "Serge Orlov" <Serge.Orlov.python.questionfor.info.gmail.com> wrote in message news:1144295335.353840.322190.python.questionfor.info.i40g2000cwc.googlegr oups.com...

      > > I have an impression that handling/production of byte order marks is

      > > pretty clear: they are produced/consumed only by two codecs: utf-16 and

      > > utf-8-sig. What is not clear?

      > Are you talking about the C APIs in Python/SQLite (that is what I

      > have been discussing) or the language level?

      Both. Documentation for PyUnicode_DecodeUTF16 and PyUnicode_EncodeUTF16

      is pretty clear when BOM is produced/removed. The only problem is that

      you have to find out host endianess yourself. In python it's

      sys.byteorder, in C you use hack like

      unsigned long one = 1;

      endianess = (*(char *) &one) == 0) ? 1 : -1;

      And then pass endianess to PyUnicode_(De/En)codeUTF16. So I still don't

      see what is unclear about BOM production/handling.

      > At the C level, SQLite doesn't accept boms.

      It would be surprising if it did. Quote from

      <http://www.unicode.org/faq/utf_bom.html>: "Where the data is typed,

      such as a field in a database, a BOM is unnecessary"

      #17; Wed, 02 Jan 2008 20:58:00 GMT
    • Roger Binns wrote:

      > No. APSW converts it *to* Unicode. SQLite only accepts Unicode

      > so a Unicode string has to be supplied. If you supply a non-Unicode

      > string then conversion has to happen. APSW asks Python to

      > supply the string in Unicode. If Python can't do that (eg

      > it doesn't know the encoding) then you get an error.

      If what you say is true, I have to ask why I get a converstion error

      which states it cant convert to ASCII, not it cant convert to UNICODE?

      > > Ok if SQLite uses unicode internally why do you need to ignore

      > > everything greater than 127,

      > I never said that. I said that a special case is made so that

      > if the string you supply only contains ASCII characters (ie <=127)

      > then the ASCII string is converted to Unicode. (In fact it is

      > valid UTF-8 hence the shortcut).

      > > the ascii table (256 bit one) fits into

      > > unicode just fine as far as I recall?

      > No, ASCII characters have defined Unicode codepoints. The ASCII

      > character number just happens to be the same as the Unicode

      > codepoints. But there are only 127 ASCII characters.

      > > Or did I miss the boat here ?

      > For bytes greater than 127, what character set is used? There

      > are hundreds of character sets that define those characters.

      > You have to tell the computer which one to use. See the Unicode

      > article referenced above.

      Yes I know there are a million "extended" ASCII charaters sets, which

      happen to the bane of all existence. Most computers deal in bytes

      nativly and the 7 bit coding still causes problems to this day. But

      since the error I get is a converstion error to ASCII, not from ASCII,

      I am willing to accept loss of information. You cant code unicode into

      ascii without loss of information or two charcater codes. In my mind,

      somewhere inside the "cursor.execute" function, it converts to ascii. I

      say this because of the error msg recieved. So I am missing how a

      function which supposedly converts evereythin to unicode lands up doing

      an ascii converstion ?

      #18; Wed, 02 Jan 2008 20:59:00 GMT
    • >

      > There's an Oracle environment variable that appears to make a

      > difference: NLS_CHARSET, perhaps - it's been a while since I've had to

      > deal with Oracle, and I'm not looking for another adventure into

      > Oracle's hideous documentation to find out.

      That is an EVIL setting which should not be used. The NLS_CHARSET

      environment variable causes so many headaches its not worth playing

      with it at all.

      #19; Wed, 02 Jan 2008 21:00:00 GMT
    • ChaosKCW wrote:

      > Roger Binns wrote:

      > > No. APSW converts it *to* Unicode. SQLite only accepts Unicode

      > > so a Unicode string has to be supplied. If you supply a non-Unicode

      > > string then conversion has to happen. APSW asks Python to

      > > supply the string in Unicode. If Python can't do that (eg

      > > it doesn't know the encoding) then you get an error.

      > If what you say is true, I have to ask why I get a converstion error

      > which states it cant convert to ASCII, not it cant convert to UNICODE?

      You do get error about convertion to unicode. Quote from you message:

      > SQLiteCur.execute(sql, row)

      >UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 12: ordinal not in >range(128)

      Notice the name of the error: UnicodeDecode or in other words

      ToUnicode.

      > So I am missing how a

      > function which supposedly converts evereythin to unicode lands up doing

      > an ascii converstion ?

      When python tries to concatenate a byte string and a unicode string, it

      assumes that the byte string is encoded ascii and tries to convert from

      encoded ascii to unicode. It calls ascii decoder to do the decoding. If

      decoding fails you see message from ascii decoder about the error.

      Serge

      #20; Wed, 02 Jan 2008 21:01:00 GMT
    • ChaosKCW wrote:

      > > There's an Oracle environment variable that appears to make a

      > > difference: NLS_CHARSET, perhaps - it's been a while since I've had to

      > > deal with Oracle, and I'm not looking for another adventure into

      > > Oracle's hideous documentation to find out.

      > That is an EVIL setting which should not be used. The NLS_CHARSET

      > environment variable causes so many headaches its not worth playing

      > with it at all.

      Well, at this very point in time I don't remember the preferred way of

      getting Oracle, the client libraries and the database adapter to agree

      on the character encoding used for communicating data between

      applications and the database system. Nevertheless, what you need to do

      is to make sure that you know which encoding is used so that if you

      either get plain strings (ie. not Unicode objects) out of the database,

      or if you need to write plain strings to the database, you can provide

      the encoding to the unicode built-in function or to the decode/encode

      methods; this is much better than just stripping out characters that

      can't be represented by ASCII.

      Anyway, despite my objections to digging through Oracle documentation,

      I found the following useful documents: the "Globalization Support"

      index [1], an FAQ about NLS_LANG [2], and a white paper about Unicode

      support in Oracle [3]. It may well be the case that NLS_LANG might help

      you do what you want, but since the database systems I have installed

      (PostgreSQL, sqlite3) seem to "do Unicode" without such horsing around,

      I'm not really able to offer much more advice on this subject.

      Paul

      [1]

      http://www.oracle.com/technology/te...tion/index.html

      [2]

      http://www.oracle.com/technology/te...htdocs/nls_lang

      faq.htm

      [3]

      http://www.oracle.com/technology/te...icode_10gR2.pdf

      #21; Wed, 02 Jan 2008 21:02:00 GMT
    • > When python tries to concatenate a byte string and a unicode string, it

      > assumes that the byte string is encoded ascii and tries to convert from

      > encoded ascii to unicode. It calls ascii decoder to do the decoding. If

      > decoding fails you see message from ascii decoder about the error.

      > Serge

      Ok I get it now. Sorry for the slowness. I have to say as a lover of

      python for its simplicity and clarity, the charatcer set thing has been

      harder than I would have liked to figure out.

      Thanks for all the help.

      #22; Wed, 02 Jan 2008 21:03:00 GMT
    • ChaosKCW wrote:

      > Ok I get it now. Sorry for the slowness. I have to say as a lover of

      > python for its simplicity and clarity, the charatcer set thing has been

      > harder than I would have liked to figure out.

      I think there is a room for improvement here. In my opinion the message

      is too confusing for newbies. It would be easier for them if there is a

      mini tutorial available about what's going on with links to other more

      broad tutorials (like unicode tutorial). Instead of generic error

      -------

      UnicodeDecodeError: 'ascii' codec can't decode byte 0xa5 in position 0:

      ordinal not in range(128)

      -------

      It can be like this. Notice special url, it is pointing to a

      non-existing :) tutorial about why concatenating byte strings with

      unicode strings can produce UnicodeDecodeError

      -------

      UnicodeDecodeError: 'ascii' codec can't decode byte 0xa5 in position 0:

      ordinal

      not in range(128)

      For additional information about this exception see:

      http://docs.python.org/2.4/exceptio...DecodeError+str

      -------

      Here is the sample code how it can be done:

      --------

      extended_help = {

      ("concat", UnicodeDecodeError, str, unicode):

      "http://docs.python.org/2.4/exceptions/concat+UnicodeDecodeError+str",

      ("concat", UnicodeDecodeError, unicode, str):

      "http://docs.python.org/2.4/exceptions/concat+UnicodeDecodeError+str"

      }

      def get_more_help(error, key):

      if not extended_help.has_key(key):

      return

      error.reason += "\nFor additional information about this exception

      see:\n "

      error.reason += extended_help[key]

      def concat(s1,s2):

      try:

      return s1 + s2

      except Exception, e:

      key = "concat", e.__class__, type(s1), type(s2)

      get_more_help(e, key)

      raise

      concat(chr(0xA5),unichr(0x5432))

      #23; Wed, 02 Jan 2008 21:04:00 GMT
    • "Serge Orlov" <Serge.Orlov.python.questionfor.info.gmail.com> wrote in message news:1144718544.134817.126720.python.questionfor.info.i39g2000cwa.googlegr oups.com...

      > It can be like this. Notice special url, it is pointing to a

      > non-existing :) tutorial about why concatenating byte strings with

      > unicode strings can produce UnicodeDecodeError

      An alternate is to give an error code that people can use to

      look up (and users can report). It also has the advantage

      of being language neutral. You can see this kind of approach

      on IBM systems (mainframe, AIX etc) and Oracle.

      Roger

      #24; Wed, 02 Jan 2008 21:05:00 GMT