Discussion:
Python pretty-printers and non-ASCII strings do not play well together :-(
(too old to reply)
ppluzhnikov-hpIqsD4AKlfQT0dZR+ (Paul Pluzhnikov)
2008-11-04 19:28:34 UTC
Permalink
Greetings,

Consider this source:
--- simple.c ---
#include <string.h>

int main()
{
union U {
char s[sizeof(int)];
int x;
} u, v;
strcpy(u.s, "abc");
v.x = 0xABCDEF;
return 0; // break here
}
--- simple.c ---

--- simple.py ---
def pp_u(val):
return "<" + str(val['s']) + ">"

gdb.cli_pretty_printers['^union U$'] = pp_u
--- simple.py ---

(gdb) b 11
Breakpoint 1 at 0x40032e: file simple.c, line 11.
(gdb) r

Breakpoint 1, main () at simple.c:11
11 return 0; // break here
(gdb) python execfile('simple.py')
(gdb) print u
$1 = <"abc">

Good so far...
But:

(gdb) print v
$2 = Traceback (most recent call last):
File "simple.py", line 2, in pp_u
return "<" + str(val['s']) + ">"
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
Traceback (most recent call last):
File "simple.py", line 2, in pp_u
return "<" + str(val['s']) + ">"
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
{s = "ïÍ«", x = 11259375}

Not so good :(

I've attempted to fix this, but my Python-Fu is not yet up to the
task, and I couldn't find any good referencese on Python/Unicode/C-API.

What are some of the good Python references?

Thanks,
--
Paul Pluzhnikov
Tom Tromey
2008-11-04 19:43:35 UTC
Permalink
Paul> UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)

Paul> Not so good :(

Yeah.

What should happen here, though? The string contains invalid
characters for its declared (via set target-charset) encoding.

Paul> I've attempted to fix this, but my Python-Fu is not yet up to the
Paul> task, and I couldn't find any good referencese on Python/Unicode/C-API.

Paul> What are some of the good Python references?

I've been using info pages that I built from the Python sources -- but
only because I prefer using info when possible.

The same stuff is on python.org, e.g.:

http://www.python.org/doc/2.5.2/ext/contents.html

or

http://www.python.org/doc/2.5.2/api/api.html

IME the Python C API docs are spotty. I spend a fair amount of time
looking through Google code search :(

Tom
Daniel Jacobowitz
2008-11-04 19:56:42 UTC
Permalink
Post by Tom Tromey
What should happen here, though? The string contains invalid
characters for its declared (via set target-charset) encoding.
IMO, what happens in GDB: Convert them to escape codes.
--
Daniel Jacobowitz
CodeSourcery
Paul Pluzhnikov
2008-11-04 19:59:54 UTC
Permalink
Post by Tom Tromey
Paul> UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
Paul> Not so good :(
Yeah.
What should happen here, though? The string contains invalid
characters for its declared (via set target-charset) encoding.
As an end-user, I would expect something like

$2 = <"\xef\xcd\xab">

or perhaps the same thing GDB prints without Python:

$2 = <"�ͫ">
Post by Tom Tromey
Paul> What are some of the good Python references?
http://www.python.org/doc/2.5.2/api/api.html
Yes, I've seen the above, but it didn't have the answers I was
looking for :(

Thanks,
--
Paul Pluzhnikov
Tom Tromey
2008-11-05 00:58:59 UTC
Permalink
Tom> What should happen here, though? The string contains invalid
Tom> characters for its declared (via set target-charset) encoding.

Paul> As an end-user, I would expect something like
Paul> $2 = <"\xef\xcd\xab">

It occurs to me I am not completely certain where this error
originates. My theory is that it is the call to PyUnicode_Decode in
valpy_str.

If so, then we aren't seeing a value representation problem, which is
what I was worried about. Instead, I think common_val_print is
emitting a string which is not actually valid according to
host_charset. That seems wrong.

We could work around this in valpy_str, I suppose. But I'm curious to
know why this is happening -- why isn't common_val_print printing the
escape sequences itself?

My guess is that the target and host charsets are the same, and
charset.c is passing character through without checking them for
validity. I didn't debug it, but when I set host-charset to ASCII (my
target-charset is ISO-8859-1), I do see the escapes.

Every time I look at this stuff I'm reminded that the gdb charset code
could use a good scrubbing. For example, the default host charset
ought to come from the locale settings. I have a patch to implement
this, but there's no point submitting it since it breaks gdb on
typical Linux systems -- most people use UTF-8 locales, but gdb
doesn't handle UTF-8.

Maybe we should just install a smart Python printer for 'char *' ;-)

Paul> What are some of the good Python references?
Tom> http://www.python.org/doc/2.5.2/api/api.html

Paul> Yes, I've seen the above, but it didn't have the answers I was
Paul> looking for :(

What do you want to know? Both Thiago and I have worked in this area,
maybe one of us knows.

Tom
Paul Pluzhnikov
2008-11-05 01:39:02 UTC
Permalink
Post by Tom Tromey
Tom> What should happen here, though? The string contains invalid
Tom> characters for its declared (via set target-charset) encoding.
Paul> As an end-user, I would expect something like
Paul> $2 = <"\xef\xcd\xab">
It occurs to me I am not completely certain where this error
originates. My theory is that it is the call to PyUnicode_Decode in
valpy_str.
The 'PyUnicode_Decode()' returns a PyObject, for which
PyUnicode_AsEncodedString() returns NULL.

Here is the trace of this happening:

Breakpoint 1, valpy_str (self=0x2aaaaaae7250) at
../../gdb/python/python-value.c:246
246 char *s = NULL;
(top) n
253 stb = mem_fileopen ();
(top)
254 old_chain = make_cleanup_ui_file_delete (stb);
(top)
256 TRY_CATCH (except, RETURN_MASK_ALL)
(top)
258 common_val_print (((value_object *) self)->value, stb, 0, 0, 0,
(top)
260 s = ui_file_xstrdup (stb, &dummy);
(top)
256 TRY_CATCH (except, RETURN_MASK_ALL)
(top) p s
$4 = 0xb04c90 "\"�ͫ\""
(top) n
262 GDB_PY_HANDLE_EXCEPTION (except);
(top)
264 do_cleanups (old_chain);
(top)
266 result = PyUnicode_Decode (s, strlen (s), host_charset (), NULL);
(top)
267 xfree (s);
(top) p result
$5 = (PyObject *) 0x2aaaaab71a80
(top) n
269 return result;
(top)
270 }
(top)

### Now return into Python interpreter ###

PyObject_Str (v=<value optimized out>) at ../../Objects/object.c:361
361 if (res == NULL)
(top)
360 res = (*v->ob_type->tp_str)(v);
(top)
361 if (res == NULL)
(top) p res
$6 = (PyObject *) 0x2aaaaab71a80
(top) n
364 if (PyUnicode_Check(res)) {
(top)
366 str = PyUnicode_AsEncodedString(res, NULL, NULL);
(top)
367 Py_DECREF(res);
(top) p str
$7 = (PyObject *) 0x0
Post by Tom Tromey
If so, then we aren't seeing a value representation problem, which is
what I was worried about. Instead, I think common_val_print is
emitting a string which is not actually valid according to
host_charset. That seems wrong.
We could work around this in valpy_str, I suppose. But I'm curious to
know why this is happening -- why isn't common_val_print printing the
escape sequences itself?
I don't see any escape sequences here.
Note that 'raw' GDB doesn't print any escape sequences either,
just raw contents of the buffer.
Post by Tom Tromey
My guess is that the target and host charsets are the same, and
charset.c is passing character through without checking them for
validity. I didn't debug it, but when I set host-charset to ASCII (my
target-charset is ISO-8859-1), I do see the escapes.
Every time I look at this stuff I'm reminded that the gdb charset code
could use a good scrubbing. For example, the default host charset
ought to come from the locale settings. I have a patch to implement
this, but there's no point submitting it since it breaks gdb on
typical Linux systems -- most people use UTF-8 locales, but gdb
doesn't handle UTF-8.
Maybe we should just install a smart Python printer for 'char *' ;-)
Paul> What are some of the good Python references?
Tom> http://www.python.org/doc/2.5.2/api/api.html
Paul> Yes, I've seen the above, but it didn't have the answers I was
Paul> looking for :(
What do you want to know? Both Thiago and I have worked in this area,
maybe one of us knows.
How to turn raw buffer contents with unprintable characters into something
which will print as "\xef\xcd\xab" :)

Or "what PyUnicode_AsEncodedString() is actually supposed to do?"
--
Paul Pluzhnikov
Thiago Jung Bauermann
2008-11-05 12:57:07 UTC
Permalink
Hi,
Post by Paul Pluzhnikov
258 common_val_print (((value_object *) self)->value, stb, 0, 0, 0,
<snip>
Post by Paul Pluzhnikov
266 result = PyUnicode_Decode (s, strlen (s), host_charset (), NULL);
Just a parenthesis: at first I thought this call host_charset here was
wrong and should be to target_charset. Then I thought again and it's
right if common_val_print converts the string from target_charset to
host_charset. I think that's the case, but it's hard to follow what GDB
does to print a value. I'll put a comment in the call above explaining
this.

Anyway, what this call is doing is converting the string from GDB's host
charset (probably iso-8859-1 in your case, I think it's GDB's default)
to Unicode. Here, your non-ASCII character isn't a problem because it
exists in ISO-8859-1 and Python knows what to do with it.
Post by Paul Pluzhnikov
364 if (PyUnicode_Check(res)) {
(top)
366 str = PyUnicode_AsEncodedString(res, NULL, NULL);
(top)
367 Py_DECREF(res);
(top) p str
$7 = (PyObject *) 0x0
PyUnicode_AsEncodedString converts a Unicode string to a different
charset. Since this call is passing NULL as the 'charset' argument,
Python will convert to its default charset which is, unfortunately,
ASCII. Since the Unicode string contains a non-ASCII character, the
conversion will fail. At this point, a UnicodeError exception is raised.
Post by Paul Pluzhnikov
Post by Tom Tromey
What do you want to know? Both Thiago and I have worked in this area,
maybe one of us knows.
How to turn raw buffer contents with unprintable characters into something
which will print as "\xef\xcd\xab" :)
Tromey mentioned that if you set host-charset to ASCII, that's what GDB
will do. If I followed correctly what it does to print a value, in
valpy_str the call to common_val_print will convert the string from
target-charset to host-charset (I believe the magic happens in
c_emit_char) and PyUnicode_Decode will receive a pure ASCII string, with
the non-ASCII chars escaped.

What you hit is a shortcoming in Python itself, due to the fact that it
has ASCII as its default charset. I can reproduce the problem in a
Python interpreter:

% python
Python 2.5.2 (r252:60911, Jun 25 2008, 17:58:32)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Post by Paul Pluzhnikov
Post by Tom Tromey
a = "á"
print a
á
Post by Paul Pluzhnikov
Post by Tom Tromey
print str(a)
á
Post by Paul Pluzhnikov
Post by Tom Tromey
b = u"á"
print b
á
Post by Paul Pluzhnikov
Post by Tom Tromey
print str(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
Post by Paul Pluzhnikov
Post by Tom Tromey
print str(b.encode("utf8"))
á

The lesson to learn here is to never use str on a Unicode string. :-/
This is a known limitation of Python. I talked about this issue in:

http://sourceware.org/ml/gdb/2008-07/msg00037.html
--
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center
Doug Evans
2008-11-05 21:52:46 UTC
Permalink
Post by Tom Tromey
Tom> What should happen here, though? The string contains invalid
Tom> characters for its declared (via set target-charset) encoding.
Paul> As an end-user, I would expect something like
Paul> $2 = <"\xef\xcd\xab">
It occurs to me I am not completely certain where this error
originates. My theory is that it is the call to PyUnicode_Decode in
valpy_str.
If so, then we aren't seeing a value representation problem, which is
what I was worried about. Instead, I think common_val_print is
emitting a string which is not actually valid according to
host_charset. That seems wrong.
We could work around this in valpy_str, I suppose. But I'm curious to
know why this is happening -- why isn't common_val_print printing the
escape sequences itself?
My guess is that the target and host charsets are the same, and
charset.c is passing character through without checking them for
validity. I didn't debug it, but when I set host-charset to ASCII (my
target-charset is ISO-8859-1), I do see the escapes.
Every time I look at this stuff I'm reminded that the gdb charset code
could use a good scrubbing. For example, the default host charset
ought to come from the locale settings. I have a patch to implement
this, but there's no point submitting it since it breaks gdb on
typical Linux systems -- most people use UTF-8 locales, but gdb
doesn't handle UTF-8.
Maybe we should just install a smart Python printer for 'char *' ;-)
It seems(!) like the right solution is to make gdb unicode-aware. It
might mean going with utf8 internally and only converting at the
boundaries, I don't know.
Thiago Jung Bauermann
2008-11-05 17:56:23 UTC
Permalink
Post by ppluzhnikov-hpIqsD4AKlfQT0dZR+ (Paul Pluzhnikov)
(gdb) python execfile('simple.py')
I was wondering if it would be useful to have a command in GDB to source
Python scripts, something like "source -p foo.py". But the command above
is short enough that we wouldn't need a separate command. What do you
think?
--
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center
Paul Pluzhnikov
2008-11-05 18:05:06 UTC
Permalink
On Wed, Nov 5, 2008 at 9:56 AM, Thiago Jung Bauermann
Post by Thiago Jung Bauermann
Post by ppluzhnikov-hpIqsD4AKlfQT0dZR+ (Paul Pluzhnikov)
(gdb) python execfile('simple.py')
I was wondering if it would be useful to have a command in GDB to source
Python scripts, something like "source -p foo.py".
To tell the truth, I've just been using 'source foo.py', where the
first line of 'foo.py' is

python

I think it might be nice for "plain" 'source' to auto-deduce that
this is python source either by looking at ".py" extension, or
by scanning file contents for "obvious" python-ness.
--
Paul Pluzhnikov
Thiago Jung Bauermann
2008-11-05 18:12:39 UTC
Permalink
Post by Paul Pluzhnikov
On Wed, Nov 5, 2008 at 9:56 AM, Thiago Jung Bauermann
Post by Thiago Jung Bauermann
Post by ppluzhnikov-hpIqsD4AKlfQT0dZR+ (Paul Pluzhnikov)
(gdb) python execfile('simple.py')
I was wondering if it would be useful to have a command in GDB to source
Python scripts, something like "source -p foo.py".
To tell the truth, I've just been using 'source foo.py', where the
first line of 'foo.py' is
python
I think it might be nice for "plain" 'source' to auto-deduce that
this is python source either by looking at ".py" extension, or
by scanning file contents for "obvious" python-ness.
I liked the idea of looking at the extension. Also, another argument
which I thought in favor of using a gdb command directly instead of
python execfile is that it gives you tab completion.
--
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center
Doug Evans
2008-11-05 19:40:20 UTC
Permalink
On Wed, Nov 5, 2008 at 10:12 AM, Thiago Jung Bauermann
Post by Thiago Jung Bauermann
Post by Paul Pluzhnikov
On Wed, Nov 5, 2008 at 9:56 AM, Thiago Jung Bauermann
Post by Thiago Jung Bauermann
Post by ppluzhnikov-hpIqsD4AKlfQT0dZR+ (Paul Pluzhnikov)
(gdb) python execfile('simple.py')
I was wondering if it would be useful to have a command in GDB to source
Python scripts, something like "source -p foo.py".
To tell the truth, I've just been using 'source foo.py', where the
first line of 'foo.py' is
python
I think it might be nice for "plain" 'source' to auto-deduce that
this is python source either by looking at ".py" extension, or
by scanning file contents for "obvious" python-ness.
I liked the idea of looking at the extension. Also, another argument
which I thought in favor of using a gdb command directly instead of
python execfile is that it gives you tab completion.
The thing I like about using a gdb command directly is that it feels
better from a u/i perspective. Having to use execfile seems way too
clumsy for something as important as finally having a real scripting
language built into gdb.

I vote for using the file extension, with -p (or some such) to handle
situations where the file extension isn't .py.
Tom Tromey
2008-11-05 18:13:08 UTC
Permalink
Tom> (gdb) python execfile('simple.py')

Thiago> I was wondering if it would be useful to have a command in GDB to source
Thiago> Python scripts, something like "source -p foo.py". But the command above
Thiago> is short enough that we wouldn't need a separate command. What do you
Thiago> think?

How about the appended? :-)

This isn't quite as nice as "source -p", but it was simpler to
implement.

A couple ideas I've had for improved commands-in-python:

* Make it possible to rename existing gdb commands, or have Python
objects wrap the underlying command structure.
This would let us write a new "source" (e.g.) in Python that could
delegate to the old "source" when needed.
I have a use for this with "backtrace" too...
For this to work well I think we'd also need to fix the existing gdb
crash involving redefining commands with aliases.

* Put interesting commands into a Python module, like gdb.commands.
Ship a bunch of them with gdb -- but not activated. Then users
could pick the ones they like:

python import gdb.commands.PSource
python import gdb.commands.FancyBacktrace

Tom

import gdb

class PSource(gdb.Command):
"Read a file and evaluate its contents as Python code."

def __init__(self):
super(PSource, self).__init__("psource",
gdb.COMMAND_OBSCURE,
gdb.COMPLETE_FILENAME)

def invoke(self, arg, from_tty):
self.dont_repeat()
execfile(arg)

PSource()
Thiago Jung Bauermann
2008-11-05 18:33:07 UTC
Permalink
Post by Tom Tromey
Tom> (gdb) python execfile('simple.py')
Thiago> I was wondering if it would be useful to have a command in GDB to source
Thiago> Python scripts, something like "source -p foo.py". But the command above
Thiago> is short enough that we wouldn't need a separate command. What do you
Thiago> think?
How about the appended? :-)
This is great. I'm surprised I didn't think about implementing a command
in Python. :-)
Post by Tom Tromey
* Make it possible to rename existing gdb commands, or have Python
objects wrap the underlying command structure.
This would let us write a new "source" (e.g.) in Python that could
delegate to the old "source" when needed.
I liked this idea.
Post by Tom Tromey
For this to work well I think we'd also need to fix the existing gdb
crash involving redefining commands with aliases.
At least now we have a motivation to fix it. :-)
Post by Tom Tromey
* Put interesting commands into a Python module, like gdb.commands.
Ship a bunch of them with gdb -- but not activated. Then users
python import gdb.commands.PSource
python import gdb.commands.FancyBacktrace
Good idea. But IMHO the PSource command in particular could be enabled
by default.
--
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center
Thiago Jung Bauermann
2008-11-07 14:07:10 UTC
Permalink
Post by Tom Tromey
self.dont_repeat()
execfile(arg)
Just FYI, this command will not work as is, and it made me scratch my
head for a good while. When you try to source a Python script which
defines a new command, psource fails with:

NameError: global name 'ReverseBacktrace' is not defined

The funny thing is that the same script works when sourced directly with
"python execfile('reverse-backtrace.py')". Due to limitations in the
Python language which I'm not sure I understood yet, you have to change
the execfile call in the invoke method to:

execfile(arg, globals ())

Then psource will work as expected. This is apparently a common pitfall
with execfile in Python, and you will find users struggling with it if
you search the web about the problem. I found an explanation in:

http://bytes.com/forum/thread21061.html
--
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center
Loading...