- 16 Dec, 2024 2 commits
-
-
Kirill Smelkov authored
Since 50b8cb7e (strconv: Move functionality related to UTF8 encode/decode into _golang_str) both golang_str and strconv import each other. Before this patch that import was done at py level at runtime from outside to workaround the import cycle. This results in that strconv functionality is not available while golang is only being imported. So far it was not a problem, but when builtin string types will become patched with bstr and ustr, that will become a problem because string repr starts to be used at import time, which for pybstr is implemented via strconv.quote . -> Fix this by switching golang and strconv to cimport each other at pyx level. There, similarly to C, the cycle works just ok out of the box. This also automatically helps performance a bit: name old time/op new time/op delta quote[a] 805µs ± 0% 786µs ± 1% -2.40% (p=0.016 n=5+4) quote[\u03b1] 1.21ms ± 0% 1.12ms ± 0% -7.47% (p=0.008 n=5+5) quote[\u65e5] 785µs ± 0% 738µs ± 2% -5.97% (p=0.016 n=5+4) quote[\U0001f64f] 1.04ms ± 0% 0.92ms ± 1% -11.73% (p=0.008 n=5+5) stdquote 1.18µs ± 0% 1.19µs ± 0% +0.54% (p=0.008 n=5+5) unquote[a] 1.26ms ± 0% 1.08ms ± 0% -14.66% (p=0.008 n=5+5) unquote[\u03b1] 911µs ± 1% 797µs ± 0% -12.55% (p=0.008 n=5+5) unquote[\u65e5] 592µs ± 0% 522µs ± 0% -11.81% (p=0.008 n=5+5) unquote[\U0001f64f] 3.46ms ± 0% 3.21ms ± 0% -7.34% (p=0.008 n=5+5) stdunquote 812ns ± 1% 815ns ± 0% ~ (p=0.183 n=5+5)
-
Kirill Smelkov authored
So far this is plain code movement with no type annotations added and internal from-strconv imports still being done via py level. As expected this does not help practically for performance yet: name old time/op new time/op delta quote[a] 910µs ± 0% 805µs ± 0% -11.54% (p=0.008 n=5+5) quote[\u03b1] 1.23ms ± 0% 1.21ms ± 0% -1.24% (p=0.008 n=5+5) quote[\u65e5] 800µs ± 0% 785µs ± 0% -1.86% (p=0.016 n=4+5) quote[\U0001f64f] 1.06ms ± 1% 1.04ms ± 0% -1.92% (p=0.008 n=5+5) stdquote 1.17µs ± 0% 1.18µs ± 0% +0.80% (p=0.008 n=5+5) unquote[a] 1.33ms ± 1% 1.26ms ± 0% -5.13% (p=0.008 n=5+5) unquote[\u03b1] 952µs ± 2% 911µs ± 1% -4.25% (p=0.008 n=5+5) unquote[\u65e5] 613µs ± 2% 592µs ± 0% -3.48% (p=0.008 n=5+5) unquote[\U0001f64f] 3.62ms ± 1% 3.46ms ± 0% -4.32% (p=0.008 n=5+5) stdunquote 788ns ± 0% 812ns ± 1% +3.07% (p=0.016 n=4+5)
-
- 09 Oct, 2022 2 commits
-
-
Kirill Smelkov authored
Teach bstr/ustr to provide repr of themselves: it goes as b(...) and u(...) where u stands for human-readable repr of contained data. Human-readable means that non-ascii printable unicode characters are shown as-is instead of escaping them, for example: >>> x = u'αβγ' >>> x 'αβγ' >>> y = b(x) >>> y b('αβγ') <-- NOTE not b(b'\xce\xb1\xce\xb2\xce\xb3') >>> x.encode('utf-8') b'\xce\xb1\xce\xb2\xce\xb3'
-
Kirill Smelkov authored
bstr is becoming the default pygolang string type. And it can be mixed ok with all bytes/unicode and ustr. Previously e.g. strconv.quote was checking which kind of type its input was and was trying to return the result of the same type. Now this becomes unnecessary since bstr is intended to be used universally and interoperable with all other string types.
-
- 05 Oct, 2022 1 commit
-
-
Kirill Smelkov authored
We recently moved our custom UTF-8 encoding/decoding routines to Cython. Now we can start taking speedup advantage on C level to make our own UTF-8 decoder a bit less horribly slow on py2: name old time/op new time/op delta stddecode 752ns ± 0% 743ns ± 0% -1.19% (p=0.000 n=9+10) udecode 216µs ± 0% 75µs ± 0% -65.19% (p=0.000 n=9+10) stdencode 328ns ± 2% 327ns ± 1% ~ (p=0.252 n=10+9) bencode 34.1µs ± 1% 32.1µs ± 1% -5.92% (p=0.000 n=10+10) So it is ~ 3x speedup for u(), but still significantly slower compared to std unicode.decode('utf-8'). Only low-hanging fruit here to make _utf_decode_rune a bit more prompt, since it sits in the most inner loop. In the future _utf8_decode_surrogateescape might be reworked as well to avoid constructing resulting unicode via py-level list of py-unicode character objects. And similarly for _utf8_encode_surrogateescape. On py3 the performance of std and u/b decode/encode is approximately the same. /trusted-by @jerome /reviewed-on !19
-
- 04 Oct, 2022 2 commits
-
-
Kirill Smelkov authored
Error rune (u+fffd) is returned by _utf8_decode_rune to indicate an error in decoding. But the error rune itself is valid unicode codepoint: >>> x = u"�" >>> x u'\ufffd' >>> x.encode('utf-8') '\xef\xbf\xbd' This way only (r=_rune_error, size=1) should be treated by the caller as utf8 decoding error. But e.g. strconv.quote was not careful to also inspect the size, and this way was quoting � into just "\xef" instead of "\xef\xbf\xbd". _utf8_decode_surrogateescape was also subject to similar error. -> Fix it. Without the fix e.g. added test for strconv.quote fails as > assert quote(tin) == tquoted E assert '"\xef"' == '"�"' E - "\xef" E + "�" /reviewed-by @jerome /reviewed-at !18
-
Kirill Smelkov authored
- Move _utf8_decode_rune, _utf8_decode_surrogateescape, _utf8_encode_surrogateescape out from strconv into _golang_str - Factor _bstr/_ustr code into pyb/pyu. _bstr/_ustr become plain wrappers over pyb/pyu. - work-around emerged golang
↔ strconv dependency with at-runtime import. Moved routines belong to the main part of golang strings processing -> their home should be in _golang_str.pyx /reviewed-by @jerome /reviewed-at nexedi/pygolang!18
-
- 16 Mar, 2021 1 commit
-
-
Kirill Smelkov authored
Those are quote codes that Go strconv.Quote might produce. And even though Python does not use them when quoting, it too handles those quote codes when decoding: In [1]: '\r' Out[1]: '\r' In [2]: '\a\b\v\f' Out[2]: '\x07\x08\x0b\x0c' https://github.com/python/cpython/blob/2.7.18-0-g8d21aa21f2c/Objects/stringobject.c#L677-L688 -> Teach strconv.unquote + friends to handle them as well. /reviewed-by @jerome /reviewed-on !14
-
- 28 Feb, 2020 2 commits
-
-
Kirill Smelkov authored
On macos and windows, Python2 is built with --enable-unicode=ucs2, which makes it to use UTF-16 encoding for unicode characters, and so for characters higher than U+10000 it uses surrogate encoding with _2_ unicode points, for example: >>> import sys >>> sys.maxunicode 65535 <-- NOTE indicates UCS2 build >>> s = u'\U00012345' >>> s u'\U00012345' >>> s.encode('utf-8') '\xf0\x92\x8d\x85' >>> len(s) 2 <-- NOTE _not_ 1 >>> s[0] u'\ud808' >>> s[1] u'\udf45' This leads to e.g. b tests failing for # tbytes tunicode (b"\xf0\x90\x8c\xbc", u'\U0001033c'), # Valid 4 Octet Sequence '𐌼' > assert b(tunicode) == tbytes E AssertionError: assert '\xed\xa0\x80\xed\xbc\xbc' == '\xf0\x90\x8c\xbc' E - \xed\xa0\x80\xed\xbc\xbc E + \xf0\x90\x8c\xbc because on UCS2 python build u'\U0001033c' is represented as 2 unicode points: >>> s = u'\U0001033c' >>> len(s) 2 >>> s[0] u'\ud800' >>> s[1] u'\udf3c' >>> s[0].encode('utf-8') '\xed\xa0\x80' >>> s[1].encode('utf-8') '\xed\xbc\xbc' -> Fix it by detecting UCS2 build and working around by manually combining such surrogate unicode pairs appropriately. A reference on the subject: https://matthew-brett.github.io/pydagogue/python_unicode.html#utf-16-ucs2-builds-of-python-and-32-bit-unicode-code-points
-
Kirill Smelkov authored
This is a preparatory step for the next patch where we'll be fixing strconv for Python2 builds with --enable-unicode=ucs2, where a unicode character can be taking _2_ unicode points. In that general case relying on unicode objects to represent runes is not good, because many things generally do not work for U+10000 and above, e.g. ord breaks: >>> import sys >>> sys.maxunicode 65535 <-- NOTE indicates UCS2 build >>> s = u'\U00012345' >>> s u'\U00012345' >>> s.encode('utf-8') '\xf0\x92\x8d\x85' >>> len(s) 2 <-- NOTE _not_ 1 >>> ord(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: ord() expected a character, but string of length 2 found so we switch to represent runes as integer, similarly to what Go does.
-
- 04 Feb, 2020 1 commit
-
-
Kirill Smelkov authored
With Python3 I've got tired to constantly use .encode() and .decode(); getting exception if original argument was unicode on e.g. b.decode(); getting exception on raw bytes that are invalid UTF-8, not being able to use bytes literal with non-ASCII characters, etc. So instead of this pain provide two functions that make sure an object is either bytes or unicode: - b converts str/unicode/bytes s to UTF-8 encoded bytestring. Bytes input is preserved as-is: b(bytes_input) == bytes_input Unicode input is UTF-8 encoded. The encoding always succeeds. b is reverse operation to u - the following invariant is always true: b(u(bytes_input)) == bytes_input - u converts str/unicode/bytes s to unicode string. Unicode input is preserved as-is: u(unicode_input) == unicode_input Bytes input is UTF-8 decoded. The decoding always succeeds and input information is not lost: non-valid UTF-8 bytes are decoded into surrogate codes ranging from U+DC80 to U+DCFF. u is reverse operation to b - the following invariant is always true: u(b(unicode_input)) == unicode_input NOTE: encoding _and_ decoding *never* fail nor loose information. This is achieved by using 'surrogateescape' error handler on Python3, and providing manual fallback that behaves the same way on Python2. The naming is chosen with the idea so that b(something) resembles b"something", and u(something) resembles u"something". This, even being only a part of strings solution discussed in [1], should help handle byte- and unicode- strings in more robust and distraction free way. Top-level documentation is TODO. [1] nexedi/zodbtools!13
-
- 11 Sep, 2019 1 commit
-
-
Kirill Smelkov authored
In Python, %-formatting needs % operator, not ",".
-
- 26 Aug, 2019 1 commit
-
-
Kirill Smelkov authored
-
- 14 May, 2019 1 commit
-
-
Kirill Smelkov authored
- we are going to introduce golang.time, and from inside there without `from __future__ import absolute_imports` it won't be possible to import needed stdlib's time. - we were already doing `from __future__ import print_function`, but only in some files. -> It makes sense to apply updated __future__ usage uniformly.
-
- 13 Dec, 2018 2 commits
-
-
Kirill Smelkov authored
This are functions to decode quotation that was produced by strconv.quote().
-
Kirill Smelkov authored
Move quote implementation from gcompat to strconv. Make quote work on both unicode|bytes string input and produce the same output type. Preserve qq (still remaining in gcompat) to always produce str, since that is used in prints.
-
- 02 Jul, 2018 1 commit
-
-
Kirill Smelkov authored
This patch made its first step as a way to teach qq to also work on python3. However the differences in str / unicode and escapes in between py2 / py3 quickly popped out and then it became easier to just handle whole escaping logic myself. The implementation is based on go123@c0bbd06e and byproduct of manual handling is that now we don't escape printable UTF-8 characters.
-
- 20 Jun, 2018 1 commit
-
-
Kirill Smelkov authored
Not only we can import modules by full path, but now we can also spawn threads/coroutines and exchange data in between them with the same primitives and semantic as in Go. The bulk of new functionality is copied from here: kirr/go123@9e1aa6ab Original commit description follows: """ golang: New _Python_ package to provide Go-like features to Python language - `go` spawns lightweight thread. - `chan` and `select` provide channels with Go semantic. - `method` allows to define methods separate from class. - `gimport` allows to import python modules by full path in a Go workspace. The focus of first draft was on usage interface and on correctness, not speed. In particular select should be fully working. If there is a chance I will maybe try to followup with gevent-based implementation in the future. Hide whitespace changes """
-