1. 16 Dec, 2024 2 commits
    • Kirill Smelkov's avatar
      golang, strconv: Switch them to cimport each other at pyx level · e5c513bf
      Kirill Smelkov authored
      Since 50b8cb7e (strconv: Move functionality related to UTF8
      encode/decode into _golang_str) both golang_str and strconv import each
      other.
      
      Before this patch that import was done at py level at runtime from
      outside to workaround the import cycle. This results in that strconv
      functionality is not available while golang is only being imported.
      So far it was not a problem, but when builtin string types will become
      patched with bstr and ustr, that will become a problem because string
      repr starts to be used at import time, which for pybstr is implemented
      via strconv.quote .
      
      -> Fix this by switching golang and strconv to cimport each other at pyx
      level. There, similarly to C, the cycle works just ok out of the box.
      
      This also automatically helps performance a bit:
      
          name                 old time/op  new time/op  delta
          quote[a]              805µs ± 0%   786µs ± 1%   -2.40%  (p=0.016 n=5+4)
          quote[\u03b1]        1.21ms ± 0%  1.12ms ± 0%   -7.47%  (p=0.008 n=5+5)
          quote[\u65e5]         785µs ± 0%   738µs ± 2%   -5.97%  (p=0.016 n=5+4)
          quote[\U0001f64f]    1.04ms ± 0%  0.92ms ± 1%  -11.73%  (p=0.008 n=5+5)
          stdquote             1.18µs ± 0%  1.19µs ± 0%   +0.54%  (p=0.008 n=5+5)
          unquote[a]           1.26ms ± 0%  1.08ms ± 0%  -14.66%  (p=0.008 n=5+5)
          unquote[\u03b1]       911µs ± 1%   797µs ± 0%  -12.55%  (p=0.008 n=5+5)
          unquote[\u65e5]       592µs ± 0%   522µs ± 0%  -11.81%  (p=0.008 n=5+5)
          unquote[\U0001f64f]  3.46ms ± 0%  3.21ms ± 0%   -7.34%  (p=0.008 n=5+5)
          stdunquote            812ns ± 1%   815ns ± 0%     ~     (p=0.183 n=5+5)
      e5c513bf
    • Kirill Smelkov's avatar
      strconv: Move it to pyx · 2684dc94
      Kirill Smelkov authored
      So far this is plain code movement with no type annotations added and
      internal from-strconv imports still being done via py level.
      
      As expected this does not help practically for performance yet:
      
          name                 old time/op  new time/op  delta
          quote[a]              910µs ± 0%   805µs ± 0%  -11.54%  (p=0.008 n=5+5)
          quote[\u03b1]        1.23ms ± 0%  1.21ms ± 0%   -1.24%  (p=0.008 n=5+5)
          quote[\u65e5]         800µs ± 0%   785µs ± 0%   -1.86%  (p=0.016 n=4+5)
          quote[\U0001f64f]    1.06ms ± 1%  1.04ms ± 0%   -1.92%  (p=0.008 n=5+5)
          stdquote             1.17µs ± 0%  1.18µs ± 0%   +0.80%  (p=0.008 n=5+5)
          unquote[a]           1.33ms ± 1%  1.26ms ± 0%   -5.13%  (p=0.008 n=5+5)
          unquote[\u03b1]       952µs ± 2%   911µs ± 1%   -4.25%  (p=0.008 n=5+5)
          unquote[\u65e5]       613µs ± 2%   592µs ± 0%   -3.48%  (p=0.008 n=5+5)
          unquote[\U0001f64f]  3.62ms ± 1%  3.46ms ± 0%   -4.32%  (p=0.008 n=5+5)
          stdunquote            788ns ± 0%   812ns ± 1%   +3.07%  (p=0.016 n=4+5)
      2684dc94
  2. 09 Oct, 2022 2 commits
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr repr · 386844d3
      Kirill Smelkov authored
      Teach bstr/ustr to provide repr of themselves: it goes as b(...) and
      u(...) where u stands for human-readable repr of contained data.
      Human-readable means that non-ascii printable unicode characters are
      shown as-is instead of escaping them, for example:
      
          >>> x = u'αβγ'
          >>> x
          'αβγ'
          >>> y = b(x)
          >>> y
          b('αβγ')				<-- NOTE not b(b'\xce\xb1\xce\xb2\xce\xb3')
          >>> x.encode('utf-8')
          b'\xce\xb1\xce\xb2\xce\xb3'
      386844d3
    • Kirill Smelkov's avatar
      strconv, golang_str: Switch quote, unquote and qq to always return bstr · 604a7765
      Kirill Smelkov authored
      bstr is becoming the default pygolang string type. And it can be mixed
      ok with all bytes/unicode and ustr. Previously e.g. strconv.quote was
      checking which kind of type its input was and was trying to return the
      result of the same type. Now this becomes unnecessary since bstr is
      intended to be used universally and interoperable with all other string
      types.
      604a7765
  3. 05 Oct, 2022 1 commit
    • Kirill Smelkov's avatar
      golang_str: Speedup utf-8 decoding a bit on py2 · 9cb7b210
      Kirill Smelkov authored
      We recently moved our custom UTF-8 encoding/decoding routines to Cython.
      Now we can start taking speedup advantage on C level to make our own
      UTF-8 decoder a bit less horribly slow on py2:
      
          name       old time/op  new time/op  delta
          stddecode   752ns ± 0%   743ns ± 0%   -1.19%  (p=0.000 n=9+10)
          udecode     216µs ± 0%    75µs ± 0%  -65.19%  (p=0.000 n=9+10)
          stdencode   328ns ± 2%   327ns ± 1%     ~     (p=0.252 n=10+9)
          bencode    34.1µs ± 1%  32.1µs ± 1%   -5.92%  (p=0.000 n=10+10)
      
      So it is ~ 3x speedup for u(), but still significantly slower compared
      to std unicode.decode('utf-8').
      
      Only low-hanging fruit here to make _utf_decode_rune a bit more prompt,
      since it sits in the most inner loop. In the future
      _utf8_decode_surrogateescape might be reworked as well to avoid
      constructing resulting unicode via py-level list of py-unicode character
      objects. And similarly for _utf8_encode_surrogateescape.
      
      On py3 the performance of std and u/b decode/encode is approximately the same.
      
      /trusted-by @jerome
      /reviewed-on !19
      9cb7b210
  4. 04 Oct, 2022 2 commits
    • Kirill Smelkov's avatar
      golang_str,strconv: Fix decoding of rune-error · 598eb479
      Kirill Smelkov authored
      Error rune (u+fffd) is returned by _utf8_decode_rune to indicate an
      error in decoding. But the error rune itself is valid unicode codepoint:
      
         >>> x = u"�"
         >>> x
         u'\ufffd'
         >>> x.encode('utf-8')
         '\xef\xbf\xbd'
      
      This way only (r=_rune_error, size=1) should be treated by the caller as
      utf8 decoding error.
      
      But e.g. strconv.quote was not careful to also inspect the size, and this way
      was quoting � into just "\xef" instead of "\xef\xbf\xbd".
      _utf8_decode_surrogateescape was also subject to similar error.
      
      -> Fix it.
      
      Without the fix e.g. added test for strconv.quote fails as
      
          >           assert quote(tin) == tquoted
          E           assert '"\xef"' == '"�"'
          E             - "\xef"
          E             + "�"
      
      /reviewed-by @jerome
      /reviewed-at !18
      598eb479
    • Kirill Smelkov's avatar
      strconv: Move functionality related to UTF8 encode/decode into _golang_str · 50b8cb7e
      Kirill Smelkov authored
      - Move _utf8_decode_rune, _utf8_decode_surrogateescape, _utf8_encode_surrogateescape out from strconv into _golang_str
      - Factor _bstr/_ustr code into pyb/pyu. _bstr/_ustr become plain wrappers over pyb/pyu.
      - work-around emerged golang  strconv dependency with at-runtime import.
      
      Moved routines belong to the main part of golang strings processing
      -> their home should be in _golang_str.pyx
      
      /reviewed-by @jerome
      /reviewed-at nexedi/pygolang!18
      50b8cb7e
  5. 16 Mar, 2021 1 commit
  6. 28 Feb, 2020 2 commits
    • Kirill Smelkov's avatar
      strconv: Fix b & friends on macos/windows · 0561926a
      Kirill Smelkov authored
      On macos and windows, Python2 is built with --enable-unicode=ucs2, which
      makes it to use UTF-16 encoding for unicode characters, and so for
      characters higher than U+10000 it uses surrogate encoding with _2_
      unicode points, for example:
      
              >>> import sys
              >>> sys.maxunicode
              65535                       <-- NOTE indicates UCS2 build
              >>> s = u'\U00012345'
              >>> s
              u'\U00012345'
              >>> s.encode('utf-8')
              '\xf0\x92\x8d\x85'
              >>> len(s)
              2                           <-- NOTE _not_ 1
              >>> s[0]
              u'\ud808'
              >>> s[1]
              u'\udf45'
      
      This leads to e.g. b tests failing for
      
          # tbytes                        tunicode
          (b"\xf0\x90\x8c\xbc",           u'\U0001033c'),     # Valid 4 Octet Sequence '𐌼'
      
          >           assert b(tunicode) == tbytes
          E           AssertionError: assert '\xed\xa0\x80\xed\xbc\xbc' == '\xf0\x90\x8c\xbc'
          E             - \xed\xa0\x80\xed\xbc\xbc
          E             + \xf0\x90\x8c\xbc
      
      because on UCS2 python build u'\U0001033c' is represented as 2 unicode
      points:
      
          >>> s = u'\U0001033c'
          >>> len(s)
          2
          >>> s[0]
          u'\ud800'
          >>> s[1]
          u'\udf3c'
          >>> s[0].encode('utf-8')
          '\xed\xa0\x80'
          >>> s[1].encode('utf-8')
          '\xed\xbc\xbc'
      
      -> Fix it by detecting UCS2 build and working around by manually
      combining such surrogate unicode pairs appropriately.
      
      A reference on the subject:
      
      https://matthew-brett.github.io/pydagogue/python_unicode.html#utf-16-ucs2-builds-of-python-and-32-bit-unicode-code-points
      0561926a
    • Kirill Smelkov's avatar
      strconv: Switch _utf8_decode_rune to return rune ordinal instead of unicode character · 5cc679ac
      Kirill Smelkov authored
      This is a preparatory step for the next patch where we'll be fixing
      strconv for Python2 builds with --enable-unicode=ucs2, where a unicode
      character can be taking _2_ unicode points.
      
      In that general case relying on unicode objects to represent runes is
      not good, because many things generally do not work for U+10000 and
      above, e.g. ord breaks:
      
          >>> import sys
          >>> sys.maxunicode
          65535                       <-- NOTE indicates UCS2 build
          >>> s = u'\U00012345'
          >>> s
          u'\U00012345'
          >>> s.encode('utf-8')
          '\xf0\x92\x8d\x85'
          >>> len(s)
          2                           <-- NOTE _not_ 1
          >>> ord(s)
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
          TypeError: ord() expected a character, but string of length 2 found
      
      so we switch to represent runes as integer, similarly to what Go does.
      5cc679ac
  7. 04 Feb, 2020 1 commit
    • Kirill Smelkov's avatar
      golang: Provide b, u for strings · bcb95cd5
      Kirill Smelkov authored
      With Python3 I've got tired to constantly use .encode() and .decode();
      getting exception if original argument was unicode on e.g. b.decode();
      getting exception on raw bytes that are invalid UTF-8, not being able to
      use bytes literal with non-ASCII characters, etc.
      
      So instead of this pain provide two functions that make sure an object
      is either bytes or unicode:
      
      - b converts str/unicode/bytes s to UTF-8 encoded bytestring.
      
      	Bytes input is preserved as-is:
      
      	   b(bytes_input) == bytes_input
      
      	Unicode input is UTF-8 encoded. The encoding always succeeds.
      	b is reverse operation to u - the following invariant is always true:
      
      	   b(u(bytes_input)) == bytes_input
      
      - u converts str/unicode/bytes s to unicode string.
      
      	Unicode input is preserved as-is:
      
      	   u(unicode_input) == unicode_input
      
      	Bytes input is UTF-8 decoded. The decoding always succeeds and input
      	information is not lost: non-valid UTF-8 bytes are decoded into
      	surrogate codes ranging from U+DC80 to U+DCFF.
      	u is reverse operation to b - the following invariant is always true:
      
      	   u(b(unicode_input)) == unicode_input
      
      NOTE: encoding _and_ decoding *never* fail nor loose information. This
      is achieved by using 'surrogateescape' error handler on Python3, and
      providing manual fallback that behaves the same way on Python2.
      
      The naming is chosen with the idea so that b(something) resembles
      b"something", and u(something) resembles u"something".
      
      This, even being only a part of strings solution discussed in [1],
      should help handle byte- and unicode- strings in more robust and
      distraction free way.
      
      Top-level documentation is TODO.
      
      [1] nexedi/zodbtools!13
      bcb95cd5
  8. 11 Sep, 2019 1 commit
  9. 26 Aug, 2019 1 commit
  10. 14 May, 2019 1 commit
    • Kirill Smelkov's avatar
      *: __future__ += absolute_imports; Use unified __future__ everywhere · 81dfefa0
      Kirill Smelkov authored
      - we are going to introduce golang.time, and from inside there without
        `from __future__ import absolute_imports` it won't be possible to import
        needed stdlib's time.
      
      - we were already doing `from __future__ import print_function`, but
        only in some files.
      
      -> It makes sense to apply updated __future__ usage uniformly.
      81dfefa0
  11. 13 Dec, 2018 2 commits
  12. 02 Jul, 2018 1 commit
  13. 20 Jun, 2018 1 commit
    • Kirill Smelkov's avatar
      Turn pygopath into full pygolang · afa46cf5
      Kirill Smelkov authored
      Not only we can import modules by full path, but now we can also spawn
      threads/coroutines and exchange data in between them with the same
      primitives and semantic as in Go.
      
      The bulk of new functionality is copied from here:
      
      	kirr/go123@9e1aa6ab
      
      Original commit description follows:
      
      """
      golang: New _Python_ package to provide Go-like features to Python language
      - `go` spawns lightweight thread.
      - `chan` and `select` provide channels with Go semantic.
      - `method` allows to define methods separate from class.
      - `gimport` allows to import python modules by full path in a Go workspace.
      
      The focus of first draft was on usage interface and on correctness, not speed.
      In particular select should be fully working.
      
      If there is a chance I will maybe try to followup with gevent-based
      implementation in the future.
      Hide whitespace changes
      """
      afa46cf5