Skip to content

Update test_unicodedata from v3.14.2 and implement more#7114

Merged
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:unicode
Feb 14, 2026
Merged

Update test_unicodedata from v3.14.2 and implement more#7114
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:unicode

Conversation

@youknowone
Copy link
Member

@youknowone youknowone commented Feb 13, 2026

Summary by CodeRabbit

  • New Features

    • Added methods to query Unicode character properties: combining classes, decomposition mappings, digit and decimal values, numeric values, and normalization status.
  • Bug Fixes

    • Updated error handling for failed Unicode character lookups.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

Expands the Unicode Data API with six new character property methods (combining class, decomposition, normalization, digit/decimal/numeric values), transitions error handling from LookupError to KeyError, and introduces a decomposition type mapping helper. Includes minor formatting adjustments to the time module.

Changes

Cohort / File(s) Summary
Unicode Data API Expansion
crates/stdlib/src/unicodedata.rs
Added six new UCD methods: is_normalized(), combining(), decomposition(), digit(), decimal(), and numeric() for querying character properties. Updated imports to include DecompositionType, Number, and NumericType. Changed error handling in lookup paths from LookupError to KeyError. Added decomposition_type_tag() helper function for mapping decomposition types to string tags.
Time Module Formatting
crates/vm/src/stdlib/time.rs
Minor formatting adjustments to tm_from_struct_time function signature and pyobj_to_time_t error return statement; no semantic changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 A rabbit hops through Unicode's halls,
Combining, decomposing, answering calls,
With digits and decimals in every gleam,
The UCD API's a charming dream!

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (33 files):

⚔️ .gitattributes (content)
⚔️ Cargo.lock (content)
⚔️ Lib/collections/__init__.py (content)
⚔️ Lib/hashlib.py (content)
⚔️ Lib/test/test_class.py (content)
⚔️ Lib/test/test_codecs.py (content)
⚔️ Lib/test/test_collections.py (content)
⚔️ Lib/test/test_descr.py (content)
⚔️ Lib/test/test_hashlib.py (content)
⚔️ Lib/test/test_hmac.py (content)
⚔️ Lib/test/test_inspect/test_inspect.py (content)
⚔️ Lib/test/test_itertools.py (content)
⚔️ Lib/test/test_keywordonlyarg.py (content)
⚔️ Lib/test/test_pickle.py (content)
⚔️ Lib/test/test_positional_only_arg.py (content)
⚔️ Lib/test/test_re.py (content)
⚔️ Lib/test/test_smtplib.py (content)
⚔️ Lib/test/test_struct.py (content)
⚔️ Lib/test/test_ucn.py (content)
⚔️ Lib/test/test_unicodedata.py (content)
⚔️ Lib/test/test_urlparse.py (content)
⚔️ crates/codegen/src/compile.rs (content)
⚔️ crates/stdlib/Cargo.toml (content)
⚔️ crates/stdlib/src/blake2.rs (content)
⚔️ crates/stdlib/src/hashlib.rs (content)
⚔️ crates/stdlib/src/md5.rs (content)
⚔️ crates/stdlib/src/sha1.rs (content)
⚔️ crates/stdlib/src/sha256.rs (content)
⚔️ crates/stdlib/src/sha3.rs (content)
⚔️ crates/stdlib/src/sha512.rs (content)
⚔️ crates/stdlib/src/unicodedata.rs (content)
⚔️ crates/vm/src/builtins/object.rs (content)
⚔️ crates/vm/src/vm/mod.rs (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title references updating test_unicodedata and implementing more features, which aligns with the main changes adding new UCD API methods (combining, decimal, decomposition, digit, numeric, is_normalized) and updating error handling.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch unicode
  • Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
crates/stdlib/src/unicodedata.rs (2)

202-213: is_normalized is correct but always performs full normalization.

CPython uses Unicode quick-check properties to short-circuit when possible, avoiding full normalization for already-normalized strings. This implementation always normalizes and compares, which is O(n) allocation + comparison even for already-normalized input. Fine for correctness; optimization can be deferred.


322-342: Canonical arm is unreachable but harmless.

Line 251 already handles DecompositionType::Canonical before calling this function, so the Canonical => "canonical" arm on line 324 is dead code. It's fine to keep for exhaustiveness, but worth noting.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

Code has been automatically formatted

The code in this PR has been formatted using:

  • cargo fmt --all
    Please pull the latest changes before pushing again:
git pull origin unicode

CPython Developers and others added 2 commits February 14, 2026 09:32
Add combining, decomposition, digit, decimal, numeric methods to Ucd.
Change lookup() to raise KeyError instead of LookupError.
Remove expectedFailure markers from 9 passing tests.
Add unicodedata.is_normalized() method.
Rename decomp_chars to chars to fix spell check.
Remove expectedFailure from test_named_unicode_escapes and
test_urlsplit_normalization.
@github-actions
Copy link
Contributor

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] lib: cpython/Lib/re
[x] lib: cpython/Lib/sre_compile.py
[x] lib: cpython/Lib/sre_constants.py
[x] lib: cpython/Lib/sre_parse.py
[ ] test: cpython/Lib/test/test_re.py (TODO: 17)

dependencies:

  • re

dependent tests: (59 tests)

  • re: test_android test_ast test_asyncio test_binascii test_builtin test_bytes test_ctypes test_dict test_dis test_docxmlrpc test_dtrace test_email test_faulthandler test_filecmp test_fileinput test_format test_fstring test_functools test_future_stmt test_genericalias test_glob test_http_cookiejar test_httplib test_httpservers test_imaplib test_importlib test_ipaddress test_logging test_mailbox test_mmap test_optparse test_pprint test_pydoc test_re test_regrtest test_runpy test_site test_smtplib test_socket test_ssl test_strftime test_strtod test_symtable test_syntax test_sysconfig test_tempfile test_tools test_traceback test_typing test_unittest test_venv test_webbrowser test_winapi test_wsgiref test_xmlrpc test_zipfile test_zipimport test_zoneinfo test_zstd

[ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 9)
[x] test: cpython/Lib/test/test_unicode_file.py
[ ] test: cpython/Lib/test/test_unicode_file_functions.py
[ ] test: cpython/Lib/test/test_unicode_identifiers.py (TODO: 1)
[x] test: cpython/Lib/test/test_ucn.py (TODO: 3)

dependencies:

dependent tests: (no tests depend on unicode)

[x] lib: cpython/Lib/urllib
[x] test: cpython/Lib/test/test_urllib.py
[x] test: cpython/Lib/test/test_urllib2.py
[x] test: cpython/Lib/test/test_urllib2_localnet.py (TODO: 19)
[x] test: cpython/Lib/test/test_urllib2net.py
[x] test: cpython/Lib/test/test_urllibnet.py
[x] test: cpython/Lib/test/test_urlparse.py
[x] test: cpython/Lib/test/test_urllib_response.py
[x] test: cpython/Lib/test/test_robotparser.py

dependencies:

  • urllib

dependent tests: (27 tests)

  • urllib: test_genericalias test_http_cookiejar test_httpservers test_logging test_pathlib test_pydoc test_robotparser test_site test_sqlite3 test_ssl test_ucn test_urllib test_urllib2 test_urllib2_localnet test_urllib2net test_urllib_response test_urllibnet test_urlparse
    • email.utils: test_email test_smtplib
      • smtplib: test_smtpnet
    • http.client: test_docxmlrpc test_hashlib test_unicodedata test_wsgiref test_xmlrpc
    • pydoc: test_enum

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

@youknowone youknowone marked this pull request as ready for review February 14, 2026 13:32
@youknowone youknowone merged commit 93d83c1 into RustPython:main Feb 14, 2026
24 of 25 checks passed
@youknowone youknowone deleted the unicode branch February 14, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant