Type Annotations by nachomaiz · Pull Request #320 · Roche/pyreadstat

nachomaiz · 2026-02-13T13:34:55Z

This PR aims to fix #299, adding type annotations to all public interface functions and classes.

I based them on the docstrings and how I understand the code is operating with the different parameters and class attributes, but I might have missed something.

I wasn't able to compile the library in this machine, however I have done a runtime check of the type annotations to make sure everything runs in py3.10+.

How it works:

pyclasses.py:
- I've written TypedDict classes for missing ranges and MR sets.
- Because the instance is not meant to be initialized by the user, I've set the type annotations for optional parameters as the end type. It might be better to turn it into a dataclass or add default values of the same type.
pyreadstat.py:
- Created a FileLike protocol with the methods read and seek.
- Use os.PathLike for flexibility with os.fsencode
- Added overloads to read functions for the different output format types.
- ~~Write functions accept any dataframe object supported by narwhals.~~ Write functions accept either a pandas.DataFrame or a polars.DataFrame as the first argument.
- Chunk- and multi-read functions only accept a PyreadstatReadFunction callable type. It's first argument must be a path/file-like object and it must return a tuple of data and metadata.
pyfunctions.py
- Use narwhals type vars to signal the return of the same type of dataframe.
py.typed: to signal to type checkers that the public interface of the library is type hinted.

While I added type annotations, I saw a few issues with the docstrings. I took the liberty to sync them up with the type annotations.

I also used a formatter for the function signatures as they were getting unwieldy. This changed the formatting of some of the code within the functions, so let me know if you would prefer I revert those.

Looking forward to your feedback.

…values

…-hinted

…read functions accepting FileLike incorrectly.

ofajardo · 2026-02-16T14:34:10Z

hi @nachomaiz thanks a lot! I am a bit snowed right now, but will take a look as soon as I get some time.
By looking at what you wrote here, I have two comments:

If you like, you can transform the metadata container to a data class if you think it will work better.
For the write functions, they should only accept a pandas or a polars dataframe, accepting any dataframe that narwhals would accept is misleading, because only those two are supported.

@jonathon-love please check this PR, probably it addresses the same as in yours?

…ibute

…and numpy types

nachomaiz · 2026-02-16T17:23:22Z

Thank you for the feedback!

I've now changed metadata_container to a dataclass. I saw that all fields are assigned to in _readstat_parser, so I've removed the Nones and added equivalent, type-compatible values. datetime fields default to datetime.now, but hopefully it should be a minimal amount of extra compute.

I also took a quick look at @jonathon-love's PR, and I realized I was missing specific types for PathLike and np.ndarray, so I ported over some of those nicer type definitions. Could I ask for your thoughts as well?

I switched the dataframe types for the write functions to pandas.DataFrame | polars.DataFrame to keep the hints restricted to those libraries only.

Additionally, I've added py.typed to MANIFEST.in and setup.py, as I got reminded by @jonathon-love to include that as well.

ofajardo · 2026-02-19T13:40:33Z

hi @nachomaiz thanks for your efforts!

I have cloned your fork, compiled it (all ok) and then run the tests.

The test_basic.py and test_narwhalified.py fail with 3 errors. The origin is probably that in the metadata _container class, if you look carefully, there are some members like number_rows that before were by default None, and now you are defining them as 0, this raises an inconsistency when using metadataonly, which also breaks read_in_chunks when reading an export file. So, could you please review those members and adapt them to be as they were before?

oh now I see your comment

I saw that all fields are assigned to in _readstat_parser, so I've removed the Nones and added equivalent, type-compatible values. datetime fields default to datetime.now, but hopefully it should be a minimal amount of extra compute.

No, that is not correct, those values are not always set and therefore the None's need to be there, also do not default datetime.now but to None.

Another one: typing_test.py raises a lot of errors. I am less familiar with mypy so I have not checked what they are about.

Found 22 errors in 4 files (checked 1 source file)

I think you have to get a machine where you can compile pyreadstat and be able to run the tests, please run test_basic.py, test_narwhalified.py with backend==pandas and backend==polars and test_http_integratio.py. BTW,please rename typing_test.py to test_typing.py just to keep the naming pattern.

Last one:
In setup.py there are these two additions:

 package_data={"pyreadstat": ["py.typed"]},
    include_package_data=True,

I think this might be unnecessary if you included py.typed in the manifest. The issue with package data, is that on windows, when people install Python from the window app market store, it installs the package and package data into different places (can't remember exactly), and I am not sure if in such case the IDE will see the py.typed (maybe yes?). I had such an issue in the past when I had to deliver dll files for windows, and python was not able to find them. I think this has to be tested.

Otherwise it looks good! =) Speed is also the same as before when I converted the files from pyx to py, so it seems the dataclass change is neutral in terms of performance.

nachomaiz · 2026-02-23T19:09:02Z

Hello!

Thank you for reviewing and for your feedback!

I'm working on setting up a machine to be able to compile and run tests, will hopefully have it soon.

In the meantime, just wanted to get your thoughts on a couple of the things you mentioned above.

I have now gone through the code a bit more carefully and found the places where the num_rows distinction between 0 and None is made, and I see that it's generally related to POR and XPORT (?) files not having row counts in their metadata, and how that interacts with metadataonly=True and chunk/multiprocessing reads...

What makes it a bit complicated in my view is that if we set num_rows as int | None, any access of num_rows for any other file type will always need to be preceded by:

if meta.number_rows is not None:
    ...

Which may be the easier way to handle things in the end, but also feels redundant when for many users it would never be None. Would there be any alternatives? Maybe a subclass of metadata_container only for POR and XPORT files, in which number_rows can be None as well? But you're probably more familiar with the code in terms of other potential ways to keep the logic working.

...

On the typing_tests.pyi file, do note that it's a PYI file, so it's not executable. This is just to run mypy tests against it, with mypy tests/typing_tests.pyi. I was worried that naming it as test_typing.py would make the test runner think it's a file with actual runtime tests. I suppose since it's a PYI file I could rename it to test_typing.pyi and it should be ok.

There are a few rows which should error, there should be 5 errors in that file (there are comments in the file where it points them out). It also analyzes other files as the test file imports them, so I'm ignoring those files for the purpose of these annotations.

I noticed that there's an import error in the file so at the moment it doesn't work correctly, I'll fix that soon. I should also mention that both polars and pandas-stubs must be installed for mypy to do the type checks correctly.

I'll fix those few bits and remove the extra setup.py lines in my next batch of commits, but I'd be keen to hear your thoughts on alternatives to setting int | None for all types of files.

ofajardo · 2026-02-23T20:24:51Z

hi @nachomaiz

Regarding the topic of num_rows being int or None, None signals that it was not possible to recover the information from the metadata and therefore it is undefined. It is not correct to say that happens only for POR and XPORT files, theoretically can happen to any file type if the writing application did not write that information, for example in the case of SPSS SAV files, some applications do not write the number of rows and therefore cannot be determined and should stay as None (see for example #109).

However, I am not 100% sure of what the problem is ... this is the way it has been for years and there has been no problems so far. I am also reluctant to change the interface unless it is strictly necessary. So can you please explain a bit more what your concern is? If you mean the user needs to check the possibility that num_rows is None, yes, the user should do that if wants to be strict, no way around that, for the reason explained before.

Please also notice that I would like all the members that were None before to stay as they are, not only num_rows.

nachomaiz · 2026-02-24T09:18:18Z

Ok! That makes sense. My mistake for assuming things. 😅

Will bring back all the None values, try to run the tests, and push new commits, aiming for later today.

Hopefully that gets it to a good place to merge!

ofajardo · 2026-02-24T10:18:59Z

hi @nachomaiz thanks!

Regarding the typing tests, please indicate in the comment at the top of the file, where you indicate that it has to be run with mypy, which other packages need to be installed in order to run.

We need the tests to be executable, they should have assertions which should all of them pass if everything is fine and fail if something is wrong. These tests will be then run in order to make the wheels and expected to pass, so reveal_type is not enough. So please transform your tests into an executable and rename it as suggested before. I have never done this, so not sure what is better, a quick search says you can use either assert_type (would be nice as no extra package needed, then you could do similarly as test_narwhalified.py) or pytest-mypy-plugin (would require to install extra stuff, but apparently you can write negative tests more easily).

…optional too

nachomaiz · 2026-02-24T21:38:50Z

Hello again!

So I've managed to setup my machine to run tests! And yes I saw where they were failing. Sorry about that.

I've done the following:

Reverted the previously optional values to None by default
Added int | None to MRSet["counted_value"] as I saw that it was a valid type in the tests.
Created typing test cases to run with pytest-mypy-plugins. This was actually pretty interesting, so thanks for the challenge! 😄

The typing tests are run using the following command (also available at the top of the file):

pytest tests/test_typing.yml --mypy-ini-file=tests/test_mypy_setup.ini --mypy-only-local-stub

The mypy-* flags are needed because mypy checks imported libraries by default (numpy and narwhals), and that brings up tons of noise from typing errors there.

pytest will also pick up these tests if you just run pytest tests for example, so the mypy flags are required there too.

I've made just a few quick type tests on the metadata_container object, but they can be expanded as needed.

Let me know if you see anything else to change!

PS. I didn't go with assert_type because it's not available in Python 3.10.

nachomaiz added 19 commits February 6, 2026 20:17

first pass at typing public interface

71dc48d

remove Optional types since they will always be returned with actual …

ce93b59

…values

finish overloads for all read_ functions

34ac76e

add typehint tests for read_file_in_chunks

d55b7ad

type tests for all read-write functions

84e3886

fixed run command

1ae8aa2

prefer direct import for builtins when possible

9175509

cleanup stringified types & make MissingRange public

0674be0

type hints for worker

c7f7e08

better import sorting

2f419ce

added py.typed file to signal that the python interface is fully type…

46f9540

…-hinted

revert some formatting changes due to line length

57ea86c

fix comment for type test

047096a

revert some formatting changes due to line length

c9258f8

type tests for all other modules

353f213

fix inconsistencies with PathLike and FileLike. Fix chunk- and multi-…

d1bac16

…read functions accepting FileLike incorrectly.

file format for write_xport must be 5 or 8

e734fd2

sync type in docstring to type annotation

77e5c2f

Add ParamSpec to PyreadstatReadFunction type definition

2c20526

ofajardo mentioned this pull request Feb 16, 2026

Added basic type checking #321

Closed

nachomaiz added 3 commits February 16, 2026 17:15

add py.typed to package files

c5346a0

change metadata_container to dataclass, added missing file_label attr…

a2e101a

…ibute

Change narwhals frame type to pandas/polars types and tweaks to path …

8ac87b7

…and numpy types

nachomaiz added 2 commits February 24, 2026 21:26

Revert metadata values to optional and set MRSet["counted_value"] as …

b82fc50

…optional too

implement type test cases for pytest

49eea57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type Annotations#320

Type Annotations#320
nachomaiz wants to merge 24 commits intoRoche:pyfile_devfrom
nachomaiz:pyfile_typehints

nachomaiz commented Feb 13, 2026 •

edited

Loading

Uh oh!

ofajardo commented Feb 16, 2026

Uh oh!

nachomaiz commented Feb 16, 2026 •

edited

Loading

Uh oh!

ofajardo commented Feb 19, 2026

Uh oh!

nachomaiz commented Feb 23, 2026

Uh oh!

ofajardo commented Feb 23, 2026

Uh oh!

nachomaiz commented Feb 24, 2026

Uh oh!

ofajardo commented Feb 24, 2026

Uh oh!

nachomaiz commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nachomaiz commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ofajardo commented Feb 16, 2026

Uh oh!

nachomaiz commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ofajardo commented Feb 19, 2026

Uh oh!

nachomaiz commented Feb 23, 2026

Uh oh!

ofajardo commented Feb 23, 2026

Uh oh!

nachomaiz commented Feb 24, 2026

Uh oh!

ofajardo commented Feb 24, 2026

Uh oh!

nachomaiz commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nachomaiz commented Feb 13, 2026 •

edited

Loading

nachomaiz commented Feb 16, 2026 •

edited

Loading

nachomaiz commented Feb 24, 2026 •

edited

Loading