Skip to content

cpburnz/python-pathspec

Repository files navigation

PathSpec

pathspec is a utility library for pattern matching of file paths. So far this only includes Git's gitignore pattern matching.

Tutorial

Say you have a "Projects" directory and you want to back it up, but only certain files, and ignore others depending on certain conditions:

>>> from pathspec import PathSpec
>>> # The gitignore-style patterns for files to select, but we're including
>>> # instead of ignoring.
>>> spec_text = """
...
... # This is a comment because the line begins with a hash: "#"
...
... # Include several project directories (and all descendants) relative to
... # the current directory. To reference only a directory you must end with a
... # slash: "/"
... /project-a/
... /project-b/
... /project-c/
...
... # Patterns can be negated by prefixing with exclamation mark: "!"
...
... # Ignore temporary files beginning or ending with "~" and ending with
... # ".swp".
... !~*
... !*~
... !*.swp
...
... # These are python projects so ignore compiled python files from
... # testing.
... !*.pyc
...
... # Ignore the build directories but only directly under the project
... # directories.
... !/*/build/
...
... """

The PathSpec class provides an abstraction around pattern implementations, and we want to compile our patterns as "gitignore" patterns. You could call it a wrapper for a list of compiled patterns:

>>> spec = PathSpec.from_lines('gitignore', spec_text.splitlines())

If we wanted to manually compile the patterns, we can use the GitIgnoreBasicPattern class directly. It is used in the background for "gitignore" which internally converts patterns to regular expressions:

>>> from pathspec.patterns.gitignore.basic import GitIgnoreBasicPattern
>>> patterns = map(GitIgnoreBasicPattern, spec_text.splitlines())
>>> spec = PathSpec(patterns)

PathSpec.from_lines() is a class method which simplifies that.

If you want to load the patterns from file, you can pass the file object directly as well:

>>> with open('patterns.list', 'r') as fh:
>>>     spec = PathSpec.from_lines('gitignore', fh)

You can perform matching on a whole directory tree with:

>>> matches = set(spec.match_tree_files('path/to/directory'))

Or you can perform matching on a specific set of file paths with:

>>> matches = set(spec.match_files(file_paths))

Or check to see if an individual file matches:

>>> is_matched = spec.match_file(file_path)

There's actually two implementations of "gitignore". The basic implementation is used by PathSpec and follows patterns as documented by gitignore. However, Git's behavior differs from the documented patterns. There's some edge-cases, and in particular, Git allows including files from excluded directories which appears to contradict the documentation. GitIgnoreSpec handles these cases to more closely replicate Git's behavior:

>>> from pathspec import GitIgnoreSpec
>>> spec = GitIgnoreSpec.from_lines(spec_text.splitlines())

You do not specify the style of pattern for GitIgnoreSpec because it should always use GitIgnoreSpecPattern internally.

Performance

Running lots of regular expression matches against thousands of files in Python is slow. Alternate regular expression backends can be used to improve performance. PathSpec and GitIgnoreSpec both accept a backend parameter to control the backend. The default is "best" to automatically choose the best available backend. There are currently 3 backends.

The "simple" backend is the default and it simply uses Python's re.Pattern objects that are normally created. This can be the fastest when there's only 1 or 2 patterns.

The "hyperscan" backend uses the hyperscan library. Hyperscan tends to be at least 2 times faster than "simple", and generally slower than "re2". This can be faster than "re2" under the right conditions with pattern counts of 1-25.

The "re2" backend uses the google-re2 library (not to be confused with the re2 library on PyPI which is unrelated and abandoned). Google's re2 tends to be significantly faster than "simple", and 3 times faster than "hyperscan" at high pattern counts.

PathSpec.match_files(): 6.5k files using CPython 3.13.7 on 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Patterns simple hyperscan   re2  
  ops ops x ops x
1 289.0 166.4 0.58 197.2 0.68
5 109.4 161.7 1.48 209.1 1.91
15 48.0 114.8 2.39 180.4 3.76
25 28.8 57.6 2.00 192.3 6.67
50 16.4 36.1 2.21 187.5 11.46
100 9.2 51.4 5.57 188.7 20.42
150 6.4 57.2 9.00 184.6 29.04
GitIgnoreSpec.match_files(): 6.5k files using CPython 3.13.7 on 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Patterns simple hyperscan   re2  
  ops ops x ops x
1 289.2 172.0 0.59 216.2 0.75
5 111.1 154.6 1.39 214.8 1.93
15 49.1 133.5 2.72 208.2 4.24
25 30.5 55.9 1.84 194.4 6.38
50 15.9 38.5 2.41 195.5 12.26
100 8.8 67.7 7.72 199.2 22.71
150 6.3 61.1 9.67 177.8 28.14

FAQ

1. How do I ignore files like .gitignore?

GitIgnoreSpec (and PathSpec) positively match files by default. To find the files to keep, and exclude files like .gitignore, you need to set negate=True to flip the results:

>>> from pathspec import GitIgnoreSpec
>>> spec = GitIgnoreSpec.from_lines([...])
>>> keep_files = set(spec.match_tree_files('path/to/directory', negate=True))
>>> ignore_files = set(spec.match_tree_files('path/to/directory'))

License

pathspec is licensed under the Mozilla Public License Version 2.0. See LICENSE or the FAQ for more information.

In summary, you may use pathspec with any closed or open source project without affecting the license of the larger work so long as you:

  • give credit where credit is due,
  • and release any custom changes made to pathspec.

Source

The source code for pathspec is available from the GitHub repo cpburnz/python-pathspec.

Installation

pathspec is available for install through PyPI:

pip install pathspec
pip install pathspec[hyperscan]
pip install pathspec[google-re2]

pathspec can also be built from source. The following packages will be required:

pathspec can then be built and installed with:

python -m build
pip install dist/pathspec-*-py3-none-any.whl

Documentation

Documentation for pathspec is available on Read the Docs.

Other Languages

The related project pathspec-ruby (by highb) provides a similar library as a Ruby gem.

About

Utility library for gitignore style pattern matching of file paths.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages