pathspec is a utility library for pattern matching of file paths. So far this only includes Git's gitignore pattern matching.
Say you have a "Projects" directory and you want to back it up, but only certain files, and ignore others depending on certain conditions:
>>> from pathspec import PathSpec >>> # The gitignore-style patterns for files to select, but we're including >>> # instead of ignoring. >>> spec_text = """ ... ... # This is a comment because the line begins with a hash: "#" ... ... # Include several project directories (and all descendants) relative to ... # the current directory. To reference only a directory you must end with a ... # slash: "/" ... /project-a/ ... /project-b/ ... /project-c/ ... ... # Patterns can be negated by prefixing with exclamation mark: "!" ... ... # Ignore temporary files beginning or ending with "~" and ending with ... # ".swp". ... !~* ... !*~ ... !*.swp ... ... # These are python projects so ignore compiled python files from ... # testing. ... !*.pyc ... ... # Ignore the build directories but only directly under the project ... # directories. ... !/*/build/ ... ... """
The PathSpec class provides an abstraction around pattern implementations,
and we want to compile our patterns as "gitignore" patterns. You could call it a
wrapper for a list of compiled patterns:
>>> spec = PathSpec.from_lines('gitignore', spec_text.splitlines())
If we wanted to manually compile the patterns, we can use the GitIgnoreBasicPattern
class directly. It is used in the background for "gitignore" which internally
converts patterns to regular expressions:
>>> from pathspec.patterns.gitignore.basic import GitIgnoreBasicPattern >>> patterns = map(GitIgnoreBasicPattern, spec_text.splitlines()) >>> spec = PathSpec(patterns)
PathSpec.from_lines() is a class method which simplifies that.
If you want to load the patterns from file, you can pass the file object directly as well:
>>> with open('patterns.list', 'r') as fh:
>>> spec = PathSpec.from_lines('gitignore', fh)
You can perform matching on a whole directory tree with:
>>> matches = set(spec.match_tree_files('path/to/directory'))
Or you can perform matching on a specific set of file paths with:
>>> matches = set(spec.match_files(file_paths))
Or check to see if an individual file matches:
>>> is_matched = spec.match_file(file_path)
There's actually two implementations of "gitignore". The basic implementation is
used by PathSpec and follows patterns as documented by gitignore.
However, Git's behavior differs from the documented patterns. There's some
edge-cases, and in particular, Git allows including files from excluded
directories which appears to contradict the documentation. GitIgnoreSpec
handles these cases to more closely replicate Git's behavior:
>>> from pathspec import GitIgnoreSpec >>> spec = GitIgnoreSpec.from_lines(spec_text.splitlines())
You do not specify the style of pattern for GitIgnoreSpec because it should
always use GitIgnoreSpecPattern internally.
Running lots of regular expression matches against thousands of files in Python
is slow. Alternate regular expression backends can be used to improve
performance. PathSpec and GitIgnoreSpec both accept a backend
parameter to control the backend. The default is "best" to automatically choose
the best available backend. There are currently 3 backends.
The "simple" backend is the default and it simply uses Python's re.Pattern
objects that are normally created. This can be the fastest when there's only 1
or 2 patterns.
The "hyperscan" backend uses the hyperscan library. Hyperscan tends to be at least 2 times faster than "simple", and generally slower than "re2". This can be faster than "re2" under the right conditions with pattern counts of 1-25.
The "re2" backend uses the google-re2 library (not to be confused with the re2 library on PyPI which is unrelated and abandoned). Google's re2 tends to be significantly faster than "simple", and 3 times faster than "hyperscan" at high pattern counts.
| Patterns | simple | hyperscan | re2 | ||
|---|---|---|---|---|---|
| ops | ops | x | ops | x | |
| 1 | 289.0 | 166.4 | 0.58 | 197.2 | 0.68 |
| 5 | 109.4 | 161.7 | 1.48 | 209.1 | 1.91 |
| 15 | 48.0 | 114.8 | 2.39 | 180.4 | 3.76 |
| 25 | 28.8 | 57.6 | 2.00 | 192.3 | 6.67 |
| 50 | 16.4 | 36.1 | 2.21 | 187.5 | 11.46 |
| 100 | 9.2 | 51.4 | 5.57 | 188.7 | 20.42 |
| 150 | 6.4 | 57.2 | 9.00 | 184.6 | 29.04 |
| Patterns | simple | hyperscan | re2 | ||
|---|---|---|---|---|---|
| ops | ops | x | ops | x | |
| 1 | 289.2 | 172.0 | 0.59 | 216.2 | 0.75 |
| 5 | 111.1 | 154.6 | 1.39 | 214.8 | 1.93 |
| 15 | 49.1 | 133.5 | 2.72 | 208.2 | 4.24 |
| 25 | 30.5 | 55.9 | 1.84 | 194.4 | 6.38 |
| 50 | 15.9 | 38.5 | 2.41 | 195.5 | 12.26 |
| 100 | 8.8 | 67.7 | 7.72 | 199.2 | 22.71 |
| 150 | 6.3 | 61.1 | 9.67 | 177.8 | 28.14 |
GitIgnoreSpec (and PathSpec) positively match files by default. To find
the files to keep, and exclude files like .gitignore, you need to set
negate=True to flip the results:
>>> from pathspec import GitIgnoreSpec
>>> spec = GitIgnoreSpec.from_lines([...])
>>> keep_files = set(spec.match_tree_files('path/to/directory', negate=True))
>>> ignore_files = set(spec.match_tree_files('path/to/directory'))
pathspec is licensed under the Mozilla Public License Version 2.0. See LICENSE or the FAQ for more information.
In summary, you may use pathspec with any closed or open source project without affecting the license of the larger work so long as you:
- give credit where credit is due,
- and release any custom changes made to pathspec.
The source code for pathspec is available from the GitHub repo cpburnz/python-pathspec.
pathspec is available for install through PyPI:
pip install pathspec pip install pathspec[hyperscan] pip install pathspec[google-re2]
pathspec can also be built from source. The following packages will be required:
- build (>=0.6.0)
pathspec can then be built and installed with:
python -m build pip install dist/pathspec-*-py3-none-any.whl
Documentation for pathspec is available on Read the Docs.
The related project pathspec-ruby (by highb) provides a similar library as a Ruby gem.