hermes.commands.harvest.file_exists

Module for the FileExistsHarvestPlugin and it’s associated models and helpers.

Classes

URL

Basic model of a schema:URL.

MediaObject

Basic model of a schema:MediaObject.

CreativeWork

Basic model of a schema:CreativeWork.

FileExistsHarvestSettings

Settings for file_exists harvester.

FileExistsHarvestPlugin

Harvest plugin that finds and tags files based on patterns.

Functions

_path_matches_pattern(→ bool)

Case-insensitive path matching.

_ls_files(→ List[pathlib.Path])

Get a list of all files by recursively searching the working_directory.

_git_ls_files(→ Optional[List[pathlib.Path]])

Get a list of all files by calling git ls-file in working_directory.

Module Contents

class hermes.commands.harvest.file_exists.URL

Basic model of a schema:URL.

See also: https://schema.org/URL

url: str
classmethod from_path(path: pathlib.Path) typing_extensions.Self
as_codemeta() dict
class hermes.commands.harvest.file_exists.MediaObject

Basic model of a schema:MediaObject.

See also: https://schema.org/MediaObject

content_size: str | None
encoding_format: str | None
url: URL
classmethod from_path(path: pathlib.Path) typing_extensions.Self
as_codemeta() dict
class hermes.commands.harvest.file_exists.CreativeWork

Basic model of a schema:CreativeWork.

See also: https://schema.org/CreativeWork

name: str
associated_media: MediaObject
keywords: Set[str]
classmethod from_path(path: pathlib.Path, keywords: Iterable[str]) typing_extensions.Self
as_codemeta() dict
class hermes.commands.harvest.file_exists.FileExistsHarvestSettings(/, **data: Any)

Bases: pydantic.BaseModel

Settings for file_exists harvester.

enable_git_ls_files: bool = True
keep_untagged_files: bool = False
search_patterns: Dict[str, List[str]]
class hermes.commands.harvest.file_exists.FileExistsHarvestPlugin

Bases: hermes.commands.harvest.base.HermesHarvestPlugin

Harvest plugin that finds and tags files based on patterns.

Files are searched using git ls-files or a recursive traversal of the working directory. If available, git ls-files is used. This can be disabled via the options.

The found files are then tagged based on patterns such as readme.md or licenses/*.txt. Matching of the file paths is implemented using the match function of Python’s Path objects. This means, matching is performed from the end of the path. Search patterns are case-insensitive and use / as the path separator.

Files are tagged using the name of the file name pattern’s “group” as the keyword. If a file matches multiple patterns, all appropriate keywords are added. Depending on configuration of keep_untagged_files, files without any tags are then removed from the file list (this is the default).

Files that were tagged with readme are added to the data model as a schema:URL using the codemeta:readme property. Files that were tagged license are added to the data model as a schema:URL using the schema:license property. All files are added to the data model as a schema:CreativeWork using the schema:hasPart property. All file URLs are given using the file: protocol and the absolute path of the file at the time of harvesting.

settings_class
base_search_patterns
working_directory: pathlib.Path
settings: FileExistsHarvestSettings
search_patterns: Dict[str, List[str]]
search_pattern_keywords: Dict[str, Set[str]]
search_pattern_list: List[str] = []
__call__(command: hermes.commands.harvest.base.HermesHarvestCommand)

Execute the plugin.

Parameters:

command – The command that triggered this plugin to run.

_find_files() List[pathlib.Path]

Find files.

If the setting enable_git_ls_files is True, git ls-files is used to find matching files. If it is set to False or getting the list from git fails, the working directory is searched recursively.

_tag_files(paths: Iterable[pathlib.Path]) Dict[pathlib.Path, Set[str]]

Tag file paths based on patterns.

The files are tagged using the “group” names of the search pattern as the keywords.

_filter_files(files_tags: Dict[pathlib.Path, Set[str]]) Dict[pathlib.Path, Set[str]]

Filter out untagged files if required.

If the setting keep_untagged_files is set to True, the filter is not applied.

hermes.commands.harvest.file_exists._path_matches_pattern(path: pathlib.Path, pattern: str) bool

Case-insensitive path matching.

Python 3.12 introduces the case_sensitive kwarg to the match function. For older Python versions, we have to implement this behaviour ourselves.

hermes.commands.harvest.file_exists._ls_files(working_directory: pathlib.Path) List[pathlib.Path]

Get a list of all files by recursively searching the working_directory.

Only regular files (i.e. files which are not directories, pipes, etc.) are returned.

hermes.commands.harvest.file_exists._git_ls_files(working_directory: pathlib.Path) List[pathlib.Path] | None

Get a list of all files by calling git ls-file in working_directory.

git ls-file is called with the --cached flag which lists all files tracked by git. The returned file paths are converted to a list of Path objects. Files that are tracked by git but don’t exist on disk are not returned. If the git command fails or git is not found, None is returned.

The result of this function is cached. Git is only executed once per given working_directory.