hermes.commands.harvest.file_exists
===================================

.. py:module:: hermes.commands.harvest.file_exists

.. autoapi-nested-parse::

   Module for the ``FileExistsHarvestPlugin`` and it's associated models and helpers.



Classes
-------

.. autoapisummary::

   hermes.commands.harvest.file_exists.URL
   hermes.commands.harvest.file_exists.MediaObject
   hermes.commands.harvest.file_exists.CreativeWork
   hermes.commands.harvest.file_exists.FileExistsHarvestSettings
   hermes.commands.harvest.file_exists.FileExistsHarvestPlugin


Functions
---------

.. autoapisummary::

   hermes.commands.harvest.file_exists._path_matches_pattern
   hermes.commands.harvest.file_exists._ls_files
   hermes.commands.harvest.file_exists._git_ls_files


Module Contents
---------------

.. py:class:: URL

   Basic model of a ``schema:URL``.

   See also: https://schema.org/URL


   .. py:attribute:: url
      :type:  str


   .. py:method:: from_path(path: pathlib.Path) -> typing_extensions.Self
      :classmethod:



   .. py:method:: as_codemeta() -> dict


.. py:class:: MediaObject

   Basic model of a ``schema:MediaObject``.

   See also: https://schema.org/MediaObject


   .. py:attribute:: content_size
      :type:  Optional[str]


   .. py:attribute:: encoding_format
      :type:  Optional[str]


   .. py:attribute:: url
      :type:  URL


   .. py:method:: from_path(path: pathlib.Path) -> typing_extensions.Self
      :classmethod:



   .. py:method:: as_codemeta() -> dict


.. py:class:: CreativeWork

   Basic model of a ``schema:CreativeWork``.

   See also: https://schema.org/CreativeWork


   .. py:attribute:: name
      :type:  str


   .. py:attribute:: associated_media
      :type:  MediaObject


   .. py:attribute:: keywords
      :type:  Set[str]


   .. py:method:: from_path(path: pathlib.Path, keywords: Iterable[str]) -> typing_extensions.Self
      :classmethod:



   .. py:method:: as_codemeta() -> dict


.. py:class:: FileExistsHarvestSettings(/, **data: Any)

   Bases: :py:obj:`pydantic.BaseModel`


   Settings for ``file_exists`` harvester.


   .. py:attribute:: enable_git_ls_files
      :type:  bool
      :value: True



   .. py:attribute:: keep_untagged_files
      :type:  bool
      :value: False



   .. py:attribute:: search_patterns
      :type:  Dict[str, List[str]]


.. py:class:: FileExistsHarvestPlugin

   Bases: :py:obj:`hermes.commands.harvest.base.HermesHarvestPlugin`


   Harvest plugin that finds and tags files based on patterns.

   Files are searched using ``git ls-files`` or a recursive traversal of the working
   directory. If available, ``git ls-files`` is used. This can be disabled via the
   options.

   The found files are then tagged based on patterns such as ``readme.md``
   or ``licenses/*.txt``. Matching of the file paths is implemented using the ``match``
   function of Python's ``Path`` objects. This means, matching is performed from the
   end of the path. Search patterns are case-insensitive and use ``/`` as the path
   separator.

   Files are tagged using the name of the file name pattern's "group" as the keyword.
   If a file matches multiple patterns, all appropriate keywords are added. Depending
   on configuration of ``keep_untagged_files``, files without any tags are then removed
   from the file list (this is the default).

   Files that were tagged with ``readme`` are added to the data model as a
   ``schema:URL`` using the ``codemeta:readme`` property. Files that were tagged
   ``license`` are added to the data model as a ``schema:URL`` using the
   ``schema:license`` property. All files are added to the data model as a
   ``schema:CreativeWork`` using the ``schema:hasPart`` property. All file URLs are
   given using the ``file:`` protocol and the absolute path of the file at the time of
   harvesting.


   .. py:attribute:: settings_class


   .. py:attribute:: base_search_patterns


   .. py:attribute:: working_directory
      :type:  pathlib.Path


   .. py:attribute:: settings
      :type:  FileExistsHarvestSettings


   .. py:attribute:: search_patterns
      :type:  Dict[str, List[str]]


   .. py:attribute:: search_pattern_keywords
      :type:  Dict[str, Set[str]]


   .. py:attribute:: search_pattern_list
      :type:  List[str]
      :value: []



   .. py:method:: __call__(command: hermes.commands.harvest.base.HermesHarvestCommand)

      Execute the plugin.

      :param command: The command that triggered this plugin to run.



   .. py:method:: _find_files() -> List[pathlib.Path]

      Find files.

      If the setting ``enable_git_ls_files`` is ``True``, ``git ls-files`` is used to
      find matching files. If it is set to ``False`` or getting the list from git
      fails, the working directory is searched recursively.



   .. py:method:: _tag_files(paths: Iterable[pathlib.Path]) -> Dict[pathlib.Path, Set[str]]

      Tag file paths based on patterns.

      The files are tagged using the "group" names of the search pattern as the
      keywords.



   .. py:method:: _filter_files(files_tags: Dict[pathlib.Path, Set[str]]) -> Dict[pathlib.Path, Set[str]]

      Filter out untagged files if required.

      If the setting ``keep_untagged_files`` is set to ``True``, the filter is not
      applied.



.. py:function:: _path_matches_pattern(path: pathlib.Path, pattern: str) -> bool

   Case-insensitive path matching.

   Python 3.12 introduces the ``case_sensitive`` kwarg to the ``match`` function. For
   older Python versions, we have to implement this behaviour ourselves.


.. py:function:: _ls_files(working_directory: pathlib.Path) -> List[pathlib.Path]

   Get a list of all files by recursively searching the ``working_directory``.

   Only regular files (i.e. files which are not directories, pipes, etc.) are returned.


.. py:function:: _git_ls_files(working_directory: pathlib.Path) -> Optional[List[pathlib.Path]]

   Get a list of all files by calling ``git ls-file`` in ``working_directory``.

   ``git ls-file`` is called with the ``--cached`` flag which lists all files tracked
   by git. The returned file paths are converted to a list of ``Path`` objects. Files
   that are tracked by git but don't exist on disk are not returned. If the git command
   fails or git is not found, ``None`` is returned.

   The result of this function is cached. Git is only executed once per given
   ``working_directory``.


