gigawork


Namegigawork JSON
Version 1.4.2 PyPI version JSON
download
home_pagehttps://github.com/cardoeng/gigawork
SummaryA tool for extracting GitHub Actions workflows
upload_time2024-10-24 07:32:26
maintainerNone
docs_urlNone
authorGuillaume Cardoen
requires_python>=3.8
licenseLGPLv3
keywords gha workflows dataset tool
VCS
bugtrack_url
requirements click gitdb GitPython smmap jsonschema ruamel.yaml
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![SWH](https://archive.softwareheritage.org/badge/origin/https://github.com/cardoeng/gigawork/)](https://archive.softwareheritage.org/browse/origin/?origin_url=https://github.com/cardoeng/gigawork)
# gigawork

An automated tool for extracting GitHub Actions' workflows from Git repositories written in Python.
`gigawork` (**Gi**ve me **G**itHub **A**ctions **Work**flows) is primarily designed to be used as a command-line tool.
Given a Git repository, it extracts the different workflows and their versions from the Git history. 
The extraction is done by traversing the Git history of the repository and going back in time respecting the first-parent rule until the first commit (or the given reference) is reached.
The workflows are saved in a given directory along with relevant metadata (see [Usage](#usage)) in a given CSV file.

## Installation

The easiest way to install `gigawork` is to install from Pypi
```
pip install gigawork
```

*Note: a recently fixed issue in GitPython (see https://github.com/gitpython-developers/GitPython/pull/1933) might affect the results of gigawork. As Pypi does not seem to allow git dependencies, you may need to update the dependency yourself via `pip install git+https://github.com/gitpython-developers/GitPython.git`.*

Another easy way to install `gigawork` is via `pip` from this GitHub repository
```
pip install git+https://github.com/cardoeng/gigawork
```

Alternatively, you can clone this repository and install it locally
```
git clone https://github.com/cardoeng/gigawork
cd gigawork
pip install .
```

You may wish to use this tool in a virtual environment. You can use the following commands.
```
virtualenv gigawork_venv
source gigawork_venv/bin/activate
pip install gigawork
```

## Usage

After installation, the `gigawork` command-line tool should be available in your shell. Otherwise, please replace `gigawork` by `python -m gigawork`. The explanations in the following stays valid in both cases.

You can use `gigawork` with the following arguments:

```
Usage: gigawork [OPTIONS] REPOSITORY

  Extract the GitHub Actions workflow files from a single Git repository
  `REPOSITORY`. The extraction is done by traversing the Git history of the
  repository starting from the reference given to `-r` and going back in time
  respecting the first-parent rule until the first commit (or the reference
  given to `-a`) is reached. The Git repository can be local or distant. In
  the latter case, it will be pulled locally and deleted unless specified
  otherwise. Every extracted workflow file will be stored in the directory
  given to `-w` (or the directory `workflows` if not specified). The metadata
  related to the extracted workflows will be written in the CSV file given to
  `-o`, or in the standard output if not specified. The metadata related to
  the renaming of workflows will be stored in the CSV file given to `-ro`, or
  not stored if not specified.

  Example of usage: gigawork myRepository -n myRepositoryName -s directory -o
  output.csv --no-headers

Options:
  -r, --ref, --branch REF         The most recent commit reference (i.e.,
                                  commit SHA or TAG) to be considered for the
                                  extraction.
  -s, --save-repository DIRECTORY
                                  Save the repository to the given directory
                                  in case `REPOSITORY` was distant.
  -u, --update                    Fetch the repository at the given path.
  -a, --after REF                 Only consider commits after the given commit
                                  reference (i.e., commit SHA or TAG).
  -w, --workflows DIRECTORY       The directory where the extracted GitHub
                                  Actions workflow files will be stored.
  -o, --output FILE               The output CSV file where information
                                  related to the dataset will be stored. By
                                  default, the information will written to the
                                  standard output.
  -sa, --save-auxiliaries         If the information related to the auxiliary
                                  files will be stored. By default, the
                                  information will not be stored.
  -n, --repository-name TEXT      Add a column `repository` to the output file
                                  where each value will be equal to the
                                  provided parameter.
  --no-headers                    Remove the header row from the CSV output
                                  file.
  -h, --help                      Show this message and exit.
```

The CSV file given to `-o` (or that will be written to the standard output by default) will contain the following columns:
- `repository`: the name of the repository if `-n` was specified
- `commit_hash`: the commit SHA of the commit where the workflow file was extracted
- `author_name`: the name of the author of the commit
- `author_email`: the email of the author of the commit
- `committer_name`: the name of the committer of the commit
- `committer_email`: the email of the committer of the commit
- `committed_date`: the committed date of the commit
- `authored_date`: the authored date of the commit
- `file_path`: the path of the workflow file in the repository
- `previous_file_path`: The path to this file before it has been touched
- `file_hash`: the SHA of the workflow file (and so, its name in the output directory)
- `previous_file_hash`: The name of the related workflow file in the dataset, before it has been touched
- `change_type`: the type of change (A for added, M for modified, D for deleted). Note that a renamed file will be seen as a modification.
- `valid_yaml`: a boolean indicating if the file is a valid YAML file.
- `probably_workflow`: a boolean representing if the file contains the YAML key `on` and `jobs`. (Note that a file can be an invalid YAML file while having this value set to true).
- `valid_workflow`: a boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema was used in this goal. This schema is neither made nor maintained by the authors of this repository. It was originally found on [https://json.schemastore.org/github-workflow.json](https://json.schemastore.org/github-workflow.json)

### Examples

As an example, the following command extracts every workflow files from the repository `example_repository`, add the name `my-example-name` in the output. It also saves various information (such as commit SHA, author name, ...) in `output.csv` (with the headers as `--no-headers` is not specified). Each workflow file will be saved in the directory `workflows` (which is also the default save directory).

```bash
gigawork example_repository -n my-example-name -o output.csv -w workflows
```

Note that the repository does not have to be already cloned. The tool can fetch it for you and clean up (unless told otherwise) when the work is done. An example is shown below. The GitHub repository `https://github.com/cardoeng/gigawork` will be fetched, saved under the `gigawork` directory and the `repository` column will be `gigawork_name` in the resulting CSV file. Note that, if `-s gigawork` was not specified, the tool will create a temporary directory and clean up when it finishes.

```bash
gigawork https://github.com/cardoeng/gigawork -n gigawork_name -s gigawork -o output.csv
```

## License

Distributed under [GNU Lesser General Public License v3](https://github.com/cardoeng/gigawork/blob/master/LICENSE.txt).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cardoeng/gigawork",
    "name": "gigawork",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "gha workflows dataset tool",
    "author": "Guillaume Cardoen",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/e7/83/eac802a2b03c627f6eff6d97ac6be9e736acbbf04e26c65d08a9479d5077/gigawork-1.4.2.tar.gz",
    "platform": null,
    "description": "[![SWH](https://archive.softwareheritage.org/badge/origin/https://github.com/cardoeng/gigawork/)](https://archive.softwareheritage.org/browse/origin/?origin_url=https://github.com/cardoeng/gigawork)\n# gigawork\n\nAn automated tool for extracting GitHub Actions' workflows from Git repositories written in Python.\n`gigawork` (**Gi**ve me **G**itHub **A**ctions **Work**flows) is primarily designed to be used as a command-line tool.\nGiven a Git repository, it extracts the different workflows and their versions from the Git history. \nThe extraction is done by traversing the Git history of the repository and going back in time respecting the first-parent rule until the first commit (or the given reference) is reached.\nThe workflows are saved in a given directory along with relevant metadata (see [Usage](#usage)) in a given CSV file.\n\n## Installation\n\nThe easiest way to install `gigawork` is to install from Pypi\n```\npip install gigawork\n```\n\n*Note: a recently fixed issue in GitPython (see https://github.com/gitpython-developers/GitPython/pull/1933) might affect the results of gigawork. As Pypi does not seem to allow git dependencies, you may need to update the dependency yourself via `pip install git+https://github.com/gitpython-developers/GitPython.git`.*\n\nAnother easy way to install `gigawork` is via `pip` from this GitHub repository\n```\npip install git+https://github.com/cardoeng/gigawork\n```\n\nAlternatively, you can clone this repository and install it locally\n```\ngit clone https://github.com/cardoeng/gigawork\ncd gigawork\npip install .\n```\n\nYou may wish to use this tool in a virtual environment. You can use the following commands.\n```\nvirtualenv gigawork_venv\nsource gigawork_venv/bin/activate\npip install gigawork\n```\n\n## Usage\n\nAfter installation, the `gigawork` command-line tool should be available in your shell. Otherwise, please replace `gigawork` by `python -m gigawork`. The explanations in the following stays valid in both cases.\n\nYou can use `gigawork` with the following arguments:\n\n```\nUsage: gigawork [OPTIONS] REPOSITORY\n\n  Extract the GitHub Actions workflow files from a single Git repository\n  `REPOSITORY`. The extraction is done by traversing the Git history of the\n  repository starting from the reference given to `-r` and going back in time\n  respecting the first-parent rule until the first commit (or the reference\n  given to `-a`) is reached. The Git repository can be local or distant. In\n  the latter case, it will be pulled locally and deleted unless specified\n  otherwise. Every extracted workflow file will be stored in the directory\n  given to `-w` (or the directory `workflows` if not specified). The metadata\n  related to the extracted workflows will be written in the CSV file given to\n  `-o`, or in the standard output if not specified. The metadata related to\n  the renaming of workflows will be stored in the CSV file given to `-ro`, or\n  not stored if not specified.\n\n  Example of usage: gigawork myRepository -n myRepositoryName -s directory -o\n  output.csv --no-headers\n\nOptions:\n  -r, --ref, --branch REF         The most recent commit reference (i.e.,\n                                  commit SHA or TAG) to be considered for the\n                                  extraction.\n  -s, --save-repository DIRECTORY\n                                  Save the repository to the given directory\n                                  in case `REPOSITORY` was distant.\n  -u, --update                    Fetch the repository at the given path.\n  -a, --after REF                 Only consider commits after the given commit\n                                  reference (i.e., commit SHA or TAG).\n  -w, --workflows DIRECTORY       The directory where the extracted GitHub\n                                  Actions workflow files will be stored.\n  -o, --output FILE               The output CSV file where information\n                                  related to the dataset will be stored. By\n                                  default, the information will written to the\n                                  standard output.\n  -sa, --save-auxiliaries         If the information related to the auxiliary\n                                  files will be stored. By default, the\n                                  information will not be stored.\n  -n, --repository-name TEXT      Add a column `repository` to the output file\n                                  where each value will be equal to the\n                                  provided parameter.\n  --no-headers                    Remove the header row from the CSV output\n                                  file.\n  -h, --help                      Show this message and exit.\n```\n\nThe CSV file given to `-o` (or that will be written to the standard output by default) will contain the following columns:\n- `repository`: the name of the repository if `-n` was specified\n- `commit_hash`: the commit SHA of the commit where the workflow file was extracted\n- `author_name`: the name of the author of the commit\n- `author_email`: the email of the author of the commit\n- `committer_name`: the name of the committer of the commit\n- `committer_email`: the email of the committer of the commit\n- `committed_date`: the committed date of the commit\n- `authored_date`: the authored date of the commit\n- `file_path`: the path of the workflow file in the repository\n- `previous_file_path`: The path to this file before it has been touched\n- `file_hash`: the SHA of the workflow file (and so, its name in the output directory)\n- `previous_file_hash`: The name of the related workflow file in the dataset, before it has been touched\n- `change_type`: the type of change (A for added, M for modified, D for deleted). Note that a renamed file will be seen as a modification.\n- `valid_yaml`: a boolean indicating if the file is a valid YAML file.\n- `probably_workflow`: a boolean representing if the file contains the YAML key `on` and `jobs`. (Note that a file can be an invalid YAML file while having this value set to true).\n- `valid_workflow`: a boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema was used in this goal. This schema is neither made nor maintained by the authors of this repository. It was originally found on [https://json.schemastore.org/github-workflow.json](https://json.schemastore.org/github-workflow.json)\n\n### Examples\n\nAs an example, the following command extracts every workflow files from the repository `example_repository`, add the name `my-example-name` in the output. It also saves various information (such as commit SHA, author name, ...) in `output.csv` (with the headers as `--no-headers` is not specified). Each workflow file will be saved in the directory `workflows` (which is also the default save directory).\n\n```bash\ngigawork example_repository -n my-example-name -o output.csv -w workflows\n```\n\nNote that the repository does not have to be already cloned. The tool can fetch it for you and clean up (unless told otherwise) when the work is done. An example is shown below. The GitHub repository `https://github.com/cardoeng/gigawork` will be fetched, saved under the `gigawork` directory and the `repository` column will be `gigawork_name` in the resulting CSV file. Note that, if `-s gigawork` was not specified, the tool will create a temporary directory and clean up when it finishes.\n\n```bash\ngigawork https://github.com/cardoeng/gigawork -n gigawork_name -s gigawork -o output.csv\n```\n\n## License\n\nDistributed under [GNU Lesser General Public License v3](https://github.com/cardoeng/gigawork/blob/master/LICENSE.txt).\n",
    "bugtrack_url": null,
    "license": "LGPLv3",
    "summary": "A tool for extracting GitHub Actions workflows",
    "version": "1.4.2",
    "project_urls": {
        "Homepage": "https://github.com/cardoeng/gigawork"
    },
    "split_keywords": [
        "gha",
        "workflows",
        "dataset",
        "tool"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "04aa5c40a53b28173cdb079c1fa01a88d723a1931a987b7dd7ac0634a060874e",
                "md5": "aaf688a9b1aa433986d70c73d5169208",
                "sha256": "8d22c792c82f0c2b45b87f1b0f3ae2cf5ef6f8ae00dc8afc5735201d2708f619"
            },
            "downloads": -1,
            "filename": "gigawork-1.4.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aaf688a9b1aa433986d70c73d5169208",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 30375,
            "upload_time": "2024-10-24T07:32:25",
            "upload_time_iso_8601": "2024-10-24T07:32:25.531285Z",
            "url": "https://files.pythonhosted.org/packages/04/aa/5c40a53b28173cdb079c1fa01a88d723a1931a987b7dd7ac0634a060874e/gigawork-1.4.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e783eac802a2b03c627f6eff6d97ac6be9e736acbbf04e26c65d08a9479d5077",
                "md5": "2bc63feb35c65bd6d4532273a91d09ab",
                "sha256": "e4a4e431d550269177559e5c95c50d5ef466ff3d4935572914e44a2652187274"
            },
            "downloads": -1,
            "filename": "gigawork-1.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "2bc63feb35c65bd6d4532273a91d09ab",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 30237,
            "upload_time": "2024-10-24T07:32:26",
            "upload_time_iso_8601": "2024-10-24T07:32:26.724710Z",
            "url": "https://files.pythonhosted.org/packages/e7/83/eac802a2b03c627f6eff6d97ac6be9e736acbbf04e26c65d08a9479d5077/gigawork-1.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-24 07:32:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cardoeng",
    "github_project": "gigawork",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "click",
            "specs": [
                [
                    "==",
                    "8.1.7"
                ]
            ]
        },
        {
            "name": "gitdb",
            "specs": [
                [
                    "==",
                    "4.0.10"
                ]
            ]
        },
        {
            "name": "GitPython",
            "specs": [
                [
                    "==",
                    "3.1.43"
                ]
            ]
        },
        {
            "name": "smmap",
            "specs": [
                [
                    "==",
                    "5.0.1"
                ]
            ]
        },
        {
            "name": "jsonschema",
            "specs": [
                [
                    "==",
                    "4.21.1"
                ]
            ]
        },
        {
            "name": "ruamel.yaml",
            "specs": [
                [
                    "==",
                    "0.18.6"
                ]
            ]
        }
    ],
    "lcname": "gigawork"
}
        
Elapsed time: 1.22689s