# git-history
[![PyPI](https://img.shields.io/pypi/v/git-history.svg)](https://pypi.org/project/git-history/)
[![Changelog](https://img.shields.io/github/v/release/simonw/git-history?include_prereleases&label=changelog)](https://github.com/simonw/git-history/releases)
[![Tests](https://github.com/simonw/git-history/workflows/Test/badge.svg)](https://github.com/simonw/git-history/actions?query=workflow%3ATest)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/git-history/blob/master/LICENSE)
Tools for analyzing Git history using SQLite
For background on this project see [git-history: a tool for analyzing scraped data collected using Git and SQLite](https://simonwillison.net/2021/Dec/7/git-history/)
## Installation
Install this tool using `pip`:
$ pip install git-history
## Demos
[git-history-demos.datasette.io](http://git-history-demos.datasette.io/) hosts three example databases created using this tool:
- [pge-outages](https://git-history-demos.datasette.io/pge-outages) shows a history of PG&E (the electricity supplier) [outages](https://pgealerts.alerts.pge.com/outagecenter/), using data collected in [simonw/pge-outages](https://github.com/simonw/pge-outages) converted using [pge-outages.sh](https://github.com/simonw/git-history/blob/main/demos/pge-outages.sh)
- [ca-fires](https://git-history-demos.datasette.io/ca-fires) shows a history of fires in California reported on [fire.ca.gov/incidents](https://www.fire.ca.gov/incidents/), from data in [simonw/ca-fires-history](https://github.com/simonw/ca-fires-history) converted using [ca-fires.sh](https://github.com/simonw/git-history/blob/main/demos/ca-fires.sh)
- [sf-bay-511](https://git-history-demos.datasette.io/sf-bay-511) has records of San Francisco Bay Area traffic and transit incident data from [511.org](https://511.org/), collected in [dbreunig/511-events-history](https://github.com/dbreunig/511-events-history) converted using [sf-bay-511.sh](https://github.com/simonw/git-history/blob/main/demos/sf-bay-511.sh)
The demos are deployed using [Datasette](https://datasette.io/) on [Google Cloud Run](https://cloud.google.com/run/) by [this GitHub Actions workflow](https://github.com/simonw/git-history/blob/main/.github/workflows/deploy-demos.yml).
## Usage
This tool can be run against a Git repository that holds a file that contains JSON, CSV/TSV or some other format and which has multiple versions tracked in the Git history. Read [Git scraping: track changes over time by scraping to a Git repository](https://simonwillison.net/2020/Oct/9/git-scraping/) to understand how you might create such a repository.
The `file` command analyzes the history of an individual file within the repository, and generates a SQLite database table that represents the different versions of that file over time.
The file is assumed to contain multiple objects - for example, the results of scraping an electricity outage map or a CSV file full of records.
Assuming you have a file called `incidents.json` that is a JSON array of objects, with multiple versions of that file recorded in a repository. Each version of that file might look something like this:
```json
[
{
"IncidentID": "abc123",
"Location": "Corner of 4th and Vermont",
"Type": "fire"
},
{
"IncidentID": "cde448",
"Location": "555 West Example Drive",
"Type": "medical"
}
]
```
Change directory into the GitHub repository in question and run the following:
git-history file incidents.db incidents.json
This will create a new SQLite database in the `incidents.db` file with three tables:
- `commits` containing a row for every commit, with a `hash` column, the `commit_at` date and a foreign key to a `namespace`.
- `item` containing a row for every item in every version of the `filename.json` file - with an extra `_commit` column that is a foreign key back to the `commit` table.
- `namespaces` containing a single row. This allows you to build multiple tables for different files, using the `--namespace` option described below.
The database schema for this example will look like this:
<!-- [[[cog
import cog, json
from git_history import cli
from click.testing import CliRunner
from tests.test_git_history import make_repo
import sqlite_utils
import tempfile, pathlib
tmpdir = pathlib.Path(tempfile.mkdtemp())
db_path = str(tmpdir / "data.db")
make_repo(tmpdir)
runner = CliRunner()
result = runner.invoke(cli.cli, [
"file", db_path, str(tmpdir / "repo" / "incidents.json"), "--repo", str(tmpdir / "repo")
])
cog.out("```sql\n")
cog.out(sqlite_utils.Database(db_path).schema)
cog.out("\n```")
]]] -->
```sql
CREATE TABLE [namespaces] (
[id] INTEGER PRIMARY KEY,
[name] TEXT
);
CREATE UNIQUE INDEX [idx_namespaces_name]
ON [namespaces] ([name]);
CREATE TABLE [commits] (
[id] INTEGER PRIMARY KEY,
[namespace] INTEGER REFERENCES [namespaces]([id]),
[hash] TEXT,
[commit_at] TEXT
);
CREATE UNIQUE INDEX [idx_commits_namespace_hash]
ON [commits] ([namespace], [hash]);
CREATE TABLE [item] (
[IncidentID] TEXT,
[Location] TEXT,
[Type] TEXT
);
```
<!-- [[[end]]] -->
If you have 10 historic versions of the `incidents.json` file and each one contains 30 incidents, you will end up with 10 * 30 = 300 rows in your `item` table.
### Track the history of individual items using IDs
If your objects have a unique identifier - or multiple columns that together form a unique identifier - you can use the `--id` option to de-duplicate and track changes to each of those items over time.
This provides a much more interesting way to apply this tool.
If there is a unique identifier column called `IncidentID` you could run the following:
git-history file incidents.db incidents.json --id IncidentID
The database schema used here is very different from the one used without the `--id` option.
If you have already imported history, the command will skip any commits that it has seen already and just process new ones. This means that even though an initial import could be slow subsequent imports should run a lot faster.
This command will create six tables - `commits`, `item`, `item_version`, `columns`, `item_changed` and `namespaces`.
Here's the full schema:
<!-- [[[cog
db_path2 = str(tmpdir / "data2.db")
result = runner.invoke(cli.cli, [
"file", db_path2, str(tmpdir / "repo" / "incidents.json"),
"--repo", str(tmpdir / "repo"),
"--id", "IncidentID"
])
cog.out("```sql\n")
cog.out(sqlite_utils.Database(db_path2).schema)
cog.out("\n```")
]]] -->
```sql
CREATE TABLE [namespaces] (
[id] INTEGER PRIMARY KEY,
[name] TEXT
);
CREATE UNIQUE INDEX [idx_namespaces_name]
ON [namespaces] ([name]);
CREATE TABLE [commits] (
[id] INTEGER PRIMARY KEY,
[namespace] INTEGER REFERENCES [namespaces]([id]),
[hash] TEXT,
[commit_at] TEXT
);
CREATE UNIQUE INDEX [idx_commits_namespace_hash]
ON [commits] ([namespace], [hash]);
CREATE TABLE [item] (
[_id] INTEGER PRIMARY KEY,
[_item_id] TEXT
, [IncidentID] TEXT, [Location] TEXT, [Type] TEXT, [_commit] INTEGER);
CREATE UNIQUE INDEX [idx_item__item_id]
ON [item] ([_item_id]);
CREATE TABLE [item_version] (
[_id] INTEGER PRIMARY KEY,
[_item] INTEGER REFERENCES [item]([_id]),
[_version] INTEGER,
[_commit] INTEGER REFERENCES [commits]([id]),
[IncidentID] TEXT,
[Location] TEXT,
[Type] TEXT,
[_item_full_hash] TEXT
);
CREATE TABLE [columns] (
[id] INTEGER PRIMARY KEY,
[namespace] INTEGER REFERENCES [namespaces]([id]),
[name] TEXT
);
CREATE UNIQUE INDEX [idx_columns_namespace_name]
ON [columns] ([namespace], [name]);
CREATE TABLE [item_changed] (
[item_version] INTEGER REFERENCES [item_version]([_id]),
[column] INTEGER REFERENCES [columns]([id]),
PRIMARY KEY ([item_version], [column])
);
CREATE VIEW item_version_detail AS select
commits.commit_at as _commit_at,
commits.hash as _commit_hash,
item_version.*,
(
select json_group_array(name) from columns
where id in (
select column from item_changed
where item_version = item_version._id
)
) as _changed_columns
from item_version
join commits on commits.id = item_version._commit;
CREATE INDEX [idx_item_version__item]
ON [item_version] ([_item]);
```
<!-- [[[end]]] -->
#### item table
The `item` table will contain the most recent version of each row, de-duplicated by ID, plus the following additional columns:
- `_id` - a numeric integer primary key, used as a foreign key from the `item_version` table.
- `_item_id` - a hash of the values of the columns specified using the `--id` option to the command. This is used for de-duplication when processing new versions.
- `_commit` - a foreign key to the `commit` table, representing the most recent commit to modify this item.
#### item_version table
The `item_version` table will contain a row for each captured differing version of that item, plus the following columns:
- `_id` - a numeric ID for the item version record.
- `_item` - a foreign key to the `item` table.
- `_version` - the numeric version number, starting at 1 and incrementing for each captured version.
- `_commit` - a foreign key to the `commit` table.
- `_item_full_hash` - a hash of this version of the item. This is used internally by the tool to identify items that have changed between commits.
The other columns in this table represent columns in the original data that have changed since the previous version. If the value has not changed, it will be represented by a `null`.
If a value was previously set but has been changed back to `null` it will still be represented as `null` in the `item_version` row. You can identify these using the `item_changed` many-to-many table described below.
You can use the `--full-versions` option to store full copies of the item at each version, rather than just storing the columns that have changed.
#### item_version_detail view
This SQL view joins `item_version` against `commits` to add three further columns: `_commit_at` with the date of the commit, and `_commit_hash` with the Git commit hash.
#### item_changed
This many-to-many table indicates exactly which columns were changed in an `item_version`.
- `item_version` is a foreign key to a row in the `item_version` table.
- `column` is a foreign key to a row in the `columns` table.
This table with have the largest number of rows, which is why it stores just two integers in order to save space.
#### columns
The `columns` table stores column names. It is referenced by `item_changed`.
- `id` - an integer ID.
- `name` - the name of the column.
- `namespace` - a foreign key to `namespaces`, for if multiple file histories are sharing the same database.
#### Reserved column names
<!-- [[[cog
from git_history.utils import RESERVED
cog.out("Note that ")
cog.out(", ".join("`{}`".format(r) for r in RESERVED))
cog.out(" are considered reserved column names for the purposes of this tool.")
]]] -->
Note that `_id`, `_item_full_hash`, `_item`, `_item_id`, `_version`, `_commit`, `_item_id`, `_commit_at`, `_commit_hash`, `_changed_columns`, `rowid` are considered reserved column names for the purposes of this tool.
<!-- [[[end]]] -->
If your data contains any of these they will be renamed to add a trailing underscore, for example `_id_`, `_item_`, `_version_`, to avoid clashing with the reserved columns.
If you have a column with a name such as `_commit_` it will be renamed too, adding an additional trailing underscore, so `_commit_` becomes `_commit__` and `_commit__` becomes `_commit___`.
### Additional options
- `--repo DIRECTORY` - the path to the Git repository, if it is not the current working directory.
- `--branch TEXT` - the Git branch to analyze - defaults to `main`.
- `--id TEXT` - as described above: pass one or more columns that uniquely identify a record, so that changes to that record can be calculated over time.
- `--full-versions` - instead of recording just the columns that have changed in the `item_version` table record a full copy of each version of theh item.
- `--ignore TEXT` - one or more columns to ignore - they will not be included in the resulting database.
- `--csv` - treat the data is CSV or TSV rather than JSON, and attempt to guess the correct dialect
- `--dialect` - use a spcific CSV dialect. Options are `excel`, `excel-tab` and `unix` - see [the Python CSV documentation](https://docs.python.org/3/library/csv.html#csv.excel) for details.
- `--skip TEXT` - one or more full Git commit hashes that should be skipped. You can use this if some of the data in your revision history is corrupted in a way that prevents this tool from working.
- `--start-at TEXT` - skip commits prior to the specified commit hash.
- `--start-after TEXT` - skip commits up to and including the specified commit hash, then start processing from the following commit.
- `--convert TEXT` - custom Python code for a conversion, described below.
- `--import TEXT` - additional Python modules to import for `--convert`.
- `--ignore-duplicate-ids` - if a single version of a file has the same ID in it more than once, the tool will exit with an error. Use this option to ignore this and instead pick just the first of the two duplicates.
- `--namespace TEXT` - use this if you wish to include the history of multiple different files in the same database. The default is `item` but you can set it to something else, which will produce tables with names like `yournamespace` and `yournamespace_version`.
- `--wal` - Enable WAL mode on the created database file. Use this if you plan to run queries against the database while `git-history` is creating it.
- `--silent` - don't show the progress bar.
### CSV and TSV data
If the data in your repository is a CSV or TSV file you can process it by adding the `--csv` option. This will attempt to detect which delimiter is used by the file, so the same option works for both comma- and tab-separated values.
git-history file trees.db trees.csv --id TreeID
You can also specify the CSV dialect using the `--dialect` option.
### Custom conversions using --convert
If your data is not already either CSV/TSV or a flat JSON array, you can reshape it using the `--convert` option.
The format needed by this tool is an array of dictionaries, as demonstrated by the `incidents.json` example above.
If your data does not fit this shape, you can provide a snippet of Python code to converts the on-disk content of each stored file into a Python list of dictionaries.
For example, if your stored files each look like this:
```json
{
"incidents": [
{
"id": "552",
"name": "Hawthorne Fire",
"engines": 3
},
{
"id": "556",
"name": "Merlin Fire",
"engines": 1
}
]
}
```
You could use the following Python snippet to convert them to the required format:
```python
json.loads(content)["incidents"]
```
(The `json` module is exposed to your custom function by default.)
You would then run the tool like this:
git-history file database.db incidents.json \
--id id \
--convert 'json.loads(content)["incidents"]'
The `content` variable is always a `bytes` object representing the content of the file at a specific moment in the repository's history.
You can import additional modules using `--import`. This example shows how you could read a CSV file that uses `;` as the delimiter:
git-history file trees.db ../sf-tree-history/Street_Tree_List.csv \
--repo ../sf-tree-history \
--import csv \
--import io \
--convert '
fp = io.StringIO(content.decode("utf-8"))
return list(csv.DictReader(fp, delimiter=";"))
' \
--id TreeID
You can import nested modules such as [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) using `--import xml.etree.ElementTree`, then refer to them in your function body as `xml.etree.ElementTree`. For example, if your tracked data was in an `items.xml` file that looked like this:
```xml
<items>
<item id="1" name="One" />
<item id="2" name="Two" />
<item id="3" name="Three" />
</item>
```
You could load it using the following `--convert` script:
```
git-history file items.xml --convert '
tree = xml.etree.ElementTree.fromstring(content)
return [el.attrib for el in tree.iter("item")]
' --import xml.etree.ElementTree --id id
```
If your Python code spans more than one line it needs to include a `return` statement.
You can also use Python generators in your `--convert` code, for example:
git-history file stats.db package-stats/stats.json \
--repo package-stats \
--convert '
data = json.loads(content)
for key, counts in data.items():
for date, count in counts.items():
yield {
"package": key,
"date": date,
"count": count
}
' --id package --id date
This conversion function expects data that looks like this:
```json
{
"airtable-export": {
"2021-05-18": 66,
"2021-05-19": 60,
"2021-05-20": 87
}
}
```
## Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
cd git-history
python -m venv venv
source venv/bin/activate
Or if you are using `pipenv`:
pipenv shell
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
To update the schema examples in this README file:
cog -r README.md
Raw data
{
"_id": null,
"home_page": "https://github.com/simonw/git-history",
"name": "git-history",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "Simon Willison",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/d8/24/ad8bcddc30e1bc9aa1242e242a28ddf836265d099eba5e23c0625ca48928/git-history-0.6.1.tar.gz",
"platform": "",
"description": "# git-history\n\n[![PyPI](https://img.shields.io/pypi/v/git-history.svg)](https://pypi.org/project/git-history/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/git-history?include_prereleases&label=changelog)](https://github.com/simonw/git-history/releases)\n[![Tests](https://github.com/simonw/git-history/workflows/Test/badge.svg)](https://github.com/simonw/git-history/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/git-history/blob/master/LICENSE)\n\nTools for analyzing Git history using SQLite\n\nFor background on this project see [git-history: a tool for analyzing scraped data collected using Git and SQLite](https://simonwillison.net/2021/Dec/7/git-history/)\n\n## Installation\n\nInstall this tool using `pip`:\n\n $ pip install git-history\n\n## Demos\n\n[git-history-demos.datasette.io](http://git-history-demos.datasette.io/) hosts three example databases created using this tool:\n\n- [pge-outages](https://git-history-demos.datasette.io/pge-outages) shows a history of PG&E (the electricity supplier) [outages](https://pgealerts.alerts.pge.com/outagecenter/), using data collected in [simonw/pge-outages](https://github.com/simonw/pge-outages) converted using [pge-outages.sh](https://github.com/simonw/git-history/blob/main/demos/pge-outages.sh)\n- [ca-fires](https://git-history-demos.datasette.io/ca-fires) shows a history of fires in California reported on [fire.ca.gov/incidents](https://www.fire.ca.gov/incidents/), from data in [simonw/ca-fires-history](https://github.com/simonw/ca-fires-history) converted using [ca-fires.sh](https://github.com/simonw/git-history/blob/main/demos/ca-fires.sh)\n- [sf-bay-511](https://git-history-demos.datasette.io/sf-bay-511) has records of San Francisco Bay Area traffic and transit incident data from [511.org](https://511.org/), collected in [dbreunig/511-events-history](https://github.com/dbreunig/511-events-history) converted using [sf-bay-511.sh](https://github.com/simonw/git-history/blob/main/demos/sf-bay-511.sh)\n\nThe demos are deployed using [Datasette](https://datasette.io/) on [Google Cloud Run](https://cloud.google.com/run/) by [this GitHub Actions workflow](https://github.com/simonw/git-history/blob/main/.github/workflows/deploy-demos.yml).\n\n## Usage\n\nThis tool can be run against a Git repository that holds a file that contains JSON, CSV/TSV or some other format and which has multiple versions tracked in the Git history. Read [Git scraping: track changes over time by scraping to a Git repository](https://simonwillison.net/2020/Oct/9/git-scraping/) to understand how you might create such a repository.\n\nThe `file` command analyzes the history of an individual file within the repository, and generates a SQLite database table that represents the different versions of that file over time.\n\nThe file is assumed to contain multiple objects - for example, the results of scraping an electricity outage map or a CSV file full of records.\n\nAssuming you have a file called `incidents.json` that is a JSON array of objects, with multiple versions of that file recorded in a repository. Each version of that file might look something like this:\n\n```json\n[\n {\n \"IncidentID\": \"abc123\",\n \"Location\": \"Corner of 4th and Vermont\",\n \"Type\": \"fire\"\n },\n {\n \"IncidentID\": \"cde448\",\n \"Location\": \"555 West Example Drive\",\n \"Type\": \"medical\"\n }\n]\n```\n\nChange directory into the GitHub repository in question and run the following:\n\n git-history file incidents.db incidents.json\n\nThis will create a new SQLite database in the `incidents.db` file with three tables:\n\n- `commits` containing a row for every commit, with a `hash` column, the `commit_at` date and a foreign key to a `namespace`.\n- `item` containing a row for every item in every version of the `filename.json` file - with an extra `_commit` column that is a foreign key back to the `commit` table.\n- `namespaces` containing a single row. This allows you to build multiple tables for different files, using the `--namespace` option described below.\n\nThe database schema for this example will look like this:\n\n<!-- [[[cog\nimport cog, json\nfrom git_history import cli\nfrom click.testing import CliRunner\nfrom tests.test_git_history import make_repo\nimport sqlite_utils\nimport tempfile, pathlib\ntmpdir = pathlib.Path(tempfile.mkdtemp())\ndb_path = str(tmpdir / \"data.db\")\nmake_repo(tmpdir)\nrunner = CliRunner()\nresult = runner.invoke(cli.cli, [\n \"file\", db_path, str(tmpdir / \"repo\" / \"incidents.json\"), \"--repo\", str(tmpdir / \"repo\")\n])\ncog.out(\"```sql\\n\")\ncog.out(sqlite_utils.Database(db_path).schema)\ncog.out(\"\\n```\")\n]]] -->\n```sql\nCREATE TABLE [namespaces] (\n [id] INTEGER PRIMARY KEY,\n [name] TEXT\n);\nCREATE UNIQUE INDEX [idx_namespaces_name]\n ON [namespaces] ([name]);\nCREATE TABLE [commits] (\n [id] INTEGER PRIMARY KEY,\n [namespace] INTEGER REFERENCES [namespaces]([id]),\n [hash] TEXT,\n [commit_at] TEXT\n);\nCREATE UNIQUE INDEX [idx_commits_namespace_hash]\n ON [commits] ([namespace], [hash]);\nCREATE TABLE [item] (\n [IncidentID] TEXT,\n [Location] TEXT,\n [Type] TEXT\n);\n```\n<!-- [[[end]]] -->\n\nIf you have 10 historic versions of the `incidents.json` file and each one contains 30 incidents, you will end up with 10 * 30 = 300 rows in your `item` table.\n\n### Track the history of individual items using IDs\n\nIf your objects have a unique identifier - or multiple columns that together form a unique identifier - you can use the `--id` option to de-duplicate and track changes to each of those items over time.\n\nThis provides a much more interesting way to apply this tool.\n\nIf there is a unique identifier column called `IncidentID` you could run the following:\n\n git-history file incidents.db incidents.json --id IncidentID\n\nThe database schema used here is very different from the one used without the `--id` option.\n\nIf you have already imported history, the command will skip any commits that it has seen already and just process new ones. This means that even though an initial import could be slow subsequent imports should run a lot faster.\n\nThis command will create six tables - `commits`, `item`, `item_version`, `columns`, `item_changed` and `namespaces`.\n\nHere's the full schema:\n\n<!-- [[[cog\ndb_path2 = str(tmpdir / \"data2.db\")\nresult = runner.invoke(cli.cli, [\n \"file\", db_path2, str(tmpdir / \"repo\" / \"incidents.json\"),\n \"--repo\", str(tmpdir / \"repo\"),\n \"--id\", \"IncidentID\"\n])\ncog.out(\"```sql\\n\")\ncog.out(sqlite_utils.Database(db_path2).schema)\ncog.out(\"\\n```\")\n]]] -->\n```sql\nCREATE TABLE [namespaces] (\n [id] INTEGER PRIMARY KEY,\n [name] TEXT\n);\nCREATE UNIQUE INDEX [idx_namespaces_name]\n ON [namespaces] ([name]);\nCREATE TABLE [commits] (\n [id] INTEGER PRIMARY KEY,\n [namespace] INTEGER REFERENCES [namespaces]([id]),\n [hash] TEXT,\n [commit_at] TEXT\n);\nCREATE UNIQUE INDEX [idx_commits_namespace_hash]\n ON [commits] ([namespace], [hash]);\nCREATE TABLE [item] (\n [_id] INTEGER PRIMARY KEY,\n [_item_id] TEXT\n, [IncidentID] TEXT, [Location] TEXT, [Type] TEXT, [_commit] INTEGER);\nCREATE UNIQUE INDEX [idx_item__item_id]\n ON [item] ([_item_id]);\nCREATE TABLE [item_version] (\n [_id] INTEGER PRIMARY KEY,\n [_item] INTEGER REFERENCES [item]([_id]),\n [_version] INTEGER,\n [_commit] INTEGER REFERENCES [commits]([id]),\n [IncidentID] TEXT,\n [Location] TEXT,\n [Type] TEXT,\n [_item_full_hash] TEXT\n);\nCREATE TABLE [columns] (\n [id] INTEGER PRIMARY KEY,\n [namespace] INTEGER REFERENCES [namespaces]([id]),\n [name] TEXT\n);\nCREATE UNIQUE INDEX [idx_columns_namespace_name]\n ON [columns] ([namespace], [name]);\nCREATE TABLE [item_changed] (\n [item_version] INTEGER REFERENCES [item_version]([_id]),\n [column] INTEGER REFERENCES [columns]([id]),\n PRIMARY KEY ([item_version], [column])\n);\nCREATE VIEW item_version_detail AS select\n commits.commit_at as _commit_at,\n commits.hash as _commit_hash,\n item_version.*,\n (\n select json_group_array(name) from columns\n where id in (\n select column from item_changed\n where item_version = item_version._id\n )\n) as _changed_columns\nfrom item_version\n join commits on commits.id = item_version._commit;\nCREATE INDEX [idx_item_version__item]\n ON [item_version] ([_item]);\n```\n<!-- [[[end]]] -->\n\n#### item table\n\nThe `item` table will contain the most recent version of each row, de-duplicated by ID, plus the following additional columns:\n\n- `_id` - a numeric integer primary key, used as a foreign key from the `item_version` table.\n- `_item_id` - a hash of the values of the columns specified using the `--id` option to the command. This is used for de-duplication when processing new versions.\n- `_commit` - a foreign key to the `commit` table, representing the most recent commit to modify this item.\n\n#### item_version table\n\nThe `item_version` table will contain a row for each captured differing version of that item, plus the following columns:\n\n- `_id` - a numeric ID for the item version record.\n- `_item` - a foreign key to the `item` table.\n- `_version` - the numeric version number, starting at 1 and incrementing for each captured version.\n- `_commit` - a foreign key to the `commit` table.\n- `_item_full_hash` - a hash of this version of the item. This is used internally by the tool to identify items that have changed between commits.\n\nThe other columns in this table represent columns in the original data that have changed since the previous version. If the value has not changed, it will be represented by a `null`.\n\nIf a value was previously set but has been changed back to `null` it will still be represented as `null` in the `item_version` row. You can identify these using the `item_changed` many-to-many table described below.\n\nYou can use the `--full-versions` option to store full copies of the item at each version, rather than just storing the columns that have changed.\n\n#### item_version_detail view\n\nThis SQL view joins `item_version` against `commits` to add three further columns: `_commit_at` with the date of the commit, and `_commit_hash` with the Git commit hash.\n\n#### item_changed\n\nThis many-to-many table indicates exactly which columns were changed in an `item_version`.\n\n- `item_version` is a foreign key to a row in the `item_version` table.\n- `column` is a foreign key to a row in the `columns` table.\n\nThis table with have the largest number of rows, which is why it stores just two integers in order to save space.\n\n#### columns\n\nThe `columns` table stores column names. It is referenced by `item_changed`.\n\n- `id` - an integer ID.\n- `name` - the name of the column.\n- `namespace` - a foreign key to `namespaces`, for if multiple file histories are sharing the same database.\n\n#### Reserved column names\n\n<!-- [[[cog\nfrom git_history.utils import RESERVED\ncog.out(\"Note that \")\ncog.out(\", \".join(\"`{}`\".format(r) for r in RESERVED))\ncog.out(\" are considered reserved column names for the purposes of this tool.\")\n]]] -->\nNote that `_id`, `_item_full_hash`, `_item`, `_item_id`, `_version`, `_commit`, `_item_id`, `_commit_at`, `_commit_hash`, `_changed_columns`, `rowid` are considered reserved column names for the purposes of this tool.\n<!-- [[[end]]] -->\n\nIf your data contains any of these they will be renamed to add a trailing underscore, for example `_id_`, `_item_`, `_version_`, to avoid clashing with the reserved columns.\n\nIf you have a column with a name such as `_commit_` it will be renamed too, adding an additional trailing underscore, so `_commit_` becomes `_commit__` and `_commit__` becomes `_commit___`.\n\n### Additional options\n\n- `--repo DIRECTORY` - the path to the Git repository, if it is not the current working directory.\n- `--branch TEXT` - the Git branch to analyze - defaults to `main`.\n- `--id TEXT` - as described above: pass one or more columns that uniquely identify a record, so that changes to that record can be calculated over time.\n- `--full-versions` - instead of recording just the columns that have changed in the `item_version` table record a full copy of each version of theh item.\n- `--ignore TEXT` - one or more columns to ignore - they will not be included in the resulting database.\n- `--csv` - treat the data is CSV or TSV rather than JSON, and attempt to guess the correct dialect\n- `--dialect` - use a spcific CSV dialect. Options are `excel`, `excel-tab` and `unix` - see [the Python CSV documentation](https://docs.python.org/3/library/csv.html#csv.excel) for details.\n- `--skip TEXT` - one or more full Git commit hashes that should be skipped. You can use this if some of the data in your revision history is corrupted in a way that prevents this tool from working.\n- `--start-at TEXT` - skip commits prior to the specified commit hash.\n- `--start-after TEXT` - skip commits up to and including the specified commit hash, then start processing from the following commit.\n- `--convert TEXT` - custom Python code for a conversion, described below.\n- `--import TEXT` - additional Python modules to import for `--convert`.\n- `--ignore-duplicate-ids` - if a single version of a file has the same ID in it more than once, the tool will exit with an error. Use this option to ignore this and instead pick just the first of the two duplicates.\n- `--namespace TEXT` - use this if you wish to include the history of multiple different files in the same database. The default is `item` but you can set it to something else, which will produce tables with names like `yournamespace` and `yournamespace_version`.\n- `--wal` - Enable WAL mode on the created database file. Use this if you plan to run queries against the database while `git-history` is creating it.\n- `--silent` - don't show the progress bar.\n\n### CSV and TSV data\n\nIf the data in your repository is a CSV or TSV file you can process it by adding the `--csv` option. This will attempt to detect which delimiter is used by the file, so the same option works for both comma- and tab-separated values.\n\n git-history file trees.db trees.csv --id TreeID\n\nYou can also specify the CSV dialect using the `--dialect` option.\n\n### Custom conversions using --convert\n\nIf your data is not already either CSV/TSV or a flat JSON array, you can reshape it using the `--convert` option.\n\nThe format needed by this tool is an array of dictionaries, as demonstrated by the `incidents.json` example above.\n\nIf your data does not fit this shape, you can provide a snippet of Python code to converts the on-disk content of each stored file into a Python list of dictionaries.\n\nFor example, if your stored files each look like this:\n\n```json\n{\n \"incidents\": [\n {\n \"id\": \"552\",\n \"name\": \"Hawthorne Fire\",\n \"engines\": 3\n },\n {\n \"id\": \"556\",\n \"name\": \"Merlin Fire\",\n \"engines\": 1\n }\n ]\n}\n```\nYou could use the following Python snippet to convert them to the required format:\n\n```python\njson.loads(content)[\"incidents\"]\n```\n(The `json` module is exposed to your custom function by default.)\n\nYou would then run the tool like this:\n\n git-history file database.db incidents.json \\\n --id id \\\n --convert 'json.loads(content)[\"incidents\"]'\n\nThe `content` variable is always a `bytes` object representing the content of the file at a specific moment in the repository's history.\n\nYou can import additional modules using `--import`. This example shows how you could read a CSV file that uses `;` as the delimiter:\n\n git-history file trees.db ../sf-tree-history/Street_Tree_List.csv \\\n --repo ../sf-tree-history \\\n --import csv \\\n --import io \\\n --convert '\n fp = io.StringIO(content.decode(\"utf-8\"))\n return list(csv.DictReader(fp, delimiter=\";\"))\n ' \\\n --id TreeID\n\nYou can import nested modules such as [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) using `--import xml.etree.ElementTree`, then refer to them in your function body as `xml.etree.ElementTree`. For example, if your tracked data was in an `items.xml` file that looked like this:\n\n```xml\n<items>\n <item id=\"1\" name=\"One\" />\n <item id=\"2\" name=\"Two\" />\n <item id=\"3\" name=\"Three\" />\n</item>\n```\nYou could load it using the following `--convert` script:\n```\ngit-history file items.xml --convert '\ntree = xml.etree.ElementTree.fromstring(content)\nreturn [el.attrib for el in tree.iter(\"item\")]\n' --import xml.etree.ElementTree --id id\n```\n\nIf your Python code spans more than one line it needs to include a `return` statement.\n\nYou can also use Python generators in your `--convert` code, for example:\n\n git-history file stats.db package-stats/stats.json \\\n --repo package-stats \\\n --convert '\n data = json.loads(content)\n for key, counts in data.items():\n for date, count in counts.items():\n yield {\n \"package\": key,\n \"date\": date,\n \"count\": count\n }\n ' --id package --id date\n\nThis conversion function expects data that looks like this:\n\n```json\n{\n \"airtable-export\": {\n \"2021-05-18\": 66,\n \"2021-05-19\": 60,\n \"2021-05-20\": 87\n }\n}\n```\n\n## Development\n\nTo contribute to this tool, first checkout the code. Then create a new virtual environment:\n\n cd git-history\n python -m venv venv\n source venv/bin/activate\n\nOr if you are using `pipenv`:\n\n pipenv shell\n\nNow install the dependencies and test dependencies:\n\n pip install -e '.[test]'\n\nTo run the tests:\n\n pytest\n\nTo update the schema examples in this README file:\n\n cog -r README.md\n\n\n",
"bugtrack_url": null,
"license": "Apache License, Version 2.0",
"summary": "Tools for analyzing Git history using SQLite",
"version": "0.6.1",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a956e8c303c9a34346f7b871e4a123b1f81b07181aa8647178206c47bf15c9e0",
"md5": "01af6dba34a055cd8812dcb439a6f3ad",
"sha256": "050061b3ac06a7889aa6eb072443d4930cb784d77161e87dfa9526f60d8493e7"
},
"downloads": -1,
"filename": "git_history-0.6.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "01af6dba34a055cd8812dcb439a6f3ad",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 17612,
"upload_time": "2021-12-08T20:30:34",
"upload_time_iso_8601": "2021-12-08T20:30:34.799879Z",
"url": "https://files.pythonhosted.org/packages/a9/56/e8c303c9a34346f7b871e4a123b1f81b07181aa8647178206c47bf15c9e0/git_history-0.6.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d824ad8bcddc30e1bc9aa1242e242a28ddf836265d099eba5e23c0625ca48928",
"md5": "bd128a4103341088f05d4f5d282a9214",
"sha256": "e423a96916ef8b7d5504efc9e378e54aa86ccbb0136dc7846ac9c04593f0caca"
},
"downloads": -1,
"filename": "git-history-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "bd128a4103341088f05d4f5d282a9214",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 22251,
"upload_time": "2021-12-08T20:30:36",
"upload_time_iso_8601": "2021-12-08T20:30:36.065292Z",
"url": "https://files.pythonhosted.org/packages/d8/24/ad8bcddc30e1bc9aa1242e242a28ddf836265d099eba5e23c0625ca48928/git-history-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2021-12-08 20:30:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "simonw",
"github_project": "git-history",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "git-history"
}