pimdb

Name	pimdb JSON
Version	0.3.0 JSON
	download
home_page	https://github.com/roskakori/pimdb
Summary	build a database from IMDb datasets
upload_time	2024-05-14 10:36:43
maintainer	None
docs_url	None
author	Thomas Aglassinger
requires_python	>=3.9
license	BSD
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# pimdb

Pimdb is a python package and command line utility to maintain a local copy of
the essential parts of the
[Internet Movie Database](https://imdb.com) (IMDb) based in the TSV files
available from [IMDb datasets](https://www.imdb.com/interfaces/).

## License

The [IMDb datasets](https://www.imdb.com/interfaces/) are only available for
personal and non-commercial use. For details refer to the previous link.

Pimdb is open source and distributed under the
[BSD license](https://opensource.org/licenses/BSD-3-Clause). The source
code is available from https://github.com/roskakori/pimdb.

## Installation

Pimdb is available from [PyPI](https://pypi.org/project/pimdb/) and can be
installed using:

```bash
$ pip install pimdb
```

## Quick start

### Downloading datasets

To download the current IMDb datasets to the current folder, run:

```bash
pimdb download all
```

(This downloads about 1 GB of data and might take a couple of minutes).

### Transferring datasets into tables

To import them in a local SQLite database `pimdb.db` located in the current
folder, run:

```bash
pimdb transfer all
```

This will take several hours, on a MacBook Pro M1 about 11 hours.

The resulting database contains one table for each dataset. The table names
are PascalCase variants of the dataset name. For example, the date from the
dataset `title.basics` are stored in the table `TitleBasics`. The column names
in the table match the names from the datasets, for example
`TitleBasics.primaryTitle`. A short description of all the datasets and
columns can be found at the download page for the
[IMDb datasets](https://www.imdb.com/interfaces/).

Optionally you can specify a different database using the `--database` option
with an
[SQLAlchemy engine configuration](https://docs.sqlalchemy.org/en/13/core/engines.html).

### Querying tables

To query the tables, you can use any database tool that supports SQLite, for
example the freely available and platform independent community edition of
[DBeaver](https://dbeaver.io/) or the
[command line shell for SQLite](https://sqlite.org/cli.html).

For simple queries you can also use `pimdb` and look at the result as
UTF-8 encoded TSV. For example, here are the details of the top 10 oldest
people alive according to IMDb:

```bash
pimdb query "select * from NameBasics where birthYear is not null and deathYear is null order by birthYear limit 10" >oldest_people_alive.tsv
```

You can also run an SQL statement stored in a file:

```bash
pimdb query --file some.sql
```

### Building normalized tables

The tables so far are almost verbatim copies of the IMDb datasets with the
exception that possible duplicate rows have been removed. This data model
already allows to perform several kinds of queries quite easily and
efficiently.

However, the IMDb datasets do not offer a simple way to query N:M relations.
For example, the column `NameBasics.knownForTitles` contains a comma separated
list of tconsts like "tt2076794,tt0116514,tt0118577,tt0086491".

To perform such queries efficiently you can build strictly normalized tables
derived from the dataset tables by running:

```bash
pimdb build
```
If you did specify a `--database` for the `transfer` command before, you have to
specify the same value for `build` in order to find the source data. These tables
generally use snake_case names for both tables and columns, for example
`title_allias.is_original`.

This will take some time, on a MacBook Pro M1 about 30 minutes.

## Querying normalized tables

N:M relations are stored in tables using the naming template `some_to_other`,
for example `name_to_known_for_title`. These relation tables contain only the
numeric ID's to the respective actual data and a numeric column `ordering` to
remember the sort order of the comma separated list in the IMDb dataset column.

For example, here is an SQL query to list the titles Alan Smithee is known
for:

```sql
select
title.primary_title,
title.start_year
from
name_to_known_for_title
join name on
name.id = name_to_known_for_title.name_id
join title on
title.id = name_to_known_for_title.title_id
where
name.primary_name = 'Alan Smithee'
```

For more information on which tables are available on how they are related
read the chapter about the
[pimdb data model](https://pimdb.readthedocs.io/en/latest/datamodel.html).

## Where to go from here

Pimdb's [online documentation](https://pimdb.readthedocs.io/) describes all
aspects in further detail. You might find the following chapters of particular
interest:

* [Usage](https://pimdb.readthedocs.io/en/latest/usage.html): all command line
options explained
* [Data model](https://pimdb.readthedocs.io/en/latest/datamodel.html):
available tables and example SQL queries
* [Contributing](https://pimdb.readthedocs.io/en/latest/contributing.html):
obtaining the source code and building the project locally

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/roskakori/pimdb",
    "name": "pimdb",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Thomas Aglassinger",
    "author_email": "roskakori@users.sourceforge.net",
    "download_url": null,
    "platform": null,
    "description": "# pimdb\n\nPimdb is a python package and command line utility to maintain a local copy of\nthe essential parts of the\n[Internet Movie Database](https://imdb.com) (IMDb) based in the TSV files\navailable from [IMDb datasets](https://www.imdb.com/interfaces/).\n\n\n## License\n\nThe [IMDb datasets](https://www.imdb.com/interfaces/) are only available for\npersonal and non-commercial use. For details refer to the previous link.\n\nPimdb is open source and distributed under the\n[BSD license](https://opensource.org/licenses/BSD-3-Clause). The source\ncode is available from https://github.com/roskakori/pimdb.\n\n\n## Installation\n\nPimdb is available from [PyPI](https://pypi.org/project/pimdb/) and can be\ninstalled using:\n\n```bash\n$ pip install pimdb\n```\n\n\n## Quick start\n\n\n### Downloading datasets\n\nTo download the current IMDb datasets to the current folder, run:\n\n```bash\npimdb download all\n```\n\n(This downloads about 1 GB of data and might take a couple of minutes).\n\n\n### Transferring datasets into tables\n\nTo import them in a local SQLite database `pimdb.db` located in the current\nfolder, run:\n\n```bash\npimdb transfer all\n```\n\nThis will take several hours, on a MacBook Pro M1 about 11 hours.\n\nThe resulting database contains one table for each dataset. The table names\nare PascalCase variants of the dataset name. For example, the date from the\ndataset `title.basics` are stored in the table `TitleBasics`. The column names\nin the table match the names from the datasets, for example\n`TitleBasics.primaryTitle`. A short description of all the datasets and\ncolumns can be found at the download page for the\n[IMDb datasets](https://www.imdb.com/interfaces/).\n\nOptionally you can specify a different database using the `--database` option\nwith an\n[SQLAlchemy engine configuration](https://docs.sqlalchemy.org/en/13/core/engines.html).\n\n\n### Querying tables\n\nTo query the tables, you can use any database tool that supports SQLite, for\nexample the freely available and platform independent community edition of\n[DBeaver](https://dbeaver.io/) or the\n[command line shell for SQLite](https://sqlite.org/cli.html).\n\nFor simple queries you can also use `pimdb` and look at the result as\nUTF-8 encoded TSV. For example, here are the details of the top 10 oldest\npeople alive according to IMDb:\n\n```bash\npimdb query \"select * from NameBasics where birthYear is not null and deathYear is null order by birthYear limit 10\" >oldest_people_alive.tsv\n```\n\nYou can also run an SQL statement stored in a file:\n\n```bash\npimdb query --file some.sql\n```\n\n\n### Building normalized tables\n\nThe tables so far are almost verbatim copies of the IMDb datasets with the\nexception that possible duplicate rows have been removed. This data model\nalready allows to perform several kinds of queries quite easily and\nefficiently.\n\nHowever, the IMDb datasets do not offer a simple way to query N:M relations.\nFor example, the column `NameBasics.knownForTitles` contains a comma separated\nlist of tconsts like \"tt2076794,tt0116514,tt0118577,tt0086491\".\n\nTo perform such queries efficiently you can build strictly normalized tables\nderived from the dataset tables by running:\n\n```bash\npimdb build\n```\nIf you did specify a `--database` for the `transfer` command before, you have to\nspecify the same value for `build` in order to find the source data. These tables\ngenerally use snake_case names for both tables and columns, for example\n`title_allias.is_original`.\n\nThis will take some time, on a MacBook Pro M1 about 30 minutes.\n\n## Querying normalized tables\n\nN:M relations are stored in tables using the naming template `some_to_other`,\nfor example `name_to_known_for_title`. These relation tables contain only the\nnumeric ID's to the respective actual data and a numeric column `ordering` to\nremember the sort order of the comma separated list in the IMDb dataset column.\n\nFor example, here is an SQL query to list the titles Alan Smithee is known\nfor:\n\n```sql\nselect\n    title.primary_title,\n    title.start_year\nfrom\n    name_to_known_for_title\n    join name on\n        name.id = name_to_known_for_title.name_id\n    join title on\n        title.id = name_to_known_for_title.title_id\nwhere\n    name.primary_name = 'Alan Smithee'\n```\n\nFor more information on which tables are available on how they are related\nread the chapter about the\n[pimdb data model](https://pimdb.readthedocs.io/en/latest/datamodel.html).\n\n\n## Where to go from here\n\nPimdb's [online documentation](https://pimdb.readthedocs.io/) describes all\naspects in further detail. You might find the following chapters of particular\ninterest:\n\n* [Usage](https://pimdb.readthedocs.io/en/latest/usage.html): all command line\n  options explained\n* [Data model](https://pimdb.readthedocs.io/en/latest/datamodel.html):\n  available tables and example SQL queries\n* [Contributing](https://pimdb.readthedocs.io/en/latest/contributing.html):\n  obtaining the source code and building the project locally\n",
    "bugtrack_url": null,
    "license": "BSD",
    "summary": "build a database from IMDb datasets",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://pimdb.readthedocs.io/",
        "Homepage": "https://github.com/roskakori/pimdb",
        "Issue Tracker": "https://github.com/roskakori/pimdb/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6246ab862580b55907a5feb122e1c52473c756d6f1dae650f4ac2fc407bc999c",
                "md5": "782a294d1ce62cfa2b4b414315e48206",
                "sha256": "517359ebabc72f47b63321101439f1c4ff607793c7fa9315c27e9fbf31326f5a"
            },
            "downloads": -1,
            "filename": "pimdb-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "782a294d1ce62cfa2b4b414315e48206",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 22893,
            "upload_time": "2024-05-14T10:36:43",
            "upload_time_iso_8601": "2024-05-14T10:36:43.786042Z",
            "url": "https://files.pythonhosted.org/packages/62/46/ab862580b55907a5feb122e1c52473c756d6f1dae650f4ac2fc407bc999c/pimdb-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-14 10:36:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "roskakori",
    "github_project": "pimdb",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "pimdb"
}

Thomas Aglassinger