gtfs-proto


Namegtfs-proto JSON
Version 0.1.0 PyPI version JSON
download
home_page
SummaryLibrary to package and process GTFS feeds in a protobuf format
upload_time2024-01-07 14:02:36
maintainer
docs_urlNone
author
requires_python>=3.9
licenseCopyright (c) 2021, Ilya Zverev Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
keywords gtfs transit feed gtp command line
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GTFS to Protobuf Packaging

This library / command-line tool introduces a protocol buffers-based format
for packaging GTFS feeds. The reasons for this are:

1. Decrease the size of a feed to 8-10% of the original.
2. Allow for even smaller and easy to apply delta files.

The recommended file extension for packaged feeds is _gtp_.

## Differences with GTFS

The main thing missing from the packaged feed is fare information.
This is planned to be added, refer to [this ticket](https://github.com/Zverik/gtfs-proto/issues/1)
to track the implementation progress.

Packaged feed does not reflect source tables one to one. This is true
for most tables though. Here's what been deliberately skipped from
the official format:

* `stops.txt`: `tts_stop_name`, `stop_url`, `stop_timezone`, `level_id`.
* `routes.txt`: `route_url`, `route_sort_order`.
* `trips.txt`: `block_id`.
* `stop_times.txt`: `stop_sequence`, `shape_dist_travelled`.
* `shapes.txt`: `shape_pt_sequence`, `shape_dist_travelled`.

Feed files ignored are all the `fare*.txt`, `timeframes.txt`, `pathways.txt`,
`levels.txt`, `translations.txt`, `feed_info.txt`, and `attributions.txt`.
Fares are to be implemented, and for other tables it is recommended
to use the original feed.

### Binary Format

**Note that the format is to have significant changes, including renumbering
of fields, until version 1.0 is published.**

Inside the file, first thing is a little-endian two-byte size for the header
block. Then the header serialized message follows.

The header contains a list of sizes for each block. Blocks follow in the order
listed in the `Block` enum: first identifiers, then strings, then agencies,
and so on.

The same enum is used for keys in the `IdReference` message, that links
generated numeric ids from this packed feed with the original string ids.

If the feed is compressed (marked by a flag in the header), each block is
compressed using the Zstandard algorithm. It proved to be both fast and efficient,
decreasing the size by 70%.

### Location Encoding

Floating-point values are stored inefficiently, hence all longitude and latitudes
are multiplied by 100000 (10^5) and rounded. This allows for one-meter precision,
which is good enough on public transit scales.

In addition, when coordinates become lists, we store only a difference with the
last coordinate. This applies to both stops (relative to the previous stop) and
shapes: in latter, coordinates are relative to the previous coordinate, or to
the last one in the previous shape.

### Routes and Trips

The largest file in every GTFS feed is `stop_times.txt`. Here, it's missing, with
the data spread between routes and trips. The format also adds itineraries:

* An itinerary is a series of stops for a route with the same headsign and shape.
  * Note that there is no specific block for itineraries, instead they are packaged
    inside corresponding routes. But they still have unique identifiers.
* Route is the same as in GTFS.
* Trips reference an itinerary for stops, and add departure and arrival times for
  each stop (or start and end time when those are specified with `frequencies.txt`).

So to find a departure time for a given stop, you find itineraries that contain it,
and from those, routes and trips. You get a departure times list from the trip,
and use addition to get the actual time (since we store just differences with previous
times, with 5-second granularity).

### Deltas

Delta files looks the same as the original, but the header size has its last bit
set (`size & 0x8000`, note the unsigned integer). After that, `GtfsDeltaHeader`
follows, which also has version, date, compression fields, and a list of block sizes.

How the blocks are different, is explained in the [proto file](protobuf/gtfs.proto).

## Installation and Usage

Installing is simple:

    pip install gtfs-proto

### Packaging a feed

See a list of commands the tool provides by running it without arguments:

    gtfs_proto
    gtfs_proto pack --help

To package a feed, call:

    gtfs_proto pack gtfs.zip --output city.gtp

In a header, a feed stores an URL of a source zip file, and a date on which
that feed was built. You should specify those, although if the date is "today",
you can skip the argument:

    gtfs_proto pack gtfs.zip --url https://mta.org/gtfs/gtfs.zip --date 2024-03-19 -o city.gtp

When setting a pipeline to package feeds regularly, do specify the previous feed
file to keep identifiers from altering, and to keep delta file sizes to a minimum:

    gtfs_proto pack gtfs.zip --prev city_last.gtp -o city.gtp

### Deltas

Delta, a list differences between two feeds, is made with this obvious command:

    gtfs_proto delta city_last.gtp city.gtp -o city_delta.gtp

Currently it's to be decided whether a delta requires a different file extension.
Technically the format is almost the same, using the same protocol buffers definition.

If you lost an even older file and wish to keep your users updated even from very
old feeds, you can merge deltas:

    gtfs_proto dmerge city_delta_1-2.gtp city_delta_2-3.gtp -o city_delta_1-3.gtp

It's recommended to avoid merging deltas and store old feeds instead to produce
delta files with the `delta` command.

There is no command for applying deltas: it's on end users to read the file and
apply it straight to their inner database.

### Information

A packaged feed contains a header and an array of blocks, similar but not exactly mirroring
the original GTFS files. You can see the list, sizes and counts by running:

    gtfs_proto info city.gtp

Any block can be dumped into a series of one-line JSON objects by specifying
the block name:

    gtfs_proto info city.gtp --block stops

Currently the blocks are `ids`, `strings`, `agency`, `calendar`, `shapes`,
`stops`, `routes`, `trips`, `transfers`, `networks`, `areas`, and `fare_links`.

There are two additional "blocks" that print numbers from the header:
`version` and `date`. Use these to simplify automation. For example, this is
how you make a version-named copy of the lastest feed:

```sh
cp city-latest.gtp city-$(gtfs_proto info -b version).gtp
```

When applicable, you can print just the line for a given identifier,
both for the one from the original GTFS feed, and for a numeric generated one:

    gtfs_proto info city.gtp -p stops --id 45

Of course you can view contents of a delta file the same way.

## Python Library

Reading GTFS protobuf files is pretty straightforward:

```python
import gtfs_proto as gtp

feed = GtfsProto(open('city.gtp', 'rb'))
print(f'Feed built on {feed.header.date}')
for stop in feed.stops:
    print(f'Stop {stop.stop_id} named "{feed.strings[stop.name]}".')
```

The `GtfsProto` (and `GtfsDelta`) object reads the file header and lazily provides
all blocks as lists or dicts. To read all blocks instantly, use the `read_now=True`
argument for the constructor.

Parsing shapes and calendar services is not easy, so there are some service
functions, namely `parse_shape` and `parse_calendar`. The latter returns a list
of `CalendarService` with all the dates and day lists unpacked, and an `operates`
method to determine whether the line functions on a given date.

All built-in commands use this library, so refer to, for example,
[delta.py](src/gtfs_proto/delta.py) for an extended usage tutorial.

## Author and License

The format and the code were written by Ilya Zverev. The code is published under ISC License,
the the format is CC0 or in a public domain, whatever applies in your country.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "gtfs-proto",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "gtfs,transit,feed,gtp,command line",
    "author": "",
    "author_email": "Ilya Zverev <ilya@zverev.info>",
    "download_url": "https://files.pythonhosted.org/packages/ca/9e/4720c5ca1e93065ddbbaa13f4987c15bf228d7e769e5523a14b1d9f85331/gtfs-proto-0.1.0.tar.gz",
    "platform": null,
    "description": "# GTFS to Protobuf Packaging\n\nThis library / command-line tool introduces a protocol buffers-based format\nfor packaging GTFS feeds. The reasons for this are:\n\n1. Decrease the size of a feed to 8-10% of the original.\n2. Allow for even smaller and easy to apply delta files.\n\nThe recommended file extension for packaged feeds is _gtp_.\n\n## Differences with GTFS\n\nThe main thing missing from the packaged feed is fare information.\nThis is planned to be added, refer to [this ticket](https://github.com/Zverik/gtfs-proto/issues/1)\nto track the implementation progress.\n\nPackaged feed does not reflect source tables one to one. This is true\nfor most tables though. Here's what been deliberately skipped from\nthe official format:\n\n* `stops.txt`: `tts_stop_name`, `stop_url`, `stop_timezone`, `level_id`.\n* `routes.txt`: `route_url`, `route_sort_order`.\n* `trips.txt`: `block_id`.\n* `stop_times.txt`: `stop_sequence`, `shape_dist_travelled`.\n* `shapes.txt`: `shape_pt_sequence`, `shape_dist_travelled`.\n\nFeed files ignored are all the `fare*.txt`, `timeframes.txt`, `pathways.txt`,\n`levels.txt`, `translations.txt`, `feed_info.txt`, and `attributions.txt`.\nFares are to be implemented, and for other tables it is recommended\nto use the original feed.\n\n### Binary Format\n\n**Note that the format is to have significant changes, including renumbering\nof fields, until version 1.0 is published.**\n\nInside the file, first thing is a little-endian two-byte size for the header\nblock. Then the header serialized message follows.\n\nThe header contains a list of sizes for each block. Blocks follow in the order\nlisted in the `Block` enum: first identifiers, then strings, then agencies,\nand so on.\n\nThe same enum is used for keys in the `IdReference` message, that links\ngenerated numeric ids from this packed feed with the original string ids.\n\nIf the feed is compressed (marked by a flag in the header), each block is\ncompressed using the Zstandard algorithm. It proved to be both fast and efficient,\ndecreasing the size by 70%.\n\n### Location Encoding\n\nFloating-point values are stored inefficiently, hence all longitude and latitudes\nare multiplied by 100000 (10^5) and rounded. This allows for one-meter precision,\nwhich is good enough on public transit scales.\n\nIn addition, when coordinates become lists, we store only a difference with the\nlast coordinate. This applies to both stops (relative to the previous stop) and\nshapes: in latter, coordinates are relative to the previous coordinate, or to\nthe last one in the previous shape.\n\n### Routes and Trips\n\nThe largest file in every GTFS feed is `stop_times.txt`. Here, it's missing, with\nthe data spread between routes and trips. The format also adds itineraries:\n\n* An itinerary is a series of stops for a route with the same headsign and shape.\n  * Note that there is no specific block for itineraries, instead they are packaged\n    inside corresponding routes. But they still have unique identifiers.\n* Route is the same as in GTFS.\n* Trips reference an itinerary for stops, and add departure and arrival times for\n  each stop (or start and end time when those are specified with `frequencies.txt`).\n\nSo to find a departure time for a given stop, you find itineraries that contain it,\nand from those, routes and trips. You get a departure times list from the trip,\nand use addition to get the actual time (since we store just differences with previous\ntimes, with 5-second granularity).\n\n### Deltas\n\nDelta files looks the same as the original, but the header size has its last bit\nset (`size & 0x8000`, note the unsigned integer). After that, `GtfsDeltaHeader`\nfollows, which also has version, date, compression fields, and a list of block sizes.\n\nHow the blocks are different, is explained in the [proto file](protobuf/gtfs.proto).\n\n## Installation and Usage\n\nInstalling is simple:\n\n    pip install gtfs-proto\n\n### Packaging a feed\n\nSee a list of commands the tool provides by running it without arguments:\n\n    gtfs_proto\n    gtfs_proto pack --help\n\nTo package a feed, call:\n\n    gtfs_proto pack gtfs.zip --output city.gtp\n\nIn a header, a feed stores an URL of a source zip file, and a date on which\nthat feed was built. You should specify those, although if the date is \"today\",\nyou can skip the argument:\n\n    gtfs_proto pack gtfs.zip --url https://mta.org/gtfs/gtfs.zip --date 2024-03-19 -o city.gtp\n\nWhen setting a pipeline to package feeds regularly, do specify the previous feed\nfile to keep identifiers from altering, and to keep delta file sizes to a minimum:\n\n    gtfs_proto pack gtfs.zip --prev city_last.gtp -o city.gtp\n\n### Deltas\n\nDelta, a list differences between two feeds, is made with this obvious command:\n\n    gtfs_proto delta city_last.gtp city.gtp -o city_delta.gtp\n\nCurrently it's to be decided whether a delta requires a different file extension.\nTechnically the format is almost the same, using the same protocol buffers definition.\n\nIf you lost an even older file and wish to keep your users updated even from very\nold feeds, you can merge deltas:\n\n    gtfs_proto dmerge city_delta_1-2.gtp city_delta_2-3.gtp -o city_delta_1-3.gtp\n\nIt's recommended to avoid merging deltas and store old feeds instead to produce\ndelta files with the `delta` command.\n\nThere is no command for applying deltas: it's on end users to read the file and\napply it straight to their inner database.\n\n### Information\n\nA packaged feed contains a header and an array of blocks, similar but not exactly mirroring\nthe original GTFS files. You can see the list, sizes and counts by running:\n\n    gtfs_proto info city.gtp\n\nAny block can be dumped into a series of one-line JSON objects by specifying\nthe block name:\n\n    gtfs_proto info city.gtp --block stops\n\nCurrently the blocks are `ids`, `strings`, `agency`, `calendar`, `shapes`,\n`stops`, `routes`, `trips`, `transfers`, `networks`, `areas`, and `fare_links`.\n\nThere are two additional \"blocks\" that print numbers from the header:\n`version` and `date`. Use these to simplify automation. For example, this is\nhow you make a version-named copy of the lastest feed:\n\n```sh\ncp city-latest.gtp city-$(gtfs_proto info -b version).gtp\n```\n\nWhen applicable, you can print just the line for a given identifier,\nboth for the one from the original GTFS feed, and for a numeric generated one:\n\n    gtfs_proto info city.gtp -p stops --id 45\n\nOf course you can view contents of a delta file the same way.\n\n## Python Library\n\nReading GTFS protobuf files is pretty straightforward:\n\n```python\nimport gtfs_proto as gtp\n\nfeed = GtfsProto(open('city.gtp', 'rb'))\nprint(f'Feed built on {feed.header.date}')\nfor stop in feed.stops:\n    print(f'Stop {stop.stop_id} named \"{feed.strings[stop.name]}\".')\n```\n\nThe `GtfsProto` (and `GtfsDelta`) object reads the file header and lazily provides\nall blocks as lists or dicts. To read all blocks instantly, use the `read_now=True`\nargument for the constructor.\n\nParsing shapes and calendar services is not easy, so there are some service\nfunctions, namely `parse_shape` and `parse_calendar`. The latter returns a list\nof `CalendarService` with all the dates and day lists unpacked, and an `operates`\nmethod to determine whether the line functions on a given date.\n\nAll built-in commands use this library, so refer to, for example,\n[delta.py](src/gtfs_proto/delta.py) for an extended usage tutorial.\n\n## Author and License\n\nThe format and the code were written by Ilya Zverev. The code is published under ISC License,\nthe the format is CC0 or in a public domain, whatever applies in your country.\n",
    "bugtrack_url": null,
    "license": "Copyright (c) 2021, Ilya Zverev  Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.  THE SOFTWARE IS PROVIDED \"AS IS\" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. ",
    "summary": "Library to package and process GTFS feeds in a protobuf format",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/Zverik/gtfs_proto/issues",
        "Homepage": "https://github.com/Zverik/gtfs_proto"
    },
    "split_keywords": [
        "gtfs",
        "transit",
        "feed",
        "gtp",
        "command line"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9be6d2acd4c2f7a5800bbba30d83b72350c007f9d4dc829754a0f6966df23ac8",
                "md5": "d6c132e8cb4012740d430f4dc993d0c9",
                "sha256": "d1165cc593abe308864c789b352b605efb5a73e39675dfaa3a8a91f73696d337"
            },
            "downloads": -1,
            "filename": "gtfs_proto-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d6c132e8cb4012740d430f4dc993d0c9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 39546,
            "upload_time": "2024-01-07T14:02:34",
            "upload_time_iso_8601": "2024-01-07T14:02:34.078119Z",
            "url": "https://files.pythonhosted.org/packages/9b/e6/d2acd4c2f7a5800bbba30d83b72350c007f9d4dc829754a0f6966df23ac8/gtfs_proto-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ca9e4720c5ca1e93065ddbbaa13f4987c15bf228d7e769e5523a14b1d9f85331",
                "md5": "c94057b5afff7ae11635a2bbe47bd82b",
                "sha256": "f9820b40e6fe4bbaf8b70ae3d887997552a560e31483eaa93b3c6633ba5c5ad2"
            },
            "downloads": -1,
            "filename": "gtfs-proto-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c94057b5afff7ae11635a2bbe47bd82b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 35922,
            "upload_time": "2024-01-07T14:02:36",
            "upload_time_iso_8601": "2024-01-07T14:02:36.550953Z",
            "url": "https://files.pythonhosted.org/packages/ca/9e/4720c5ca1e93065ddbbaa13f4987c15bf228d7e769e5523a14b1d9f85331/gtfs-proto-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-07 14:02:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Zverik",
    "github_project": "gtfs_proto",
    "github_not_found": true,
    "lcname": "gtfs-proto"
}
        
Elapsed time: 0.84112s