market-data-transcoder


Namemarket-data-transcoder JSON
Version 1.0.3 PyPI version JSON
download
home_pagehttps://github.com/GoogleCloudPlatform/market-data-transcoder
SummaryMarket Data Transcoder
upload_time2023-06-19 13:46:37
maintainer
docs_urlNone
authorGoogle Cloud FSI Solutions
requires_python>=3.6
licenseApache License 2.0
keywords bigquery devops json automation schema trading avro binary transcoding pubsub fix fixprotocol google-cloud-platform itch sbe simple-binary-encoding exchanges marketdata binaryencoding
VCS
bugtrack_url
requirements avro dpkt lxml numpy fastavro six google-cloud-pubsub google-cloud-bigquery pyyaml
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ```Google Cloud Datacast Solution```
 
##  _Ingest high-performance exchange feeds into Google Cloud_

_This is not an official Google product or service_

### Introduction

The Datacast `transcoder` is a schema-driven, message-oriented utility to simplify the lossless ingestion of common high-performance electronic trading data formats to Google Cloud.

Electronic trading venues have specialized data representation and distribution needs. In particular, efficient message representation is a high priority due to the massive volume of transactions a venue processes. Cloud-native APIs often use JSON for message payloads, but the extra bytes required to represent messages using high-context encodings have cost implications in metered computing environments. 

Unlike JSON, YAML, or even CSV, binary-encoded data is low-context and not self-describing -- the instructions for interpreting binary messages must be explicitly provided by producers separately and in advance, and followed by interpreters.

The architecture of the transcoder relies on several principal abstractions, detailed below:

#### Schema

A schema (also known as a data dictionary) is similar to an API specification, but instead of describing API endpoint contracts, it describes the representative format of binary _messages_ that flow between systems. The closest comparison might be drawn with table definitions supported by SQL Data Definition Language, but these schemas are used for data in-motion as well as data at-rest.

The transcoder's current input schema support is for Simple Binary Encoding (SBE) XML as well as QuickFIX-styled FIX protocol schema representations (also in XML).

Target schema and data elements are rendered based on the specified `output_type`. With no output type specified, the transcoder defaults to displaying the YAML representation of transcoded messages to the console, and does not perform persistent schema transformations. For Avro and JSON, the transcoded schema and data files are encapsulated in POSIX files locally. Direct trancoding to BigQuery and Pub/Sub targets are supported, with the transcoded schemas being applied prior to message ingestion or publishing. Terraform configurations for BigQuery and Pub/Sub resources can also be derived from a specified input schema. The Terraform options only render the configurations locally and do not execute Terraform `apply`. The `--create_schemas_only` option transcodes  schemas in isolation for other output types.

The names of the output resources will individually correspond to the names of the message types defined in the input schema. For example, the transcoder will create and use a Pub/Sub topic named "NewOrderSingle" for publishing FIX `NewOrderSingle` messages found in source data. Similarly, if an output type of `bigquery` is selected, the transcoder will create a `NewOrderSingle` table in the dataset specified by `--destination_dataset_id`. By default, Avro and JSON encoded output will be saved to a file named `<message type>` with the respective extensions in a directory specified using the `--output_path` parameter.

#### Message

A message represents a discrete interaction between two systems sharing a schema. Each message will conform to a single _message type_ as defined in the schema. Specific message types can be included or excluded for processing by passing a comma-delimited string of message type names to the `--message_type_exclusions` and `--message_type_inclusions` parameters.


#### Encoding

Encodings describe how the contents of a message payload are represented to systems. Many familiar encodings, such as JSON, YAML or CSV, are self-describing and do not strictly require that applications use a separate schema definition. However, binary encodings such as SBE, Avro and Protocol Buffers require that applications employ the associated schema in order to properly interpret messages.

The transcoder's supported inbound encodings are SBE binary and ASCII-encoded (tag=value) FIX. Outbound encodings for Pub/Sub message payloads can be Avro binary or Avro JSON. Local files can be generated in either Avro or JSON.

The transcoder supports base64 decoding of messages using the `--base64` and `--base64_urlsafe` options.

#### Transport

A message transport describes the mechanism for transferring messages between systems. This can be data-in-motion, such as an ethernet network, or data-at-rest, such as a file living on a POSIX filesytem or an object residing within cloud storage. Raw message bytes must be unframed from a particular transport, such as length-delimited files or packet capture files. 

The transcoder's currently supported inbound message source transports are PCAP files, length-delimited binary files, and newline-delimited ASCII files. Multicast UDP and Pub/Sub inbound transports are on the roadmap.

Outbound transport options are locally stored Avro and JSON POSIX files, and Pub/Sub topics or BigQuery tables. If no `output_type` is specified, the transcoded messages are output to the console encoded in YAML and not persisted automatically. Additionally, Google Cloud resource definitions for specified schemas can be encapsulated in Terraform configurations.

#### Message factory

A message factory takes a message payload read from the input source, determines the associated message type from the schema to apply, and performs any adjustments to the message data prior to transcoding. For example, a message producer may use non-standard SBE headers or metadata that you would like to remove or transform. For standard FIX tag/value input sources, the included `fix` message factory may be used.

### CLI usage

```
usage: txcode  [-h] [--factory {cme,itch,memx,fix}]
               [--schema_file SCHEMA_FILE] [--source_file SOURCE_FILE]
               [--source_file_encoding SOURCE_FILE_ENCODING]
               --source_file_format_type
               {pcap,length_delimited,line_delimited,cme_binary_packet}
               [--base64 | --base64_urlsafe]
               [--fix_header_tags FIX_HEADER_TAGS]
               [--fix_separator FIX_SEPARATOR]
               [--message_handlers MESSAGE_HANDLERS]
               [--message_skip_bytes MESSAGE_SKIP_BYTES]
               [--prefix_length PREFIX_LENGTH]
               [--message_type_exclusions MESSAGE_TYPE_EXCLUSIONS | --message_type_inclusions MESSAGE_TYPE_INCLUSIONS]
               [--sampling_count SAMPLING_COUNT] [--skip_bytes SKIP_BYTES]
               [--skip_lines SKIP_LINES] [--source_file_endian {big,little}]
               [--output_path OUTPUT_PATH]
               [--output_type {diag,avro,fastavro,bigquery,pubsub,bigquery_terraform,pubsub_terraform,jsonl,length_delimited}]
               [--error_output_path ERROR_OUTPUT_PATH]
               [--lazy_create_resources] [--frame_only] [--stats_only]
               [--create_schemas_only]
               [--destination_project_id DESTINATION_PROJECT_ID]
               [--destination_dataset_id DESTINATION_DATASET_ID]
               [--output_encoding {binary,json}]
               [--create_schema_enforcing_topics | --no-create_schema_enforcing_topics]
               [--continue_on_error]
               [--log {notset,debug,info,warning,error,critical}] [-q] [-v]

Datacast Transcoder process input arguments

options:
  -h, --help            show this help message and exit
  --continue_on_error   Indicates if an exception file should be created, and
                        records continued to be processed upon message level
                        exceptions
  --log {notset,debug,info,warning,error,critical}
                        The default logging level
  -q, --quiet           Suppress message output to console
  -v, --version         show program's version number and exit

Input source arguments:
  --factory {cme,itch,memx,fix}
                        Message factory for decoding
  --schema_file SCHEMA_FILE
                        Path to the schema file
  --source_file SOURCE_FILE
                        Path to the source file
  --source_file_encoding SOURCE_FILE_ENCODING
                        The source file character encoding
  --source_file_format_type {pcap,length_delimited,line_delimited,cme_binary_packet}
                        The source file format
  --base64              Indicates if each individual message extracted from
                        the source is base 64 encoded
  --base64_urlsafe      Indicates if each individual message extracted from
                        the source is base 64 url safe encoded
  --fix_header_tags FIX_HEADER_TAGS
                        Comma delimited list of fix header tags
  --fix_separator FIX_SEPARATOR
                        The unicode int representing the fix message separator
  --message_handlers MESSAGE_HANDLERS
                        Comma delimited list of message handlers in priority
                        order
  --message_skip_bytes MESSAGE_SKIP_BYTES
                        Number of bytes to skip before processing individual
                        messages within a repeated length delimited file
                        message source
  --prefix_length PREFIX_LENGTH
                        How many bytes to use for the length prefix of length-
                        delimited binary sources
  --message_type_exclusions MESSAGE_TYPE_EXCLUSIONS
                        Comma-delimited list of message types to exclude when
                        processing
  --message_type_inclusions MESSAGE_TYPE_INCLUSIONS
                        Comma-delimited list of message types to include when
                        processing
  --sampling_count SAMPLING_COUNT
                        Halt processing after reaching this number of
                        messages. Applied after all Handlers are executed per
                        message
  --skip_bytes SKIP_BYTES
                        Number of bytes to skip before processing the file.
                        Useful for skipping file-level headers
  --skip_lines SKIP_LINES
                        Number of lines to skip before processing the file
  --source_file_endian {big,little}
                        Source file endianness

Output arguments:
  --output_path OUTPUT_PATH
                        Output file path. Defaults to avroOut
  --output_type {diag,avro,fastavro,bigquery,pubsub,bigquery_terraform,pubsub_terraform,jsonl,length_delimited}
                        Output format type
  --error_output_path ERROR_OUTPUT_PATH
                        Error output file path if --continue_on_error flag
                        enabled. Defaults to errorOut
  --lazy_create_resources
                        Flag indicating that output resources for message
                        types should be only created as messages of each type
                        are encountered in the source data. Default behavior
                        is to create resources for each message type before
                        messages are processed. Particularly useful when
                        working with FIX but only processing a limited set of
                        message types in the source data
  --frame_only          Flag indicating that transcoder should only frame
                        messages to an output source
  --stats_only          Flag indicating that transcoder should only report on
                        message type counts without parsing messages further
  --create_schemas_only
                        Flag indicating that transcoder should only create
                        output resource schemas and not output message data

Google Cloud arguments:
  --destination_project_id DESTINATION_PROJECT_ID
                        The Google Cloud project ID for the destination
                        resource

BigQuery arguments:
  --destination_dataset_id DESTINATION_DATASET_ID
                        The BigQuery dataset for the destination. If it does
                        not exist, it will be created

Pub/Sub arguments:
  --output_encoding {binary,json}
                        The encoding of the output
  --create_schema_enforcing_topics, --no-create_schema_enforcing_topics
                        Indicates if Pub/Sub schemas should be created and
                        used to validate messages sent to a topic
```

### Message handlers

`txcode` supports the execution of _message handler_ classes that can
be used to statefully mutate in-flight streams and messages. For example,
`TimestampPullForwardHandler` will look for a `seconds`-styled ITCH
message (that informs the stream of the prevailing epochs second to
apply to subsequent messages), and append the latest value from
that to all subsequent messages (between instances of the `seconds`
message appearing. This helps individual messages be persisted with
absolute timestamps that require less context to interpret
(i.e. outbound messages contain more than just "nanoseconds past
midnight" for a timestamp.

Another handler is `SequencerHandler`, which appends a sequence number
to all outbound messages. This is useful when processing bulk messages
in length-delimited storage formats where the IP packet headers
containing the original sequence numbers have been stripped.	

`FilterHandler` lets you filter output based upon a specific property
of a message. A common use for this is to filter messages pertaining
only to a particular security identifier or symbol.

Here is a combination of transcoding invocations that can
be used to shard a message universe by trading symbol. First, the mnemonic
trading symbol identifier (`stock`) must be used to find it's associated integer
security identifier (`stock_locate`) from the `stock_directory`
message. `stock_locate` is the identifier included in every
relevant message (as opposed to `stock`, which is absent from
certain message types):

```

txcode --source_file 12302019.NASDAQ_ITCH50 --schema_file totalview-itch-50.xml --message_type_inclusions stock_directory --source_file_format_type length_delimited --factory itch --message_handlers FilterHandler:stock=SPY --sampling_count 1

authenticity: P
etp_flag: Y
etp_leverage_factor: null
financial_status_indicator: ' '
inverse_indicator: null
ipo_flag: ' '
issue_classification: Q
issue_subtype: E
luld_reference_price_tier: '1'
market_category: P
round_lot_size: 100
round_lots_only: N
short_sale_threshold_indicator: N
stock: SPY
stock_locate: 7451
timestamp: 11354508113636
tracking_number: 0

INFO:root:Sampled messages: 1
INFO:root:Message type inclusions: ['stock_directory']
INFO:root:Source message count: 7466
INFO:root:Processed message count: 7451
INFO:root:Transcoded message count: 1
INFO:root:Processed schema count: 1
INFO:root:Summary of message counts: {'stock_directory': 7451}
INFO:root:Summary of error message counts: {}
INFO:root:Message rate: 53260.474108 per second
INFO:root:Total runtime in seconds: 0.140179
INFO:root:Total runtime in minutes: 0.002336
```

Taking the value of the field `stock_locate` from the above message
allows us to filter all messages for that field/value combination. In
addition, we can append a sequence number to all transcoded messages
that are output. The below combination returns the original `stock_directory`
message we used to look up the `stock_locate` code, as well as the
next two messages in the stream that have the same value for `stock_locate`:

```

txcode --source_file 12302019.NASDAQ_ITCH50 --schema_file totalview-itch-50.xml --source_file_format_type length_delimited --factory itch --message_handlers FilterHandler:stock_locate=7451,SequencerHandler --sampling_count 3 

authenticity: P
etp_flag: Y
etp_leverage_factor: null
financial_status_indicator: ' '
inverse_indicator: null
ipo_flag: ' '
issue_classification: Q
issue_subtype: E
luld_reference_price_tier: '1'
market_category: P
round_lot_size: 100
round_lots_only: N
sequence_number: 1
short_sale_threshold_indicator: N
stock: SPY
stock_locate: 7451
timestamp: 11354508113636
tracking_number: 0

reason: ''
reserved: ' '
sequence_number: 2
stock: SPY
stock_locate: 7451
timestamp: 11355134575401
tracking_number: 0
trading_state: T

reg_sho_action: '0'
sequence_number: 3
stock: SPY
stock_locate: 7451
timestamp: 11355134599149
tracking_number: 0

INFO:root:Sampled messages: 3
INFO:root:Source message count: 23781
INFO:root:Processed message count: 23781
INFO:root:Transcoded message count: 3
INFO:root:Processed schema count: 21
INFO:root:Summary of message counts: {'system_event': 1, 'stock_directory': 8906, 'stock_trading_action': 7437, 'reg_sho_restriction': 7437, 'market_participant_position': 0, 'mwcb_decline_level': 0, 'ipo_quoting_period_update': 0, 'luld_auction_collar': 0, 'operational_halt': 0, 'add_order_no_attribution': 0, 'add_order_attribution': 0, 'order_executed': 0, 'order_executed_price': 0, 'order_cancelled': 0, 'order_deleted': 0, 'order_replaced': 0, 'trade': 0, 'cross_trade': 0, 'broken_trade': 0, 'net_order_imbalance': 0, 'retail_price_improvement_indicator': 0}
INFO:root:Summary of error message counts: {}
INFO:root:Message rate: 80950.257512 per second
INFO:root:Total runtime in seconds: 0.293773
INFO:root:Total runtime in minutes: 0.004896


```

The syntax for handler specifications is:

```
<Handler1>:<Handler1Parameter>=<Handler1Parameter>,<Handler2>
```

Message handlers are deployed in `transcoder/message/handler/`.

# Installation
If you are a user looking to use the CLI or library without making changes, you can install the Market Data Transcoder from [PyPI](https://pypi.org/project/market-data-transcoder) using pip:
```
pip install market-data-transcoder
```

After the pip installation, you can validate that the transcoder is available by the following command:
```
txcode --help
```

# Developers
If you are looking to extend the functionality of the Market Data Transcoder:
```
cd market-data-transcoder
pip install -r requirements.txt
```

After installing the required dependencies, you can run the transcoder with the following:
```
export PYTHONPATH=`pwd`
python ./transcoder/main.py --help
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/GoogleCloudPlatform/market-data-transcoder",
    "name": "market-data-transcoder",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "bigquery,devops,json,automation,schema,trading,avro,binary,transcoding,pubsub,fix,fixprotocol,google-cloud-platform,itch,sbe,simple-binary-encoding,exchanges,marketdata,binaryencoding",
    "author": "Google Cloud FSI Solutions",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/ff/20/8a4a7fede94018ef39875f14f8261b49e359e5d44c9f2fec30333a28a8e5/market-data-transcoder-1.0.3.tar.gz",
    "platform": null,
    "description": "# ```Google Cloud Datacast Solution```\n \n##  _Ingest high-performance exchange feeds into Google Cloud_\n\n_This is not an official Google product or service_\n\n### Introduction\n\nThe Datacast `transcoder` is a schema-driven, message-oriented utility to simplify the lossless ingestion of common high-performance electronic trading data formats to Google Cloud.\n\nElectronic trading venues have specialized data representation and distribution needs. In particular, efficient message representation is a high priority due to the massive volume of transactions a venue processes. Cloud-native APIs often use JSON for message payloads, but the extra bytes required to represent messages using high-context encodings have cost implications in metered computing environments. \n\nUnlike JSON, YAML, or even CSV, binary-encoded data is low-context and not self-describing -- the instructions for interpreting binary messages must be explicitly provided by producers separately and in advance, and followed by interpreters.\n\nThe architecture of the transcoder relies on several principal abstractions, detailed below:\n\n#### Schema\n\nA schema (also known as a data dictionary) is similar to an API specification, but instead of describing API endpoint contracts, it describes the representative format of binary _messages_ that flow between systems. The closest comparison might be drawn with table definitions supported by SQL Data Definition Language, but these schemas are used for data in-motion as well as data at-rest.\n\nThe transcoder's current input schema support is for Simple Binary Encoding (SBE) XML as well as QuickFIX-styled FIX protocol schema representations (also in XML).\n\nTarget schema and data elements are rendered based on the specified `output_type`. With no output type specified, the transcoder defaults to displaying the YAML representation of transcoded messages to the console, and does not perform persistent schema transformations. For Avro and JSON, the transcoded schema and data files are encapsulated in POSIX files locally. Direct trancoding to BigQuery and Pub/Sub targets are supported, with the transcoded schemas being applied prior to message ingestion or publishing. Terraform configurations for BigQuery and Pub/Sub resources can also be derived from a specified input schema. The Terraform options only render the configurations locally and do not execute Terraform `apply`. The `--create_schemas_only` option transcodes  schemas in isolation for other output types.\n\nThe names of the output resources will individually correspond to the names of the message types defined in the input schema. For example, the transcoder will create and use a Pub/Sub topic named \"NewOrderSingle\" for publishing FIX `NewOrderSingle` messages found in source data. Similarly, if an output type of `bigquery` is selected, the transcoder will create a `NewOrderSingle` table in the dataset specified by `--destination_dataset_id`. By default, Avro and JSON encoded output will be saved to a file named `<message type>` with the respective extensions in a directory specified using the `--output_path` parameter.\n\n#### Message\n\nA message represents a discrete interaction between two systems sharing a schema. Each message will conform to a single _message type_ as defined in the schema. Specific message types can be included or excluded for processing by passing a comma-delimited string of message type names to the `--message_type_exclusions` and `--message_type_inclusions` parameters.\n\n\n#### Encoding\n\nEncodings describe how the contents of a message payload are represented to systems. Many familiar encodings, such as JSON, YAML or CSV, are self-describing and do not strictly require that applications use a separate schema definition. However, binary encodings such as SBE, Avro and Protocol Buffers require that applications employ the associated schema in order to properly interpret messages.\n\nThe transcoder's supported inbound encodings are SBE binary and ASCII-encoded (tag=value) FIX. Outbound encodings for Pub/Sub message payloads can be Avro binary or Avro JSON. Local files can be generated in either Avro or JSON.\n\nThe transcoder supports base64 decoding of messages using the `--base64` and `--base64_urlsafe` options.\n\n#### Transport\n\nA message transport describes the mechanism for transferring messages between systems. This can be data-in-motion, such as an ethernet network, or data-at-rest, such as a file living on a POSIX filesytem or an object residing within cloud storage. Raw message bytes must be unframed from a particular transport, such as length-delimited files or packet capture files. \n\nThe transcoder's currently supported inbound message source transports are PCAP files, length-delimited binary files, and newline-delimited ASCII files. Multicast UDP and Pub/Sub inbound transports are on the roadmap.\n\nOutbound transport options are locally stored Avro and JSON POSIX files, and Pub/Sub topics or BigQuery tables. If no `output_type` is specified, the transcoded messages are output to the console encoded in YAML and not persisted automatically. Additionally, Google Cloud resource definitions for specified schemas can be encapsulated in Terraform configurations.\n\n#### Message factory\n\nA message factory takes a message payload read from the input source, determines the associated message type from the schema to apply, and performs any adjustments to the message data prior to transcoding. For example, a message producer may use non-standard SBE headers or metadata that you would like to remove or transform. For standard FIX tag/value input sources, the included `fix` message factory may be used.\n\n### CLI usage\n\n```\nusage: txcode  [-h] [--factory {cme,itch,memx,fix}]\n               [--schema_file SCHEMA_FILE] [--source_file SOURCE_FILE]\n               [--source_file_encoding SOURCE_FILE_ENCODING]\n               --source_file_format_type\n               {pcap,length_delimited,line_delimited,cme_binary_packet}\n               [--base64 | --base64_urlsafe]\n               [--fix_header_tags FIX_HEADER_TAGS]\n               [--fix_separator FIX_SEPARATOR]\n               [--message_handlers MESSAGE_HANDLERS]\n               [--message_skip_bytes MESSAGE_SKIP_BYTES]\n               [--prefix_length PREFIX_LENGTH]\n               [--message_type_exclusions MESSAGE_TYPE_EXCLUSIONS | --message_type_inclusions MESSAGE_TYPE_INCLUSIONS]\n               [--sampling_count SAMPLING_COUNT] [--skip_bytes SKIP_BYTES]\n               [--skip_lines SKIP_LINES] [--source_file_endian {big,little}]\n               [--output_path OUTPUT_PATH]\n               [--output_type {diag,avro,fastavro,bigquery,pubsub,bigquery_terraform,pubsub_terraform,jsonl,length_delimited}]\n               [--error_output_path ERROR_OUTPUT_PATH]\n               [--lazy_create_resources] [--frame_only] [--stats_only]\n               [--create_schemas_only]\n               [--destination_project_id DESTINATION_PROJECT_ID]\n               [--destination_dataset_id DESTINATION_DATASET_ID]\n               [--output_encoding {binary,json}]\n               [--create_schema_enforcing_topics | --no-create_schema_enforcing_topics]\n               [--continue_on_error]\n               [--log {notset,debug,info,warning,error,critical}] [-q] [-v]\n\nDatacast Transcoder process input arguments\n\noptions:\n  -h, --help            show this help message and exit\n  --continue_on_error   Indicates if an exception file should be created, and\n                        records continued to be processed upon message level\n                        exceptions\n  --log {notset,debug,info,warning,error,critical}\n                        The default logging level\n  -q, --quiet           Suppress message output to console\n  -v, --version         show program's version number and exit\n\nInput source arguments:\n  --factory {cme,itch,memx,fix}\n                        Message factory for decoding\n  --schema_file SCHEMA_FILE\n                        Path to the schema file\n  --source_file SOURCE_FILE\n                        Path to the source file\n  --source_file_encoding SOURCE_FILE_ENCODING\n                        The source file character encoding\n  --source_file_format_type {pcap,length_delimited,line_delimited,cme_binary_packet}\n                        The source file format\n  --base64              Indicates if each individual message extracted from\n                        the source is base 64 encoded\n  --base64_urlsafe      Indicates if each individual message extracted from\n                        the source is base 64 url safe encoded\n  --fix_header_tags FIX_HEADER_TAGS\n                        Comma delimited list of fix header tags\n  --fix_separator FIX_SEPARATOR\n                        The unicode int representing the fix message separator\n  --message_handlers MESSAGE_HANDLERS\n                        Comma delimited list of message handlers in priority\n                        order\n  --message_skip_bytes MESSAGE_SKIP_BYTES\n                        Number of bytes to skip before processing individual\n                        messages within a repeated length delimited file\n                        message source\n  --prefix_length PREFIX_LENGTH\n                        How many bytes to use for the length prefix of length-\n                        delimited binary sources\n  --message_type_exclusions MESSAGE_TYPE_EXCLUSIONS\n                        Comma-delimited list of message types to exclude when\n                        processing\n  --message_type_inclusions MESSAGE_TYPE_INCLUSIONS\n                        Comma-delimited list of message types to include when\n                        processing\n  --sampling_count SAMPLING_COUNT\n                        Halt processing after reaching this number of\n                        messages. Applied after all Handlers are executed per\n                        message\n  --skip_bytes SKIP_BYTES\n                        Number of bytes to skip before processing the file.\n                        Useful for skipping file-level headers\n  --skip_lines SKIP_LINES\n                        Number of lines to skip before processing the file\n  --source_file_endian {big,little}\n                        Source file endianness\n\nOutput arguments:\n  --output_path OUTPUT_PATH\n                        Output file path. Defaults to avroOut\n  --output_type {diag,avro,fastavro,bigquery,pubsub,bigquery_terraform,pubsub_terraform,jsonl,length_delimited}\n                        Output format type\n  --error_output_path ERROR_OUTPUT_PATH\n                        Error output file path if --continue_on_error flag\n                        enabled. Defaults to errorOut\n  --lazy_create_resources\n                        Flag indicating that output resources for message\n                        types should be only created as messages of each type\n                        are encountered in the source data. Default behavior\n                        is to create resources for each message type before\n                        messages are processed. Particularly useful when\n                        working with FIX but only processing a limited set of\n                        message types in the source data\n  --frame_only          Flag indicating that transcoder should only frame\n                        messages to an output source\n  --stats_only          Flag indicating that transcoder should only report on\n                        message type counts without parsing messages further\n  --create_schemas_only\n                        Flag indicating that transcoder should only create\n                        output resource schemas and not output message data\n\nGoogle Cloud arguments:\n  --destination_project_id DESTINATION_PROJECT_ID\n                        The Google Cloud project ID for the destination\n                        resource\n\nBigQuery arguments:\n  --destination_dataset_id DESTINATION_DATASET_ID\n                        The BigQuery dataset for the destination. If it does\n                        not exist, it will be created\n\nPub/Sub arguments:\n  --output_encoding {binary,json}\n                        The encoding of the output\n  --create_schema_enforcing_topics, --no-create_schema_enforcing_topics\n                        Indicates if Pub/Sub schemas should be created and\n                        used to validate messages sent to a topic\n```\n\n### Message handlers\n\n`txcode` supports the execution of _message handler_ classes that can\nbe used to statefully mutate in-flight streams and messages. For example,\n`TimestampPullForwardHandler` will look for a `seconds`-styled ITCH\nmessage (that informs the stream of the prevailing epochs second to\napply to subsequent messages), and append the latest value from\nthat to all subsequent messages (between instances of the `seconds`\nmessage appearing. This helps individual messages be persisted with\nabsolute timestamps that require less context to interpret\n(i.e. outbound messages contain more than just \"nanoseconds past\nmidnight\" for a timestamp.\n\nAnother handler is `SequencerHandler`, which appends a sequence number\nto all outbound messages. This is useful when processing bulk messages\nin length-delimited storage formats where the IP packet headers\ncontaining the original sequence numbers have been stripped.\t\n\n`FilterHandler` lets you filter output based upon a specific property\nof a message. A common use for this is to filter messages pertaining\nonly to a particular security identifier or symbol.\n\nHere is a combination of transcoding invocations that can\nbe used to shard a message universe by trading symbol. First, the mnemonic\ntrading symbol identifier (`stock`) must be used to find it's associated integer\nsecurity identifier (`stock_locate`) from the `stock_directory`\nmessage. `stock_locate` is the identifier included in every\nrelevant message (as opposed to `stock`, which is absent from\ncertain message types):\n\n```\n\ntxcode --source_file 12302019.NASDAQ_ITCH50 --schema_file totalview-itch-50.xml --message_type_inclusions stock_directory --source_file_format_type length_delimited --factory itch --message_handlers FilterHandler:stock=SPY --sampling_count 1\n\nauthenticity: P\netp_flag: Y\netp_leverage_factor: null\nfinancial_status_indicator: ' '\ninverse_indicator: null\nipo_flag: ' '\nissue_classification: Q\nissue_subtype: E\nluld_reference_price_tier: '1'\nmarket_category: P\nround_lot_size: 100\nround_lots_only: N\nshort_sale_threshold_indicator: N\nstock: SPY\nstock_locate: 7451\ntimestamp: 11354508113636\ntracking_number: 0\n\nINFO:root:Sampled messages: 1\nINFO:root:Message type inclusions: ['stock_directory']\nINFO:root:Source message count: 7466\nINFO:root:Processed message count: 7451\nINFO:root:Transcoded message count: 1\nINFO:root:Processed schema count: 1\nINFO:root:Summary of message counts: {'stock_directory': 7451}\nINFO:root:Summary of error message counts: {}\nINFO:root:Message rate: 53260.474108 per second\nINFO:root:Total runtime in seconds: 0.140179\nINFO:root:Total runtime in minutes: 0.002336\n```\n\nTaking the value of the field `stock_locate` from the above message\nallows us to filter all messages for that field/value combination. In\naddition, we can append a sequence number to all transcoded messages\nthat are output. The below combination returns the original `stock_directory`\nmessage we used to look up the `stock_locate` code, as well as the\nnext two messages in the stream that have the same value for `stock_locate`:\n\n```\n\ntxcode --source_file 12302019.NASDAQ_ITCH50 --schema_file totalview-itch-50.xml --source_file_format_type length_delimited --factory itch --message_handlers FilterHandler:stock_locate=7451,SequencerHandler --sampling_count 3 \n\nauthenticity: P\netp_flag: Y\netp_leverage_factor: null\nfinancial_status_indicator: ' '\ninverse_indicator: null\nipo_flag: ' '\nissue_classification: Q\nissue_subtype: E\nluld_reference_price_tier: '1'\nmarket_category: P\nround_lot_size: 100\nround_lots_only: N\nsequence_number: 1\nshort_sale_threshold_indicator: N\nstock: SPY\nstock_locate: 7451\ntimestamp: 11354508113636\ntracking_number: 0\n\nreason: ''\nreserved: ' '\nsequence_number: 2\nstock: SPY\nstock_locate: 7451\ntimestamp: 11355134575401\ntracking_number: 0\ntrading_state: T\n\nreg_sho_action: '0'\nsequence_number: 3\nstock: SPY\nstock_locate: 7451\ntimestamp: 11355134599149\ntracking_number: 0\n\nINFO:root:Sampled messages: 3\nINFO:root:Source message count: 23781\nINFO:root:Processed message count: 23781\nINFO:root:Transcoded message count: 3\nINFO:root:Processed schema count: 21\nINFO:root:Summary of message counts: {'system_event': 1, 'stock_directory': 8906, 'stock_trading_action': 7437, 'reg_sho_restriction': 7437, 'market_participant_position': 0, 'mwcb_decline_level': 0, 'ipo_quoting_period_update': 0, 'luld_auction_collar': 0, 'operational_halt': 0, 'add_order_no_attribution': 0, 'add_order_attribution': 0, 'order_executed': 0, 'order_executed_price': 0, 'order_cancelled': 0, 'order_deleted': 0, 'order_replaced': 0, 'trade': 0, 'cross_trade': 0, 'broken_trade': 0, 'net_order_imbalance': 0, 'retail_price_improvement_indicator': 0}\nINFO:root:Summary of error message counts: {}\nINFO:root:Message rate: 80950.257512 per second\nINFO:root:Total runtime in seconds: 0.293773\nINFO:root:Total runtime in minutes: 0.004896\n\n\n```\n\nThe syntax for handler specifications is:\n\n```\n<Handler1>:<Handler1Parameter>=<Handler1Parameter>,<Handler2>\n```\n\nMessage handlers are deployed in `transcoder/message/handler/`.\n\n# Installation\nIf you are a user looking to use the CLI or library without making changes, you can install the Market Data Transcoder from [PyPI](https://pypi.org/project/market-data-transcoder) using pip:\n```\npip install market-data-transcoder\n```\n\nAfter the pip installation, you can validate that the transcoder is available by the following command:\n```\ntxcode --help\n```\n\n# Developers\nIf you are looking to extend the functionality of the Market Data Transcoder:\n```\ncd market-data-transcoder\npip install -r requirements.txt\n```\n\nAfter installing the required dependencies, you can run the transcoder with the following:\n```\nexport PYTHONPATH=`pwd`\npython ./transcoder/main.py --help\n```\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Market Data Transcoder",
    "version": "1.0.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/GoogleCloudPlatform/market-data-transcoder/issues",
        "Homepage": "https://github.com/GoogleCloudPlatform/market-data-transcoder"
    },
    "split_keywords": [
        "bigquery",
        "devops",
        "json",
        "automation",
        "schema",
        "trading",
        "avro",
        "binary",
        "transcoding",
        "pubsub",
        "fix",
        "fixprotocol",
        "google-cloud-platform",
        "itch",
        "sbe",
        "simple-binary-encoding",
        "exchanges",
        "marketdata",
        "binaryencoding"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bd47472db0db67a18826983c75a2adb8666015b32b0d8b7084d209a3bcd70014",
                "md5": "b4caf146e69543192e696f5fe3702c82",
                "sha256": "e16a326823c0f31894447daeabc2454aec421b150ec95f2ccccbb23aa7638a0f"
            },
            "downloads": -1,
            "filename": "market_data_transcoder-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b4caf146e69543192e696f5fe3702c82",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 129495,
            "upload_time": "2023-06-19T13:46:35",
            "upload_time_iso_8601": "2023-06-19T13:46:35.790121Z",
            "url": "https://files.pythonhosted.org/packages/bd/47/472db0db67a18826983c75a2adb8666015b32b0d8b7084d209a3bcd70014/market_data_transcoder-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ff208a4a7fede94018ef39875f14f8261b49e359e5d44c9f2fec30333a28a8e5",
                "md5": "73cb05f69d6e0a825ebe5702c1885104",
                "sha256": "561ddadb47ac34773f5d70d46255b8aca5d0911021a45add67546fd0fc3e72fd"
            },
            "downloads": -1,
            "filename": "market-data-transcoder-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "73cb05f69d6e0a825ebe5702c1885104",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 67281,
            "upload_time": "2023-06-19T13:46:37",
            "upload_time_iso_8601": "2023-06-19T13:46:37.473363Z",
            "url": "https://files.pythonhosted.org/packages/ff/20/8a4a7fede94018ef39875f14f8261b49e359e5d44c9f2fec30333a28a8e5/market-data-transcoder-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-19 13:46:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GoogleCloudPlatform",
    "github_project": "market-data-transcoder",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "avro",
            "specs": [
                [
                    "==",
                    "1.11.1"
                ]
            ]
        },
        {
            "name": "dpkt",
            "specs": [
                [
                    "==",
                    "1.9.8"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    "==",
                    "4.9.2"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.24.3"
                ]
            ]
        },
        {
            "name": "fastavro",
            "specs": [
                [
                    "==",
                    "1.7.4"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    ">=",
                    "1.12.0"
                ]
            ]
        },
        {
            "name": "google-cloud-pubsub",
            "specs": [
                [
                    "==",
                    "2.17.1"
                ]
            ]
        },
        {
            "name": "google-cloud-bigquery",
            "specs": [
                [
                    "==",
                    "3.11.0"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    "==",
                    "6.0"
                ]
            ]
        }
    ],
    "lcname": "market-data-transcoder"
}
        
Elapsed time: 0.09823s