Ethereum-Blockchain-Parser


NameEthereum-Blockchain-Parser JSON
Version 2.1.5 PyPI version JSON
download
home_pagehttps://github.com/yanjlee/Ethereum_Blockchain_Parser
SummaryThis is a project to parse the Ethereum blockchain from a local geth node. Blockchains are perfect data sets because they contain every transaction ever made on the network.
upload_time2024-06-01 07:37:49
maintainerNone
docs_urlNone
authoryanjlee
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Ethereum Blockchain Parser

This is a project to parse the Ethereum blockchain from a local geth node. Blockchains are perfect data sets because they contain every transaction ever made on the network. This is valuable data if you want to analyze the network, but Ethereum stores its blockchain in [RLP](https://github.com/ethereum/wiki/wiki/RLP) encoded binary blobs within a series of LevelDB files and these are surprisingly difficult to access, even given the available tools. This project takes the approach of querying a local node via [JSON-RPC](https://github.com/ethereum/wiki/wiki/JSON-RPC), which returns unencoded transactional data, and then moves that data to a mongo database.

![Blocks 1 to 120000](.content/1_120000.jpg)


## Usage

### Streaming data

To stream blockchain data for real-time analysis, make sure you have both geth and mongo running and start the process with:

        python3 stream.py

Note that this will automatically backfill your mongo database with blocks that it is missing.

### Backfilling your Mongo database

To get data from the blockchain as it exists now and then stop parsing, simply run the following scripts, which are located in the `Scripts` directory. Note that at the time of writing, the Ethereum blockchain has about 1.5 million blocks so this will likely take several hours.

1. Funnel the data from geth to MongoDB:


        python3 preprocess.py

2. Create a series of snapshots of the blockchain through time and for each snapshot, calculate key metrics. Dump the data into a CSV file:


        python3 extract.py


        
## Prerequisites:

Before using this tool to analyze your copy of the blockchain, you need the following things:

### Geth
[Geth](https://github.com/ethereum/go-ethereum/wiki/Geth) is the Go implementation of a full Ethereum node. We will need to run it with the `--rpc` flag in order to request data (**WARNING** if you run this on a geth client containing an account that has ether in it, make sure you put a firewall 8545 or whatever port you run geth RPC on).

A geth instance downloads the blockchain and processes it, saving the blocks as LevelDB files in the specified data directory (`~/.ethereum/chaindata` by default). The geth instance can be queried via RPC with the `eth_getBlockByNumber([block, true])` endpoint (see [here](https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_getblockbynumber)) to get the `X-th` block (with `true` indicating we want the transactional data included), which returns data of the form:

    {
      number: 1000000,
      timestamp: 1465003569,
      ...
      transactions: [
        {
          blockHash: "0x2052ce710a08094b81b5047ea9df5119773ce4b263a23d86659fa7293251055e",
          blockNumber: 1284937,
          from: "0x1f57f826caf594f7a837d9fc092456870a289365",
          gas: 22050,
          gasPrice: 20000000000,
          hash: "0x654ac26084ee6e40767e8735f38274ef5f594454a4d34cfdd70c93aa95be0c64",
          input: "0x",
          nonce: 6610,
          to: "0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98",
          transactionIndex: 27,
          value: 201544820000000000
        }
      ]
    }

Since I am only interested in `number`, `timestamps`, and `transactions` for this application, I have omitted the rest of the data, but there is lots of additional information in the block (explore [here](https://etherchain.org/blocks)), including a few Merkle trees to maintain hashes of state, transactions, and receipts (read [here](https://blog.ethereum.org/2015/11/15/merkling-in-ethereum/).

Using the `from` and `to` addresses in the `transactions` array, I can map the flow of ether through the network as time processes. Note that the value, gas, and gasPrice are in Wei, where 1 Ether = 10<sup>18</sup> Wei. The numbers are converted into Ether automatically with this tool.

### MongoDB

We will use mongo to essentially copy each block served by Geth, preserving its structure. The data outside the scope of this analysis will be omitted. Note that this project also requires pymongo.

### graph-tool

[graph-tool](https://graph-tool.skewed.de/) is a python library written in C to construct graphs quickly and has a flexible feature set for mapping properties to its edges and vertices. Depending on your system, this may be tricky to install, so be sure and follow their instructions carefully. I recommend you find some way to install it with a package manager because building from source is a pain.

### python3

This was written for python 3.4 with the packages: contractmap, tqdm and requests. Some things will probably break if you try to do this analysis in python 2.


## Workflow

The following outlines the procedure used to turn the data from bytes on the blockchain to data in a CSV file.

### 1. Process the blockchain

Preprocessing is done with the `Crawler` class, which can be found in the `Preprocessing/Crawler` directory. Before instantiating a `Crawler` object, you need to have geth and mongo processes running. Starting a `Crawler()` instance will go through the processes of requesting and processing the blockchain from geth and copying it over to a Mongo collection named `transactions`. Once copied over, you can close the `Crawler()` instance.

### 2. Take a snapshot of the blockchain

A snapshot of the network (i.e. all of the transactions occurring between two timestamps, or numbered blocks in the block chain) can be taken with a `TxnGraph()` instance. This class can be found in the `Analysis` directory. Create an instance with:

    snapshot = TxnGraph(a, b)

where a is the starting block (int) and b is ending block (int). This will load a directed graph of all ethereum addresses that made transactions between the two specified blocks. It will also weight vertices by the total amount of Ether at the time that the ending block was mined and edges by the amount of ether send in the transaction.

To move on to the next snapshot (i.e. forward in time):

    snapshot.extend(c)

where `c` is the number of blocks to proceed.

At each snapshot, the instance will automatically pickle the snapshot and save the state to a local file (disable on instantiation with `save=False`).

#### Drawing an image:

Once `TxnGraph` is created, it will create a graph out of all of the data in the blocks between a and b. An image can be drawn by calling `TxnGraph.draw()` and specific dimensions can be passed using `TxnGraph.draw(w=A, h=B)` where A and B are ints corresponding to numbers of pixels. By default, this is saved to the `Analysis/data/snapshots` directory.

#### Saving/Loading State (using pickle)

The `TxnGraph` instance state can be (and automatically is) pickled with `TxnGraph.save()` where the filename is parameterized by the start/end blocks and is saved. By default, this saves to the `Analysis/data/pickles` directory. If another instance was pickled with a different set of start/end blocks, it can be reloaded with `TxnGraph.load(a,b)`.

### 3: (Optional) Add a lookup table for smart contract transactions

An important consideration when doing an analysis of the Ethereum network is of smart contract addresses. Much ether flows to and from contracts, which you may want to distinguish from simple peer-to-peer transactions. This can be done by loading a `ContractMap` instance. It is recommended you pass the most recent block in the blockchain for `last_block`, as this will find all contracts that were transacted with up to that point in history:

    # If a mongo_client is passed, the ContractMap will scan geth via RPC
    # for new contract addresses starting at "last_block".
    cmap = ContractMap(mongo_client, last_block=90000, filepath="./contracts.p")
    cmap.save()

    # If None is passed for a mongo_client, the ContractMap will automatically
    # load the map of addresses from the pickle file specified in "filepath",
    # ./contracts.p by default.
    cmap = ContractMap()

This will create a hash table of all contract addresses using a `defaultdict` and will save it to a pickle file.

### 4: Aggregate data and analyze

Once a snapshot has been created, initialize an instance of `ParsedBlocks` with a `TxnGraph` instance. This will automatically aggregate the data and save to a local CSV file, which can then be analyzed.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yanjlee/Ethereum_Blockchain_Parser",
    "name": "Ethereum-Blockchain-Parser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "yanjlee",
    "author_email": "yanjlee@163.com",
    "download_url": "https://files.pythonhosted.org/packages/f7/06/d53cf8a559f6873637ded106f68396540db6cf2eb2ca43b1763c35c7c186/ethereum_blockchain_parser-2.1.5.tar.gz",
    "platform": null,
    "description": "# Ethereum Blockchain Parser\r\n\r\nThis is a project to parse the Ethereum blockchain from a local geth node. Blockchains are perfect data sets because they contain every transaction ever made on the network. This is valuable data if you want to analyze the network, but Ethereum stores its blockchain in [RLP](https://github.com/ethereum/wiki/wiki/RLP) encoded binary blobs within a series of LevelDB files and these are surprisingly difficult to access, even given the available tools. This project takes the approach of querying a local node via [JSON-RPC](https://github.com/ethereum/wiki/wiki/JSON-RPC), which returns unencoded transactional data, and then moves that data to a mongo database.\r\n\r\n![Blocks 1 to 120000](.content/1_120000.jpg)\r\n\r\n\r\n## Usage\r\n\r\n### Streaming data\r\n\r\nTo stream blockchain data for real-time analysis, make sure you have both geth and mongo running and start the process with:\r\n\r\n        python3 stream.py\r\n\r\nNote that this will automatically backfill your mongo database with blocks that it is missing.\r\n\r\n### Backfilling your Mongo database\r\n\r\nTo get data from the blockchain as it exists now and then stop parsing, simply run the following scripts, which are located in the `Scripts` directory. Note that at the time of writing, the Ethereum blockchain has about 1.5 million blocks so this will likely take several hours.\r\n\r\n1. Funnel the data from geth to MongoDB:\r\n\r\n\r\n        python3 preprocess.py\r\n\r\n2. Create a series of snapshots of the blockchain through time and for each snapshot, calculate key metrics. Dump the data into a CSV file:\r\n\r\n\r\n        python3 extract.py\r\n\r\n\r\n        \r\n## Prerequisites:\r\n\r\nBefore using this tool to analyze your copy of the blockchain, you need the following things:\r\n\r\n### Geth\r\n[Geth](https://github.com/ethereum/go-ethereum/wiki/Geth) is the Go implementation of a full Ethereum node. We will need to run it with the `--rpc` flag in order to request data (**WARNING** if you run this on a geth client containing an account that has ether in it, make sure you put a firewall 8545 or whatever port you run geth RPC on).\r\n\r\nA geth instance downloads the blockchain and processes it, saving the blocks as LevelDB files in the specified data directory (`~/.ethereum/chaindata` by default). The geth instance can be queried via RPC with the `eth_getBlockByNumber([block, true])` endpoint (see [here](https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_getblockbynumber)) to get the `X-th` block (with `true` indicating we want the transactional data included), which returns data of the form:\r\n\r\n    {\r\n      number: 1000000,\r\n      timestamp: 1465003569,\r\n      ...\r\n      transactions: [\r\n        {\r\n          blockHash: \"0x2052ce710a08094b81b5047ea9df5119773ce4b263a23d86659fa7293251055e\",\r\n          blockNumber: 1284937,\r\n          from: \"0x1f57f826caf594f7a837d9fc092456870a289365\",\r\n          gas: 22050,\r\n          gasPrice: 20000000000,\r\n          hash: \"0x654ac26084ee6e40767e8735f38274ef5f594454a4d34cfdd70c93aa95be0c64\",\r\n          input: \"0x\",\r\n          nonce: 6610,\r\n          to: \"0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98\",\r\n          transactionIndex: 27,\r\n          value: 201544820000000000\r\n        }\r\n      ]\r\n    }\r\n\r\nSince I am only interested in `number`, `timestamps`, and `transactions` for this application, I have omitted the rest of the data, but there is lots of additional information in the block (explore [here](https://etherchain.org/blocks)), including a few Merkle trees to maintain hashes of state, transactions, and receipts (read [here](https://blog.ethereum.org/2015/11/15/merkling-in-ethereum/).\r\n\r\nUsing the `from` and `to` addresses in the `transactions` array, I can map the flow of ether through the network as time processes. Note that the value, gas, and gasPrice are in Wei, where 1 Ether = 10<sup>18</sup> Wei. The numbers are converted into Ether automatically with this tool.\r\n\r\n### MongoDB\r\n\r\nWe will use mongo to essentially copy each block served by Geth, preserving its structure. The data outside the scope of this analysis will be omitted. Note that this project also requires pymongo.\r\n\r\n### graph-tool\r\n\r\n[graph-tool](https://graph-tool.skewed.de/) is a python library written in C to construct graphs quickly and has a flexible feature set for mapping properties to its edges and vertices. Depending on your system, this may be tricky to install, so be sure and follow their instructions carefully. I recommend you find some way to install it with a package manager because building from source is a pain.\r\n\r\n### python3\r\n\r\nThis was written for python 3.4 with the packages: contractmap, tqdm and requests. Some things will probably break if you try to do this analysis in python 2.\r\n\r\n\r\n## Workflow\r\n\r\nThe following outlines the procedure used to turn the data from bytes on the blockchain to data in a CSV file.\r\n\r\n### 1. Process the blockchain\r\n\r\nPreprocessing is done with the `Crawler` class, which can be found in the `Preprocessing/Crawler` directory. Before instantiating a `Crawler` object, you need to have geth and mongo processes running. Starting a `Crawler()` instance will go through the processes of requesting and processing the blockchain from geth and copying it over to a Mongo collection named `transactions`. Once copied over, you can close the `Crawler()` instance.\r\n\r\n### 2. Take a snapshot of the blockchain\r\n\r\nA snapshot of the network (i.e. all of the transactions occurring between two timestamps, or numbered blocks in the block chain) can be taken with a `TxnGraph()` instance. This class can be found in the `Analysis` directory. Create an instance with:\r\n\r\n    snapshot = TxnGraph(a, b)\r\n\r\nwhere a is the starting block (int) and b is ending block (int). This will load a directed graph of all ethereum addresses that made transactions between the two specified blocks. It will also weight vertices by the total amount of Ether at the time that the ending block was mined and edges by the amount of ether send in the transaction.\r\n\r\nTo move on to the next snapshot (i.e. forward in time):\r\n\r\n    snapshot.extend(c)\r\n\r\nwhere `c` is the number of blocks to proceed.\r\n\r\nAt each snapshot, the instance will automatically pickle the snapshot and save the state to a local file (disable on instantiation with `save=False`).\r\n\r\n#### Drawing an image:\r\n\r\nOnce `TxnGraph` is created, it will create a graph out of all of the data in the blocks between a and b. An image can be drawn by calling `TxnGraph.draw()` and specific dimensions can be passed using `TxnGraph.draw(w=A, h=B)` where A and B are ints corresponding to numbers of pixels. By default, this is saved to the `Analysis/data/snapshots` directory.\r\n\r\n#### Saving/Loading State (using pickle)\r\n\r\nThe `TxnGraph` instance state can be (and automatically is) pickled with `TxnGraph.save()` where the filename is parameterized by the start/end blocks and is saved. By default, this saves to the `Analysis/data/pickles` directory. If another instance was pickled with a different set of start/end blocks, it can be reloaded with `TxnGraph.load(a,b)`.\r\n\r\n### 3: (Optional) Add a lookup table for smart contract transactions\r\n\r\nAn important consideration when doing an analysis of the Ethereum network is of smart contract addresses. Much ether flows to and from contracts, which you may want to distinguish from simple peer-to-peer transactions. This can be done by loading a `ContractMap` instance. It is recommended you pass the most recent block in the blockchain for `last_block`, as this will find all contracts that were transacted with up to that point in history:\r\n\r\n    # If a mongo_client is passed, the ContractMap will scan geth via RPC\r\n    # for new contract addresses starting at \"last_block\".\r\n    cmap = ContractMap(mongo_client, last_block=90000, filepath=\"./contracts.p\")\r\n    cmap.save()\r\n\r\n    # If None is passed for a mongo_client, the ContractMap will automatically\r\n    # load the map of addresses from the pickle file specified in \"filepath\",\r\n    # ./contracts.p by default.\r\n    cmap = ContractMap()\r\n\r\nThis will create a hash table of all contract addresses using a `defaultdict` and will save it to a pickle file.\r\n\r\n### 4: Aggregate data and analyze\r\n\r\nOnce a snapshot has been created, initialize an instance of `ParsedBlocks` with a `TxnGraph` instance. This will automatically aggregate the data and save to a local CSV file, which can then be analyzed.\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "This is a project to parse the Ethereum blockchain from a local geth node. Blockchains are perfect data sets because they contain every transaction ever made on the network.",
    "version": "2.1.5",
    "project_urls": {
        "Homepage": "https://github.com/yanjlee/Ethereum_Blockchain_Parser"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a11f3fe57201d4271ff35fa5a7a52abf4cb311911b4f93eaa04d6c1e8623c19e",
                "md5": "0756e83eb1e9c414095ed5b485a08e02",
                "sha256": "a2c76437860ecc5e1b3e7c423db58dedef0f83e5c0cc64aa9721c87b62b50793"
            },
            "downloads": -1,
            "filename": "Ethereum_Blockchain_Parser-2.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0756e83eb1e9c414095ed5b485a08e02",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5888,
            "upload_time": "2024-06-01T07:37:47",
            "upload_time_iso_8601": "2024-06-01T07:37:47.515182Z",
            "url": "https://files.pythonhosted.org/packages/a1/1f/3fe57201d4271ff35fa5a7a52abf4cb311911b4f93eaa04d6c1e8623c19e/Ethereum_Blockchain_Parser-2.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f706d53cf8a559f6873637ded106f68396540db6cf2eb2ca43b1763c35c7c186",
                "md5": "e94277db9e23cc3d8e5d3548e9ed55db",
                "sha256": "29953fcaa8024e58c8cb86351769906be7e8a747479140f0152e22886e55790c"
            },
            "downloads": -1,
            "filename": "ethereum_blockchain_parser-2.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "e94277db9e23cc3d8e5d3548e9ed55db",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6842,
            "upload_time": "2024-06-01T07:37:49",
            "upload_time_iso_8601": "2024-06-01T07:37:49.860596Z",
            "url": "https://files.pythonhosted.org/packages/f7/06/d53cf8a559f6873637ded106f68396540db6cf2eb2ca43b1763c35c7c186/ethereum_blockchain_parser-2.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-01 07:37:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yanjlee",
    "github_project": "Ethereum_Blockchain_Parser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ethereum-blockchain-parser"
}
        
Elapsed time: 8.41328s