masnet


Namemasnet JSON
Version 0.3.6 PyPI version JSON
download
home_pagehttps://github.com/metebalci/masnet
Summarytools for Mastodon network analysis
upload_time2022-12-18 18:29:13
maintainer
docs_urlNone
authorMete Balci
requires_python
licenseGPLv3
keywords mastodon
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # masnet

`masnet` includes basic tools for Mastodon network/graph analysis. 

It can be installed by executing `pip install masnet`. 

After installation, following commands can be used: 

- `masnet.download`: downloads peers information from Mastodon instances.

- `masnet.generate`: creates Mastodon graph using the peers information downloaded.

- `masnet.analyze`: performs basic graph/network analysis on Mastodon graph.

There a few common arguments to each command:

- `-d <directory>`: uses the directory instead of the current directory to read and/or write files to. If the directory specified does not exist, it is created. This is important as `masnet.download` creates a lot of files and `masnet.generate` reads them.

- `-v`: enables verbose logging. It should not be normally needed.

- `--debug`: enables debug logging. It is only to be used during development.

Each command provide verbose logging with `-v` argument and debug logging with `--debug` argument.

In the software and documentation, I use both Graph and Network interchangeably, but instead of Vertex and Edge, I try to stick to terms Node and Link.

The input and output file names cannot be customized, as I think the tools will be run on a clean directory (due to large number of files).

# Hardware Requirements

Mastodon graph has over 10K nodes and more than 20M links. It is not a large graph, but also not small. 

The tools in `masnet` have different requirements:

- `masnet.download` is network I/O intensive. It is implemented as an async single-thread app using Python's `asyncio` and async libraries `aiohttp`, `aiodns` and `aiofile`.

- `masnet.generate` is memory intensive. It consumes more than 4GB of memory at its peak. I recommend having more than 4GB of free memory, hence a computer with 8GB of memory should be OK.

- `masnet.analyze` is compute intensive and uses less memory than `masnet.generate`. As the algorithms are optimized with OpenMP in `networkit` package, the more cores you have the faster it will run.

I am running the tools on an Ubuntu Linux virtual machine with 8 cores and 16GB of memory.

# masnet.download

```
Input Files: none

Output Files:
    - <domain>.peers.json files
    - masnet.download.visits
    - masnet.download.errors
    - masnet.download.skips
    - masnet.download.times
```

`masnet.download` connects to a given start domain (given with `-s` argument), downloads its peers (with `api/v1/instance/peers` API call), and then repeats this for each peer until all domains are visited. If start domain is not specified, `mastodon.social` is used.

After downloading the peers information of a domain, it is saved to `<domain>.peers.json`. `peers.json` file is not the same as `instance/peers` API response, it is augmented with the domain name, thus contains a `domain` key with the value of domain name, and a `peers` key with the list of domain names of peers.

For network connections, a timeout in seconds can be specified with `-t` argument. System default timeout is usually too high, I recommend experimenting with lower values such as 30 (seconds).

Because the download process is mostly network I/O, a number of tasks are needed to fetch peers data of different domains. These tasks are implemented as coroutines in Python. The number of tasks can be specified with `-n` argument (default is 100). 

Because there are (many) malicious or irrelevant domains in Mastodon network, it is possible to exclude them from traversal as they are seen in peers. Exclusion is specified as regular expression patterns. The name of the file containing such patterns can given with `-e` argument. If not specified, a default list (`masnet/default_exclusion_patterns`) is used. This default list excludes all private IP spaces, special domain names and known malicious domains at the time of release. It also filters URLs (names containing `/`).

`masnet.download` is a long running process. The execution of `masnet.download` can be terminated with `Ctrl-C`. Since it is a long running process, it might be a good idea to pipe the output to `tee` and save the output to a log file. 

All expected errors are handled gracefully with no stack trace printed to stdout or stderr. If you see any stack trace, that means there is an unexpected error and you can report it as an issue on GitHub.

At the end of an execution (also when terminated by Ctrl-C), in addition to (many) `<domain>.peers.json` filoes, 4 more files are generated:

- `masnet.download.visits`: list of (successfully) visited domains
- `masnet.download.errors`: list of domains where an error is encountered. For each domain, after a space, also the error message is saved.
- `masnet.download.skips`: list of skipped domains due exclusion patterns
- `masnet.download.times`: list of download times (in ms) of each peers.json file

All files other than `<domain>.peers.json` files are only for information. Only `<domain>.peers.json` files are used by `masnet.generate`.

An example run, with 30s timeout, saving files to current directory would be:

```
$ masnet.download -t 30
```

This command will print (to stdout) a status line like:

```
q:000000 a:000001 s:138374 N:011867 L:039653324 e:00126507 s:01990292 to:14706 t:00:39:41
```

The meaning of the fields are:

- q: # of domains in queue (of which peers information will be fetched)
- a: # of async tasks
- s: # of scheduled domains (already downloaded and will be downloaded)

- N: # of domains of which peers information is fetched successfully
- L: # of links observed

- e: # of domains where an error happened during fetch
- s: # of skipped domains (due to exclusion)
- to: number of connections gave timeout error

- t: elapsed time hh:mm:ss

The status line is printed every second and at the end of the execution.

The links observed reported with `L` is higher than what `masnet.generate` will report. This is because `masnet.download` counts links also for probably error returning domains. The number of nodes `N` and number of links observed `L` are only for information, accurate number of nodes and links will be reported by `masnet.generate`.

You might see increasing (much more than N) error e and skipped s numbers. At least at the moment, there are many domains in Mastodon network that are either malicious (there are many randomly named subdomains of some domains) or invalid like a private IP. Hence, there are much more domains in peers than the ones actually working.

`masnet.download` terminates automatically when there are zero elements observed in queue and there is only one task (which is the main program, that means no download tasks) for 5 seconds.

On 2022-12-18, `masnet.download -t 30 -n 100` and `masnet.download -t 30 -n 1000` both completes around 75 minutes with the following outputs:

```
$ masnet.download -t 30 -n 100
....
q:000000 a:000001 s:139500 N:013678 L:047071155 e:00125822 s:02981916 to:4241 t:01:15:26
```

```
$ masnet.download -t 30 -n 1000
...
q:000000 a:000001 s:139483 N:013688 L:047105958 e:00125795 s:02990283 to:3956 t:01:13:15
```

If you enable verbose mode with `-v` argument, after the status line, top 5 parent domains encountered during graph traversal will be displayed like below (parent meaning the top domain name part is removed, for a.b.com, it is b.com). This would help to identify malicious or invalid domains.

# masnet.generate

```
Input Files: <domain>.peers.json files

Output Files: 
    - mastodon.networkit 
    - mastodon.labels
```

`masnet.generate` reads all `<domain>.peers.json` files saved by `masnet.download` and creates a graph and saves it in networkit Binary format to `mastodon.networkit.directed` and `mastodon.networkit.undirected` files. The Mastodon peers network is normally directed, but undirected version is also saved by ignoring the direction of peer relationship. In addition to the graphs, the actual labels (domains) are also save into `mastodon.labels` file. Both of these files are read by `masnet.analyze`. Both directed and undirected networks have same nodes and node ids.

Each `<domain>.peers.json` file (thus a working domain) will be represented by a node in the graph, and it will have connections to its peers as long as the peer also has its `<domain>.peers.json` file. Thus if a domain returns error (for the API call) or skipped/exluded, it is also skipped in the generated network, no such node will exist.

```
$ masnet.generate
masnet v0.2
creating the graph from peers...
N:00000000 L:00000000 e:00:00:00 
N:00000562 L:00000000 e:00:00:01 
... repeats lines like this ...
:00011775 e:23631551 t:00:00:38 
graph created.
labels saved.
saving graphs...
0% 46% 76% 99% 100%
0% 26% 49% 70% 91% 100%
graphs saved.
bye.
```

where `N` means number of nodes, `L` means number of links and `e` means elapsed time.

This is a relatively fast operation, it completes under a minute.

# masnet.analyze

```
Input Files: 
    - mastodon.networkit.directed or mastodon.network.undirected
    - mastodon.labels

Output Files: depends on the analysis
```

`masnet.analyze` performs basic graph/network analysis on Mastodon graph using `networkit` large-scale network analysis toolkit. As the core of `networkit` is written in C++ using OpenMP, it is quite a fast implementation.

The input graph file is selected using `--directed` argument. By default, the undirected graph is used (mastodon.networkit.undirected).

Running `masnet.analyze` with `--card` argument shows the [network card](https://doi.org/10.1007/s41109-022-00514-7) of Mastodon network specified in `mastodon.networkit`.

```
$ masnet.analyze --card
masnet v0.3
node labels loaded.
reading undirected graph into networkit...
graph read in 0.6 seconds.
generating the network card...
----------------------------  --------------------------------
Name                          Mastodon peers network
Kind                          undirected, unweighted
Nodes are                     Mastodon instances
Links are                     Peer relationship
----------------------------  --------------------------------
Number of nodes               13688
Number of links               16010592
Degree*                       2339.362 [0, 12881]
Clustering                    0.754
Connected                     2 components [100.0% in largest]
Component size*               6844.0 [1, 13687]
Diameter                      n/a
Largest component's diameter  4
----------------------------  --------------------------------
Data generating process       masnet.download, masnet.generate
----------------------------  --------------------------------
*: avg [min, max]
```

Each analysis is requested with an argument to `masnet.analyze`. Please check the help with `--help` to see a list. At the moment, the following analysis arguments are supported (there are others mark experimental which I am developing/testing):

- `--card`: shows the network card

# Demo

On [network.bastodon.org](https://network.bastodon.org), the network card of the daily snapshot of Mastodon peer network can be seen.

# Release History

0.3.6:
- doc fix

0.3.5:
- masnet.generate fix

0.3.4:
- masnet.analyze --card fix for directed graphs

0.3.3:
- doc fix

0.3.2:
- masnet.analyze --card file support

0.3.1:
- pypi fix

0.3:
- major rewrite of masnet.download
- some changes on masnet.generate and masnet.analyze

0.2.1:
- terms fixed to Node and Link (rather than Vertex and Edge)
- masnet.analyze improvements
- masnet.downloads fix for termination of download threads

0.2:
- major rewrite of masnet.download
- initial release of masnet.generate
- initial release of masnet.analyze

0.1.1:
- small fixes
    
0.1:
- initial release of masnet.download
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/metebalci/masnet",
    "name": "masnet",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "Mastodon",
    "author": "Mete Balci",
    "author_email": "metebalci@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a7/21/ce02f5db6a7d963045d5208157b8506b4ad02f921503525e9772d20e9090/masnet-0.3.6.tar.gz",
    "platform": null,
    "description": "# masnet\n\n`masnet` includes basic tools for Mastodon network/graph analysis. \n\nIt can be installed by executing `pip install masnet`. \n\nAfter installation, following commands can be used: \n\n- `masnet.download`: downloads peers information from Mastodon instances.\n\n- `masnet.generate`: creates Mastodon graph using the peers information downloaded.\n\n- `masnet.analyze`: performs basic graph/network analysis on Mastodon graph.\n\nThere a few common arguments to each command:\n\n- `-d <directory>`: uses the directory instead of the current directory to read and/or write files to. If the directory specified does not exist, it is created. This is important as `masnet.download` creates a lot of files and `masnet.generate` reads them.\n\n- `-v`: enables verbose logging. It should not be normally needed.\n\n- `--debug`: enables debug logging. It is only to be used during development.\n\nEach command provide verbose logging with `-v` argument and debug logging with `--debug` argument.\n\nIn the software and documentation, I use both Graph and Network interchangeably, but instead of Vertex and Edge, I try to stick to terms Node and Link.\n\nThe input and output file names cannot be customized, as I think the tools will be run on a clean directory (due to large number of files).\n\n# Hardware Requirements\n\nMastodon graph has over 10K nodes and more than 20M links. It is not a large graph, but also not small. \n\nThe tools in `masnet` have different requirements:\n\n- `masnet.download` is network I/O intensive. It is implemented as an async single-thread app using Python's `asyncio` and async libraries `aiohttp`, `aiodns` and `aiofile`.\n\n- `masnet.generate` is memory intensive. It consumes more than 4GB of memory at its peak. I recommend having more than 4GB of free memory, hence a computer with 8GB of memory should be OK.\n\n- `masnet.analyze` is compute intensive and uses less memory than `masnet.generate`. As the algorithms are optimized with OpenMP in `networkit` package, the more cores you have the faster it will run.\n\nI am running the tools on an Ubuntu Linux virtual machine with 8 cores and 16GB of memory.\n\n# masnet.download\n\n```\nInput Files: none\n\nOutput Files:\n    - <domain>.peers.json files\n    - masnet.download.visits\n    - masnet.download.errors\n    - masnet.download.skips\n    - masnet.download.times\n```\n\n`masnet.download` connects to a given start domain (given with `-s` argument), downloads its peers (with `api/v1/instance/peers` API call), and then repeats this for each peer until all domains are visited. If start domain is not specified, `mastodon.social` is used.\n\nAfter downloading the peers information of a domain, it is saved to `<domain>.peers.json`. `peers.json` file is not the same as `instance/peers` API response, it is augmented with the domain name, thus contains a `domain` key with the value of domain name, and a `peers` key with the list of domain names of peers.\n\nFor network connections, a timeout in seconds can be specified with `-t` argument. System default timeout is usually too high, I recommend experimenting with lower values such as 30 (seconds).\n\nBecause the download process is mostly network I/O, a number of tasks are needed to fetch peers data of different domains. These tasks are implemented as coroutines in Python. The number of tasks can be specified with `-n` argument (default is 100). \n\nBecause there are (many) malicious or irrelevant domains in Mastodon network, it is possible to exclude them from traversal as they are seen in peers. Exclusion is specified as regular expression patterns. The name of the file containing such patterns can given with `-e` argument. If not specified, a default list (`masnet/default_exclusion_patterns`) is used. This default list excludes all private IP spaces, special domain names and known malicious domains at the time of release. It also filters URLs (names containing `/`).\n\n`masnet.download` is a long running process. The execution of `masnet.download` can be terminated with `Ctrl-C`. Since it is a long running process, it might be a good idea to pipe the output to `tee` and save the output to a log file. \n\nAll expected errors are handled gracefully with no stack trace printed to stdout or stderr. If you see any stack trace, that means there is an unexpected error and you can report it as an issue on GitHub.\n\nAt the end of an execution (also when terminated by Ctrl-C), in addition to (many) `<domain>.peers.json` filoes, 4 more files are generated:\n\n- `masnet.download.visits`: list of (successfully) visited domains\n- `masnet.download.errors`: list of domains where an error is encountered. For each domain, after a space, also the error message is saved.\n- `masnet.download.skips`: list of skipped domains due exclusion patterns\n- `masnet.download.times`: list of download times (in ms) of each peers.json file\n\nAll files other than `<domain>.peers.json` files are only for information. Only `<domain>.peers.json` files are used by `masnet.generate`.\n\nAn example run, with 30s timeout, saving files to current directory would be:\n\n```\n$ masnet.download -t 30\n```\n\nThis command will print (to stdout) a status line like:\n\n```\nq:000000 a:000001 s:138374 N:011867 L:039653324 e:00126507 s:01990292 to:14706 t:00:39:41\n```\n\nThe meaning of the fields are:\n\n- q: # of domains in queue (of which peers information will be fetched)\n- a: # of async tasks\n- s: # of scheduled domains (already downloaded and will be downloaded)\n\n- N: # of domains of which peers information is fetched successfully\n- L: # of links observed\n\n- e: # of domains where an error happened during fetch\n- s: # of skipped domains (due to exclusion)\n- to: number of connections gave timeout error\n\n- t: elapsed time hh:mm:ss\n\nThe status line is printed every second and at the end of the execution.\n\nThe links observed reported with `L` is higher than what `masnet.generate` will report. This is because `masnet.download` counts links also for probably error returning domains. The number of nodes `N` and number of links observed `L` are only for information, accurate number of nodes and links will be reported by `masnet.generate`.\n\nYou might see increasing (much more than N) error e and skipped s numbers. At least at the moment, there are many domains in Mastodon network that are either malicious (there are many randomly named subdomains of some domains) or invalid like a private IP. Hence, there are much more domains in peers than the ones actually working.\n\n`masnet.download` terminates automatically when there are zero elements observed in queue and there is only one task (which is the main program, that means no download tasks) for 5 seconds.\n\nOn 2022-12-18, `masnet.download -t 30 -n 100` and `masnet.download -t 30 -n 1000` both completes around 75 minutes with the following outputs:\n\n```\n$ masnet.download -t 30 -n 100\n....\nq:000000 a:000001 s:139500 N:013678 L:047071155 e:00125822 s:02981916 to:4241 t:01:15:26\n```\n\n```\n$ masnet.download -t 30 -n 1000\n...\nq:000000 a:000001 s:139483 N:013688 L:047105958 e:00125795 s:02990283 to:3956 t:01:13:15\n```\n\nIf you enable verbose mode with `-v` argument, after the status line, top 5 parent domains encountered during graph traversal will be displayed like below (parent meaning the top domain name part is removed, for a.b.com, it is b.com). This would help to identify malicious or invalid domains.\n\n# masnet.generate\n\n```\nInput Files: <domain>.peers.json files\n\nOutput Files: \n    - mastodon.networkit \n    - mastodon.labels\n```\n\n`masnet.generate` reads all `<domain>.peers.json` files saved by `masnet.download` and creates a graph and saves it in networkit Binary format to `mastodon.networkit.directed` and `mastodon.networkit.undirected` files. The Mastodon peers network is normally directed, but undirected version is also saved by ignoring the direction of peer relationship. In addition to the graphs, the actual labels (domains) are also save into `mastodon.labels` file. Both of these files are read by `masnet.analyze`. Both directed and undirected networks have same nodes and node ids.\n\nEach `<domain>.peers.json` file (thus a working domain) will be represented by a node in the graph, and it will have connections to its peers as long as the peer also has its `<domain>.peers.json` file. Thus if a domain returns error (for the API call) or skipped/exluded, it is also skipped in the generated network, no such node will exist.\n\n```\n$ masnet.generate\nmasnet v0.2\ncreating the graph from peers...\nN:00000000 L:00000000 e:00:00:00 \nN:00000562 L:00000000 e:00:00:01 \n... repeats lines like this ...\n:00011775 e:23631551 t:00:00:38 \ngraph created.\nlabels saved.\nsaving graphs...\n0% 46% 76% 99% 100%\n0% 26% 49% 70% 91% 100%\ngraphs saved.\nbye.\n```\n\nwhere `N` means number of nodes, `L` means number of links and `e` means elapsed time.\n\nThis is a relatively fast operation, it completes under a minute.\n\n# masnet.analyze\n\n```\nInput Files: \n    - mastodon.networkit.directed or mastodon.network.undirected\n    - mastodon.labels\n\nOutput Files: depends on the analysis\n```\n\n`masnet.analyze` performs basic graph/network analysis on Mastodon graph using `networkit` large-scale network analysis toolkit. As the core of `networkit` is written in C++ using OpenMP, it is quite a fast implementation.\n\nThe input graph file is selected using `--directed` argument. By default, the undirected graph is used (mastodon.networkit.undirected).\n\nRunning `masnet.analyze` with `--card` argument shows the [network card](https://doi.org/10.1007/s41109-022-00514-7) of Mastodon network specified in `mastodon.networkit`.\n\n```\n$ masnet.analyze --card\nmasnet v0.3\nnode labels loaded.\nreading undirected graph into networkit...\ngraph read in 0.6 seconds.\ngenerating the network card...\n----------------------------  --------------------------------\nName                          Mastodon peers network\nKind                          undirected, unweighted\nNodes are                     Mastodon instances\nLinks are                     Peer relationship\n----------------------------  --------------------------------\nNumber of nodes               13688\nNumber of links               16010592\nDegree*                       2339.362 [0, 12881]\nClustering                    0.754\nConnected                     2 components [100.0% in largest]\nComponent size*               6844.0 [1, 13687]\nDiameter                      n/a\nLargest component's diameter  4\n----------------------------  --------------------------------\nData generating process       masnet.download, masnet.generate\n----------------------------  --------------------------------\n*: avg [min, max]\n```\n\nEach analysis is requested with an argument to `masnet.analyze`. Please check the help with `--help` to see a list. At the moment, the following analysis arguments are supported (there are others mark experimental which I am developing/testing):\n\n- `--card`: shows the network card\n\n# Demo\n\nOn [network.bastodon.org](https://network.bastodon.org), the network card of the daily snapshot of Mastodon peer network can be seen.\n\n# Release History\n\n0.3.6:\n- doc fix\n\n0.3.5:\n- masnet.generate fix\n\n0.3.4:\n- masnet.analyze --card fix for directed graphs\n\n0.3.3:\n- doc fix\n\n0.3.2:\n- masnet.analyze --card file support\n\n0.3.1:\n- pypi fix\n\n0.3:\n- major rewrite of masnet.download\n- some changes on masnet.generate and masnet.analyze\n\n0.2.1:\n- terms fixed to Node and Link (rather than Vertex and Edge)\n- masnet.analyze improvements\n- masnet.downloads fix for termination of download threads\n\n0.2:\n- major rewrite of masnet.download\n- initial release of masnet.generate\n- initial release of masnet.analyze\n\n0.1.1:\n- small fixes\n    \n0.1:\n- initial release of masnet.download",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "tools for Mastodon network analysis",
    "version": "0.3.6",
    "split_keywords": [
        "mastodon"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "e15e6bc1d99fcbb51acb8f5d0d737750",
                "sha256": "2ee56bb901cc539825354090cc5ee84063fb651e665ad92621d434e28837a57e"
            },
            "downloads": -1,
            "filename": "masnet-0.3.6.tar.gz",
            "has_sig": false,
            "md5_digest": "e15e6bc1d99fcbb51acb8f5d0d737750",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 18914,
            "upload_time": "2022-12-18T18:29:13",
            "upload_time_iso_8601": "2022-12-18T18:29:13.236700Z",
            "url": "https://files.pythonhosted.org/packages/a7/21/ce02f5db6a7d963045d5208157b8506b4ad02f921503525e9772d20e9090/masnet-0.3.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-18 18:29:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "metebalci",
    "github_project": "masnet",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "masnet"
}
        
Elapsed time: 0.05937s