triplea


Nametriplea JSON
Version 0.0.5 PyPI version JSON
download
home_pagehttps://github.com/EhsanBitaraf/triple-a
SummaryArticle Analysis Assistant
upload_time2024-02-13 05:28:00
maintainer
docs_urlNone
authorEhsanBitaraf
requires_python>=3.10,<4.0
licenseApache-2.0
keywords graph semantic-scholar citation-graph
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # triple-a
*Article Analysis Assistant*

This program somehow creates a network of article references and provides a connection between authors and keywords, these things are usually called "[**Citation Graph**](https://en.wikipedia.org/wiki/Citation_graph)".

There are various software and online systems for this, a brief review of which can be found [here](docs/related-work.md).

This tool gives you the power to create a graph of articles and analyze it. This tool is designed as a **CLI** (command-line interface) and you can use it as a Python library.

[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

[![commits](https://badgen.net/github/commits/EhsanBitaraf/triple-a/main)](https://github.com/EhsanBitaraf/triple-a/commits/main?icon=github&color=green)
[![GitHub Last commit](https://img.shields.io/github/last-commit/EhsanBitaraf/triple-a)](https://github.com/EhsanBitaraf/triple-a/main)
![Open Issue](https://img.shields.io/github/issues-raw/EhsanBitaraf/triple-a)

![Repo Size](https://img.shields.io/github/repo-size/EhsanBitaraf/triple-a)
![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/EhsanBitaraf/triple-a)
![Downloads](https://img.shields.io/github/downloads/EhsanBitaraf/triple-a/total)

[![GitHub tag](https://img.shields.io/github/tag/EhsanBitaraf/triple-a.svg)](https://GitHub.com/EhsanBitaraf/triple-a/tags/)
![Release](https://img.shields.io/github/release/EhsanBitaraf/triple-a)
![Release](https://img.shields.io/github/release-date/EhsanBitaraf/triple-a)

<!-- ![PyPI - Wheel](https://img.shields.io/pypi/EhsanBitaraf/triple-a) -->

[![PyPI version](https://badge.fury.io/py/triplea.svg)](https://badge.fury.io/py/triplea)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/triplea)

![Build and push images](https://github.com/EhsanBitaraf/triple-a/workflows/push%20docker%20image/badge.svg)

![Testing](https://github.com/EhsanBitaraf/triple-a/actions/workflows/test-poetry-action.yml/badge.svg)

![Code Quality](https://github.com/EhsanBitaraf/triple-a/actions/workflows/python-flake.yml/badge.svg)





<!-- test this :

https://badge.fury.io/for/py/Triple-a -->

<!-- [![GitHub commits](https://img.shields.io/github/commits-since/EhsanBitaraf/triple-a/v1.0.0.svg)](https://github.com/EhsanBitaraf/triple-a/commit/master) -->



- [🎮 Main Features](#---main-features)
- [How to use](#how-to-use)
  * [Setup](#setup)
  * [Functional Use](#functional-use)
    + [Training NER for Article Title](#training-ner-for-article-title)
  * [Command Line (CLI) Use](#command-line--cli--use)
    + [Get and Save list of article identifier base on search term](#get-and-save-list-of-article-identifier-base-on-search-term)
    + [Move core pipeline state](#move-core-pipeline-state)
    + [Run custom pipeline](#run-custom-pipeline)
      - [NER Article Title](#ner-article-title)
      - [Country-based Co-authorship](#country-based-co-authorship)
      - [Extract Triple from Abstract](#extract-triple-from-abstract)
      - [Extract Topic from Abstract](#extract-topic-from-abstract)
    + [Import Single Reference File](#import-single-reference-file)
    + [Export graph](#export-graph)
    + [Visualizing Graph](#visualizing-graph)
    + [Analysis Graph](#analysis-graph)
    + [Work with Article Repository](#work-with-article-repository)
    + [Configuration](#configuration)
- [Testing](#testing)
- [Dependencies](#dependencies)
- [Use case](#use-case)
  * [Bio Bank](#bio-bank)
  * [Registry of Breast Cancer](#registry-of-breast-cancer)
  * [EHR](#ehr)
- [Graph Visualization](#graph-visualization)
- [Graph Analysis](#graph-analysis)
- [Knowledge Extraction](#knowledge-extraction)
- [Related Article](#related-article)
- [Code Quality](#code-quality)
- [Citation](#citation)
- [License](#license)





# 🎮 Main Features
- Single paper analysis
- Dynamic citations deep definition for meta data fetch
- Network Analysis (Per Node/Overall Graph)
- Import bibliography file
- Use for [Bibliometric Analysis](https://researchguides.uic.edu/bibliometrics)



# How to use 

## From Source

### Setup

Clone repository:
```shell
git clone https://github.com/EhsanBitaraf/triple-a.git
```

or 

```shell
git clone git@github.com:EhsanBitaraf/triple-a.git
```

Create environment variable:
```shell
python -m venv venv
```

Activate environment variable:

*Windows*
```shell
$ .\venv\Scripts\activate
```

*Linux*
```shell
$ source venv/bin/activate
```

Install poetry:
```shell
pip install poetry
```

Instal dependences:
```shell
poetry install
```

run cli:
```shell
poetry run python triplea/cli/aaa.py 
```

### Functional Use

get list of PMID in state 0
```python
term = '("Electronic Health Records"[Mesh]) AND ("National"[Title/Abstract]) AND Iran'
get_article_list_all_store_to_kg_rep(term)
```

move from state 1
```python
move_state_forward(1)
```

get list of PMID in state 0 and save to file for debugginf use
```python
    data = get_article_list_from_pubmed(1, 10,'("Electronic Health Records"[Mesh]) AND ("National"[Title/Abstract])')
    data = get_article_list_from_pubmed(1, 10,'"Electronic Health Records"')
    data1= json.dumps(data, indent=4)
    with open("sample1.json", "w") as outfile:
        outfile.write(data1)
```

open before file for debugging use
```python
    f = open('sample1.json')
    data = json.load(f)
    f.close()
```

get one article from kg and save to file
```python
    data = get_article_by_pmid('32434767')
    data= json.dumps(data, indent=4)
    with open("one-article.json", "w") as outfile:
        outfile.write(data)
```

Save Title for Annotation
```python
    file =  open("article-title.txt", "w", encoding="utf-8")
    la = get_article_by_state(2)
    for a in la:
        try:
            article = Article(**a.copy())
        except:
            pass
        file.write(article.Title  + "\n")
```

#### Training NER for Article Title

You can use NLP(Natural Language Processing) methods to extract information from the structure of the article and add it to your graph. For example, you can extract NER(Named-entity recognition) words from the title of the article and add to the graph. [Here's how to create a custom NER](docs/training-ner.md).



### Command Line (CLI) Use

By using the following command, you can see the command completion `help`. Each command has a separate `help`.

```shell
python .\triplea\cli\aaa.py  --help
```

output:

![](docs/assets/img/aaa-help.png)


#### Get and Save list of article identifier base on search term

Get list of article identifier like PMID base on search term and save into knowledge repository in first state (0):

use this command:
```shell
python .\triplea\cli\aaa.py search --searchterm [searchterm]
```

Even the PMID itself can be used in the search term.
```shell
python .\triplea\cli\aaa.py search --searchterm 36467335
```

output:

![](docs/assets/img/aaa-search.png)

#### Move core pipeline state
The preparation of the article for extracting the graph has different steps that are placed in a pipeline. Each step is identified by a number in the state value. The following table describes the state number:

*List of state number*

|State|Short Description|Description|
|-----|-----------------|-----------|
|0    |article identifier saved|At this stage, the article object stored in the data bank has only one identifier, such as the PMID or DOI identifier|
|1    |article details article info saved (json Form)|Metadata related to the article is stored in the `OriginalArticle` field from the `SourceBank`, but it has not been parsed yet|
|2    |parse details info|The contents of the `OriginalArticle` field are parsed and placed in the fields of the Article object.|
|3    |Get Citation      ||
|4    |Get Full Text     |At this stage, the articles that are open access and it is possible to get their full text are taken and added to the bank|
|5    |Convert full text to string     ||
|-1   |Error             |if error happend in move state 1 to 2|
|-2   |Error             |if error happend in move state 2 to 3|

There are two ways to run a pipeline. In the first method, we give the number of the existing state and all the articles in this state move forward one state.
In another method, we give the final state number and each article under that state starts to move until it reaches the final state number that we specified.
The first can be executed with the `next` command and the second with the `go` command.

With this command move from current state to the next state
```shell
python .\triplea\cli\aaa.py next --state [current state]
```

for example move all article in state 0 to 1:
```shell
python .\triplea\cli\aaa.py next --state 0
```
output:

![](docs/assets/img/aaa-next.png)


`go` command:
```shell
python .\triplea\cli\aaa.py go --end [last state]
```

```shell
python .\triplea\cli\aaa.py go --end 3
```

output:

![](docs/assets/img/aaa-go.png)


#### Run custom pipeline
Apart from the core pipelines that should be used to prepare articles, customized pipelines can also be used. Custom pipelines may be implemented to extract knowledge from texts and NLP processing. These pipelines themselves can form a new graph other than the citation graph or in combination with it.


List of Custom Pipeline

|Action|Tag Name|Description|Prerequisite|
|------|--------|-----------|------------|
|Triple extraction from article abstract      |FlagExtractKG        ||At least core state 2|
|Topic extraction from article abstract       |FlagExtractTopic     ||At least core state 2|
|Convert Affiliation text to structural data  |FlagAffiliationMining|This is simple way for parse Affiliation text |At least core state 2|
|Convert Affiliation text to structural data  |FlagAffiliationMining_Titipata|use [Titipat Achakulvisut Repo](https://github.com/titipata/affiliation_parser) for parsing Affiliation text|At least core state 2|
|Text embedding abstract and send to SciGenius|FlagEmbedding        ||At least core state 2|
|Title and Abstract Review by LLM             |FlagShortReviewByLLM ||At least core state 2|

##### NER Article Title
You can try the NER method to extract the major topic of the article's title by using the following command. This command is independent and is used for testing and is not stored in the Arepo.

```shell
python .\triplea\cli\ner.py --title "The Iranian Integrated Care Electronic Health Record."
```

##### Country-based Co-authorship
A country-based co-authorship network refers to a network of collaborative relationships between researchers from different countries who have co-authored academic papers together. It represents the connections and collaborations that exist among researchers across national boundaries.

By studying a country-based co-authorship network, researchers can gain insights into international collaborations, identify emerging research trends, foster interdisciplinary cooperation, and facilitate policy decisions related to research funding, academic mobility, and scientific development at a global scale.

There are several software tools available that can help you produce country-based co-authorship networks. Here are a few popular options:

[VOSviewer](https://www.vosviewer.com/): VOSviewer is a widely used software tool for constructing and visualizing co-authorship networks. It offers various clustering and visualization techniques and allows you to analyze and explore the network based on different attributes, including country affiliation.

[Sci2 Tool](https://sci2.cns.iu.edu/user/index.php): The Science of Science (Sci2) Tool is a Java-based software package (in [GitHub](https://github.com/CIShell)) that supports the analysis and visualization of scientific networks. It offers a variety of functionalities for constructing and analyzing co-authorship networks, including country-based analysis. It allows users to perform data preprocessing, network analysis, and visualization within a single integrated environment.



To convert affiliation into a hierarchical structure of country, city and centers, you can use the following command:

```shell
python .\triplea\cli\aaa.py pipeline -n FlagAffiliationMining
```


##### Extract Triple from Abstract

```shell
python .\triplea\cli\aaa.py pipeline --name FlagExtractKG
```



##### Extract Topic from Abstract

```shell
python .\triplea\cli\aaa.py pipeline --name FlagExtractTopic
```

An example of working with the functions of this part using `Jupyter` is given in [here](./jupyter_lab/selection-sampling.ipynb). which is finally drawn using VOSviewer program as below:

![](./docs/assets/img/topic-graph-biobank.png)

#### Import Data

##### Import Single Reference File
Import file type is `.bib` , `.enw` , `.ris`

```shell
python .\triplea\cli\importbib.py "C:\...\bc.ris"
```

output:

![](docs/assets/img/import-output.png)


##### Import Triplea Format

```sh
python .\triplea\cli\aaa.py import --help
```


```sh
python .\triplea\cli\aaa.py import --type triplea --format json --bar True "C:\BibliometricAnalysis.json"
```


#### Export Data
Various data export can be created from the article repository. These outputs are used to create raw datasets.

|Type|Format|
|-|-|
|triplea|json, csv , *csvs*|
|rayyan|csv|
|RefMan*|ris|


* It has not yet been implemented.


For guidance from the export command, you can act like this:
```sh
python .\triplea\cli\aaa.py export --help
```

For Example :




The export is limited to 100 samples, and the resulting exported articles are saved in the file Triple Json format named "test_export.json".
```sh
python .\triplea\cli\aaa.py export --type triplea --format json --limit 100 --output "test_export.json"
```


```sh
python .\triplea\cli\aaa.py export --type triplea --format json --output "test_export.json"
```

Export Triplea CSV format:
```sh
python .\triplea\cli\aaa.py export --type triplea --format csv --output "test_export.csv"
```


```sh
python .\triplea\cli\aaa.py export --type triplea --format csvs --output "export.csv"
```


Export for Rayyan CSV format:
```sh
python .\triplea\cli\aaa.py export --type rayyan --format csv --output "test_export.csv"
```

#### Export Graph

for details information:
```sh
python .\triplea\cli\aaa.py export_graph --help
```


Making a graph with the `graphml` format and saving it in a file `test.graphml`
```shell
python .\triplea\cli\aaa.py export_graph -g gen-all -f graphml -o .\triplea\test
```

Making a graph with the `gexf` format and saving it in a file `C:\Users\Dr bitaraf\Documents\graph\article.gexf`.This graph contains article, author, affiliation and relation between them:
```shell
python .\triplea\cli\aaa.py export_graph -g article-author-affiliation -f gexf -o "C:\Users\Dr bitaraf\Documents\graph\article"
```

Making a graph with the `graphdict` format and saving it in a file `C:\Users\Dr bitaraf\Documents\graph\article.json`.This graph contains article, Reference, article cite and relation between them:
```shell
python .\triplea\cli\aaa.py export_graph -g article-reference -g article-cited -f graphdict -o "C:\Users\Dr bitaraf\Documents\graph\article.json"
```

Making a graph with the `graphml` format and saving it in a file `C:\graph-repo\country-authorship.jgraphmlson`.This graph contains article, country, and relation between them:
```shell
python .\triplea\cli\aaa.py export_graph -g country-authorship -f graphml -o "C:\graph-repo\country-authorship"
```


Types of graph generators that can be used in the `-g` parameter:

|Name|Description|
|----|-----------|
|store|It considers all the nodes and edges that are stored in the database|
|gen-all|It considers all possible nodes and edges|
|article-topic|It considers article and topic as nodes and edges between them|
|article-author-affiliation|It considers article, author and affiliation as nodes and edges between them|
|article-keyword|It considers article and keyword as nodes and edges between them|
|article-reference|It considers article and reference as nodes and edges between them|
|article-cited|It considers article and cited as nodes and edges between them|
|country-authorship||

Types of graph file format that can be used in the `-f` parameter:
|Name|Description|
|----|-----------|
|graphdict|This format is a customized format for citation graphs in the form of a Python dictionary.|
|graphjson||
|gson||
|gpickle|Write graph in Python pickle format. Pickles are a serialized byte stream of a Python object|
|graphml|The GraphML file format uses .graphml extension and is XML structured. It supports attributes for nodes and edges, hierarchical graphs and benefits from a flexible architecture.|
|gexf|GEXF (Graph Exchange XML Format) is an XML-based file format for storing a single undirected or directed graph.|

#### Visualizing Graph
Several visualizator are used to display graphs in this program. These include:

[Alchemy.js](https://graphalchemist.github.io/Alchemy/#/) : Alchemy.js is a graph drawing application built almost entirely in d3.

[interactivegaraph](https://github.com/grapheco/InteractiveGraph) : InteractiveGraph provides a web-based interactive visualization and analysis framework for large graph data, which may come from a GSON file

[netwulf](https://github.com/benmaier/netwulf) : Interactive visualization of networks based on Ulf Aslak's d3 web app.


```shell
python .\triplea\cli\aaa.py visualize -g article-reference -g article-cited -p 8001
```


```shell
python .\triplea\cli\aaa.py visualize -g gen-all -p 8001
```


output:

![](docs/assets//img/gen-all-graph.png)


```shell
python .\triplea\cli\aaa.py visualize -g article-topic -g article-keyword -p 8001
```

output:

![](docs/assets/img/graph-alchemy.png)


Visulaize File

A file related to the extracted graph can be visualized in different formats with the following command:
```sh
python .\triplea\cli\aaa.py visualize_file --format graphdict "graph.json"
```

#### Analysis Graph


`analysis info` command calculates specific metrics for the entire graph. These metrics include the following:
- Graph Type: 
- SCC: 
- WCC: 
- Reciprocity : 
- Graph Nodes: 
- Graph Edges: 
- Graph Average Degree : 
- Graph Density : 
- Graph Transitivity : 
- Graph max path length : 
- Graph Average Clustering Coefficient : 
- Graph Degree Assortativity Coefficient : 

```
python .\triplea\cli\aaa.py analysis -g gen-all -c info
```

output:

![](docs/assets/img/aaa-analysis-info.png)




Creates a graph with all possible nodes and edges and calculates and lists the sorted [degree centrality](https://bookdown.org/omarlizardo/_main/4-2-degree-centrality.html) for each node.
```
python .\triplea\cli\aaa.py analysis -g gen-all -c sdc
```

output:

![](docs/assets/img/aaa-analysis-sdc.png)


#### Work with Article Repository
Article Repository (Arepo) is a database that stores the information of articles and graphs. Different databases can be used. We have used the following information banks here:

- [TinyDB](https://github.com/msiemens/tinydb) - TinyDB is a lightweight document oriented database

- [MongoDB](https://www.mongodb.com/) - MongoDB is a source-available cross-platform document-oriented database program


To get general information about the articles, nodes and egdes in the database, use the following command.
```shell
python .\triplea\cli\aaa.py arepo -c info
```

output:
```shell
Number of article in article repository is 122
0 Node(s) in article repository.
0 Edge(s) in article repository.
122 article(s) in state 3.
```



Get article data by PMID
```sh
python .\triplea\cli\aaa.py arepo -pmid 31398071
```

output:
```
Title   : Association between MRI background parenchymal enhancement and lymphovascular invasion and estrogen receptor status in invasive breast cancer.
Journal : The British journal of radiology
DOI     : 10.1259/bjr.20190417
PMID    : 31398071
PMC     : PMC6849688
State   : 3
Authors : Jun Li, Yin Mo, Bo He, Qian Gao, Chunyan Luo, Chao Peng, Wei Zhao, Yun Ma, Ying Yang, 
Keywords: Adult, Aged, Breast Neoplasms, Female, Humans, Lymphatic Metastasis, Magnetic Resonance Imaging, Menopause, Middle Aged, Neoplasm Invasiveness, Receptors, Estrogen, Retrospective Studies, Young Adult,
```

Get article data by PMID and save to `article.json` file.
```sh
python .\triplea\cli\aaa.py arepo -pmid 31398071 -o article.json
```

another command fo this:
```sh
python .\triplea\cli\aaa.py export_article --idtype pmid --id 31398071 --format json --output "article.json"
```

#### Configuration

For details information:
```shell
python .\triplea\cli\aaa.py config --help
```

Get environment variable:
```shell
 python .\triplea\cli\aaa.py config -c info
```

Set new environment variable:
```shell
python .\triplea\cli\aaa.py config -c update
```

Below is a summary of important environment variables in this project:
|Environment Variables     |Description|Default Value|
|--------------------------|-----------|-------------|
|TRIPLEA_DB_TYPE           |The type of database to be used in the project. The database layer is separate and you can use different databases, currently it supports `MongoDB` and `TinyDB` databases. TinyDB can be used for small scope and Mango can be used for large scope|TinyDB|
|AAA_TINYDB_FILENAME       |File name of TinyDB|articledata.json|
|AAA_MONGODB_CONNECTION_URL|[Standard Connection String Format](https://www.mongodb.com/docs/manual/reference/connection-string/#std-label-connections-standard-connection-string-format) For MongoDB|mongodb://user:pass@127.0.0.1:27017/|
|AAA_MONGODB_DB_NAME       |Name of MongoDB Collection|articledata|
|AAA_TPS_LIMIT             |Transaction Per Second Limitation|1|
|AAA_PROXY_HTTP            |An HTTP proxy is a server that acts as an intermediary between a client and PubMed server. When a client sends a request to a server through an HTTP proxy, the proxy intercepts the request and forwards it to the server on behalf of the client. Similarly, when the server responds, the proxy intercepts the response and forwards it back to the client.||
|AAA_PROXY_HTTPS           |HTTPS Proxy|| 
|AAA_REFF_CRAWLER_DEEP     ||1|
|AAA_CITED_CRAWLER_DEEP    ||1|

## From Package

You can create a python virtual environment before installing and it is recommended that you do so.
```sh
$ python -m venv venv
```

```sh
$ .\venv\Scripts\activate
```

Install Package with pip:
```sh
$ pip install triplea
```

Create environment variable by `.env` file:
```
TRIPLEA_DB_TYPE = TinyDB
AAA_TINYDB_FILENAME = articledata.json
AAA_MONGODB_CONNECTION_URL = mongodb://localhost:27017/
AAA_MONGODB_DB_NAME = articledata
AAA_TPS_LIMIT = 1
AAA_PROXY_HTTP = 
AAA_PROXY_HTTPS = 
AAA_REFF_CRAWLER_DEEP = 1
AAA_CITED_CRAWLER_DEEP = 1
AAA_TOPIC_EXTRACT_ENDPOINT=http://localhost:8001/api/v1/topic/
AAA_CLIENT_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0"
```
If you do not create the mentioned file, the default values will be used, which are:
```
TRIPLEA_DB_TYPE = TinyDB
AAA_TINYDB_FILENAME = default-tiny-db.json
AAA_TPS_LIMIT = 1
AAA_REFF_CRAWLER_DEEP = 1
AAA_CITED_CRAWLER_DEEP = 1
AAA_TOPIC_EXTRACT_ENDPOINT=http://localhost:8001/api/v1/topic/
AAA_CLIENT_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0"
```

Run CLI:
```sh
$ aaa --help
```

output:
```sh
Usage: aaa [OPTIONS] COMMAND [ARGS]...

Options:
  -v, --version
  --help         Show this message and exit.

Commands:
  analysis        Analysis Graph.
  config          Configuration additional setting.
  export          Export article repository in specific format.
  export_article  Export Article by identifier.
  export_graph    Export Graph.
  export_llm      Export preTrain LLM.
  go              Moves the articles state in the Arepo until end state.
  import          import article from specific file format to article...
  importbib       import article from .bib , .enw , .ris file format.
  ner             Single NER with custom model.
  next            Moves the articles state in the Arepo from the current...
  pipeline        Run Custom PipeLine in arepo.
  search          Search query from PubMed and store to Arepo.
```

*Note*: The visualization function is only available in the source version

# Testing

```sh
poetry run pytest
```
poetry run pytest tests/
```sh
poetry run pytest --cov
```


For unit test check :

bibilometric:

37283018

35970485

# Dependencies



For graph analysis:

[networkx](https://networkx.org/)


For NLP:

[PyTextRank](https://derwen.ai/docs/ptr/)

[transformers](https://huggingface.co/docs/transformers/index) 

[spaCy](https://spacy.io/)

For data storage:
 
[TinyDB](https://tinydb.readthedocs.io/en/latest/)

[py2neo](https://github.com/py2neo-org/py2neo)

[pymongo](https://github.com/mongodb/mongo-python-driver)

For visualization of networks:

[netwulf](https://github.com/benmaier/netwulf)

[Alchemy.js](https://graphalchemist.github.io/Alchemy/#/)

[InteractiveGraph](https://github.com/grapheco/InteractiveGraph)

For CLI:

[click](https://click.palletsprojects.com/en/8.1.x/)


For packaging and dependency management: 

[Poetry](https://python-poetry.org/docs/basic-usage/)



# Use case
With this tool, you can create datasets in different formats, here are examples of these datasets.


## Breast Cancer

Pubmed Query:
```
"breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]
```

`495,012` results

Configuration:
```
AAA_MONGODB_DB_NAME = bcarticledata
AAA_REFF_CRAWLER_DEEP = 0
AAA_CITED_CRAWLER_DEEP = 0
```

`EDirect` used.

Search with this command:

```
python .\triplea\cli\aaa.py search --searchterm r'"breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]'
```

if --searchterm argument is too complex use this:
```
python .\triplea\cli\aaa.py search
```

by Filter :
```
{
    "mindate" : "2022/01/01",
    "maxdate" : "2022/12/30"
}
```

Get info of all downloaded article:
```shell
python .\triplea\cli\aaa.py arepo -c info
```

output:
```shell
Number of article in article repository is 30914
0 Node(s) in article repository.
0 Edge(s) in article repository.
30914 article(s) in state 0.
```

Run Core pipeline to next status
```shell
python .\triplea\cli\aaa.py next --state 0
```

then parsing article:
```shell
python .\triplea\cli\aaa.py next --state 1
```

Extract Triple is type of custom pipeline. you can run this:
```shell
python .\triplea\cli\aaa.py pipeline --name FlagExtractKG
```

## Bio Bank

Pubmed Query:
```
"Biological Specimen Banks"[Mesh] OR BioBanking OR biobank OR dataBank OR "Bio Banking" OR "bio bank"
```

`39,023` results 

Search with this command:

```shell
python .\triplea\cli\aaa.py search --searchterm "\"Biological Specimen Banks\"[Mesh] OR BioBanking OR biobank OR dataBank OR \"Bio Banking\" OR \"bio bank\" "
```

Get 39,023 result until `2023/01/02`

```
"ERROR":"Search Backend failed: Exception:\n\'retstart\' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/"
```

This query had more than 10,000 results, and as a result, the following text was used:

To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using `<EDirect>` that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.

This is hard code in `get_article_list_from_pubmed` methods in `PARAMS`.

This Query Added lately:
```
"bio-banking"[Title/Abstract] OR "bio-bank"[Title/Abstract] OR "data-bank"[Title/Abstract]
```

`9,012` results

```shell
python .\triplea\cli\aaa.py search --searchterm " \"bio-banking\"[Title/Abstract] OR \"bio-bank\"[Title/Abstract] OR \"data-bank\"[Title/Abstract] "
```

after run this. get info:
```
Number of article in article repository is 47735
```

Export `graphml` format:
```shell
python .\triplea\cli\aaa.py export_graph -g article-reference -g article-keyword  -f graphml -o .\triplea\datasets\biobank.graphml
```

## Registry of Breast Cancer

Keyword Checking:
```
"Breast Neoplasms"[Mesh]  
"Breast Cancer"[Title]
"Breast Neoplasms"[Title]  
"Breast Neoplasms"[Other Term]
"Breast Cancer"[Other Term]
"Registries"[Mesh]
"Database Management Systems"[Mesh]
"Information Systems"[MeSH Major Topic]
"Registries"[Other Term]
"Information Storage and Retrieval"[MeSH Major Topic]
"Registry"[Title]
"National Program of Cancer Registries"[Mesh]
"Registries"[MeSH Major Topic]
"Information Science"[Mesh]
"Data Management"[Mesh]
```

Final Pubmed Query:
```
("Breast Neoplasms"[Mesh] OR "Breast Cancer"[Title] OR "Breast Neoplasms"[Title] OR "Breast Neoplasms"[Other Term] OR "Breast Cancer"[Other Term]) AND ("Registries"[MeSH Major Topic] OR "Database Management Systems"[MeSH Major Topic] OR "Information Systems"[MeSH Major Topic] OR "Registry"[Other Term] OR "Registry"[Title] OR "Information Storage and Retrieval"[MeSH Major Topic])
```

url:
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=("Breast+Neoplasms"[Mesh]+OR+"Breast+Cancer"[Title]+OR+"Breast+Neoplasms"[Title]+OR+"Breast+Neoplasms"[Other+Term]+OR+"Breast+Cancer"[Other+Term])+AND+("Registries"[MeSH+Major+Topic]+OR+"Database+Management+Systems"[MeSH+Major+Topic]+OR+"Information+Systems"[MeSH+Major+Topic]+OR+"Registry"[Other+Term]+OR+"Registry"[Title]+OR+"Information+Storage+and+Retrieval"[MeSH+Major+Topic])&retmode=json&retstart=1&retmax=10
```


You can download the result of this network and the relationship between the article and the keyword in `graphdict` format from [**here**](datasets/bcancer-graphdict.json). Manipulated, you can download this graph in `gramphml` format from [**here**](datasets/bcancer.graphml).

## EHR
It is not yet complete.

# Graph Visualization 
Various tools have been developed to visualize graphs. We have done a [brief review](docs/graph-visualization.md) and selected a few tools to use in this program.

# Graph Analysis
In this project, we used one of the most powerful libraries for graph analysis. Using [NetworkX](https://networkx.org/), we generated many indicators to check a citation graph. Some materials in this regard are given [here](docs/graph-analysis.md). You can use other libraries as well.


# Knowledge Extraction
In the architecture of this software, the structure of the article is stored in the database and this structure also contains the summary of the article. For this reason, it is possible to perform NLP processes such as keywords extraction, topic extraction etc., which can be completed in the future[.](docs/knowledge-extraction.md)


# Related Article
This topic is very interesting from a research point of view, so I have included the articles that were interesting [here](docs/article.md).



# Code Quality
We used flake8 and black libraries to increase code quality.
More information can be found [here](docs/code-quality.md).

---

# Citation

If you use `Triple A` for your scientific work, consider citing us! We're published in [IEEE](https://ieeexplore.ieee.org/document/10139229).

```bibtex
@INPROCEEDINGS{10139229,
  author={Jafarpour, Maryam and Bitaraf, Ehsan and Moeini, Ali and Nahvijou, Azin},
  booktitle={2023 9th International Conference on Web Research (ICWR)}, 
  title={Triple A (AAA): a Tool to Analyze Scientific Literature Metadata with Complex Network Parameters}, 
  year={2023},
  volume={},
  number={},
  pages={342-345},
  doi={10.1109/ICWR57742.2023.10139229}}
```

[![DOI:10.1109/ICWR57742.2023.10139229](https://zenodo.org/badge/doi/10.1109/ICWR57742.2023.10139229.svg)](https://doi.org/10.1109/ICWR57742.2023.10139229)



---

# License

TripleA is available under the [Apache License](LICENSE).





            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/EhsanBitaraf/triple-a",
    "name": "triplea",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "graph,semantic-scholar,citation-graph",
    "author": "EhsanBitaraf",
    "author_email": "bitaraf.e@iums.ac.ir",
    "download_url": "https://files.pythonhosted.org/packages/03/e3/a675f27b09c85c1df0cbf36f7e68fe4124bfcc8d8c2973689130636f847e/triplea-0.0.5.tar.gz",
    "platform": null,
    "description": "# triple-a\n*Article Analysis Assistant*\n\nThis program somehow creates a network of article references and provides a connection between authors and keywords, these things are usually called \"[**Citation Graph**](https://en.wikipedia.org/wiki/Citation_graph)\".\n\nThere are various software and online systems for this, a brief review of which can be found [here](docs/related-work.md).\n\nThis tool gives you the power to create a graph of articles and analyze it. This tool is designed as a **CLI** (command-line interface) and you can use it as a Python library.\n\n[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n[![commits](https://badgen.net/github/commits/EhsanBitaraf/triple-a/main)](https://github.com/EhsanBitaraf/triple-a/commits/main?icon=github&color=green)\n[![GitHub Last commit](https://img.shields.io/github/last-commit/EhsanBitaraf/triple-a)](https://github.com/EhsanBitaraf/triple-a/main)\n![Open Issue](https://img.shields.io/github/issues-raw/EhsanBitaraf/triple-a)\n\n![Repo Size](https://img.shields.io/github/repo-size/EhsanBitaraf/triple-a)\n![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/EhsanBitaraf/triple-a)\n![Downloads](https://img.shields.io/github/downloads/EhsanBitaraf/triple-a/total)\n\n[![GitHub tag](https://img.shields.io/github/tag/EhsanBitaraf/triple-a.svg)](https://GitHub.com/EhsanBitaraf/triple-a/tags/)\n![Release](https://img.shields.io/github/release/EhsanBitaraf/triple-a)\n![Release](https://img.shields.io/github/release-date/EhsanBitaraf/triple-a)\n\n<!-- ![PyPI - Wheel](https://img.shields.io/pypi/EhsanBitaraf/triple-a) -->\n\n[![PyPI version](https://badge.fury.io/py/triplea.svg)](https://badge.fury.io/py/triplea)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/triplea)\n\n![Build and push images](https://github.com/EhsanBitaraf/triple-a/workflows/push%20docker%20image/badge.svg)\n\n![Testing](https://github.com/EhsanBitaraf/triple-a/actions/workflows/test-poetry-action.yml/badge.svg)\n\n![Code Quality](https://github.com/EhsanBitaraf/triple-a/actions/workflows/python-flake.yml/badge.svg)\n\n\n\n\n\n<!-- test this :\n\nhttps://badge.fury.io/for/py/Triple-a -->\n\n<!-- [![GitHub commits](https://img.shields.io/github/commits-since/EhsanBitaraf/triple-a/v1.0.0.svg)](https://github.com/EhsanBitaraf/triple-a/commit/master) -->\n\n\n\n- [\ud83c\udfae Main Features](#---main-features)\n- [How to use](#how-to-use)\n  * [Setup](#setup)\n  * [Functional Use](#functional-use)\n    + [Training NER for Article Title](#training-ner-for-article-title)\n  * [Command Line (CLI) Use](#command-line--cli--use)\n    + [Get and Save list of article identifier base on search term](#get-and-save-list-of-article-identifier-base-on-search-term)\n    + [Move core pipeline state](#move-core-pipeline-state)\n    + [Run custom pipeline](#run-custom-pipeline)\n      - [NER Article Title](#ner-article-title)\n      - [Country-based Co-authorship](#country-based-co-authorship)\n      - [Extract Triple from Abstract](#extract-triple-from-abstract)\n      - [Extract Topic from Abstract](#extract-topic-from-abstract)\n    + [Import Single Reference File](#import-single-reference-file)\n    + [Export graph](#export-graph)\n    + [Visualizing Graph](#visualizing-graph)\n    + [Analysis Graph](#analysis-graph)\n    + [Work with Article Repository](#work-with-article-repository)\n    + [Configuration](#configuration)\n- [Testing](#testing)\n- [Dependencies](#dependencies)\n- [Use case](#use-case)\n  * [Bio Bank](#bio-bank)\n  * [Registry of Breast Cancer](#registry-of-breast-cancer)\n  * [EHR](#ehr)\n- [Graph Visualization](#graph-visualization)\n- [Graph Analysis](#graph-analysis)\n- [Knowledge Extraction](#knowledge-extraction)\n- [Related Article](#related-article)\n- [Code Quality](#code-quality)\n- [Citation](#citation)\n- [License](#license)\n\n\n\n\n\n# \ud83c\udfae Main Features\n- Single paper analysis\n- Dynamic citations deep definition for meta data fetch\n- Network Analysis (Per Node/Overall Graph)\n- Import bibliography file\n- Use for [Bibliometric Analysis](https://researchguides.uic.edu/bibliometrics)\n\n\n\n# How to use \n\n## From Source\n\n### Setup\n\nClone repository:\n```shell\ngit clone https://github.com/EhsanBitaraf/triple-a.git\n```\n\nor \n\n```shell\ngit clone git@github.com:EhsanBitaraf/triple-a.git\n```\n\nCreate environment variable:\n```shell\npython -m venv venv\n```\n\nActivate environment variable:\n\n*Windows*\n```shell\n$ .\\venv\\Scripts\\activate\n```\n\n*Linux*\n```shell\n$ source venv/bin/activate\n```\n\nInstall poetry:\n```shell\npip install poetry\n```\n\nInstal dependences:\n```shell\npoetry install\n```\n\nrun cli:\n```shell\npoetry run python triplea/cli/aaa.py \n```\n\n### Functional Use\n\nget list of PMID in state 0\n```python\nterm = '(\"Electronic Health Records\"[Mesh]) AND (\"National\"[Title/Abstract]) AND Iran'\nget_article_list_all_store_to_kg_rep(term)\n```\n\nmove from state 1\n```python\nmove_state_forward(1)\n```\n\nget list of PMID in state 0 and save to file for debugginf use\n```python\n    data = get_article_list_from_pubmed(1, 10,'(\"Electronic Health Records\"[Mesh]) AND (\"National\"[Title/Abstract])')\n    data = get_article_list_from_pubmed(1, 10,'\"Electronic Health Records\"')\n    data1= json.dumps(data, indent=4)\n    with open(\"sample1.json\", \"w\") as outfile:\n        outfile.write(data1)\n```\n\nopen before file for debugging use\n```python\n    f = open('sample1.json')\n    data = json.load(f)\n    f.close()\n```\n\nget one article from kg and save to file\n```python\n    data = get_article_by_pmid('32434767')\n    data= json.dumps(data, indent=4)\n    with open(\"one-article.json\", \"w\") as outfile:\n        outfile.write(data)\n```\n\nSave Title for Annotation\n```python\n    file =  open(\"article-title.txt\", \"w\", encoding=\"utf-8\")\n    la = get_article_by_state(2)\n    for a in la:\n        try:\n            article = Article(**a.copy())\n        except:\n            pass\n        file.write(article.Title  + \"\\n\")\n```\n\n#### Training NER for Article Title\n\nYou can use NLP(Natural Language Processing) methods to extract information from the structure of the article and add it to your graph. For example, you can extract NER(Named-entity recognition) words from the title of the article and add to the graph. [Here's how to create a custom NER](docs/training-ner.md).\n\n\n\n### Command Line (CLI) Use\n\nBy using the following command, you can see the command completion `help`. Each command has a separate `help`.\n\n```shell\npython .\\triplea\\cli\\aaa.py  --help\n```\n\noutput:\n\n![](docs/assets/img/aaa-help.png)\n\n\n#### Get and Save list of article identifier base on search term\n\nGet list of article identifier like PMID base on search term and save into knowledge repository in first state (0):\n\nuse this command:\n```shell\npython .\\triplea\\cli\\aaa.py search --searchterm [searchterm]\n```\n\nEven the PMID itself can be used in the search term.\n```shell\npython .\\triplea\\cli\\aaa.py search --searchterm 36467335\n```\n\noutput:\n\n![](docs/assets/img/aaa-search.png)\n\n#### Move core pipeline state\nThe preparation of the article for extracting the graph has different steps that are placed in a pipeline. Each step is identified by a number in the state value. The following table describes the state number:\n\n*List of state number*\n\n|State|Short Description|Description|\n|-----|-----------------|-----------|\n|0    |article identifier saved|At this stage, the article object stored in the data bank has only one identifier, such as the PMID or DOI identifier|\n|1    |article details article info saved (json Form)|Metadata related to the article is stored in the `OriginalArticle` field from the `SourceBank`, but it has not been parsed yet|\n|2    |parse details info|The contents of the `OriginalArticle` field are parsed and placed in the fields of the Article object.|\n|3    |Get Citation      ||\n|4    |Get Full Text     |At this stage, the articles that are open access and it is possible to get their full text are taken and added to the bank|\n|5    |Convert full text to string     ||\n|-1   |Error             |if error happend in move state 1 to 2|\n|-2   |Error             |if error happend in move state 2 to 3|\n\nThere are two ways to run a pipeline. In the first method, we give the number of the existing state and all the articles in this state move forward one state.\nIn another method, we give the final state number and each article under that state starts to move until it reaches the final state number that we specified.\nThe first can be executed with the `next` command and the second with the `go` command.\n\nWith this command move from current state to the next state\n```shell\npython .\\triplea\\cli\\aaa.py next --state [current state]\n```\n\nfor example move all article in state 0 to 1:\n```shell\npython .\\triplea\\cli\\aaa.py next --state 0\n```\noutput:\n\n![](docs/assets/img/aaa-next.png)\n\n\n`go` command:\n```shell\npython .\\triplea\\cli\\aaa.py go --end [last state]\n```\n\n```shell\npython .\\triplea\\cli\\aaa.py go --end 3\n```\n\noutput:\n\n![](docs/assets/img/aaa-go.png)\n\n\n#### Run custom pipeline\nApart from the core pipelines that should be used to prepare articles, customized pipelines can also be used. Custom pipelines may be implemented to extract knowledge from texts and NLP processing. These pipelines themselves can form a new graph other than the citation graph or in combination with it.\n\n\nList of Custom Pipeline\n\n|Action|Tag Name|Description|Prerequisite|\n|------|--------|-----------|------------|\n|Triple extraction from article abstract      |FlagExtractKG        ||At least core state 2|\n|Topic extraction from article abstract       |FlagExtractTopic     ||At least core state 2|\n|Convert Affiliation text to structural data  |FlagAffiliationMining|This is simple way for parse Affiliation text |At least core state 2|\n|Convert Affiliation text to structural data  |FlagAffiliationMining_Titipata|use [Titipat Achakulvisut Repo](https://github.com/titipata/affiliation_parser) for parsing Affiliation text|At least core state 2|\n|Text embedding abstract and send to SciGenius|FlagEmbedding        ||At least core state 2|\n|Title and Abstract Review by LLM             |FlagShortReviewByLLM ||At least core state 2|\n\n##### NER Article Title\nYou can try the NER method to extract the major topic of the article's title by using the following command. This command is independent and is used for testing and is not stored in the Arepo.\n\n```shell\npython .\\triplea\\cli\\ner.py --title \"The Iranian Integrated Care Electronic Health Record.\"\n```\n\n##### Country-based Co-authorship\nA country-based co-authorship network refers to a network of collaborative relationships between researchers from different countries who have co-authored academic papers together. It represents the connections and collaborations that exist among researchers across national boundaries.\n\nBy studying a country-based co-authorship network, researchers can gain insights into international collaborations, identify emerging research trends, foster interdisciplinary cooperation, and facilitate policy decisions related to research funding, academic mobility, and scientific development at a global scale.\n\nThere are several software tools available that can help you produce country-based co-authorship networks. Here are a few popular options:\n\n[VOSviewer](https://www.vosviewer.com/): VOSviewer is a widely used software tool for constructing and visualizing co-authorship networks. It offers various clustering and visualization techniques and allows you to analyze and explore the network based on different attributes, including country affiliation.\n\n[Sci2 Tool](https://sci2.cns.iu.edu/user/index.php): The Science of Science (Sci2) Tool is a Java-based software package (in [GitHub](https://github.com/CIShell)) that supports the analysis and visualization of scientific networks. It offers a variety of functionalities for constructing and analyzing co-authorship networks, including country-based analysis. It allows users to perform data preprocessing, network analysis, and visualization within a single integrated environment.\n\n\n\nTo convert affiliation into a hierarchical structure of country, city and centers, you can use the following command:\n\n```shell\npython .\\triplea\\cli\\aaa.py pipeline -n FlagAffiliationMining\n```\n\n\n##### Extract Triple from Abstract\n\n```shell\npython .\\triplea\\cli\\aaa.py pipeline --name FlagExtractKG\n```\n\n\n\n##### Extract Topic from Abstract\n\n```shell\npython .\\triplea\\cli\\aaa.py pipeline --name FlagExtractTopic\n```\n\nAn example of working with the functions of this part using `Jupyter` is given in [here](./jupyter_lab/selection-sampling.ipynb). which is finally drawn using VOSviewer program as below:\n\n![](./docs/assets/img/topic-graph-biobank.png)\n\n#### Import Data\n\n##### Import Single Reference File\nImport file type is `.bib` , `.enw` , `.ris`\n\n```shell\npython .\\triplea\\cli\\importbib.py \"C:\\...\\bc.ris\"\n```\n\noutput:\n\n![](docs/assets/img/import-output.png)\n\n\n##### Import Triplea Format\n\n```sh\npython .\\triplea\\cli\\aaa.py import --help\n```\n\n\n```sh\npython .\\triplea\\cli\\aaa.py import --type triplea --format json --bar True \"C:\\BibliometricAnalysis.json\"\n```\n\n\n#### Export Data\nVarious data export can be created from the article repository. These outputs are used to create raw datasets.\n\n|Type|Format|\n|-|-|\n|triplea|json, csv , *csvs*|\n|rayyan|csv|\n|RefMan*|ris|\n\n\n* It has not yet been implemented.\n\n\nFor guidance from the export command, you can act like this:\n```sh\npython .\\triplea\\cli\\aaa.py export --help\n```\n\nFor Example :\n\n\n\n\nThe export is limited to 100 samples, and the resulting exported articles are saved in the file Triple Json format named \"test_export.json\".\n```sh\npython .\\triplea\\cli\\aaa.py export --type triplea --format json --limit 100 --output \"test_export.json\"\n```\n\n\n```sh\npython .\\triplea\\cli\\aaa.py export --type triplea --format json --output \"test_export.json\"\n```\n\nExport Triplea CSV format:\n```sh\npython .\\triplea\\cli\\aaa.py export --type triplea --format csv --output \"test_export.csv\"\n```\n\n\n```sh\npython .\\triplea\\cli\\aaa.py export --type triplea --format csvs --output \"export.csv\"\n```\n\n\nExport for Rayyan CSV format:\n```sh\npython .\\triplea\\cli\\aaa.py export --type rayyan --format csv --output \"test_export.csv\"\n```\n\n#### Export Graph\n\nfor details information:\n```sh\npython .\\triplea\\cli\\aaa.py export_graph --help\n```\n\n\nMaking a graph with the `graphml` format and saving it in a file `test.graphml`\n```shell\npython .\\triplea\\cli\\aaa.py export_graph -g gen-all -f graphml -o .\\triplea\\test\n```\n\nMaking a graph with the `gexf` format and saving it in a file `C:\\Users\\Dr bitaraf\\Documents\\graph\\article.gexf`.This graph contains article, author, affiliation and relation between them:\n```shell\npython .\\triplea\\cli\\aaa.py export_graph -g article-author-affiliation -f gexf -o \"C:\\Users\\Dr bitaraf\\Documents\\graph\\article\"\n```\n\nMaking a graph with the `graphdict` format and saving it in a file `C:\\Users\\Dr bitaraf\\Documents\\graph\\article.json`.This graph contains article, Reference, article cite and relation between them:\n```shell\npython .\\triplea\\cli\\aaa.py export_graph -g article-reference -g article-cited -f graphdict -o \"C:\\Users\\Dr bitaraf\\Documents\\graph\\article.json\"\n```\n\nMaking a graph with the `graphml` format and saving it in a file `C:\\graph-repo\\country-authorship.jgraphmlson`.This graph contains article, country, and relation between them:\n```shell\npython .\\triplea\\cli\\aaa.py export_graph -g country-authorship -f graphml -o \"C:\\graph-repo\\country-authorship\"\n```\n\n\nTypes of graph generators that can be used in the `-g` parameter:\n\n|Name|Description|\n|----|-----------|\n|store|It considers all the nodes and edges that are stored in the database|\n|gen-all|It considers all possible nodes and edges|\n|article-topic|It considers article and topic as nodes and edges between them|\n|article-author-affiliation|It considers article, author and affiliation as nodes and edges between them|\n|article-keyword|It considers article and keyword as nodes and edges between them|\n|article-reference|It considers article and reference as nodes and edges between them|\n|article-cited|It considers article and cited as nodes and edges between them|\n|country-authorship||\n\nTypes of graph file format that can be used in the `-f` parameter:\n|Name|Description|\n|----|-----------|\n|graphdict|This format is a customized format for citation graphs in the form of a Python dictionary.|\n|graphjson||\n|gson||\n|gpickle|Write graph in Python pickle format. Pickles are a serialized byte stream of a Python object|\n|graphml|The GraphML file format uses .graphml extension and is XML structured. It supports attributes for nodes and edges, hierarchical graphs and benefits from a flexible architecture.|\n|gexf|GEXF (Graph Exchange XML Format) is an XML-based file format for storing a single undirected or directed graph.|\n\n#### Visualizing Graph\nSeveral visualizator are used to display graphs in this program. These include:\n\n[Alchemy.js](https://graphalchemist.github.io/Alchemy/#/) : Alchemy.js is a graph drawing application built almost entirely in d3.\n\n[interactivegaraph](https://github.com/grapheco/InteractiveGraph) : InteractiveGraph provides a web-based interactive visualization and analysis framework for large graph data, which may come from a GSON file\n\n[netwulf](https://github.com/benmaier/netwulf) : Interactive visualization of networks based on Ulf Aslak's d3 web app.\n\n\n```shell\npython .\\triplea\\cli\\aaa.py visualize -g article-reference -g article-cited -p 8001\n```\n\n\n```shell\npython .\\triplea\\cli\\aaa.py visualize -g gen-all -p 8001\n```\n\n\noutput:\n\n![](docs/assets//img/gen-all-graph.png)\n\n\n```shell\npython .\\triplea\\cli\\aaa.py visualize -g article-topic -g article-keyword -p 8001\n```\n\noutput:\n\n![](docs/assets/img/graph-alchemy.png)\n\n\nVisulaize File\n\nA file related to the extracted graph can be visualized in different formats with the following command:\n```sh\npython .\\triplea\\cli\\aaa.py visualize_file --format graphdict \"graph.json\"\n```\n\n#### Analysis Graph\n\n\n`analysis info` command calculates specific metrics for the entire graph. These metrics include the following:\n- Graph Type: \n- SCC: \n- WCC: \n- Reciprocity : \n- Graph Nodes: \n- Graph Edges: \n- Graph Average Degree : \n- Graph Density : \n- Graph Transitivity : \n- Graph max path length : \n- Graph Average Clustering Coefficient : \n- Graph Degree Assortativity Coefficient : \n\n```\npython .\\triplea\\cli\\aaa.py analysis -g gen-all -c info\n```\n\noutput:\n\n![](docs/assets/img/aaa-analysis-info.png)\n\n\n\n\nCreates a graph with all possible nodes and edges and calculates and lists the sorted [degree centrality](https://bookdown.org/omarlizardo/_main/4-2-degree-centrality.html) for each node.\n```\npython .\\triplea\\cli\\aaa.py analysis -g gen-all -c sdc\n```\n\noutput:\n\n![](docs/assets/img/aaa-analysis-sdc.png)\n\n\n#### Work with Article Repository\nArticle Repository (Arepo) is a database that stores the information of articles and graphs. Different databases can be used. We have used the following information banks here:\n\n- [TinyDB](https://github.com/msiemens/tinydb) - TinyDB is a lightweight document oriented database\n\n- [MongoDB](https://www.mongodb.com/) - MongoDB is a source-available cross-platform document-oriented database program\n\n\nTo get general information about the articles, nodes and egdes in the database, use the following command.\n```shell\npython .\\triplea\\cli\\aaa.py arepo -c info\n```\n\noutput:\n```shell\nNumber of article in article repository is 122\n0 Node(s) in article repository.\n0 Edge(s) in article repository.\n122 article(s) in state 3.\n```\n\n\n\nGet article data by PMID\n```sh\npython .\\triplea\\cli\\aaa.py arepo -pmid 31398071\n```\n\noutput:\n```\nTitle   : Association between MRI background parenchymal enhancement and lymphovascular invasion and estrogen receptor status in invasive breast cancer.\nJournal : The British journal of radiology\nDOI     : 10.1259/bjr.20190417\nPMID    : 31398071\nPMC     : PMC6849688\nState   : 3\nAuthors : Jun Li, Yin Mo, Bo He, Qian Gao, Chunyan Luo, Chao Peng, Wei Zhao, Yun Ma, Ying Yang, \nKeywords: Adult, Aged, Breast Neoplasms, Female, Humans, Lymphatic Metastasis, Magnetic Resonance Imaging, Menopause, Middle Aged, Neoplasm Invasiveness, Receptors, Estrogen, Retrospective Studies, Young Adult,\n```\n\nGet article data by PMID and save to `article.json` file.\n```sh\npython .\\triplea\\cli\\aaa.py arepo -pmid 31398071 -o article.json\n```\n\nanother command fo this:\n```sh\npython .\\triplea\\cli\\aaa.py export_article --idtype pmid --id 31398071 --format json --output \"article.json\"\n```\n\n#### Configuration\n\nFor details information:\n```shell\npython .\\triplea\\cli\\aaa.py config --help\n```\n\nGet environment variable:\n```shell\n python .\\triplea\\cli\\aaa.py config -c info\n```\n\nSet new environment variable:\n```shell\npython .\\triplea\\cli\\aaa.py config -c update\n```\n\nBelow is a summary of important environment variables in this project:\n|Environment Variables     |Description|Default Value|\n|--------------------------|-----------|-------------|\n|TRIPLEA_DB_TYPE           |The type of database to be used in the project. The database layer is separate and you can use different databases, currently it supports `MongoDB` and `TinyDB` databases. TinyDB can be used for small scope and Mango can be used for large scope|TinyDB|\n|AAA_TINYDB_FILENAME       |File name of TinyDB|articledata.json|\n|AAA_MONGODB_CONNECTION_URL|[Standard Connection String Format](https://www.mongodb.com/docs/manual/reference/connection-string/#std-label-connections-standard-connection-string-format) For MongoDB|mongodb://user:pass@127.0.0.1:27017/|\n|AAA_MONGODB_DB_NAME       |Name of MongoDB Collection|articledata|\n|AAA_TPS_LIMIT             |Transaction Per Second Limitation|1|\n|AAA_PROXY_HTTP            |An HTTP proxy is a server that acts as an intermediary between a client and PubMed server. When a client sends a request to a server through an HTTP proxy, the proxy intercepts the request and forwards it to the server on behalf of the client. Similarly, when the server responds, the proxy intercepts the response and forwards it back to the client.||\n|AAA_PROXY_HTTPS           |HTTPS Proxy|| \n|AAA_REFF_CRAWLER_DEEP     ||1|\n|AAA_CITED_CRAWLER_DEEP    ||1|\n\n## From Package\n\nYou can create a python virtual environment before installing and it is recommended that you do so.\n```sh\n$ python -m venv venv\n```\n\n```sh\n$ .\\venv\\Scripts\\activate\n```\n\nInstall Package with pip:\n```sh\n$ pip install triplea\n```\n\nCreate environment variable by `.env` file:\n```\nTRIPLEA_DB_TYPE = TinyDB\nAAA_TINYDB_FILENAME = articledata.json\nAAA_MONGODB_CONNECTION_URL = mongodb://localhost:27017/\nAAA_MONGODB_DB_NAME = articledata\nAAA_TPS_LIMIT = 1\nAAA_PROXY_HTTP = \nAAA_PROXY_HTTPS = \nAAA_REFF_CRAWLER_DEEP = 1\nAAA_CITED_CRAWLER_DEEP = 1\nAAA_TOPIC_EXTRACT_ENDPOINT=http://localhost:8001/api/v1/topic/\nAAA_CLIENT_AGENT=\"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0\"\n```\nIf you do not create the mentioned file, the default values will be used, which are:\n```\nTRIPLEA_DB_TYPE = TinyDB\nAAA_TINYDB_FILENAME = default-tiny-db.json\nAAA_TPS_LIMIT = 1\nAAA_REFF_CRAWLER_DEEP = 1\nAAA_CITED_CRAWLER_DEEP = 1\nAAA_TOPIC_EXTRACT_ENDPOINT=http://localhost:8001/api/v1/topic/\nAAA_CLIENT_AGENT=\"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0\"\n```\n\nRun CLI:\n```sh\n$ aaa --help\n```\n\noutput:\n```sh\nUsage: aaa [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  -v, --version\n  --help         Show this message and exit.\n\nCommands:\n  analysis        Analysis Graph.\n  config          Configuration additional setting.\n  export          Export article repository in specific format.\n  export_article  Export Article by identifier.\n  export_graph    Export Graph.\n  export_llm      Export preTrain LLM.\n  go              Moves the articles state in the Arepo until end state.\n  import          import article from specific file format to article...\n  importbib       import article from .bib , .enw , .ris file format.\n  ner             Single NER with custom model.\n  next            Moves the articles state in the Arepo from the current...\n  pipeline        Run Custom PipeLine in arepo.\n  search          Search query from PubMed and store to Arepo.\n```\n\n*Note*: The visualization function is only available in the source version\n\n# Testing\n\n```sh\npoetry run pytest\n```\npoetry run pytest tests/\n```sh\npoetry run pytest --cov\n```\n\n\nFor unit test check :\n\nbibilometric:\n\n37283018\n\n35970485\n\n# Dependencies\n\n\n\nFor graph analysis:\n\n[networkx](https://networkx.org/)\n\n\nFor NLP:\n\n[PyTextRank](https://derwen.ai/docs/ptr/)\n\n[transformers](https://huggingface.co/docs/transformers/index) \n\n[spaCy](https://spacy.io/)\n\nFor data storage:\n \n[TinyDB](https://tinydb.readthedocs.io/en/latest/)\n\n[py2neo](https://github.com/py2neo-org/py2neo)\n\n[pymongo](https://github.com/mongodb/mongo-python-driver)\n\nFor visualization of networks:\n\n[netwulf](https://github.com/benmaier/netwulf)\n\n[Alchemy.js](https://graphalchemist.github.io/Alchemy/#/)\n\n[InteractiveGraph](https://github.com/grapheco/InteractiveGraph)\n\nFor CLI:\n\n[click](https://click.palletsprojects.com/en/8.1.x/)\n\n\nFor packaging and dependency management: \n\n[Poetry](https://python-poetry.org/docs/basic-usage/)\n\n\n\n# Use case\nWith this tool, you can create datasets in different formats, here are examples of these datasets.\n\n\n## Breast Cancer\n\nPubmed Query:\n```\n\"breast neoplasms\"[MeSH Terms] OR (\"breast\"[All Fields] AND \"neoplasms\"[All Fields]) OR \"breast neoplasms\"[All Fields] OR (\"breast\"[All Fields] AND \"cancer\"[All Fields]) OR \"breast cancer\"[All Fields]\n```\n\n`495,012` results\n\nConfiguration:\n```\nAAA_MONGODB_DB_NAME = bcarticledata\nAAA_REFF_CRAWLER_DEEP = 0\nAAA_CITED_CRAWLER_DEEP = 0\n```\n\n`EDirect` used.\n\nSearch with this command:\n\n```\npython .\\triplea\\cli\\aaa.py search --searchterm r'\"breast neoplasms\"[MeSH Terms] OR (\"breast\"[All Fields] AND \"neoplasms\"[All Fields]) OR \"breast neoplasms\"[All Fields] OR (\"breast\"[All Fields] AND \"cancer\"[All Fields]) OR \"breast cancer\"[All Fields]'\n```\n\nif --searchterm argument is too complex use this:\n```\npython .\\triplea\\cli\\aaa.py search\n```\n\nby Filter :\n```\n{\n    \"mindate\" : \"2022/01/01\",\n    \"maxdate\" : \"2022/12/30\"\n}\n```\n\nGet info of all downloaded article:\n```shell\npython .\\triplea\\cli\\aaa.py arepo -c info\n```\n\noutput:\n```shell\nNumber of article in article repository is 30914\n0 Node(s) in article repository.\n0 Edge(s) in article repository.\n30914 article(s) in state 0.\n```\n\nRun Core pipeline to next status\n```shell\npython .\\triplea\\cli\\aaa.py next --state 0\n```\n\nthen parsing article:\n```shell\npython .\\triplea\\cli\\aaa.py next --state 1\n```\n\nExtract Triple is type of custom pipeline. you can run this:\n```shell\npython .\\triplea\\cli\\aaa.py pipeline --name FlagExtractKG\n```\n\n## Bio Bank\n\nPubmed Query:\n```\n\"Biological Specimen Banks\"[Mesh] OR BioBanking OR biobank OR dataBank OR \"Bio Banking\" OR \"bio bank\"\n```\n\n`39,023` results \n\nSearch with this command:\n\n```shell\npython .\\triplea\\cli\\aaa.py search --searchterm \"\\\"Biological Specimen Banks\\\"[Mesh] OR BioBanking OR biobank OR dataBank OR \\\"Bio Banking\\\" OR \\\"bio bank\\\" \"\n```\n\nGet 39,023 result until `2023/01/02`\n\n```\n\"ERROR\":\"Search Backend failed: Exception:\\n\\'retstart\\' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/\"\n```\n\nThis query had more than 10,000 results, and as a result, the following text was used:\n\nTo retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using `<EDirect>` that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.\n\nThis is hard code in `get_article_list_from_pubmed` methods in `PARAMS`.\n\nThis Query Added lately:\n```\n\"bio-banking\"[Title/Abstract] OR \"bio-bank\"[Title/Abstract] OR \"data-bank\"[Title/Abstract]\n```\n\n`9,012` results\n\n```shell\npython .\\triplea\\cli\\aaa.py search --searchterm \" \\\"bio-banking\\\"[Title/Abstract] OR \\\"bio-bank\\\"[Title/Abstract] OR \\\"data-bank\\\"[Title/Abstract] \"\n```\n\nafter run this. get info:\n```\nNumber of article in article repository is 47735\n```\n\nExport `graphml` format:\n```shell\npython .\\triplea\\cli\\aaa.py export_graph -g article-reference -g article-keyword  -f graphml -o .\\triplea\\datasets\\biobank.graphml\n```\n\n## Registry of Breast Cancer\n\nKeyword Checking:\n```\n\"Breast Neoplasms\"[Mesh]  \n\"Breast Cancer\"[Title]\n\"Breast Neoplasms\"[Title]  \n\"Breast Neoplasms\"[Other Term]\n\"Breast Cancer\"[Other Term]\n\"Registries\"[Mesh]\n\"Database Management Systems\"[Mesh]\n\"Information Systems\"[MeSH Major Topic]\n\"Registries\"[Other Term]\n\"Information Storage and Retrieval\"[MeSH Major Topic]\n\"Registry\"[Title]\n\"National Program of Cancer Registries\"[Mesh]\n\"Registries\"[MeSH Major Topic]\n\"Information Science\"[Mesh]\n\"Data Management\"[Mesh]\n```\n\nFinal Pubmed Query:\n```\n(\"Breast Neoplasms\"[Mesh] OR \"Breast Cancer\"[Title] OR \"Breast Neoplasms\"[Title] OR \"Breast Neoplasms\"[Other Term] OR \"Breast Cancer\"[Other Term]) AND (\"Registries\"[MeSH Major Topic] OR \"Database Management Systems\"[MeSH Major Topic] OR \"Information Systems\"[MeSH Major Topic] OR \"Registry\"[Other Term] OR \"Registry\"[Title] OR \"Information Storage and Retrieval\"[MeSH Major Topic])\n```\n\nurl:\n```\nhttps://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=(\"Breast+Neoplasms\"[Mesh]+OR+\"Breast+Cancer\"[Title]+OR+\"Breast+Neoplasms\"[Title]+OR+\"Breast+Neoplasms\"[Other+Term]+OR+\"Breast+Cancer\"[Other+Term])+AND+(\"Registries\"[MeSH+Major+Topic]+OR+\"Database+Management+Systems\"[MeSH+Major+Topic]+OR+\"Information+Systems\"[MeSH+Major+Topic]+OR+\"Registry\"[Other+Term]+OR+\"Registry\"[Title]+OR+\"Information+Storage+and+Retrieval\"[MeSH+Major+Topic])&retmode=json&retstart=1&retmax=10\n```\n\n\nYou can download the result of this network and the relationship between the article and the keyword in `graphdict` format from [**here**](datasets/bcancer-graphdict.json). Manipulated, you can download this graph in `gramphml` format from [**here**](datasets/bcancer.graphml).\n\n## EHR\nIt is not yet complete.\n\n# Graph Visualization \nVarious tools have been developed to visualize graphs. We have done a [brief review](docs/graph-visualization.md) and selected a few tools to use in this program.\n\n# Graph Analysis\nIn this project, we used one of the most powerful libraries for graph analysis. Using [NetworkX](https://networkx.org/), we generated many indicators to check a citation graph. Some materials in this regard are given [here](docs/graph-analysis.md). You can use other libraries as well.\n\n\n# Knowledge Extraction\nIn the architecture of this software, the structure of the article is stored in the database and this structure also contains the summary of the article. For this reason, it is possible to perform NLP processes such as keywords extraction, topic extraction etc., which can be completed in the future[.](docs/knowledge-extraction.md)\n\n\n# Related Article\nThis topic is very interesting from a research point of view, so I have included the articles that were interesting [here](docs/article.md).\n\n\n\n# Code Quality\nWe used flake8 and black libraries to increase code quality.\nMore information can be found [here](docs/code-quality.md).\n\n---\n\n# Citation\n\nIf you use `Triple A` for your scientific work, consider citing us! We're published in [IEEE](https://ieeexplore.ieee.org/document/10139229).\n\n```bibtex\n@INPROCEEDINGS{10139229,\n  author={Jafarpour, Maryam and Bitaraf, Ehsan and Moeini, Ali and Nahvijou, Azin},\n  booktitle={2023 9th International Conference on Web Research (ICWR)}, \n  title={Triple A (AAA): a Tool to Analyze Scientific Literature Metadata with Complex Network Parameters}, \n  year={2023},\n  volume={},\n  number={},\n  pages={342-345},\n  doi={10.1109/ICWR57742.2023.10139229}}\n```\n\n[![DOI:10.1109/ICWR57742.2023.10139229](https://zenodo.org/badge/doi/10.1109/ICWR57742.2023.10139229.svg)](https://doi.org/10.1109/ICWR57742.2023.10139229)\n\n\n\n---\n\n# License\n\nTripleA is available under the [Apache License](LICENSE).\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Article Analysis Assistant",
    "version": "0.0.5",
    "project_urls": {
        "Homepage": "https://github.com/EhsanBitaraf/triple-a",
        "Repository": "https://github.com/EhsanBitaraf/triple-a"
    },
    "split_keywords": [
        "graph",
        "semantic-scholar",
        "citation-graph"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e602b37657caff4983a49c6d73eaa9d777939206d836f3e285740f7e376ef4a1",
                "md5": "a75180ed8cce77ee9ceadbe0ac58921d",
                "sha256": "1e8709da924b4c0d330aaacba4ea2c8cf1d62a670530391b701b61cd2232a280"
            },
            "downloads": -1,
            "filename": "triplea-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a75180ed8cce77ee9ceadbe0ac58921d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 11478210,
            "upload_time": "2024-02-13T05:27:57",
            "upload_time_iso_8601": "2024-02-13T05:27:57.487295Z",
            "url": "https://files.pythonhosted.org/packages/e6/02/b37657caff4983a49c6d73eaa9d777939206d836f3e285740f7e376ef4a1/triplea-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "03e3a675f27b09c85c1df0cbf36f7e68fe4124bfcc8d8c2973689130636f847e",
                "md5": "f5a7ba5972b32f11f2ff794bd5258cc8",
                "sha256": "f244e7a3b8261041749c58854eda1e326d7de6846f25c331e81a8b445e92dc0b"
            },
            "downloads": -1,
            "filename": "triplea-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "f5a7ba5972b32f11f2ff794bd5258cc8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 11441853,
            "upload_time": "2024-02-13T05:28:00",
            "upload_time_iso_8601": "2024-02-13T05:28:00.334101Z",
            "url": "https://files.pythonhosted.org/packages/03/e3/a675f27b09c85c1df0cbf36f7e68fe4124bfcc8d8c2973689130636f847e/triplea-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-13 05:28:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "EhsanBitaraf",
    "github_project": "triple-a",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "triplea"
}
        
Elapsed time: 0.22871s