graphscope


Namegraphscope JSON
Version 0.27.0 PyPI version JSON
download
home_pagehttps://github.com/alibaba/GraphScope
SummaryNone
upload_time2024-03-29 15:16:25
maintainerNone
docs_urlNone
authorGraphScope Team, Damo Academy
requires_pythonNone
licenseApache License 2.0
keywords graphscope graph computations
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">
    <img src="https://graphscope.io/assets/images/graphscope-logo.svg" width="400" alt="graphscope-logo">
</h1>
<p align="center">
    A One-Stop Large-Scale Graph Computing System from Alibaba
</p>

[![GraphScope CI](https://github.com/alibaba/GraphScope/actions/workflows/local-ci.yml/badge.svg)](https://github.com/alibaba/GraphScope/actions/workflows/local-ci.yml)
[![Coverage](https://codecov.io/gh/alibaba/GraphScope/branch/main/graph/badge.svg)](https://codecov.io/gh/alibaba/GraphScope)
[![Playground](https://shields.io/badge/JupyterLab-Try%20GraphScope%20Now!-F37626?logo=jupyter)](https://try.graphscope.app)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alibaba/GraphScope)
[![Artifact HUB](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/graphscope)](https://artifacthub.io/packages/helm/graphscope/graphscope)
[![Docs-en](https://shields.io/badge/Docs-English-blue?logo=Read%20The%20Docs)](https://graphscope.io/docs)
[![FAQ-en](https://img.shields.io/badge/-FAQ-blue?logo=Read%20The%20Docs)](https://graphscope.io/docs/frequently_asked_questions.html)
[![Docs-zh](https://shields.io/badge/Docs-%E4%B8%AD%E6%96%87-blue?logo=Read%20The%20Docs)](https://graphscope.io/docs/zh/)
[![FAQ-zh](https://img.shields.io/badge/-FAQ%E4%B8%AD%E6%96%87-blue?logo=Read%20The%20Docs)](https://graphscope.io/docs/zh/frequently_asked_questions.html)
[![README-zh](https://shields.io/badge/README-%E4%B8%AD%E6%96%87-blue)](README-zh.md)
[![ACM DL](https://img.shields.io/badge/ACM%20DL-10.14778%2F3476311.3476369-blue)](https://dl.acm.org/doi/10.14778/3476311.3476369)

🎉 See our ongoing [GraphScope Flex](https://github.com/alibaba/GraphScope/tree/main/flex): a LEGO-inspired, modular, and user-friendly GraphScope evolution. 🎉

GraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simply by combining several important pieces of Alibaba technology: including [GRAPE](https://github.com/alibaba/libgrape-lite), [MaxGraph](interactive_engine/), and [Graph-Learn](https://github.com/alibaba/graph-learn) (GL) for analytics, interactive, and graph neural networks (GNN) computation, respectively, and the [Vineyard](https://github.com/v6d-io/v6d) store that offers efficient in-memory data transfers.

Visit our website at [graphscope.io](https://graphscope.io) to learn more.

## Latest News
- [05/02/2024] 🎉 GraphScope Flex [paper](https://arxiv.org/abs/2312.12107) was accepted by [SIGMOD 2024](https://2024.sigmod.org/) Industry Track. See you in 🇨🇱!
- [19/12/2023] 📑 A paper introducing GraphScope Flex released on [arXiv.org](https://arxiv.org/abs/2312.12107).
- [20/07/2023] 🏆 GraphScope achieved record-breaking results on the [LDBC Social Network Benchmark Interactive workload](https://ldbcouncil.org/benchmarks/snb-interactive/), with a 2.45× higher throughput on SF300 than the previous record holder! 🏆
- [04/07/2023] 🚀 GraphScope Flex tech preview released with [v0.23.0](https://github.com/alibaba/GraphScope/releases/tag/v0.23.0).
  
## Table of Contents

- [Getting Started](#getting-started)
  - [Installation for Standalone Mode](#installation-for-standalone-mode)
- [Demo: Node Classification on Citation Network](#demo-node-classification-on-citation-network)
  - [Loading a graph](#loading-a-graph)
  - [Interactive query](#interactive-query)
  - [Graph analytics](#graph-analytics)
  - [Graph neural networks (GNNs)](#graph-neural-networks-gnns)
- [Graph Processing on Kubernetes](#processing-large-graph-on-kubernetes-cluster)
  - [Creating a session](#creating-a-session)
  - [Loading graphs and graph computation](#loading-a-graph-and-processing-computation-tasks)
  - [Closing the session](#closing-the-session)
- [Development](#development)
  - [Building from source](#building-on-local)
  - [Building Docker images](#building-docker-images)
  - [Building the client library](#building-client-library)
  - [Testing](#testing)
- [Documentation](#documentation)
- [License](#license)
- [Publications](#publications)
- [Joining our Community!](#contributing)

## Getting Started

We provide a [Playground](https://try.graphscope.app) with a managed JupyterLab. [Try GraphScope](https://try.graphscope.app) straight away in your browser!

GraphScope supports running in standalone mode or on clusters managed by [Kubernetes](https://kubernetes.io/) within containers. For quickly getting started,
let's begin with the standalone mode.


### Installation for Standalone Mode

GraphScope pre-compiled package is distributed as a python package and can be easily installed with `pip`.

```bash
pip3 install graphscope
```

Note that `graphscope` requires `Python` >= `3.8` and `pip` >= `19.3`. The package is built for and tested on the most popular Linux (Ubuntu 20.04+ / CentOS 7+) and macOS 12+ (Intel/Apple silicon) distributions. For Windows users, you may want to install Ubuntu on WSL2 to use this package.

Next, we will walk you through a concrete example to illustrate how GraphScope can be used by data scientists to effectively analyze large graphs.


## Demo: Node Classification on Citation Network

[`ogbn-mag`](https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag) is a heterogeneous network composed of a subset of the Microsoft Academic Graph. It contains 4 types of entities(i.e., papers, authors, institutions, and fields of study), as well as four types of directed relations connecting two entities.

Given the heterogeneous `ogbn-mag` data, the task is to predict the class of each paper. Node classification can identify papers in multiple venues, which represent different groups of scientific work on different topics. We apply both the attribute and structural information to classify papers. In the graph, each paper node contains a 128-dimensional word2vec vector representing its content, which is obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are pre-trained. The structural information is computed on-the-fly.

### Loading a graph

GraphScope models graph data as property graph, in which the edges/vertices are labeled and have many properties.
Taking `ogbn-mag` as example, the figure below shows the model of the property graph.

<div align="center">
    <img src="https://graphscope.io/docs/_images/sample_pg.png" width="600" alt="sample-of-property-graph" />
</div>

This graph has four kinds of vertices, labeled as `paper`, `author`, `institution` and `field_of_study`. There are four kinds of edges connecting them, each kind of edges has a label and specifies the vertex labels for its two ends. For example, `cites` edges connect two vertices labeled `paper`. Another example is `writes`, it requires the source vertex is labeled `author` and the destination is a `paper` vertex. All the vertices and edges may have properties. e.g., `paper`  vertices have properties like features, publish year, subject label, etc.

To load this graph to GraphScope with our retrieval module, please use these code:

```python
import graphscope
from graphscope.dataset import load_ogbn_mag

g = load_ogbn_mag()
```

We provide a set of functions to load graph datasets from [ogb](https://ogb.stanford.edu/docs/dataset_overview/) and [snap](https://snap.stanford.edu/data/index.html) for convenience. Please find all the available graphs [here](https://github.com/alibaba/GraphScope/tree/docs/python/graphscope/dataset). If you want to use your own graph data, please refer [this doc](https://graphscope.io/docs/loading_graph.html) to load vertices and edges by labels.


### Interactive query

Interactive queries allow users to directly explore, examine, and present graph data in an *exploratory* manner in order to locate specific or in-depth information in time.
GraphScope adopts a high-level language called [Gremlin](http://tinkerpop.apache.org/) for graph traversal, and provides [efficient execution](interactive_engine/benchmark/) at scale.

In this example, we use graph traversal to count the number of papers two given authors have co-authored. To simplify the query, we assume the authors can be uniquely identified by ID `2` and `4307`, respectively.

```python
# get the endpoint for submitting Gremlin queries on graph g.
interactive = graphscope.gremlin(g)

# count the number of papers two authors (with id 2 and 4307) have co-authored
papers = interactive.execute("g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()").one()
```

### Graph analytics

Graph analytics is widely used in real world. Many algorithms, like community detection, paths and connectivity, centrality are proven to be very useful in various businesses.
GraphScope ships with a set of [built-in algorithms](https://graphscope.io/docs/analytics_engine.html#built-in-algorithms), enables users easily analysis their graph data.

Continuing our example, below we first derive a subgraph by extracting publications in specific time out of the entire graph (using Gremlin!), and then run k-core decomposition and triangle counting to generate the structural features of each paper node.

Please note that many algorithms may only work on *homogeneous* graphs, and therefore, to evaluate these algorithms over a property graph, we need to project it into a simple graph at first.

```python
# extract a subgraph of publication within a time range
sub_graph = interactive.subgraph("g.V().has('year', gte(2014).and(lte(2020))).outE('cites')")

# project the projected graph to simple graph.
simple_g = sub_graph.project(vertices={"paper": []}, edges={"cites": []})

ret1 = graphscope.k_core(simple_g, k=5)
ret2 = graphscope.triangles(simple_g)

# add the results as new columns to the citation graph
sub_graph = sub_graph.add_column(ret1, {"kcore": "r"})
sub_graph = sub_graph.add_column(ret2, {"tc": "r"})
```

In addition, users can write their own algorithms in GraphScope.
Currently, GraphScope supports users to write their own algorithms
in Pregel model and PIE model.

### Graph neural networks (GNNs)

Graph neural networks (GNNs) combines superiority of both graph analytics and machine learning. GNN algorithms can compress both structural and attribute information in a graph into low-dimensional embedding vectors on each node. These embeddings can be further fed into downstream machine learning tasks.

In our example, we train a GCN model to classify the nodes (papers) into 349 categories,
each of which represents a venue (e.g. pre-print and conference).
To achieve this, first we launch a learning engine and build a graph with features
following the last step.

```python

# define the features for learning
paper_features = [f"feat_{i}" for i in range(128)]

paper_features.append("kcore")
paper_features.append("tc")

# launch a learning engine.
lg = graphscope.graphlearn(sub_graph, nodes=[("paper", paper_features)],
                  edges=[("paper", "cites", "paper")],
                  gen_labels=[
                      ("train", "paper", 100, (0, 75)),
                      ("val", "paper", 100, (75, 85)),
                      ("test", "paper", 100, (85, 100))
                  ])
```

Then we define the training process, and run it.

```python
# Note: Here we use tensorflow as NN backend to train GNN model. so please
# install tensorflow.
try:
    # https://www.tensorflow.org/guide/migrate
    import tensorflow.compat.v1 as tf
    tf.disable_v2_behavior()
except ImportError:
    import tensorflow as tf

import graphscope.learning
from graphscope.learning.examples import EgoGraphSAGE
from graphscope.learning.examples import EgoSAGESupervisedDataLoader
from graphscope.learning.examples.tf.trainer import LocalTrainer

# supervised GCN.
def train_gcn(graph, node_type, edge_type, class_num, features_num,
              hops_num=2, nbrs_num=[25, 10], epochs=2,
              hidden_dim=256, in_drop_rate=0.5, learning_rate=0.01,
):
    graphscope.learning.reset_default_tf_graph()

    dimensions = [features_num] + [hidden_dim] * (hops_num - 1) + [class_num]
    model = EgoGraphSAGE(dimensions, act_func=tf.nn.relu, dropout=in_drop_rate)

    # prepare train dataset
    train_data = EgoSAGESupervisedDataLoader(
        graph, graphscope.learning.Mask.TRAIN,
        node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
    )
    train_embedding = model.forward(train_data.src_ego)
    train_labels = train_data.src_ego.src.labels
    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            labels=train_labels, logits=train_embedding,
        )
    )
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

    # prepare test dataset
    test_data = EgoSAGESupervisedDataLoader(
        graph, graphscope.learning.Mask.TEST,
        node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
    )
    test_embedding = model.forward(test_data.src_ego)
    test_labels = test_data.src_ego.src.labels
    test_indices = tf.math.argmax(test_embedding, 1, output_type=tf.int32)
    test_acc = tf.div(
        tf.reduce_sum(tf.cast(tf.math.equal(test_indices, test_labels), tf.float32)),
        tf.cast(tf.shape(test_labels)[0], tf.float32),
    )

    # train and test
    trainer = LocalTrainer()
    trainer.train(train_data.iterator, loss, optimizer, epochs=epochs)
    trainer.test(test_data.iterator, test_acc)

train_gcn(lg, node_type="paper", edge_type="cites",
          class_num=349,  # output dimension
          features_num=130,  # input dimension, 128 + kcore + triangle count
)
```

A Python script with the entire process is available [here](https://colab.research.google.com/github/alibaba/GraphScope/blob/main/tutorials/1_node_classification_on_citation.ipynb), you may try it out by yourself.


## Processing Large Graph on Kubernetes Cluster

GraphScope is designed for processing large graphs, which are usually hard to fit in the memory of a single machine.
With [Vineyard](https://github.com/v6d-io/v6d) as the distributed in-memory data manager, GraphScope supports running on a cluster managed by Kubernetes(k8s).

To continue this tutorial, please ensure that you have a k8s-managed cluster and know the credentials for the cluster.
(e.g., address of k8s API server, usually stored a `~/.kube/config` file.)

Alternatively, you can set up a local k8s cluster for testing with [Kind](https://kind.sigs.k8s.io/). 
You can install and deploy Kind referring to [Quick Start](https://kind.sigs.k8s.io/docs/user/quick-start/);

If you did not install the `graphscope` package in the above step, you can install a subset of the whole package with client functions only.

```bash
pip3 install graphscope-client
```

Next, let's revisit the example by running on a cluster instead.

<div align="center">
    <img src="https://graphscope.io/docs/_images/how-it-works.png" width="600" alt="how-it-works" />
</div>

The figure shows the flow of execution in the cluster mode. When users run code in the python client, it will:

- *Step 1*. Create a session or workspace in GraphScope.
- *Step 2 - Step 5*. Load a graph, query, analysis and run learning task on this graph via Python interface. These steps are the same to local mode, thus users process huge graphs in a distributed setting just like analysis a small graph on a single machine.(Note that `graphscope.gremlin` and `graphscope.graphlearn` need to be changed to `sess.gremlin` and `sess.graphlearn`, respectively. `sess` is the name of the `Session` instance user created.)
- *Step 6*. Close the session.


### Creating a session

To use GraphScope in a distributed setting, we need to establish a session in a python interpreter.

For convenience, we provide several demo datasets, and an option `with_dataset` to mount the dataset in the graphscope cluster. The datasets will be mounted to `/dataset` in the pods. If you want to use your own data on k8s cluster, please refer to [this](docs/deployment.rst).

```python
import graphscope

sess = graphscope.session(with_dataset=True)
```

For macOS, the session needs to establish with the LoadBalancer service type (which is NodePort by default).

```python
import graphscope

sess = graphscope.session(with_dataset=True, k8s_service_type="LoadBalancer")
```

A session tries to launch a `coordinator`, which is the entry for the back-end engines. The coordinator manages a cluster of resources (k8s pods), and the interactive/analytical/learning engines ran on them. For each pod in the cluster, there is a vineyard instance at service for distributed data in memory.


### Loading a graph and processing computation tasks

Similar to the standalone mode, we can still use the functions to load a graph easily.

```python
from graphscope.dataset import load_ogbn_mag

# Note we have mounted the demo datasets to /dataset,
# There are several datasets including ogbn_mag_small,
# User can attach to the engine container and explore the directory.
g = load_ogbn_mag(sess, "/dataset/ogbn_mag_small/")
```

Here, the `g` is loaded in parallel via vineyard and stored in vineyard instances in the cluster managed by the session.

Next, we can conduct graph queries with Gremlin, invoke various graph algorithms, or run graph-based neural network tasks like we did in the standalone mode.
We do not repeat code here, but a `.ipynb` processing the classification task on k8s is available on the [Playground](https://try.graphscope.app/).

### Closing the session

Another additional step in the distribution is session close. We close the session after processing all graph tasks.

```python
sess.close()
```

This operation will notify the backend engines and vineyard
to safely unload graphs and their applications,
Then, the coordinator will release all the applied resources in the k8s cluster.

Please note that we have not hardened this release for production use and it lacks important security features such as authentication and encryption, and therefore **it is NOT recommended for production use (yet)!**

## Development

### Building on local

To build graphscope Python package and the engine binaries, you need to install some dependencies and build tools.

```bash
python3 gsctl.py install-deps dev

# With argument --cn to speed up the download if you are in China.
python3 gsctl.py install-deps dev --cn
```

Then you can build GraphScope with pre-configured `make` commands.

```bash
# to make graphscope whole package, including python package + engine binaries.
sudo make install

# or make the engine components
# make interactive
# make analytical
# make learning
```

### Building Docker images

GraphScope ships with a [Dockerfile](k8s/dockerfiles/graphscope-dev.Dockerfile) that can build docker images for releasing. The images are built on a `builder` image with all dependencies installed and copied to
a `runtime-base` image. To build images with latest version of GraphScope, go to the `k8s/internal` directory under root directory and run this command.

```bash
# by default, the built image is tagged as graphscope/graphscope:SHORTSHA
# cd k8s
make graphscope
```

### Building client library

GraphScope python interface is separate with the engines image.
If you are developing python client and not modifying the protobuf files, the engines
image doesn't require to be rebuilt.

You may want to re-install the python client on local.

```bash
make client
```

Note that the learning engine client has C/C++ extensions modules and setting up the build
environment is a bit tedious. By default the locally-built client library doesn't include
the support for learning engine. If you want to build client library with learning engine
enabled, please refer [Build Python Wheels](https://graphscope.io/docs/developer_guide.html#build-python-wheels).

### Testing

To verify the correctness of your developed features, your code changes should pass our tests.

You may run the whole test suite with commands:

```bash
make test
```


## Documentation

Documentation can be generated using Sphinx. Users can build the documentation using:

```bash
# build the docs
make graphscope-docs

# to open preview on local
open docs/_build/latest/html/index.html
```

The latest version of online documentation can be found at https://graphscope.io/docs


## License

GraphScope is released under [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). Please note that third-party libraries may not have the same license as GraphScope.


## Publications

- Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, Youyang Yao, Qiang Yin, Wenyuan Yu, Jingren Zhou, Diwen Zhu, Rong Zhu. [GraphScope: A Unified Engine For Big Graph Processing](http://vldb.org/pvldb/vol14/p2879-qian.pdf). The 47th International Conference on Very Large Data Bases (VLDB), industry, 2021.
- Jingbo Xu, Zhanning Bai, Wenfei Fan, Longbin Lai, Xue Li, Zhao Li, Zhengping Qian, Lei Wang, Yanyan Wang, Wenyuan Yu, Jingren Zhou. [GraphScope: A One-Stop Large Graph Processing System](http://vldb.org/pvldb/vol14/p2703-xu.pdf). The 47th International Conference on Very Large Data Bases (VLDB), demo, 2021

If you use this software, please cite our paper using the following metadata:

```bibtex
@article{fan2021graphscope,
  title={GraphScope: a unified engine for big graph processing},
  author={Fan, Wenfei and He, Tao and Lai, Longbin and Li, Xue and Li, Yong and Li, Zhao and Qian, Zhengping and Tian, Chao and Wang, Lei and Xu, Jingbo and others},
  journal={Proceedings of the VLDB Endowment},
  volume={14},
  number={12},
  pages={2879--2892},
  year={2021},
  publisher={VLDB Endowment}
}
```

## Contributing

Any contributions you make are **greatly appreciated**!
- Join in the [Slack channel](http://slack.graphscope.io) for discussion.
- Please report bugs by submitting a GitHub issue.
- Please submit contributions using pull requests.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/alibaba/GraphScope",
    "name": "graphscope",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "GraphScope, Graph Computations",
    "author": "GraphScope Team, Damo Academy",
    "author_email": "graphscope@alibaba-inc.com",
    "download_url": null,
    "platform": null,
    "description": "<h1 align=\"center\">\n    <img src=\"https://graphscope.io/assets/images/graphscope-logo.svg\" width=\"400\" alt=\"graphscope-logo\">\n</h1>\n<p align=\"center\">\n    A One-Stop Large-Scale Graph Computing System from Alibaba\n</p>\n\n[![GraphScope CI](https://github.com/alibaba/GraphScope/actions/workflows/local-ci.yml/badge.svg)](https://github.com/alibaba/GraphScope/actions/workflows/local-ci.yml)\n[![Coverage](https://codecov.io/gh/alibaba/GraphScope/branch/main/graph/badge.svg)](https://codecov.io/gh/alibaba/GraphScope)\n[![Playground](https://shields.io/badge/JupyterLab-Try%20GraphScope%20Now!-F37626?logo=jupyter)](https://try.graphscope.app)\n[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alibaba/GraphScope)\n[![Artifact HUB](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/graphscope)](https://artifacthub.io/packages/helm/graphscope/graphscope)\n[![Docs-en](https://shields.io/badge/Docs-English-blue?logo=Read%20The%20Docs)](https://graphscope.io/docs)\n[![FAQ-en](https://img.shields.io/badge/-FAQ-blue?logo=Read%20The%20Docs)](https://graphscope.io/docs/frequently_asked_questions.html)\n[![Docs-zh](https://shields.io/badge/Docs-%E4%B8%AD%E6%96%87-blue?logo=Read%20The%20Docs)](https://graphscope.io/docs/zh/)\n[![FAQ-zh](https://img.shields.io/badge/-FAQ%E4%B8%AD%E6%96%87-blue?logo=Read%20The%20Docs)](https://graphscope.io/docs/zh/frequently_asked_questions.html)\n[![README-zh](https://shields.io/badge/README-%E4%B8%AD%E6%96%87-blue)](README-zh.md)\n[![ACM DL](https://img.shields.io/badge/ACM%20DL-10.14778%2F3476311.3476369-blue)](https://dl.acm.org/doi/10.14778/3476311.3476369)\n\n\ud83c\udf89 See our ongoing [GraphScope Flex](https://github.com/alibaba/GraphScope/tree/main/flex): a LEGO-inspired, modular, and user-friendly GraphScope evolution. \ud83c\udf89\n\nGraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simply by combining several important pieces of Alibaba technology: including [GRAPE](https://github.com/alibaba/libgrape-lite), [MaxGraph](interactive_engine/), and [Graph-Learn](https://github.com/alibaba/graph-learn) (GL) for analytics, interactive, and graph neural networks (GNN) computation, respectively, and the [Vineyard](https://github.com/v6d-io/v6d) store that offers efficient in-memory data transfers.\n\nVisit our website at [graphscope.io](https://graphscope.io) to learn more.\n\n## Latest News\n- [05/02/2024] \ud83c\udf89 GraphScope Flex [paper](https://arxiv.org/abs/2312.12107) was accepted by [SIGMOD 2024](https://2024.sigmod.org/) Industry Track. See you in \ud83c\udde8\ud83c\uddf1!\n- [19/12/2023] \ud83d\udcd1 A paper introducing GraphScope Flex released on [arXiv.org](https://arxiv.org/abs/2312.12107).\n- [20/07/2023] \ud83c\udfc6 GraphScope achieved record-breaking results on the [LDBC Social Network Benchmark Interactive workload](https://ldbcouncil.org/benchmarks/snb-interactive/), with a 2.45\u00d7 higher throughput on SF300 than the previous record holder! \ud83c\udfc6\n- [04/07/2023] \ud83d\ude80 GraphScope Flex tech preview released with [v0.23.0](https://github.com/alibaba/GraphScope/releases/tag/v0.23.0).\n  \n## Table of Contents\n\n- [Getting Started](#getting-started)\n  - [Installation for Standalone Mode](#installation-for-standalone-mode)\n- [Demo: Node Classification on Citation Network](#demo-node-classification-on-citation-network)\n  - [Loading a graph](#loading-a-graph)\n  - [Interactive query](#interactive-query)\n  - [Graph analytics](#graph-analytics)\n  - [Graph neural networks (GNNs)](#graph-neural-networks-gnns)\n- [Graph Processing on Kubernetes](#processing-large-graph-on-kubernetes-cluster)\n  - [Creating a session](#creating-a-session)\n  - [Loading graphs and graph computation](#loading-a-graph-and-processing-computation-tasks)\n  - [Closing the session](#closing-the-session)\n- [Development](#development)\n  - [Building from source](#building-on-local)\n  - [Building Docker images](#building-docker-images)\n  - [Building the client library](#building-client-library)\n  - [Testing](#testing)\n- [Documentation](#documentation)\n- [License](#license)\n- [Publications](#publications)\n- [Joining our Community!](#contributing)\n\n## Getting Started\n\nWe provide a [Playground](https://try.graphscope.app) with a managed JupyterLab. [Try GraphScope](https://try.graphscope.app) straight away in your browser!\n\nGraphScope supports running in standalone mode or on clusters managed by [Kubernetes](https://kubernetes.io/) within containers. For quickly getting started,\nlet's begin with the standalone mode.\n\n\n### Installation for Standalone Mode\n\nGraphScope pre-compiled package is distributed as a python package and can be easily installed with `pip`.\n\n```bash\npip3 install graphscope\n```\n\nNote that `graphscope` requires `Python` >= `3.8` and `pip` >= `19.3`. The package is built for and tested on the most popular Linux (Ubuntu 20.04+ / CentOS 7+) and macOS 12+ (Intel/Apple silicon) distributions. For Windows users, you may want to install Ubuntu on WSL2 to use this package.\n\nNext, we will walk you through a concrete example to illustrate how GraphScope can be used by data scientists to effectively analyze large graphs.\n\n\n## Demo: Node Classification on Citation Network\n\n[`ogbn-mag`](https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag) is a heterogeneous network composed of a subset of the Microsoft Academic Graph. It contains 4 types of entities(i.e., papers, authors, institutions, and fields of study), as well as four types of directed relations connecting two entities.\n\nGiven the heterogeneous `ogbn-mag` data, the task is to predict the class of each paper. Node classification can identify papers in multiple venues, which represent different groups of scientific work on different topics. We apply both the attribute and structural information to classify papers. In the graph, each paper node contains a 128-dimensional word2vec vector representing its content, which is obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are pre-trained. The structural information is computed on-the-fly.\n\n### Loading a graph\n\nGraphScope models graph data as property graph, in which the edges/vertices are labeled and have many properties.\nTaking `ogbn-mag` as example, the figure below shows the model of the property graph.\n\n<div align=\"center\">\n    <img src=\"https://graphscope.io/docs/_images/sample_pg.png\" width=\"600\" alt=\"sample-of-property-graph\" />\n</div>\n\nThis graph has four kinds of vertices, labeled as `paper`, `author`, `institution` and `field_of_study`. There are four kinds of edges connecting them, each kind of edges has a label and specifies the vertex labels for its two ends. For example, `cites` edges connect two vertices labeled `paper`. Another example is `writes`, it requires the source vertex is labeled `author` and the destination is a `paper` vertex. All the vertices and edges may have properties. e.g., `paper`  vertices have properties like features, publish year, subject label, etc.\n\nTo load this graph to GraphScope with our retrieval module, please use these code:\n\n```python\nimport graphscope\nfrom graphscope.dataset import load_ogbn_mag\n\ng = load_ogbn_mag()\n```\n\nWe provide a set of functions to load graph datasets from [ogb](https://ogb.stanford.edu/docs/dataset_overview/) and [snap](https://snap.stanford.edu/data/index.html) for convenience. Please find all the available graphs [here](https://github.com/alibaba/GraphScope/tree/docs/python/graphscope/dataset). If you want to use your own graph data, please refer [this doc](https://graphscope.io/docs/loading_graph.html) to load vertices and edges by labels.\n\n\n### Interactive query\n\nInteractive queries allow users to directly explore, examine, and present graph data in an *exploratory* manner in order to locate specific or in-depth information in time.\nGraphScope adopts a high-level language called [Gremlin](http://tinkerpop.apache.org/) for graph traversal, and provides [efficient execution](interactive_engine/benchmark/) at scale.\n\nIn this example, we use graph traversal to count the number of papers two given authors have co-authored. To simplify the query, we assume the authors can be uniquely identified by ID `2` and `4307`, respectively.\n\n```python\n# get the endpoint for submitting Gremlin queries on graph g.\ninteractive = graphscope.gremlin(g)\n\n# count the number of papers two authors (with id 2 and 4307) have co-authored\npapers = interactive.execute(\"g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()\").one()\n```\n\n### Graph analytics\n\nGraph analytics is widely used in real world. Many algorithms, like community detection, paths and connectivity, centrality are proven to be very useful in various businesses.\nGraphScope ships with a set of [built-in algorithms](https://graphscope.io/docs/analytics_engine.html#built-in-algorithms), enables users easily analysis their graph data.\n\nContinuing our example, below we first derive a subgraph by extracting publications in specific time out of the entire graph (using Gremlin!), and then run k-core decomposition and triangle counting to generate the structural features of each paper node.\n\nPlease note that many algorithms may only work on *homogeneous* graphs, and therefore, to evaluate these algorithms over a property graph, we need to project it into a simple graph at first.\n\n```python\n# extract a subgraph of publication within a time range\nsub_graph = interactive.subgraph(\"g.V().has('year', gte(2014).and(lte(2020))).outE('cites')\")\n\n# project the projected graph to simple graph.\nsimple_g = sub_graph.project(vertices={\"paper\": []}, edges={\"cites\": []})\n\nret1 = graphscope.k_core(simple_g, k=5)\nret2 = graphscope.triangles(simple_g)\n\n# add the results as new columns to the citation graph\nsub_graph = sub_graph.add_column(ret1, {\"kcore\": \"r\"})\nsub_graph = sub_graph.add_column(ret2, {\"tc\": \"r\"})\n```\n\nIn addition, users can write their own algorithms in GraphScope.\nCurrently, GraphScope supports users to write their own algorithms\nin Pregel model and PIE model.\n\n### Graph neural networks (GNNs)\n\nGraph neural networks (GNNs) combines superiority of both graph analytics and machine learning. GNN algorithms can compress both structural and attribute information in a graph into low-dimensional embedding vectors on each node. These embeddings can be further fed into downstream machine learning tasks.\n\nIn our example, we train a GCN model to classify the nodes (papers) into 349 categories,\neach of which represents a venue (e.g. pre-print and conference).\nTo achieve this, first we launch a learning engine and build a graph with features\nfollowing the last step.\n\n```python\n\n# define the features for learning\npaper_features = [f\"feat_{i}\" for i in range(128)]\n\npaper_features.append(\"kcore\")\npaper_features.append(\"tc\")\n\n# launch a learning engine.\nlg = graphscope.graphlearn(sub_graph, nodes=[(\"paper\", paper_features)],\n                  edges=[(\"paper\", \"cites\", \"paper\")],\n                  gen_labels=[\n                      (\"train\", \"paper\", 100, (0, 75)),\n                      (\"val\", \"paper\", 100, (75, 85)),\n                      (\"test\", \"paper\", 100, (85, 100))\n                  ])\n```\n\nThen we define the training process, and run it.\n\n```python\n# Note: Here we use tensorflow as NN backend to train GNN model. so please\n# install tensorflow.\ntry:\n    # https://www.tensorflow.org/guide/migrate\n    import tensorflow.compat.v1 as tf\n    tf.disable_v2_behavior()\nexcept ImportError:\n    import tensorflow as tf\n\nimport graphscope.learning\nfrom graphscope.learning.examples import EgoGraphSAGE\nfrom graphscope.learning.examples import EgoSAGESupervisedDataLoader\nfrom graphscope.learning.examples.tf.trainer import LocalTrainer\n\n# supervised GCN.\ndef train_gcn(graph, node_type, edge_type, class_num, features_num,\n              hops_num=2, nbrs_num=[25, 10], epochs=2,\n              hidden_dim=256, in_drop_rate=0.5, learning_rate=0.01,\n):\n    graphscope.learning.reset_default_tf_graph()\n\n    dimensions = [features_num] + [hidden_dim] * (hops_num - 1) + [class_num]\n    model = EgoGraphSAGE(dimensions, act_func=tf.nn.relu, dropout=in_drop_rate)\n\n    # prepare train dataset\n    train_data = EgoSAGESupervisedDataLoader(\n        graph, graphscope.learning.Mask.TRAIN,\n        node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,\n    )\n    train_embedding = model.forward(train_data.src_ego)\n    train_labels = train_data.src_ego.src.labels\n    loss = tf.reduce_mean(\n        tf.nn.sparse_softmax_cross_entropy_with_logits(\n            labels=train_labels, logits=train_embedding,\n        )\n    )\n    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)\n\n    # prepare test dataset\n    test_data = EgoSAGESupervisedDataLoader(\n        graph, graphscope.learning.Mask.TEST,\n        node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,\n    )\n    test_embedding = model.forward(test_data.src_ego)\n    test_labels = test_data.src_ego.src.labels\n    test_indices = tf.math.argmax(test_embedding, 1, output_type=tf.int32)\n    test_acc = tf.div(\n        tf.reduce_sum(tf.cast(tf.math.equal(test_indices, test_labels), tf.float32)),\n        tf.cast(tf.shape(test_labels)[0], tf.float32),\n    )\n\n    # train and test\n    trainer = LocalTrainer()\n    trainer.train(train_data.iterator, loss, optimizer, epochs=epochs)\n    trainer.test(test_data.iterator, test_acc)\n\ntrain_gcn(lg, node_type=\"paper\", edge_type=\"cites\",\n          class_num=349,  # output dimension\n          features_num=130,  # input dimension, 128 + kcore + triangle count\n)\n```\n\nA Python script with the entire process is available [here](https://colab.research.google.com/github/alibaba/GraphScope/blob/main/tutorials/1_node_classification_on_citation.ipynb), you may try it out by yourself.\n\n\n## Processing Large Graph on Kubernetes Cluster\n\nGraphScope is designed for processing large graphs, which are usually hard to fit in the memory of a single machine.\nWith [Vineyard](https://github.com/v6d-io/v6d) as the distributed in-memory data manager, GraphScope supports running on a cluster managed by Kubernetes(k8s).\n\nTo continue this tutorial, please ensure that you have a k8s-managed cluster and know the credentials for the cluster.\n(e.g., address of k8s API server, usually stored a `~/.kube/config` file.)\n\nAlternatively, you can set up a local k8s cluster for testing with [Kind](https://kind.sigs.k8s.io/). \nYou can install and deploy Kind referring to [Quick Start](https://kind.sigs.k8s.io/docs/user/quick-start/);\n\nIf you did not install the `graphscope` package in the above step, you can install a subset of the whole package with client functions only.\n\n```bash\npip3 install graphscope-client\n```\n\nNext, let's revisit the example by running on a cluster instead.\n\n<div align=\"center\">\n    <img src=\"https://graphscope.io/docs/_images/how-it-works.png\" width=\"600\" alt=\"how-it-works\" />\n</div>\n\nThe figure shows the flow of execution in the cluster mode. When users run code in the python client, it will:\n\n- *Step 1*. Create a session or workspace in GraphScope.\n- *Step 2 - Step 5*. Load a graph, query, analysis and run learning task on this graph via Python interface. These steps are the same to local mode, thus users process huge graphs in a distributed setting just like analysis a small graph on a single machine.(Note that `graphscope.gremlin` and `graphscope.graphlearn` need to be changed to `sess.gremlin` and `sess.graphlearn`, respectively. `sess` is the name of the `Session` instance user created.)\n- *Step 6*. Close the session.\n\n\n### Creating a session\n\nTo use GraphScope in a distributed setting, we need to establish a session in a python interpreter.\n\nFor convenience, we provide several demo datasets, and an option `with_dataset` to mount the dataset in the graphscope cluster. The datasets will be mounted to `/dataset` in the pods. If you want to use your own data on k8s cluster, please refer to [this](docs/deployment.rst).\n\n```python\nimport graphscope\n\nsess = graphscope.session(with_dataset=True)\n```\n\nFor macOS, the session needs to establish with the LoadBalancer service type (which is NodePort by default).\n\n```python\nimport graphscope\n\nsess = graphscope.session(with_dataset=True, k8s_service_type=\"LoadBalancer\")\n```\n\nA session tries to launch a `coordinator`, which is the entry for the back-end engines. The coordinator manages a cluster of resources (k8s pods), and the interactive/analytical/learning engines ran on them. For each pod in the cluster, there is a vineyard instance at service for distributed data in memory.\n\n\n### Loading a graph and processing computation tasks\n\nSimilar to the standalone mode, we can still use the functions to load a graph easily.\n\n```python\nfrom graphscope.dataset import load_ogbn_mag\n\n# Note we have mounted the demo datasets to /dataset,\n# There are several datasets including ogbn_mag_small,\n# User can attach to the engine container and explore the directory.\ng = load_ogbn_mag(sess, \"/dataset/ogbn_mag_small/\")\n```\n\nHere, the `g` is loaded in parallel via vineyard and stored in vineyard instances in the cluster managed by the session.\n\nNext, we can conduct graph queries with Gremlin, invoke various graph algorithms, or run graph-based neural network tasks like we did in the standalone mode.\nWe do not repeat code here, but a `.ipynb` processing the classification task on k8s is available on the [Playground](https://try.graphscope.app/).\n\n### Closing the session\n\nAnother additional step in the distribution is session close. We close the session after processing all graph tasks.\n\n```python\nsess.close()\n```\n\nThis operation will notify the backend engines and vineyard\nto safely unload graphs and their applications,\nThen, the coordinator will release all the applied resources in the k8s cluster.\n\nPlease note that we have not hardened this release for production use and it lacks important security features such as authentication and encryption, and therefore **it is NOT recommended for production use (yet)!**\n\n## Development\n\n### Building on local\n\nTo build graphscope Python package and the engine binaries, you need to install some dependencies and build tools.\n\n```bash\npython3 gsctl.py install-deps dev\n\n# With argument --cn to speed up the download if you are in China.\npython3 gsctl.py install-deps dev --cn\n```\n\nThen you can build GraphScope with pre-configured `make` commands.\n\n```bash\n# to make graphscope whole package, including python package + engine binaries.\nsudo make install\n\n# or make the engine components\n# make interactive\n# make analytical\n# make learning\n```\n\n### Building Docker images\n\nGraphScope ships with a [Dockerfile](k8s/dockerfiles/graphscope-dev.Dockerfile) that can build docker images for releasing. The images are built on a `builder` image with all dependencies installed and copied to\na `runtime-base` image. To build images with latest version of GraphScope, go to the `k8s/internal` directory under root directory and run this command.\n\n```bash\n# by default, the built image is tagged as graphscope/graphscope:SHORTSHA\n# cd k8s\nmake graphscope\n```\n\n### Building client library\n\nGraphScope python interface is separate with the engines image.\nIf you are developing python client and not modifying the protobuf files, the engines\nimage doesn't require to be rebuilt.\n\nYou may want to re-install the python client on local.\n\n```bash\nmake client\n```\n\nNote that the learning engine client has C/C++ extensions modules and setting up the build\nenvironment is a bit tedious. By default the locally-built client library doesn't include\nthe support for learning engine. If you want to build client library with learning engine\nenabled, please refer [Build Python Wheels](https://graphscope.io/docs/developer_guide.html#build-python-wheels).\n\n### Testing\n\nTo verify the correctness of your developed features, your code changes should pass our tests.\n\nYou may run the whole test suite with commands:\n\n```bash\nmake test\n```\n\n\n## Documentation\n\nDocumentation can be generated using Sphinx. Users can build the documentation using:\n\n```bash\n# build the docs\nmake graphscope-docs\n\n# to open preview on local\nopen docs/_build/latest/html/index.html\n```\n\nThe latest version of online documentation can be found at https://graphscope.io/docs\n\n\n## License\n\nGraphScope is released under [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). Please note that third-party libraries may not have the same license as GraphScope.\n\n\n## Publications\n\n- Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, Youyang Yao, Qiang Yin, Wenyuan Yu, Jingren Zhou, Diwen Zhu, Rong Zhu. [GraphScope: A Unified Engine For Big Graph Processing](http://vldb.org/pvldb/vol14/p2879-qian.pdf). The 47th International Conference on Very Large Data Bases (VLDB), industry, 2021.\n- Jingbo Xu, Zhanning Bai, Wenfei Fan, Longbin Lai, Xue Li, Zhao Li, Zhengping Qian, Lei Wang, Yanyan Wang, Wenyuan Yu, Jingren Zhou. [GraphScope: A One-Stop Large Graph Processing System](http://vldb.org/pvldb/vol14/p2703-xu.pdf). The 47th International Conference on Very Large Data Bases (VLDB), demo, 2021\n\nIf you use this software, please cite our paper using the following metadata:\n\n```bibtex\n@article{fan2021graphscope,\n  title={GraphScope: a unified engine for big graph processing},\n  author={Fan, Wenfei and He, Tao and Lai, Longbin and Li, Xue and Li, Yong and Li, Zhao and Qian, Zhengping and Tian, Chao and Wang, Lei and Xu, Jingbo and others},\n  journal={Proceedings of the VLDB Endowment},\n  volume={14},\n  number={12},\n  pages={2879--2892},\n  year={2021},\n  publisher={VLDB Endowment}\n}\n```\n\n## Contributing\n\nAny contributions you make are **greatly appreciated**!\n- Join in the [Slack channel](http://slack.graphscope.io) for discussion.\n- Please report bugs by submitting a GitHub issue.\n- Please submit contributions using pull requests.\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": null,
    "version": "0.27.0",
    "project_urls": {
        "Homepage": "https://github.com/alibaba/GraphScope"
    },
    "split_keywords": [
        "graphscope",
        " graph computations"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "39d646f861cb52b69aca958ed937e797f0121ae0f5415c2446d740be7b463eb5",
                "md5": "e6337a5079b08849f0fa3c70d960605e",
                "sha256": "4166195dd247f92e78e22d436070839dcb860a2036231849e36b1e908ed916c0"
            },
            "downloads": -1,
            "filename": "graphscope-0.27.0-py2.py3-none-macosx_12_0_x86_64.whl",
            "has_sig": false,
            "md5_digest": "e6337a5079b08849f0fa3c70d960605e",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 211334275,
            "upload_time": "2024-03-29T15:16:25",
            "upload_time_iso_8601": "2024-03-29T15:16:25.966685Z",
            "url": "https://files.pythonhosted.org/packages/39/d6/46f861cb52b69aca958ed937e797f0121ae0f5415c2446d740be7b463eb5/graphscope-0.27.0-py2.py3-none-macosx_12_0_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9aeebade668096acf44b1045b28c457c114c9ab304c6d12c0b8ced439929c73a",
                "md5": "de0facf52162347ca8aa2b969e25f43c",
                "sha256": "bb5883dce060f2f78ab1789ca861e26677d8967c4488c3447ace594ff11b956c"
            },
            "downloads": -1,
            "filename": "graphscope-0.27.0-py2.py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "de0facf52162347ca8aa2b969e25f43c",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 195829104,
            "upload_time": "2024-03-29T14:22:29",
            "upload_time_iso_8601": "2024-03-29T14:22:29.549379Z",
            "url": "https://files.pythonhosted.org/packages/9a/ee/bade668096acf44b1045b28c457c114c9ab304c6d12c0b8ced439929c73a/graphscope-0.27.0-py2.py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-29 15:16:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "alibaba",
    "github_project": "GraphScope",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "graphscope"
}
        
Elapsed time: 0.27635s