findopendata

Name	findopendata JSON
Version	1.0.5 JSON
	download
home_page	https://github.com/findopendata/findopendata
Summary	A search engine for Open Data.
upload_time	2023-02-08 08:10:52
maintainer
docs_url	None
author	Eric Zhu
requires_python	>=3.6
license
keywords	open-data search-engine
VCS
bugtrack_url
requirements	en_core_web_sm
Travis-CI
coveralls test coverage	No coveralls.

            # Find Open Data

[![Build Status](https://travis-ci.org/findopendata/findopendata.svg?branch=master)](https://travis-ci.org/findopendata/findopendata)

![Screenshot](screencapture.gif)

Table of Content:
1. [Introduction](#introduction)
2. [System Overview](#system-overview)
3. [Development Guide](#development-guide)
4. [Cloud Storage Systems](#cloud-storage-systems)
5. [Crawler Guide](#crawler-guide)

## Introduction

This is the source code repository for [findopendata.com](https://findopendata.com).
The project goal is to make a search engine for Open Data with rich 
features beyond simple keyword search. The current search methods are:

* Keyword search based on metadata
* Similar dataset search based on metadata similarity
* Joinable table search based on content (i.e., data values) similarity using LSH index

Next steps:

 * Unionable/similar table search based on content similarity
 * Time and location-based serach based on extracted timestamps and Geo tags
 * Dataset versioning
 * API for external data science tools (e.g., Jupyter Notebook, Plot.ly)

**This is a work in progress.**


## System Overview

The Find Open Data system has the following components:

1. **Frontend**: a React app, located in `frontend`.
2. **API Server**: a Flask web server, located in `apiserver`.
3. **LSH Server**: a Go web server, located in `lshserver`.
4. **Crawler**: a set of [Celery](https://docs.celeryproject.org/en/latest/userguide/tasks.html) tasks, located in `findopendata`. 

The Frontend, the API Server, and the LSH Server can be 
deployed to 
[Google App Engine](https://cloud.google.com/appengine/docs/).

We also use two external storage systems for persistence:

1. A PostgreSQL database for storing dataset registry, metadata, and sketches for content-based search.
2. A cloud-based storage system for storing dataset files, currently supporting Google Cloud Storage and Azure Blob Storage. A local storage using file system is also available.

![System Overview](system_overview.png)

## Development Guide

To develop locally, you need the following:

* PostgreSQL 9.6 or above
* RabbitMQ

#### 1. Install PostgreSQL

[PostgreSQL](https://www.postgresql.org/download/) 
(version 9.6 or above) is used by the crawler to register and save the
summaries of crawled datasets. It is also used by the API Server as the 
database backend.
If you are using Cloud SQL Postgres, you need to download 
[Cloud SQL Proxy](https://cloud.google.com/sql/docs/postgres/connect-admin-proxy#install)
and make it executable.

Once the PostgreSQL database is running, create a database, and
use the SQL scripts in `sql` to create tables:
```
psql -f sql/create_crawler_tables.sql
psql -f sql/create_metadata_tables.sql
psql -f sql/create_sketch_tables.sql
```

#### 2. Install RabbitMQ

[RabbitMQ](https://www.rabbitmq.com/download.html) 
is required to manage and queue crawl tasks.
On Mac OS X you can [install it using Homebrew](https://www.rabbitmq.com/install-homebrew.html).

Run the RabbitMQ server after finishing install.

#### 3. Python Environment

It is recommended to use virtualenv for Python development and dependencies:
```
virtualenv -p python3 .venv
source .venv/bin/activate # .\venv\bin\activate on Windows
```

`python-snappy` requires `libsnappy`. On Ubuntu you can 
simply install it by `sudo apt-get install libsnappy-dev`.
On Mac OS X use `brew install snappy`.
On Windows, instead of the `python-snappy` binary on Pypi, use the 
unofficial binary maintained by UC Irvine 
([download here](https://www.lfd.uci.edu/~gohlke/pythonlibs/)),
and install directly, for example (Python 3.7, amd64):
```
pip install python_snappy‑0.5.4‑cp37‑cp37m‑win_amd64.whl
```

Finally, install this package and other dependencies:
```
pip install -e .
```

#### 4. Configuration File

Create a `configs.yaml` by copying `configs-example.yaml`, complete fields
related to PostgreSQL and storage.

If you plan to store all datasets on your local file system,
you can skip the `gcp` and `azure` sections and only complete 
the `local` section, and make sure the `storage.provider` is 
set to `local`.

For cloud-based storage systems, see 
[Cloud Storage Systems](#cloud-storage-systems).

## Cloud Storage Systems

Currently we support using 
[Google Cloud Storage](https://cloud.google.com/storage/) and 
[Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/) 
as the dataset storage system.

To use Google Cloud Storage, you need:
* A Google Cloud project with Cloud Storage enabled, and a bucket created.
* A Google Cloud service account key file (JSON formatted) with read and write access to the Cloud Storage bucket.
* Set `storage.provider` to `gcp` in `configs.yaml`.

To use Azure Blob Storage, you need:
* An Azure storage account enabled, and a blob storage container created.
* A connection string to access the storage account.
* Set `storage.provider` to `azure` in `configs.yaml`.

## Crawler Guide

The crawler has a set of [Celery](http://www.celeryproject.org/) tasks that 
runs in parallel.
It uses the RabbitMQ server to manage and queue the tasks.

### Setup Crawler

#### Data Sources (CKAN and Socrata APIs)

The crawler uses PostgreSQL to maintain all data sources.
CKAN sources are maintained in the table `findopendata.ckan_apis`.
Socrata Discovery APIs are maintained in the table 
`findopendata.socrata_discovery_apis`.
The SQL script `sql/create_crawler_tables.sql` has already created some 
initial sources for you.

To show the CKAN APIs currently available to the crawler and whether they
are enabled:
```sql
SELECT * FROM findopendata.ckan_apis;
```

To add a new CKAN API and enable it:
```sql
INSERT INTO findopendata.ckan_apis (endpoint, name, region, enabled) VALUES
('catalog.data.gov', 'US Open Data', 'United States', true);
```

#### Socrata App Tokens

Add your [Socrata app tokens](https://dev.socrata.com/docs/app-tokens.html) 
to the table `findopendata.socrata_app_tokens`.
The app tokens are required for harvesting datasets from Socrata APIs.

For example:
```sql
INSERT INTO findopendata.socrata_app_tokens (token) VALUES ('<your app token>');
```

### Run Crawler

[Celery workers](https://docs.celeryproject.org/en/latest/userguide/workers.html) 
are processes that fetch crawler tasks from RabbitMQ and execute them.
The worker processes must be started before starting any tasks.

For example:
```
celery -A findopendata worker -l info -Ofair
```

On Windows there are some issues with using prefork process pool.
Use `gevent` instead:
```
celery -A findopendata worker -l info -Ofair -P gevent
```

#### Harvest Datasets

Run `harvest_datasets.py` to start data harvesting tasks that download 
datasets from various data sources. Downloaded datasets will be stored on
a Google Cloud Storage bucket (set in `configs.yaml`), and registed in 
Postgres tables 
`findopendata.ckan_packages` and `findopendata.socrata_resources`.

#### Generate Metadata

Run `generate_metadata.py` to start metadata generation tasks for 
downloaded and registed datasets in 
`findopendata.ckan_packages` and `findopendata.socrata_resources`
tables.

It generates metadata by extracting titles, description etc. and 
annotates them with entities for enrichment.
The metadata is stored in table `findopendata.packages`, which is 
also used by the API server to serve the frontend.

#### Sketch Dataset Content

Run `sketch_dataset_content.py` to start tasks for creating 
sketches (e.g., 
[MinHash](https://github.com/ekzhu/datasketch),
samples, data types, etc.) of dataset
content (i.e., data values, columns, and records).
The sketches will be used for content-based search such as
finding joinable tables.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/findopendata/findopendata",
    "name": "findopendata",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "open-data search-engine",
    "author": "Eric Zhu",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/78/c6/5be42649f77c156a9e3253487dff171eac49a8a9fdf5942f4335803e8c23/findopendata-1.0.5.tar.gz",
    "platform": null,
    "description": "# Find Open Data\n\n[![Build Status](https://travis-ci.org/findopendata/findopendata.svg?branch=master)](https://travis-ci.org/findopendata/findopendata)\n\n![Screenshot](screencapture.gif)\n\nTable of Content:\n1. [Introduction](#introduction)\n2. [System Overview](#system-overview)\n3. [Development Guide](#development-guide)\n4. [Cloud Storage Systems](#cloud-storage-systems)\n5. [Crawler Guide](#crawler-guide)\n\n## Introduction\n\nThis is the source code repository for [findopendata.com](https://findopendata.com).\nThe project goal is to make a search engine for Open Data with rich \nfeatures beyond simple keyword search. The current search methods are:\n\n* Keyword search based on metadata\n* Similar dataset search based on metadata similarity\n* Joinable table search based on content (i.e., data values) similarity using LSH index\n\nNext steps:\n\n * Unionable/similar table search based on content similarity\n * Time and location-based serach based on extracted timestamps and Geo tags\n * Dataset versioning\n * API for external data science tools (e.g., Jupyter Notebook, Plot.ly)\n\n**This is a work in progress.**\n\n\n## System Overview\n\nThe Find Open Data system has the following components:\n\n1. **Frontend**: a React app, located in `frontend`.\n2. **API Server**: a Flask web server, located in `apiserver`.\n3. **LSH Server**: a Go web server, located in `lshserver`.\n4. **Crawler**: a set of [Celery](https://docs.celeryproject.org/en/latest/userguide/tasks.html) tasks, located in `findopendata`. \n\nThe Frontend, the API Server, and the LSH Server can be \ndeployed to \n[Google App Engine](https://cloud.google.com/appengine/docs/).\n\nWe also use two external storage systems for persistence:\n\n1. A PostgreSQL database for storing dataset registry, metadata, and sketches for content-based search.\n2. A cloud-based storage system for storing dataset files, currently supporting Google Cloud Storage and Azure Blob Storage. A local storage using file system is also available.\n\n![System Overview](system_overview.png)\n\n## Development Guide\n\nTo develop locally, you need the following:\n\n* PostgreSQL 9.6 or above\n* RabbitMQ\n\n#### 1. Install PostgreSQL\n\n[PostgreSQL](https://www.postgresql.org/download/) \n(version 9.6 or above) is used by the crawler to register and save the\nsummaries of crawled datasets. It is also used by the API Server as the \ndatabase backend.\nIf you are using Cloud SQL Postgres, you need to download \n[Cloud SQL Proxy](https://cloud.google.com/sql/docs/postgres/connect-admin-proxy#install)\nand make it executable.\n\nOnce the PostgreSQL database is running, create a database, and\nuse the SQL scripts in `sql` to create tables:\n```\npsql -f sql/create_crawler_tables.sql\npsql -f sql/create_metadata_tables.sql\npsql -f sql/create_sketch_tables.sql\n```\n\n#### 2. Install RabbitMQ\n\n[RabbitMQ](https://www.rabbitmq.com/download.html) \nis required to manage and queue crawl tasks.\nOn Mac OS X you can [install it using Homebrew](https://www.rabbitmq.com/install-homebrew.html).\n\nRun the RabbitMQ server after finishing install.\n\n#### 3. Python Environment\n\nIt is recommended to use virtualenv for Python development and dependencies:\n```\nvirtualenv -p python3 .venv\nsource .venv/bin/activate # .\\venv\\bin\\activate on Windows\n```\n\n`python-snappy` requires `libsnappy`. On Ubuntu you can \nsimply install it by `sudo apt-get install libsnappy-dev`.\nOn Mac OS X use `brew install snappy`.\nOn Windows, instead of the `python-snappy` binary on Pypi, use the \nunofficial binary maintained by UC Irvine \n([download here](https://www.lfd.uci.edu/~gohlke/pythonlibs/)),\nand install directly, for example (Python 3.7, amd64):\n```\npip install python_snappy\u20110.5.4\u2011cp37\u2011cp37m\u2011win_amd64.whl\n```\n\nFinally, install this package and other dependencies:\n```\npip install -e .\n```\n\n#### 4. Configuration File\n\nCreate a `configs.yaml` by copying `configs-example.yaml`, complete fields\nrelated to PostgreSQL and storage.\n\nIf you plan to store all datasets on your local file system,\nyou can skip the `gcp` and `azure` sections and only complete \nthe `local` section, and make sure the `storage.provider` is \nset to `local`.\n\nFor cloud-based storage systems, see \n[Cloud Storage Systems](#cloud-storage-systems).\n\n## Cloud Storage Systems\n\nCurrently we support using \n[Google Cloud Storage](https://cloud.google.com/storage/) and \n[Azure Blob Storage](https://azure.microsoft.com/en-us/services/storage/blobs/) \nas the dataset storage system.\n\nTo use Google Cloud Storage, you need:\n* A Google Cloud project with Cloud Storage enabled, and a bucket created.\n* A Google Cloud service account key file (JSON formatted) with read and write access to the Cloud Storage bucket.\n* Set `storage.provider` to `gcp` in `configs.yaml`.\n\nTo use Azure Blob Storage, you need:\n* An Azure storage account enabled, and a blob storage container created.\n* A connection string to access the storage account.\n* Set `storage.provider` to `azure` in `configs.yaml`.\n\n## Crawler Guide\n\nThe crawler has a set of [Celery](http://www.celeryproject.org/) tasks that \nruns in parallel.\nIt uses the RabbitMQ server to manage and queue the tasks.\n\n### Setup Crawler\n\n#### Data Sources (CKAN and Socrata APIs)\n\nThe crawler uses PostgreSQL to maintain all data sources.\nCKAN sources are maintained in the table `findopendata.ckan_apis`.\nSocrata Discovery APIs are maintained in the table \n`findopendata.socrata_discovery_apis`.\nThe SQL script `sql/create_crawler_tables.sql` has already created some \ninitial sources for you.\n\nTo show the CKAN APIs currently available to the crawler and whether they\nare enabled:\n```sql\nSELECT * FROM findopendata.ckan_apis;\n```\n\nTo add a new CKAN API and enable it:\n```sql\nINSERT INTO findopendata.ckan_apis (endpoint, name, region, enabled) VALUES\n('catalog.data.gov', 'US Open Data', 'United States', true);\n```\n\n#### Socrata App Tokens\n\nAdd your [Socrata app tokens](https://dev.socrata.com/docs/app-tokens.html) \nto the table `findopendata.socrata_app_tokens`.\nThe app tokens are required for harvesting datasets from Socrata APIs.\n\nFor example:\n```sql\nINSERT INTO findopendata.socrata_app_tokens (token) VALUES ('<your app token>');\n```\n\n### Run Crawler\n\n[Celery workers](https://docs.celeryproject.org/en/latest/userguide/workers.html) \nare processes that fetch crawler tasks from RabbitMQ and execute them.\nThe worker processes must be started before starting any tasks.\n\nFor example:\n```\ncelery -A findopendata worker -l info -Ofair\n```\n\nOn Windows there are some issues with using prefork process pool.\nUse `gevent` instead:\n```\ncelery -A findopendata worker -l info -Ofair -P gevent\n```\n\n#### Harvest Datasets\n\nRun `harvest_datasets.py` to start data harvesting tasks that download \ndatasets from various data sources. Downloaded datasets will be stored on\na Google Cloud Storage bucket (set in `configs.yaml`), and registed in \nPostgres tables \n`findopendata.ckan_packages` and `findopendata.socrata_resources`.\n\n#### Generate Metadata\n\nRun `generate_metadata.py` to start metadata generation tasks for \ndownloaded and registed datasets in \n`findopendata.ckan_packages` and `findopendata.socrata_resources`\ntables.\n\nIt generates metadata by extracting titles, description etc. and \nannotates them with entities for enrichment.\nThe metadata is stored in table `findopendata.packages`, which is \nalso used by the API server to serve the frontend.\n\n#### Sketch Dataset Content\n\nRun `sketch_dataset_content.py` to start tasks for creating \nsketches (e.g., \n[MinHash](https://github.com/ekzhu/datasketch),\nsamples, data types, etc.) of dataset\ncontent (i.e., data values, columns, and records).\nThe sketches will be used for content-based search such as\nfinding joinable tables.\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A search engine for Open Data.",
    "version": "1.0.5",
    "split_keywords": [
        "open-data",
        "search-engine"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f71f0bcc0940a07cadaed587d01b079f1072202ad81b059f01dfb80d9c4f4a8d",
                "md5": "fb4e4eecef93cc92ad610c3b687e2659",
                "sha256": "927cc8b87fb7cc263ef37e50ee207787136cd00ab61f3bc30846401666ff10aa"
            },
            "downloads": -1,
            "filename": "findopendata-1.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fb4e4eecef93cc92ad610c3b687e2659",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 39380,
            "upload_time": "2023-02-08T08:10:50",
            "upload_time_iso_8601": "2023-02-08T08:10:50.408692Z",
            "url": "https://files.pythonhosted.org/packages/f7/1f/0bcc0940a07cadaed587d01b079f1072202ad81b059f01dfb80d9c4f4a8d/findopendata-1.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "78c65be42649f77c156a9e3253487dff171eac49a8a9fdf5942f4335803e8c23",
                "md5": "5f4e54a987c51549cf852f2c4888f84c",
                "sha256": "b6c6fe762c28b5bb86262dd389c0b51f890f65102de00fcd84dcf3027806cb4a"
            },
            "downloads": -1,
            "filename": "findopendata-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "5f4e54a987c51549cf852f2c4888f84c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 36232,
            "upload_time": "2023-02-08T08:10:52",
            "upload_time_iso_8601": "2023-02-08T08:10:52.152764Z",
            "url": "https://files.pythonhosted.org/packages/78/c6/5be42649f77c156a9e3253487dff171eac49a8a9fdf5942f4335803e8c23/findopendata-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-08 08:10:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "findopendata",
    "github_project": "findopendata",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "en_core_web_sm",
            "specs": []
        }
    ],
    "lcname": "findopendata"
}

Eric Zhu