# Semantic Scholar Dataset API Wrapper
A Python wrapper for the Semantic Scholar Dataset API that provides easy access to academic papers, citations, and related data.
## Description
This library provides a simple interface to interact with the Semantic Scholar Dataset API, allowing you to:
- Access various academic datasets (papers, citations, authors, etc.)
- Download dataset releases
- Get diffs between releases
- Manage large dataset downloads efficiently
## Installation
```bash
pip install semanticscholar-datasetapi
```
## Requirements
- Python 3.7+
- requests
## Basic Usage
```python
from semanticscholar_datasetapi import SemanticScholarDataset
import os
# Initialize the client with your API key
api_key = os.getenv("SEMANTIC_SCHOLAR_API_KEY")
client = SemanticScholarDataset(api_key=api_key)
# List available datasets
datasets = client.get_available_datasets()
print(datasets)
# Get latest release information
releases = client.get_available_releases()
print(releases)
# Download latest release of a specific dataset
client.download_latest_release(datasetname="papers", save_dir="downloads")
# Get diffs between releases
client.download_diffs(
start_release_id="2024-12-31",
end_release_id="latest",
datasetname="papers",
save_dir="diffs"
)
```
## Available Datasets
The API provides access to the following datasets:
- abstracts
- authors
- citations
- embeddings-specter_v1
- embeddings-specter_v2
- paper-ids
- papers
- publication-venues
- s2orc
- tldrs
## API Reference
### Main Methods
#### `SemanticScholarDataset(api_key: Optional[str] = None)`
Initialize the API client with an optional API key.
- `api_key`: API key for accessing the Semantic Scholar Dataset API. Required for most operations.
#### `get_available_releases() -> list`
Get a list of all available dataset releases.
#### `get_available_datasets() -> list`
Get a list of all available datasets.
#### `get_download_urls_from_release(datasetname: Optional[str] = None, release_id: str = "latest") -> Dict[str, Any]`
Get download URLs for a specific release of a dataset.
- `datasetname`: Name of the dataset to get URLs for
- `release_id`: ID of the release (defaults to "latest")
#### `get_download_urls_from_diffs(start_release_id: Optional[str], end_release_id: str = "latest", datasetname: Optional[str]) -> Dict[str, Any]`
Get download URLs for differences between two releases.
- `start_release_id`: Starting release ID
- `end_release_id`: Ending release ID (defaults to "latest")
- `datasetname`: Name of the dataset to get diff URLs for
#### `download_latest_release(datasetname: Optional[str] = None, save_dir: Optional[str] = None, range: Optional[range] = None) -> None`
Download the latest release of a specific dataset.
- `datasetname`: Name of the dataset to download
- `save_dir`: Directory to save downloaded files (defaults to current directory)
- `download_range`: Optional range of indices to download from the list of files
#### `download_past_release(release_id: str, datasetname: Optional[str] = None, save_dir: Optional[str] = None, range: Optional[range] = None) -> None`
Download a specific past release of a dataset.
- `release_id`: ID of the release to download
- `datasetname`: Name of the dataset to download
- `save_dir`: Directory to save downloaded files (defaults to current directory)
- `download_range`: Optional range of indices to download from the list of files
#### `download_diffs(start_release_id: str, end_release_id: str, datasetname: Optional[str] = None, save_dir: Optional[str] = None) -> None`
Download the differences between two releases of a dataset.
- `start_release_id`: Starting release ID
- `end_release_id`: Ending release ID
- `datasetname`: Name of the dataset to download diffs for
- `save_dir`: Directory to save downloaded files (defaults to current directory)
### Error Handling
The library includes comprehensive error handling for:
- Invalid dataset names
- Missing API keys
- Network errors
- Invalid release IDs
### File Naming
Downloaded files follow these naming patterns:
- Latest release: `{datasetname}_latest_{index}.json.gz`
- Past release: `{datasetname}_{release_id}_{index}.json.gz`
- Diffs:
- Updates: `{datasetname}_{from_release}_{to_release}_update_{index}.json.gz`
- Deletes: `{datasetname}_{from_release}_{to_release}_delete_{index}.json.gz`
## Environment Variables
- `SEMANTIC_SCHOLAR_API_KEY`: Your API key for the Semantic Scholar Dataset API
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Acknowledgments
- Semantic Scholar for providing the Dataset API
- The academic community for maintaining and contributing to the datasets
Raw data
{
"_id": null,
"home_page": "https://github.com/k1000dai/semanticscholar-datasetapi",
"name": "semanticscholar-datasetapi",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "semantic scholar, dataset, academic papers, citations, research, api",
"author": "Kohei Sendai",
"author_email": "your.email@example.com",
"download_url": "https://files.pythonhosted.org/packages/7c/7b/f9330fa576028da50199ec8ff400be965801bf9d760c8c509185c1b0c7fc/semanticscholar_datasetapi-0.1.2.tar.gz",
"platform": null,
"description": "# Semantic Scholar Dataset API Wrapper\n\nA Python wrapper for the Semantic Scholar Dataset API that provides easy access to academic papers, citations, and related data.\n\n## Description\n\nThis library provides a simple interface to interact with the Semantic Scholar Dataset API, allowing you to:\n- Access various academic datasets (papers, citations, authors, etc.)\n- Download dataset releases\n- Get diffs between releases\n- Manage large dataset downloads efficiently\n\n## Installation\n\n```bash\npip install semanticscholar-datasetapi\n```\n\n## Requirements\n\n- Python 3.7+\n- requests\n\n## Basic Usage\n\n```python\nfrom semanticscholar_datasetapi import SemanticScholarDataset\nimport os\n\n# Initialize the client with your API key\napi_key = os.getenv(\"SEMANTIC_SCHOLAR_API_KEY\")\nclient = SemanticScholarDataset(api_key=api_key)\n\n# List available datasets\ndatasets = client.get_available_datasets()\nprint(datasets)\n\n# Get latest release information\nreleases = client.get_available_releases()\nprint(releases)\n\n# Download latest release of a specific dataset\nclient.download_latest_release(datasetname=\"papers\", save_dir=\"downloads\")\n\n# Get diffs between releases\nclient.download_diffs(\n start_release_id=\"2024-12-31\",\n end_release_id=\"latest\",\n datasetname=\"papers\",\n save_dir=\"diffs\"\n)\n```\n\n## Available Datasets\n\nThe API provides access to the following datasets:\n- abstracts\n- authors\n- citations\n- embeddings-specter_v1\n- embeddings-specter_v2\n- paper-ids\n- papers\n- publication-venues\n- s2orc\n- tldrs\n\n## API Reference\n\n### Main Methods\n\n#### `SemanticScholarDataset(api_key: Optional[str] = None)`\nInitialize the API client with an optional API key.\n\n- `api_key`: API key for accessing the Semantic Scholar Dataset API. Required for most operations.\n\n#### `get_available_releases() -> list`\nGet a list of all available dataset releases.\n\n#### `get_available_datasets() -> list`\nGet a list of all available datasets.\n\n#### `get_download_urls_from_release(datasetname: Optional[str] = None, release_id: str = \"latest\") -> Dict[str, Any]`\nGet download URLs for a specific release of a dataset.\n\n- `datasetname`: Name of the dataset to get URLs for\n- `release_id`: ID of the release (defaults to \"latest\")\n\n#### `get_download_urls_from_diffs(start_release_id: Optional[str], end_release_id: str = \"latest\", datasetname: Optional[str]) -> Dict[str, Any]`\nGet download URLs for differences between two releases.\n\n- `start_release_id`: Starting release ID\n- `end_release_id`: Ending release ID (defaults to \"latest\")\n- `datasetname`: Name of the dataset to get diff URLs for\n\n#### `download_latest_release(datasetname: Optional[str] = None, save_dir: Optional[str] = None, range: Optional[range] = None) -> None`\nDownload the latest release of a specific dataset.\n\n- `datasetname`: Name of the dataset to download\n- `save_dir`: Directory to save downloaded files (defaults to current directory)\n- `download_range`: Optional range of indices to download from the list of files\n\n#### `download_past_release(release_id: str, datasetname: Optional[str] = None, save_dir: Optional[str] = None, range: Optional[range] = None) -> None`\nDownload a specific past release of a dataset.\n\n- `release_id`: ID of the release to download\n- `datasetname`: Name of the dataset to download\n- `save_dir`: Directory to save downloaded files (defaults to current directory)\n- `download_range`: Optional range of indices to download from the list of files\n\n#### `download_diffs(start_release_id: str, end_release_id: str, datasetname: Optional[str] = None, save_dir: Optional[str] = None) -> None`\nDownload the differences between two releases of a dataset.\n\n- `start_release_id`: Starting release ID\n- `end_release_id`: Ending release ID\n- `datasetname`: Name of the dataset to download diffs for\n- `save_dir`: Directory to save downloaded files (defaults to current directory)\n\n### Error Handling\n\nThe library includes comprehensive error handling for:\n- Invalid dataset names\n- Missing API keys\n- Network errors\n- Invalid release IDs\n\n### File Naming\n\nDownloaded files follow these naming patterns:\n- Latest release: `{datasetname}_latest_{index}.json.gz`\n- Past release: `{datasetname}_{release_id}_{index}.json.gz`\n- Diffs: \n - Updates: `{datasetname}_{from_release}_{to_release}_update_{index}.json.gz`\n - Deletes: `{datasetname}_{from_release}_{to_release}_delete_{index}.json.gz`\n\n## Environment Variables\n\n- `SEMANTIC_SCHOLAR_API_KEY`: Your API key for the Semantic Scholar Dataset API\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Acknowledgments\n\n- Semantic Scholar for providing the Dataset API\n- The academic community for maintaining and contributing to the datasets\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python wrapper for the Semantic Scholar Dataset API that provides easy access to academic papers, citations, and related data",
"version": "0.1.2",
"project_urls": {
"Bug Tracker": "https://github.com/k1000dai/semanticscholar-datasetapi/issues",
"Documentation": "https://github.com/k1000dai/semanticscholar-datasetapi#readme",
"Homepage": "https://github.com/k1000dai/semanticscholar-datasetapi",
"Source Code": "https://github.com/k1000dai/semanticscholar-datasetapi"
},
"split_keywords": [
"semantic scholar",
" dataset",
" academic papers",
" citations",
" research",
" api"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6fe985ee3f2042f25b0841c3b5c51ee427ca85b4ff114a785ed34615497fd7d4",
"md5": "a9f05db8b107155d696f2643667dee3c",
"sha256": "72446c7dc2369e85281ded7bdc6ff960c6e0a7fbe976a8a77b8b3d5b02b87758"
},
"downloads": -1,
"filename": "semanticscholar_datasetapi-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a9f05db8b107155d696f2643667dee3c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 6991,
"upload_time": "2025-02-19T08:07:37",
"upload_time_iso_8601": "2025-02-19T08:07:37.581655Z",
"url": "https://files.pythonhosted.org/packages/6f/e9/85ee3f2042f25b0841c3b5c51ee427ca85b4ff114a785ed34615497fd7d4/semanticscholar_datasetapi-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7c7bf9330fa576028da50199ec8ff400be965801bf9d760c8c509185c1b0c7fc",
"md5": "5b30f410969ff6893dd3fc67b7ec0573",
"sha256": "b221f2af3596c9074b7a5cd3989fef10509559278b0e7bcf31d3902bf42511f7"
},
"downloads": -1,
"filename": "semanticscholar_datasetapi-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "5b30f410969ff6893dd3fc67b7ec0573",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 6160,
"upload_time": "2025-02-19T08:07:39",
"upload_time_iso_8601": "2025-02-19T08:07:39.341305Z",
"url": "https://files.pythonhosted.org/packages/7c/7b/f9330fa576028da50199ec8ff400be965801bf9d760c8c509185c1b0c7fc/semanticscholar_datasetapi-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-19 08:07:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "k1000dai",
"github_project": "semanticscholar-datasetapi",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "semanticscholar-datasetapi"
}