small-web-dataset


Namesmall-web-dataset JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/fgiasson/small-web-dataset
SummaryProcess all the RSS and Atom feeds from the Small Web feeds list, validate them, generate statistics and eventually more.
upload_time2023-09-20 18:28:54
maintainer
docs_urlNone
authorFrederick Giasson
requires_python>=3.10
licenseGNU GPLv3
keywords nbdev jupyter notebook python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # small-web-dataset

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

The Small Web Dataset is a command line tool used to generate a dataset
by aggregating of all the data from the [Kagi Small Web
index](https://github.com/kagisearch/smallweb/blob/main/smallweb.txt).

What is the Small Web? The Small Web is the web of independent websites
that are not part of the big tech platforms. Here are some more
reference about the concept
\[[1](https://neustadt.fr/essays/the-small-web/)\]\[[2](https://benhoyt.com/writings/the-small-web-is-beautiful/)\]\[[3](https://smallweb.page/why)\]\[[4](https://ar.al/2020/08/07/what-is-the-small-web/)\]\[[5](https://news.ycombinator.com/item?id=29768197)\].

There are different purpose for this tool and the dataset it creates:

1.  help analyzing the Kagi Small Web index, to detect and eventually
    remove the sites that doesn’t comply with the policy of the index
2.  create a dataset of all the sites that compose the index. This
    dataset is a very specialized subset of websites that are created
    and maintained by independent people, mostly old school bloggers.
    This dataset can be used for different specialized ML training, for
    example to train a classifier to detect the Small Web sites from the
    Big Web sites, etc.

## Install

To install the command line tool, you simply have to:

``` sh
git clone https://github.com/fgiasson/small-web-dataset.git
cd small-web-dataset

make build
make install-local-build
```

This will clone the repository, build the command line tool and install
it in your local Python environment.

## Configure

You have to make those environment variables available in your
environment:

| Variable     | Description                                                                  |
|--------------|------------------------------------------------------------------------------|
| `FEEDS_PATH` | The path where you want to save all the feeds on your local file system      |
| `DB_PATH`    | The path where you want to save the SQLite dataset on your local file system |

## How to use

You can make sure that the command line tool is installed by running,
and that the latest version is available by running:

``` sh
small-web-dataset version
```

You can get the help documentation by running:

``` sh
small-web-dataset --help
```

You can check what are the current configuration options for the tool in
the current environment by running:

``` sh
small-web-dataset config
```

To create the dataset, you simply have to run the following command:

``` sh
small-web-dataset sync-feeds
```

This command will do three things:

1.  it will download all the RSS and Atom feeds from the Kagi Small Web
    index in the `FEEDS_PATH` folder
2.  it will read all the local feeds files and import them in a local
    SQLite database in the `DB_PATH` folder
3.  it will infer the core language of a feed from the language used to
    write the articles in the feed, and it will add this information in
    the database

Optionally, if you already have a local cache of the feeds and you only
want to update/recreate the database, you simply have to specify the
`DDMMYYYY` folder of the feeds you want to process:

``` sh
small-web-dataset sync-feeds 18092023
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fgiasson/small-web-dataset",
    "name": "small-web-dataset",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "nbdev jupyter notebook python",
    "author": "Frederick Giasson",
    "author_email": "Frederick Giasson <fred@fgiasson.com>",
    "download_url": "https://files.pythonhosted.org/packages/7c/49/81b99320e19085f7a9289d63956b3211fb7935a064d3d257c7e22ed03c34/small-web-dataset-0.0.2.tar.gz",
    "platform": null,
    "description": "# small-web-dataset\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nThe Small Web Dataset is a command line tool used to generate a dataset\nby aggregating of all the data from the [Kagi Small Web\nindex](https://github.com/kagisearch/smallweb/blob/main/smallweb.txt).\n\nWhat is the Small Web? The Small Web is the web of independent websites\nthat are not part of the big tech platforms. Here are some more\nreference about the concept\n\\[[1](https://neustadt.fr/essays/the-small-web/)\\]\\[[2](https://benhoyt.com/writings/the-small-web-is-beautiful/)\\]\\[[3](https://smallweb.page/why)\\]\\[[4](https://ar.al/2020/08/07/what-is-the-small-web/)\\]\\[[5](https://news.ycombinator.com/item?id=29768197)\\].\n\nThere are different purpose for this tool and the dataset it creates:\n\n1.  help analyzing the Kagi Small Web index, to detect and eventually\n    remove the sites that doesn\u2019t comply with the policy of the index\n2.  create a dataset of all the sites that compose the index. This\n    dataset is a very specialized subset of websites that are created\n    and maintained by independent people, mostly old school bloggers.\n    This dataset can be used for different specialized ML training, for\n    example to train a classifier to detect the Small Web sites from the\n    Big Web sites, etc.\n\n## Install\n\nTo install the command line tool, you simply have to:\n\n``` sh\ngit clone https://github.com/fgiasson/small-web-dataset.git\ncd small-web-dataset\n\nmake build\nmake install-local-build\n```\n\nThis will clone the repository, build the command line tool and install\nit in your local Python environment.\n\n## Configure\n\nYou have to make those environment variables available in your\nenvironment:\n\n| Variable     | Description                                                                  |\n|--------------|------------------------------------------------------------------------------|\n| `FEEDS_PATH` | The path where you want to save all the feeds on your local file system      |\n| `DB_PATH`    | The path where you want to save the SQLite dataset on your local file system |\n\n## How to use\n\nYou can make sure that the command line tool is installed by running,\nand that the latest version is available by running:\n\n``` sh\nsmall-web-dataset version\n```\n\nYou can get the help documentation by running:\n\n``` sh\nsmall-web-dataset --help\n```\n\nYou can check what are the current configuration options for the tool in\nthe current environment by running:\n\n``` sh\nsmall-web-dataset config\n```\n\nTo create the dataset, you simply have to run the following command:\n\n``` sh\nsmall-web-dataset sync-feeds\n```\n\nThis command will do three things:\n\n1.  it will download all the RSS and Atom feeds from the Kagi Small Web\n    index in the `FEEDS_PATH` folder\n2.  it will read all the local feeds files and import them in a local\n    SQLite database in the `DB_PATH` folder\n3.  it will infer the core language of a feed from the language used to\n    write the articles in the feed, and it will add this information in\n    the database\n\nOptionally, if you already have a local cache of the feeds and you only\nwant to update/recreate the database, you simply have to specify the\n`DDMMYYYY` folder of the feeds you want to process:\n\n``` sh\nsmall-web-dataset sync-feeds 18092023\n```\n",
    "bugtrack_url": null,
    "license": "GNU GPLv3",
    "summary": "Process all the RSS and Atom feeds from the Small Web feeds list, validate them, generate statistics and eventually more.",
    "version": "0.0.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/fgiasson/small-web-dataset/issues",
        "Homepage": "https://github.com/fgiasson/small-web-dataset"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "afc46af04dd709cb2d1bda993e50f0592cdb43b55274d52439fc65172be8be82",
                "md5": "535fb428d836086ff19014f4ade4c7be",
                "sha256": "195d98d8faee21e1aeaaca9f322f220ab27cfbd7fd7ade941f77d646ec5d26ae"
            },
            "downloads": -1,
            "filename": "small_web_dataset-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "535fb428d836086ff19014f4ade4c7be",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 22769,
            "upload_time": "2023-09-20T18:28:52",
            "upload_time_iso_8601": "2023-09-20T18:28:52.696994Z",
            "url": "https://files.pythonhosted.org/packages/af/c4/6af04dd709cb2d1bda993e50f0592cdb43b55274d52439fc65172be8be82/small_web_dataset-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7c4981b99320e19085f7a9289d63956b3211fb7935a064d3d257c7e22ed03c34",
                "md5": "ebba96ed0b974578579ba776a70a29e0",
                "sha256": "e5c79d638372c337bbbbde7d034aa09bd720faf448ec7bcc62f68f1d7304bbe5"
            },
            "downloads": -1,
            "filename": "small-web-dataset-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ebba96ed0b974578579ba776a70a29e0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 24376,
            "upload_time": "2023-09-20T18:28:54",
            "upload_time_iso_8601": "2023-09-20T18:28:54.044214Z",
            "url": "https://files.pythonhosted.org/packages/7c/49/81b99320e19085f7a9289d63956b3211fb7935a064d3d257c7e22ed03c34/small-web-dataset-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-20 18:28:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fgiasson",
    "github_project": "small-web-dataset",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "small-web-dataset"
}
        
Elapsed time: 0.12332s