pymeilisearch-scraper


Namepymeilisearch-scraper JSON
Version 0.2.0 PyPI version JSON
download
home_page
SummaryDocumentation scrapper for PyMeilisearch
upload_time2023-09-25 14:20:37
maintainer
docs_urlNone
author
requires_python>=3.8
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pymeilisearch-scraper

**An Ansys fork of meilisearch/docs-scraper**

This repository has been forked from
[meilisearch/docs-scraper](https://github.com/meilisearch/docs-scraper) and
incorporates several enhancements to facilitate usage with Python and Sphinx
documentation scraping.

It is used by [pymeilisearch](https://github.com/ansys/pymeilisearch) when
scrapping online and local documentation pages.

Added:

- Ability to install via `pip`
- Ignore `" #"` at the end of headers for sphinx documentation
- Added a `__main__.py` to allow you to call this as a Python module
- Includes desired CNAME when scrapping local pages

```
$ python -m scraper -h
usage: __main__.py [-h] [--meilisearch-host-url MEILISEARCH_HOST_URL]
                   [--meilisearch-api-key MEILISEARCH_API_KEY]
                   config_file

Scrape documentation.

positional arguments:
  config_file           The path to the configuration file.

options:
  -h, --help            show this help message and exit
  --meilisearch-host-url MEILISEARCH_HOST_URL
                        The URL to the meilisearch host
  --meilisearch-api-key MEILISEARCH_API_KEY
                        The URL to the meilisearch host
```

Original documentation follows:

---

<p align="center">
  <img src="https://raw.githubusercontent.com/meilisearch/integration-guides/main/assets/logos/logo.svg" alt="Meilisearch" width="200" height="200" />
</p>

<h1 align="center">docs-scraper</h1>

<h4 align="center">
  <a href="https://github.com/meilisearch/meilisearch">Meilisearch</a> |
  <a href="https://www.meilisearch.com/pricing?utm_campaign=oss&utm_source=integration&utm_medium=docs-scraper">Meilisearch Cloud</a> |
  <a href="https://www.meilisearch.com/docs">Documentation</a> |
  <a href="https://discord.meilisearch.com">Discord</a> |
  <a href="https://roadmap.meilisearch.com/tabs/1-under-consideration">Roadmap</a> |
  <a href="https://www.meilisearch.com">Website</a> |
  <a href="https://www.meilisearch.com/docs/faq">FAQ</a>
</h4>

<p align="center">
  <a href="https://github.com/meilisearch/docs-scraper/actions"><img src="https://github.com/meilisearch/docs-scraper/workflows/Tests/badge.svg" alt="GitHub Workflow Status"></a>
  <a href="https://github.com/meilisearch/docs-scraper/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-informational" alt="License"></a>
  <a href="https://ms-bors.herokuapp.com/repositories/44"><img src="https://bors.tech/images/badge_small.svg" alt="Bors enabled"></a>
</p>

**docs-scraper** is a scraper for your documentation website that indexes the scraped content into a **Meilisearch** instance.

**Meilisearch** is an open-source search engine. [Discover what Meilisearch is!](https://github.com/meilisearch/meilisearch)

This scraper is used in production and runs on the [Meilisearch documentation](https://www.meilisearch.com/docs) on each deployment.

💡 If you already have your own scraper but you still want to use Meilisearch and our [front-end tools](#-and-for-the-front-end-search-bar), check out [this discussion](https://github.com/meilisearch/docs-searchbar.js/issues/40).

## Table of Contents <!-- omit in TOC -->

- [⚡ Supercharge your Meilisearch experience](#-supercharge-your-meilisearch-experience)
- [⚙️ Usage](#️-usage)
  - [Run your Meilisearch Instance](#run-your-meilisearch-instance)
  - [Set your Config File](#set-your-config-file)
  - [Run the Scraper](#run-the-scraper)
- [🖌 And for the front-end search bar?](#-and-for-the-front-end-search-bar)
- [🛠 More Configurations](#-more-configurations)
  - [More About the Selectors](#more-about-the-selectors)
  - [All the Config File Settings](#all-the-config-file-settings)
    - [`index_uid`](#index_uid)
    - [`start_urls`](#start_urls)
    - [`stop_urls` (optional)](#stop_urls-optional)
    - [`selectors_key` (optional)](#selectors_key-optional)
    - [`scrape_start_urls` (optional)](#scrape_start_urls-optional)
    - [`sitemap_urls` (optional)](#sitemap_urls-optional)
    - [`sitemap_alternate_links` (optional)](#sitemap_alternate_links-optional)
    - [`selectors_exclude` (optional)](#selectors_exclude-optional)
    - [`custom_settings` (optional)](#custom_settings-optional)
    - [`min_indexed_level` (optional)](#min_indexed_level-optional)
    - [`only_content_level` (optional)](#only_content_level-optional)
    - [`js_render` (optional)](#js_render-optional)
    - [`js_wait` (optional)](#js_wait-optional)
    - [`allowed_domains` (optional)](#allowed_domains-optional)
  - [Authentication](#authentication)
  - [Installing Chrome Headless](#installing-chrome-headless)
- [🤖 Compatibility with Meilisearch](#-compatibility-with-meilisearch)
- [⚙️ Development Workflow and Contributing](#️-development-workflow-and-contributing)
- [Credits](#credits)


## ⚡ Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with [Meilisearch Cloud](https://www.meilisearch.com/pricing?utm_campaign=oss&utm_source=integration&utm_medium=docs-scraper). No credit card required.

## ⚙️ Usage

Here are the 3 steps to use `docs-scraper`:

1. [Run a Meilisearch instance](#run-your-meilisearch-instance)
2. [Set your config file](#set-your-config-file)
3. [Run the scraper](#run-the-scraper)

### Run your Meilisearch Instance

Your documentation content needs to be scraped and pushed into a Meilisearch instance.

You can install and run Meilisearch on your machine using `curl`.

```bash
curl -L https://install.meilisearch.com | sh
./meilisearch --master-key=myMasterKey
```

There are [other ways to install Meilisearch](https://www.meilisearch.com/docs/learn/getting_started/installation).

The host URL and the API key you will provide in the next steps correspond to the credentials of this Meilisearch instance.
In the example above, the host URL is `http://localhost:7700` and the API key is `myMasterKey`.

_Meilisearch is open-source and can run either on your server or on any cloud provider. Here is a tutorial to [run Meilisearch in production](https://www.meilisearch.com/docs/learn/cookbooks/running-production/)._


### Set your Config File

The scraper tool needs a config file to know which content you want to scrape. This is done by providing **selectors** (e.g. the HTML tag/id/class). The config file is passed as an argument. It follows no naming convention and may be named as you want.

Here is an example of a basic config file:

```json
{
  "index_uid": "docs",
  "start_urls": ["https://www.example.com/doc/"],
  "sitemap_urls": ["https://www.example.com/sitemap.xml"],
  "stop_urls": [],
  "selectors": {
    "lvl0": {
      "selector": ".docs-lvl0",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": {
      "selector": ".docs-lvl1",
      "global": true,
      "default_value": "Chapter"
    },
    "lvl2": ".docs-content .docs-lvl2",
    "lvl3": ".docs-content .docs-lvl3",
    "lvl4": ".docs-content .docs-lvl4",
    "lvl5": ".docs-content .docs-lvl5",
    "lvl6": ".docs-content .docs-lvl6",
    "text": ".docs-content p, .docs-content li"
  }
}
```
The `index_uid` field is the index identifier in your Meilisearch instance in which your website content is stored. The scraping tool will create a new index if it does not exist.

The `docs-content` class (the `.` means this is a class) is the main container of the textual content in this example. Most of the time, this tag is a `<main>` or an `<article>` HTML element.

`lvlX` selectors should use the standard title tags like `h1`, `h2`, `h3`, etc. You can also use static classes. Set a unique id or name attribute to these elements.

Every searchable `lvl` elements outside this main documentation container (for instance, in a sidebar) must be `global` selectors. They will be globally picked up and injected to every document built from your page.

You can also check out the [config file](https://github.com/meilisearch/documentation/blob/main/docs-scraper.config.json) we use in production for our own documentation site.<br>

💡 _To better understand the selectors, go to [this section](#more-about-the-selectors)._

🔨 _There are many other fields you can set in the config file that allow you to adapt the scraper to your need. Check out [this section](#all-the-config-file-settings)._

### Run the Scraper

#### From Source Code <!-- omit in TOC -->

This project supports Python 3.8 and above.

The [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed.

Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.<br>
Following on from the example in the [first step](#run-your-meilisearch-instance), they are respectively `http://localhost:7700` and `myMasterKey`.

Then, run:
```bash
pipenv install
pipenv run ./docs_scraper <path-to-your-config-file>
```

`<path-to-your-config-file>` should be the path of your configuration file defined at the [previous step](#set-your-config-file).

#### With Docker <!-- omit in TOC -->

```bash
docker run -t --rm \
    -e MEILISEARCH_HOST_URL=<your-meilisearch-host-url> \
    -e MEILISEARCH_API_KEY=<your-meilisearch-api-key> \
    -v <absolute-path-to-your-config-file>:/docs-scraper/<path-to-your-config-file> \
    getmeili/docs-scraper:latest pipenv run ./docs_scraper <path-to-your-config-file>
```

`<absolute-path-to-your-config-file>` should be the absolute path of your configuration file defined at the [previous step](#set-your-config-file).

⚠️ If you run Meilisearch locally, you must add the `--network=host` option to this Docker command.

#### In a GitHub Action <!-- omit in TOC -->

To run after your deployment job:

```yml
run-scraper:
    needs: <your-deployment-job>
    runs-on: ubuntu-18.04
    steps:
    - uses: actions/checkout@master
    - name: Run scraper
      env:
        HOST_URL: ${{ secrets.MEILISEARCH_HOST_URL }}
        API_KEY: ${{ secrets.MEILISEARCH_API_KEY }}
        CONFIG_FILE_PATH: <path-to-your-config-file>
      run: |
        docker run -t --rm \
          -e MEILISEARCH_HOST_URL=$HOST_URL \
          -e MEILISEARCH_API_KEY=$API_KEY \
          -v $CONFIG_FILE_PATH:/docs-scraper/<path-to-your-config-file> \
          getmeili/docs-scraper:latest pipenv run ./docs_scraper <path-to-your-config-file>
```

⚠️ We do not recommend using the `latest` image in production. Use the [release tags](https://github.com/meilisearch/docs-scraper/releases) instead.

Here is the [GitHub Action file](https://github.com/meilisearch/documentation/blob/master/.github/workflows/gh-pages-scraping.yml) we use in production for the Meilisearch documentation.

#### About the API Key <!-- omit in TOC -->

The API key you must provide should have the permissions to add documents into your Meilisearch instance.<br>
In a production environment, we recommend providing the private key instead of the master key, as it is safer and it has enough permissions to perform such requests.

_More about [Meilisearch authentication](https://www.meilisearch.com/docs/learn/security/master_api_keys). _

## 🖌 And for the front-end search bar?

After having scraped your documentation, you might need a search bar to improve your user experience!

About the front part:
- If your website is a VuePress application, check out the [vuepress-plugin-meilisearch](https://github.com/meilisearch/vuepress-plugin-meilisearch) repository.
- For all kinds of documentation, check out the [docs-searchbar.js](https://github.com/meilisearch/docs-searchbar.js) library.

**Both of these libraries provide a front-end search bar perfectly adapted for documentation.**

![docs-searchbar-demo](assets/docs-searchbar-demo.gif)

## 🛠 More Configurations

### More About the Selectors

#### Bases <!-- omit in TOC -->

Very simply, selectors are needed to tell the scraper "I want to get the content in this HTML tag".<br>
This HTML tag is a **selector**.

A selector can be:

- a class (e.g. `.main-content`)
- an id (e.g. `#main-article`)
- an HTML tag (e.g. `h1`)

With a more concrete example:

```json
"lvl0": {
    "selector": ".navbar-nav .active",
    "global": true,
    "default_value": "Documentation"
},
```

`.navbar-nav .active` means "take the content in the class `active` that is itself in the class `navbar-nav`".

`global: true` means you want the same `lvl0` (so, the same main title) for all the contents extracted from the same page.

`"default_value": "Documentation"` will be the displayed value if no content in `.navbar-nav .active` was found.

NB: You can set the `global` and `default_value` attributes for every selector level (`lvlX`) and not only for the `lvl0`.

#### The Levels <!-- omit in TOC -->

You can notice different levels of selectors (0 to 6 maximum) in the config file. They correspond to different levels of titles, and will be displayed this way:

![selectors-display](assets/selectors-display.png)

Your data will be displayed with a main title (`lvl0`), sub-titles (`lvl1`), sub-sub-titles (`lvl2`) and so on...

### All the Config File Settings

#### `index_uid`

The `index_uid` field is the index identifier in your Meilisearch instance in which your website content is stored. The scraping tool will create a new index if it does not exist.

```json
{
  "index_uid": "example"
}
```

#### `start_urls`

This array contains the list of URLs that will be used to start scraping your website.<br>
The scraper will recursively follow any links (`<a>` tags) from those pages. It will not follow links that are on another domain.

```json
{
  "start_urls": ["https://www.example.com/docs"]
}
```
##### Using Page Rank <!-- omit in TOC -->

This parameter gives more weight to some pages and helps to boost records built from the page.<br>
Pages with highest `page_rank` will be returned before pages with a lower `page_rank`.

```json
{
  "start_urls": [
    {
      "url": "http://www.example.com/docs/concepts/",
      "page_rank": 5
    },
    {
      "url": "http://www.example.com/docs/contributors/",
      "page_rank": 1
    }
  ]
}
```

In this example, records built from the Concepts page will be ranked higher than results extracted from the Contributors page.

#### `stop_urls` (optional)

The scraper will not follow links that match `stop_urls`.

```json
{
  "start_urls": ["https://www.example.com/docs"],
  "stop_urls": ["https://www.example.com/about-us"]
}
```

#### `selectors_key` (optional)

This allows you to use custom selectors per page.

If the markup of your website is so different from one page to another that you can't have generic selectors, you can namespace your selectors and specify which set of selectors should be applied to specific pages.

```json
{
  "start_urls": [
    "http://www.example.com/docs/",
    {
      "url": "http://www.example.com/docs/concepts/",
      "selectors_key": "concepts"
    },
    {
      "url": "http://www.example.com/docs/contributors/",
      "selectors_key": "contributors"
    }
  ],
  "selectors": {
    "default": {
      "lvl0": ".main h1",
      "lvl1": ".main h2",
      "lvl2": ".main h3",
      "lvl3": ".main h4",
      "lvl4": ".main h5",
      "text": ".main p"
    },
    "concepts": {
      "lvl0": ".header h2",
      "lvl1": ".main h1.title",
      "lvl2": ".main h2.title",
      "lvl3": ".main h3.title",
      "lvl4": ".main h5.title",
      "text": ".main p"
    },
    "contributors": {
      "lvl0": ".main h1",
      "lvl1": ".contributors .name",
      "lvl2": ".contributors .title",
      "text": ".contributors .description"
    }
  }
}
```

Here, all documentation pages will use the selectors defined in `selectors.default` while the page under `./concepts` will use `selectors.concepts` and those under `./contributors` will use `selectors.contributors`.

#### `scrape_start_urls` (optional)

By default, the scraper will extract content from the pages defined in `start_urls`. If you do not have any valuable content on your starts_urls or if it's a duplicate of another page, you should set this to `false`.

```json
{
  "scrape_start_urls": false
}
```

#### `sitemap_urls` (optional)

You can pass an array of URLs pointing to your sitemap(s) files. If this value is set, the scraper will try to read URLs from your sitemap(s)

```json
{
  "sitemap_urls": ["http://www.example.com/docs/sitemap.xml"]
}
```

#### `sitemap_alternate_links` (optional)

Sitemaps can contain alternative links for URLs. Those are other versions of the same page, in a different language, or with a different URL. By default docs-scraper will ignore those URLs.

Set this to true if you want those other versions to be scraped as well.

```json
{
  "sitemap_urls": ["http://www.example.com/docs/sitemap.xml"],
  "sitemap_alternate_links": true
}
```

With the above configuration and the `sitemap.xml` below, both `http://www.example.com/docs/` and `http://www.example.com/docs/de/` will be scraped.

```html
<url>
  <loc>http://www.example.com/docs/</loc>
  <xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/de/"/>
</url>
```

#### `selectors_exclude` (optional)

This expects an array of CSS selectors. Any element matching one of those selectors will be removed from the page before any data is extracted from it.

This can be used to remove a table of content, a sidebar, or a footer, to make other selectors easier to write.

```json
{
  "selectors_exclude": [".footer", "ul.deprecated"]
}
```

#### `custom_settings` (optional)

This field can be used to add Meilisearch settings.

##### Example:
```json
"custom_settings": {
    "synonyms": {
      "static site generator": [
        "ssg"
      ],
      "ssg": [
        "static site generator"
      ]
    },
    "stopWords": ["of", "the"],
    "filterableAttributes": ["genres", "type"]
  }
```

Learn more about `filterableAttributes`, `synonyms`, `stop-words` and all available settings in the [Meilisearch documentation](https://meilisearch.com/docs/reference/api/settings#settings-object).


#### `min_indexed_level` (optional)

The default value is 0. By increasing it, you can choose not to index some records if they don't have enough `lvlX` matching. For example, with a `min_indexed_level: 2`, the scraper indexes temporary records having at least lvl0, lvl1 and lvl2 set.

This is useful when your documentation has pages that share the same `lvl0` and `lvl1` for example. In that case, you don't want to index all the shared records, but want to keep the content different across pages.

```json
{
  "min_indexed_level": 2
}
```

#### `only_content_level` (optional)

When `only_content_level` is set to `true`, then the scraper won't create records for the `lvlX` selectors.<br>
If used, `min_indexed_level` is ignored.

```json
{
  "only_content_level": true
}
```

#### `js_render` (optional)

When `js_render` is set to `true`, the scraper will use ChromeDriver. This is needed for pages that are rendered with JavaScript, for example, pages generated with React, Vue, or applications that are running in development mode: `autoreload` `watch`.

After installing ChromeDriver, provide the path to the bin using the following environment variable `CHROMEDRIVER_PATH` (default value is `/usr/bin/chromedriver`).

The default value of `js_render` is `false`.

```json
{
  "js_render": true
}
```

#### `js_wait` (optional)

This setting can be used when `js_render` is set to `true` and the pages need time to fully load. `js_wait` takes an integer is specifies the number of seconds the scraper should wait for the page to load.

```json
{
  "js_render": true,
  "js_wait": 1
}
```

#### `allowed_domains` (optional)

This setting specifies the domains that the scraper is allowed to access. In most cases the `allowed_domains` will be automatically set using the `start_urls` and `stop_urls`. When scraping a domain that contains a port, for example `http://localhost:8080`, the domain needs to be manually added to the configuration.

```json
{
  "allowed_domains": ["localhost"]
}
```

### Authentication

__WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly!

#### Basic HTTP <!-- omit in TOC -->

Basic HTTP authentication is supported by setting these environment variables:
- `DOCS_SCRAPER_BASICAUTH_USERNAME`
- `DOCS_SCRAPER_BASICAUTH_PASSWORD`

#### Cloudflare Access: Identity and Access Management <!-- omit in TOC -->

If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers.

Values for these headers are taken from env variables `CF_ACCESS_CLIENT_ID` and `CF_ACCESS_CLIENT_SECRET`.

In case of Google Cloud Identity-Aware Proxy, please specify these env variables:
- `IAP_AUTH_CLIENT_ID` - # pick [client ID of the application](https://console.cloud.google.com/apis/credentials) you are connecting to
- `IAP_AUTH_SERVICE_ACCOUNT_JSON` - # generate in [Actions](https://console.cloud.google.com/iam-admin/serviceaccounts) -> Create key -> JSON

#### Keycloak Access: Identity and Access Management <!-- omit in TOC -->

If you need to scrape site protected by [Keycloak](https://github.com/keycloak/keycloak) (Gatekeeper), you have to provide a valid access token.

If you set the environment variables `KC_URL`, `KC_REALM`, `KC_CLIENT_ID`, and `KC_CLIENT_SECRET` the scraper authenticates itself against Keycloak using _Client Credentials Grant_ and adds the resulting access token as `Authorization` HTTP header to each scraping request.

### Installing Chrome Headless

Websites that need JavaScript for rendering are passed through ChromeDriver.<br>
[Download the version](http://chromedriver.chromium.org/downloads) suited to your OS and then set the environment variable `CHROMEDRIVER_PATH`.

## 🤖 Compatibility with Meilisearch

This package guarantees compatibility with [version v1.x of Meilisearch](https://github.com/meilisearch/meilisearch/releases/latest), but some features may not be present. Please check the [issues](https://github.com/meilisearch/docs-scraper/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22+label%3Aenhancement) for more info.

## ⚙️ Development Workflow and Contributing

Any new contribution is more than welcome in this project!

If you want to know more about the development workflow or want to contribute, please visit our [contributing guidelines](/CONTRIBUTING.md) for detailed instructions!

## Credits

Based on [Algolia's docsearch scraper repository](https://github.com/algolia/docsearch-scraper) from [this commit](https://github.com/algolia/docsearch-scraper/commit/aab0888989b3f7a4f534979f0148f471b7c435ee).<br>
Due to a lot of future changes on this repository compared to the original one, we don't maintain it as an official fork.

<hr>

**Meilisearch** provides and maintains many **SDKs and Integration tools** like this one. We want to provide everyone with an **amazing search experience for any kind of project**. If you want to contribute, make suggestions, or just know what's going on right now, visit us in the [integration-guides](https://github.com/meilisearch/integration-guides) repository.


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pymeilisearch-scraper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "\"ANSYS, Inc.\" <pyansys.core@ansys.com>",
    "keywords": "",
    "author": "",
    "author_email": "\"ANSYS, Inc.\" <pyansys.core@ansys.com>",
    "download_url": "https://files.pythonhosted.org/packages/64/68/cf9e9faad529eb3e26e21d8740eded2aa1f1094919cb8fe2d6601c3159d4/pymeilisearch_scraper-0.2.0.tar.gz",
    "platform": null,
    "description": "# pymeilisearch-scraper\n\n**An Ansys fork of meilisearch/docs-scraper**\n\nThis repository has been forked from\n[meilisearch/docs-scraper](https://github.com/meilisearch/docs-scraper) and\nincorporates several enhancements to facilitate usage with Python and Sphinx\ndocumentation scraping.\n\nIt is used by [pymeilisearch](https://github.com/ansys/pymeilisearch) when\nscrapping online and local documentation pages.\n\nAdded:\n\n- Ability to install via `pip`\n- Ignore `\" #\"` at the end of headers for sphinx documentation\n- Added a `__main__.py` to allow you to call this as a Python module\n- Includes desired CNAME when scrapping local pages\n\n```\n$ python -m scraper -h\nusage: __main__.py [-h] [--meilisearch-host-url MEILISEARCH_HOST_URL]\n                   [--meilisearch-api-key MEILISEARCH_API_KEY]\n                   config_file\n\nScrape documentation.\n\npositional arguments:\n  config_file           The path to the configuration file.\n\noptions:\n  -h, --help            show this help message and exit\n  --meilisearch-host-url MEILISEARCH_HOST_URL\n                        The URL to the meilisearch host\n  --meilisearch-api-key MEILISEARCH_API_KEY\n                        The URL to the meilisearch host\n```\n\nOriginal documentation follows:\n\n---\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/meilisearch/integration-guides/main/assets/logos/logo.svg\" alt=\"Meilisearch\" width=\"200\" height=\"200\" />\n</p>\n\n<h1 align=\"center\">docs-scraper</h1>\n\n<h4 align=\"center\">\n  <a href=\"https://github.com/meilisearch/meilisearch\">Meilisearch</a> |\n  <a href=\"https://www.meilisearch.com/pricing?utm_campaign=oss&utm_source=integration&utm_medium=docs-scraper\">Meilisearch Cloud</a> |\n  <a href=\"https://www.meilisearch.com/docs\">Documentation</a> |\n  <a href=\"https://discord.meilisearch.com\">Discord</a> |\n  <a href=\"https://roadmap.meilisearch.com/tabs/1-under-consideration\">Roadmap</a> |\n  <a href=\"https://www.meilisearch.com\">Website</a> |\n  <a href=\"https://www.meilisearch.com/docs/faq\">FAQ</a>\n</h4>\n\n<p align=\"center\">\n  <a href=\"https://github.com/meilisearch/docs-scraper/actions\"><img src=\"https://github.com/meilisearch/docs-scraper/workflows/Tests/badge.svg\" alt=\"GitHub Workflow Status\"></a>\n  <a href=\"https://github.com/meilisearch/docs-scraper/blob/main/LICENSE\"><img src=\"https://img.shields.io/badge/license-MIT-informational\" alt=\"License\"></a>\n  <a href=\"https://ms-bors.herokuapp.com/repositories/44\"><img src=\"https://bors.tech/images/badge_small.svg\" alt=\"Bors enabled\"></a>\n</p>\n\n**docs-scraper** is a scraper for your documentation website that indexes the scraped content into a **Meilisearch** instance.\n\n**Meilisearch** is an open-source search engine. [Discover what Meilisearch is!](https://github.com/meilisearch/meilisearch)\n\nThis scraper is used in production and runs on the [Meilisearch documentation](https://www.meilisearch.com/docs) on each deployment.\n\n\ud83d\udca1 If you already have your own scraper but you still want to use Meilisearch and our [front-end tools](#-and-for-the-front-end-search-bar), check out [this discussion](https://github.com/meilisearch/docs-searchbar.js/issues/40).\n\n## Table of Contents <!-- omit in TOC -->\n\n- [\u26a1 Supercharge your Meilisearch experience](#-supercharge-your-meilisearch-experience)\n- [\u2699\ufe0f Usage](#\ufe0f-usage)\n  - [Run your Meilisearch Instance](#run-your-meilisearch-instance)\n  - [Set your Config File](#set-your-config-file)\n  - [Run the Scraper](#run-the-scraper)\n- [\ud83d\udd8c And for the front-end search bar?](#-and-for-the-front-end-search-bar)\n- [\ud83d\udee0 More Configurations](#-more-configurations)\n  - [More About the Selectors](#more-about-the-selectors)\n  - [All the Config File Settings](#all-the-config-file-settings)\n    - [`index_uid`](#index_uid)\n    - [`start_urls`](#start_urls)\n    - [`stop_urls` (optional)](#stop_urls-optional)\n    - [`selectors_key` (optional)](#selectors_key-optional)\n    - [`scrape_start_urls` (optional)](#scrape_start_urls-optional)\n    - [`sitemap_urls` (optional)](#sitemap_urls-optional)\n    - [`sitemap_alternate_links` (optional)](#sitemap_alternate_links-optional)\n    - [`selectors_exclude` (optional)](#selectors_exclude-optional)\n    - [`custom_settings` (optional)](#custom_settings-optional)\n    - [`min_indexed_level` (optional)](#min_indexed_level-optional)\n    - [`only_content_level` (optional)](#only_content_level-optional)\n    - [`js_render` (optional)](#js_render-optional)\n    - [`js_wait` (optional)](#js_wait-optional)\n    - [`allowed_domains` (optional)](#allowed_domains-optional)\n  - [Authentication](#authentication)\n  - [Installing Chrome Headless](#installing-chrome-headless)\n- [\ud83e\udd16 Compatibility with Meilisearch](#-compatibility-with-meilisearch)\n- [\u2699\ufe0f Development Workflow and Contributing](#\ufe0f-development-workflow-and-contributing)\n- [Credits](#credits)\n\n\n## \u26a1 Supercharge your Meilisearch experience\n\nSay goodbye to server deployment and manual updates with [Meilisearch Cloud](https://www.meilisearch.com/pricing?utm_campaign=oss&utm_source=integration&utm_medium=docs-scraper). No credit card required.\n\n## \u2699\ufe0f Usage\n\nHere are the 3 steps to use `docs-scraper`:\n\n1. [Run a Meilisearch instance](#run-your-meilisearch-instance)\n2. [Set your config file](#set-your-config-file)\n3. [Run the scraper](#run-the-scraper)\n\n### Run your Meilisearch Instance\n\nYour documentation content needs to be scraped and pushed into a Meilisearch instance.\n\nYou can install and run Meilisearch on your machine using `curl`.\n\n```bash\ncurl -L https://install.meilisearch.com | sh\n./meilisearch --master-key=myMasterKey\n```\n\nThere are [other ways to install Meilisearch](https://www.meilisearch.com/docs/learn/getting_started/installation).\n\nThe host URL and the API key you will provide in the next steps correspond to the credentials of this Meilisearch instance.\nIn the example above, the host URL is `http://localhost:7700` and the API key is `myMasterKey`.\n\n_Meilisearch is open-source and can run either on your server or on any cloud provider. Here is a tutorial to [run Meilisearch in production](https://www.meilisearch.com/docs/learn/cookbooks/running-production/)._\n\n\n### Set your Config File\n\nThe scraper tool needs a config file to know which content you want to scrape. This is done by providing **selectors** (e.g. the HTML tag/id/class). The config file is passed as an argument. It follows no naming convention and may be named as you want.\n\nHere is an example of a basic config file:\n\n```json\n{\n  \"index_uid\": \"docs\",\n  \"start_urls\": [\"https://www.example.com/doc/\"],\n  \"sitemap_urls\": [\"https://www.example.com/sitemap.xml\"],\n  \"stop_urls\": [],\n  \"selectors\": {\n    \"lvl0\": {\n      \"selector\": \".docs-lvl0\",\n      \"global\": true,\n      \"default_value\": \"Documentation\"\n    },\n    \"lvl1\": {\n      \"selector\": \".docs-lvl1\",\n      \"global\": true,\n      \"default_value\": \"Chapter\"\n    },\n    \"lvl2\": \".docs-content .docs-lvl2\",\n    \"lvl3\": \".docs-content .docs-lvl3\",\n    \"lvl4\": \".docs-content .docs-lvl4\",\n    \"lvl5\": \".docs-content .docs-lvl5\",\n    \"lvl6\": \".docs-content .docs-lvl6\",\n    \"text\": \".docs-content p, .docs-content li\"\n  }\n}\n```\nThe `index_uid` field is the index identifier in your Meilisearch instance in which your website content is stored. The scraping tool will create a new index if it does not exist.\n\nThe `docs-content` class (the `.` means this is a class) is the main container of the textual content in this example. Most of the time, this tag is a `<main>` or an `<article>` HTML element.\n\n`lvlX` selectors should use the standard title tags like `h1`, `h2`, `h3`, etc. You can also use static classes. Set a unique id or name attribute to these elements.\n\nEvery searchable `lvl` elements outside this main documentation container (for instance, in a sidebar) must be `global` selectors. They will be globally picked up and injected to every document built from your page.\n\nYou can also check out the [config file](https://github.com/meilisearch/documentation/blob/main/docs-scraper.config.json) we use in production for our own documentation site.<br>\n\n\ud83d\udca1 _To better understand the selectors, go to [this section](#more-about-the-selectors)._\n\n\ud83d\udd28 _There are many other fields you can set in the config file that allow you to adapt the scraper to your need. Check out [this section](#all-the-config-file-settings)._\n\n### Run the Scraper\n\n#### From Source Code <!-- omit in TOC -->\n\nThis project supports Python 3.8 and above.\n\nThe [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed.\n\nSet both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.<br>\nFollowing on from the example in the [first step](#run-your-meilisearch-instance), they are respectively `http://localhost:7700` and `myMasterKey`.\n\nThen, run:\n```bash\npipenv install\npipenv run ./docs_scraper <path-to-your-config-file>\n```\n\n`<path-to-your-config-file>` should be the path of your configuration file defined at the [previous step](#set-your-config-file).\n\n#### With Docker <!-- omit in TOC -->\n\n```bash\ndocker run -t --rm \\\n    -e MEILISEARCH_HOST_URL=<your-meilisearch-host-url> \\\n    -e MEILISEARCH_API_KEY=<your-meilisearch-api-key> \\\n    -v <absolute-path-to-your-config-file>:/docs-scraper/<path-to-your-config-file> \\\n    getmeili/docs-scraper:latest pipenv run ./docs_scraper <path-to-your-config-file>\n```\n\n`<absolute-path-to-your-config-file>` should be the absolute path of your configuration file defined at the [previous step](#set-your-config-file).\n\n\u26a0\ufe0f If you run Meilisearch locally, you must add the `--network=host` option to this Docker command.\n\n#### In a GitHub Action <!-- omit in TOC -->\n\nTo run after your deployment job:\n\n```yml\nrun-scraper:\n    needs: <your-deployment-job>\n    runs-on: ubuntu-18.04\n    steps:\n    - uses: actions/checkout@master\n    - name: Run scraper\n      env:\n        HOST_URL: ${{ secrets.MEILISEARCH_HOST_URL }}\n        API_KEY: ${{ secrets.MEILISEARCH_API_KEY }}\n        CONFIG_FILE_PATH: <path-to-your-config-file>\n      run: |\n        docker run -t --rm \\\n          -e MEILISEARCH_HOST_URL=$HOST_URL \\\n          -e MEILISEARCH_API_KEY=$API_KEY \\\n          -v $CONFIG_FILE_PATH:/docs-scraper/<path-to-your-config-file> \\\n          getmeili/docs-scraper:latest pipenv run ./docs_scraper <path-to-your-config-file>\n```\n\n\u26a0\ufe0f We do not recommend using the `latest` image in production. Use the [release tags](https://github.com/meilisearch/docs-scraper/releases) instead.\n\nHere is the [GitHub Action file](https://github.com/meilisearch/documentation/blob/master/.github/workflows/gh-pages-scraping.yml) we use in production for the Meilisearch documentation.\n\n#### About the API Key <!-- omit in TOC -->\n\nThe API key you must provide should have the permissions to add documents into your Meilisearch instance.<br>\nIn a production environment, we recommend providing the private key instead of the master key, as it is safer and it has enough permissions to perform such requests.\n\n_More about [Meilisearch authentication](https://www.meilisearch.com/docs/learn/security/master_api_keys). _\n\n## \ud83d\udd8c And for the front-end search bar?\n\nAfter having scraped your documentation, you might need a search bar to improve your user experience!\n\nAbout the front part:\n- If your website is a VuePress application, check out the [vuepress-plugin-meilisearch](https://github.com/meilisearch/vuepress-plugin-meilisearch) repository.\n- For all kinds of documentation, check out the [docs-searchbar.js](https://github.com/meilisearch/docs-searchbar.js) library.\n\n**Both of these libraries provide a front-end search bar perfectly adapted for documentation.**\n\n![docs-searchbar-demo](assets/docs-searchbar-demo.gif)\n\n## \ud83d\udee0 More Configurations\n\n### More About the Selectors\n\n#### Bases <!-- omit in TOC -->\n\nVery simply, selectors are needed to tell the scraper \"I want to get the content in this HTML tag\".<br>\nThis HTML tag is a **selector**.\n\nA selector can be:\n\n- a class (e.g. `.main-content`)\n- an id (e.g. `#main-article`)\n- an HTML tag (e.g. `h1`)\n\nWith a more concrete example:\n\n```json\n\"lvl0\": {\n    \"selector\": \".navbar-nav .active\",\n    \"global\": true,\n    \"default_value\": \"Documentation\"\n},\n```\n\n`.navbar-nav .active` means \"take the content in the class `active` that is itself in the class `navbar-nav`\".\n\n`global: true` means you want the same `lvl0` (so, the same main title) for all the contents extracted from the same page.\n\n`\"default_value\": \"Documentation\"` will be the displayed value if no content in `.navbar-nav .active` was found.\n\nNB: You can set the `global` and `default_value` attributes for every selector level (`lvlX`) and not only for the `lvl0`.\n\n#### The Levels <!-- omit in TOC -->\n\nYou can notice different levels of selectors (0 to 6 maximum) in the config file. They correspond to different levels of titles, and will be displayed this way:\n\n![selectors-display](assets/selectors-display.png)\n\nYour data will be displayed with a main title (`lvl0`), sub-titles (`lvl1`), sub-sub-titles (`lvl2`) and so on...\n\n### All the Config File Settings\n\n#### `index_uid`\n\nThe `index_uid` field is the index identifier in your Meilisearch instance in which your website content is stored. The scraping tool will create a new index if it does not exist.\n\n```json\n{\n  \"index_uid\": \"example\"\n}\n```\n\n#### `start_urls`\n\nThis array contains the list of URLs that will be used to start scraping your website.<br>\nThe scraper will recursively follow any links (`<a>` tags) from those pages. It will not follow links that are on another domain.\n\n```json\n{\n  \"start_urls\": [\"https://www.example.com/docs\"]\n}\n```\n##### Using Page Rank <!-- omit in TOC -->\n\nThis parameter gives more weight to some pages and helps to boost records built from the page.<br>\nPages with highest `page_rank` will be returned before pages with a lower `page_rank`.\n\n```json\n{\n  \"start_urls\": [\n    {\n      \"url\": \"http://www.example.com/docs/concepts/\",\n      \"page_rank\": 5\n    },\n    {\n      \"url\": \"http://www.example.com/docs/contributors/\",\n      \"page_rank\": 1\n    }\n  ]\n}\n```\n\nIn this example, records built from the Concepts page will be ranked higher than results extracted from the Contributors page.\n\n#### `stop_urls` (optional)\n\nThe scraper will not follow links that match `stop_urls`.\n\n```json\n{\n  \"start_urls\": [\"https://www.example.com/docs\"],\n  \"stop_urls\": [\"https://www.example.com/about-us\"]\n}\n```\n\n#### `selectors_key` (optional)\n\nThis allows you to use custom selectors per page.\n\nIf the markup of your website is so different from one page to another that you can't have generic selectors, you can namespace your selectors and specify which set of selectors should be applied to specific pages.\n\n```json\n{\n  \"start_urls\": [\n    \"http://www.example.com/docs/\",\n    {\n      \"url\": \"http://www.example.com/docs/concepts/\",\n      \"selectors_key\": \"concepts\"\n    },\n    {\n      \"url\": \"http://www.example.com/docs/contributors/\",\n      \"selectors_key\": \"contributors\"\n    }\n  ],\n  \"selectors\": {\n    \"default\": {\n      \"lvl0\": \".main h1\",\n      \"lvl1\": \".main h2\",\n      \"lvl2\": \".main h3\",\n      \"lvl3\": \".main h4\",\n      \"lvl4\": \".main h5\",\n      \"text\": \".main p\"\n    },\n    \"concepts\": {\n      \"lvl0\": \".header h2\",\n      \"lvl1\": \".main h1.title\",\n      \"lvl2\": \".main h2.title\",\n      \"lvl3\": \".main h3.title\",\n      \"lvl4\": \".main h5.title\",\n      \"text\": \".main p\"\n    },\n    \"contributors\": {\n      \"lvl0\": \".main h1\",\n      \"lvl1\": \".contributors .name\",\n      \"lvl2\": \".contributors .title\",\n      \"text\": \".contributors .description\"\n    }\n  }\n}\n```\n\nHere, all documentation pages will use the selectors defined in `selectors.default` while the page under `./concepts` will use `selectors.concepts` and those under `./contributors` will use `selectors.contributors`.\n\n#### `scrape_start_urls` (optional)\n\nBy default, the scraper will extract content from the pages defined in `start_urls`. If you do not have any valuable content on your starts_urls or if it's a duplicate of another page, you should set this to `false`.\n\n```json\n{\n  \"scrape_start_urls\": false\n}\n```\n\n#### `sitemap_urls` (optional)\n\nYou can pass an array of URLs pointing to your sitemap(s) files. If this value is set, the scraper will try to read URLs from your sitemap(s)\n\n```json\n{\n  \"sitemap_urls\": [\"http://www.example.com/docs/sitemap.xml\"]\n}\n```\n\n#### `sitemap_alternate_links` (optional)\n\nSitemaps can contain alternative links for URLs. Those are other versions of the same page, in a different language, or with a different URL. By default docs-scraper will ignore those URLs.\n\nSet this to true if you want those other versions to be scraped as well.\n\n```json\n{\n  \"sitemap_urls\": [\"http://www.example.com/docs/sitemap.xml\"],\n  \"sitemap_alternate_links\": true\n}\n```\n\nWith the above configuration and the `sitemap.xml` below, both `http://www.example.com/docs/` and `http://www.example.com/docs/de/` will be scraped.\n\n```html\n<url>\n  <loc>http://www.example.com/docs/</loc>\n  <xhtml:link rel=\"alternate\" hreflang=\"de\" href=\"http://www.example.com/de/\"/>\n</url>\n```\n\n#### `selectors_exclude` (optional)\n\nThis expects an array of CSS selectors. Any element matching one of those selectors will be removed from the page before any data is extracted from it.\n\nThis can be used to remove a table of content, a sidebar, or a footer, to make other selectors easier to write.\n\n```json\n{\n  \"selectors_exclude\": [\".footer\", \"ul.deprecated\"]\n}\n```\n\n#### `custom_settings` (optional)\n\nThis field can be used to add Meilisearch settings.\n\n##### Example:\n```json\n\"custom_settings\": {\n    \"synonyms\": {\n      \"static site generator\": [\n        \"ssg\"\n      ],\n      \"ssg\": [\n        \"static site generator\"\n      ]\n    },\n    \"stopWords\": [\"of\", \"the\"],\n    \"filterableAttributes\": [\"genres\", \"type\"]\n  }\n```\n\nLearn more about `filterableAttributes`, `synonyms`, `stop-words` and all available settings in the [Meilisearch documentation](https://meilisearch.com/docs/reference/api/settings#settings-object).\n\n\n#### `min_indexed_level` (optional)\n\nThe default value is 0. By increasing it, you can choose not to index some records if they don't have enough `lvlX` matching. For example, with a `min_indexed_level: 2`, the scraper indexes temporary records having at least lvl0, lvl1 and lvl2 set.\n\nThis is useful when your documentation has pages that share the same `lvl0` and `lvl1` for example. In that case, you don't want to index all the shared records, but want to keep the content different across pages.\n\n```json\n{\n  \"min_indexed_level\": 2\n}\n```\n\n#### `only_content_level` (optional)\n\nWhen `only_content_level` is set to `true`, then the scraper won't create records for the `lvlX` selectors.<br>\nIf used, `min_indexed_level` is ignored.\n\n```json\n{\n  \"only_content_level\": true\n}\n```\n\n#### `js_render` (optional)\n\nWhen `js_render` is set to `true`, the scraper will use ChromeDriver. This is needed for pages that are rendered with JavaScript, for example, pages generated with React, Vue, or applications that are running in development mode: `autoreload` `watch`.\n\nAfter installing ChromeDriver, provide the path to the bin using the following environment variable `CHROMEDRIVER_PATH` (default value is `/usr/bin/chromedriver`).\n\nThe default value of `js_render` is `false`.\n\n```json\n{\n  \"js_render\": true\n}\n```\n\n#### `js_wait` (optional)\n\nThis setting can be used when `js_render` is set to `true` and the pages need time to fully load. `js_wait` takes an integer is specifies the number of seconds the scraper should wait for the page to load.\n\n```json\n{\n  \"js_render\": true,\n  \"js_wait\": 1\n}\n```\n\n#### `allowed_domains` (optional)\n\nThis setting specifies the domains that the scraper is allowed to access. In most cases the `allowed_domains` will be automatically set using the `start_urls` and `stop_urls`. When scraping a domain that contains a port, for example `http://localhost:8080`, the domain needs to be manually added to the configuration.\n\n```json\n{\n  \"allowed_domains\": [\"localhost\"]\n}\n```\n\n### Authentication\n\n__WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly!\n\n#### Basic HTTP <!-- omit in TOC -->\n\nBasic HTTP authentication is supported by setting these environment variables:\n- `DOCS_SCRAPER_BASICAUTH_USERNAME`\n- `DOCS_SCRAPER_BASICAUTH_PASSWORD`\n\n#### Cloudflare Access: Identity and Access Management <!-- omit in TOC -->\n\nIf it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers.\n\nValues for these headers are taken from env variables `CF_ACCESS_CLIENT_ID` and `CF_ACCESS_CLIENT_SECRET`.\n\nIn case of Google Cloud Identity-Aware Proxy, please specify these env variables:\n- `IAP_AUTH_CLIENT_ID` - # pick [client ID of the application](https://console.cloud.google.com/apis/credentials) you are connecting to\n- `IAP_AUTH_SERVICE_ACCOUNT_JSON` - # generate in [Actions](https://console.cloud.google.com/iam-admin/serviceaccounts) -> Create key -> JSON\n\n#### Keycloak Access: Identity and Access Management <!-- omit in TOC -->\n\nIf you need to scrape site protected by [Keycloak](https://github.com/keycloak/keycloak) (Gatekeeper), you have to provide a valid access token.\n\nIf you set the environment variables `KC_URL`, `KC_REALM`, `KC_CLIENT_ID`, and `KC_CLIENT_SECRET` the scraper authenticates itself against Keycloak using _Client Credentials Grant_ and adds the resulting access token as `Authorization` HTTP header to each scraping request.\n\n### Installing Chrome Headless\n\nWebsites that need JavaScript for rendering are passed through ChromeDriver.<br>\n[Download the version](http://chromedriver.chromium.org/downloads) suited to your OS and then set the environment variable `CHROMEDRIVER_PATH`.\n\n## \ud83e\udd16 Compatibility with Meilisearch\n\nThis package guarantees compatibility with [version v1.x of Meilisearch](https://github.com/meilisearch/meilisearch/releases/latest), but some features may not be present. Please check the [issues](https://github.com/meilisearch/docs-scraper/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22+label%3Aenhancement) for more info.\n\n## \u2699\ufe0f Development Workflow and Contributing\n\nAny new contribution is more than welcome in this project!\n\nIf you want to know more about the development workflow or want to contribute, please visit our [contributing guidelines](/CONTRIBUTING.md) for detailed instructions!\n\n## Credits\n\nBased on [Algolia's docsearch scraper repository](https://github.com/algolia/docsearch-scraper) from [this commit](https://github.com/algolia/docsearch-scraper/commit/aab0888989b3f7a4f534979f0148f471b7c435ee).<br>\nDue to a lot of future changes on this repository compared to the original one, we don't maintain it as an official fork.\n\n<hr>\n\n**Meilisearch** provides and maintains many **SDKs and Integration tools** like this one. We want to provide everyone with an **amazing search experience for any kind of project**. If you want to contribute, make suggestions, or just know what's going on right now, visit us in the [integration-guides](https://github.com/meilisearch/integration-guides) repository.\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Documentation scrapper for PyMeilisearch",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/ansys/pymeilisearch-scraper",
        "Source": "https://github.com/ansys/pymeilisearch-scraper",
        "Tracker": "https://github.com/ansys/pymeilisearch-scraper/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4b69d4303af285672f194fbd692f493f90999a61b0de879876be5792ee499a01",
                "md5": "f22a6d3afd0a704b425e05a9a60e1446",
                "sha256": "d1158237c0e54922f134a5dfeca7ac7156b682b2886661f684725682cbea4c6d"
            },
            "downloads": -1,
            "filename": "pymeilisearch_scraper-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f22a6d3afd0a704b425e05a9a60e1446",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 33755,
            "upload_time": "2023-09-25T14:20:35",
            "upload_time_iso_8601": "2023-09-25T14:20:35.295013Z",
            "url": "https://files.pythonhosted.org/packages/4b/69/d4303af285672f194fbd692f493f90999a61b0de879876be5792ee499a01/pymeilisearch_scraper-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6468cf9e9faad529eb3e26e21d8740eded2aa1f1094919cb8fe2d6601c3159d4",
                "md5": "4ee05f0c478e42852ad1e0e467f4bd5a",
                "sha256": "b8afe89793c0046bb804ac1afbd566c4cce1fea410bae348720e64947bd36b32"
            },
            "downloads": -1,
            "filename": "pymeilisearch_scraper-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4ee05f0c478e42852ad1e0e467f4bd5a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 34688,
            "upload_time": "2023-09-25T14:20:37",
            "upload_time_iso_8601": "2023-09-25T14:20:37.414223Z",
            "url": "https://files.pythonhosted.org/packages/64/68/cf9e9faad529eb3e26e21d8740eded2aa1f1094919cb8fe2d6601c3159d4/pymeilisearch_scraper-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-25 14:20:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ansys",
    "github_project": "pymeilisearch-scraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "pymeilisearch-scraper"
}
        
Elapsed time: 0.45715s