researches

Name	researches JSON
Version	0.3 JSON
	download
home_page	None
Summary	Researches is a Google search scraper.
upload_time	2024-08-11 09:18:53
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	google search scraper research llm
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # researches
Researches is a vanilla<sup>1</sup> Google scraper. Minimal requirements.

```python
search("Who invented papers?")
```

<sub><sup>1</sup> In context, this refers to raw/unformatted data and contents. `researches` does not clean them up for you, and it's not guranteed to be 100% human-readable. However, feeding to LLMs may help as most of them use word-level tokenizers.</sub>

## Requirements
- A decent computer
- Python ≥ 3.9
- `httpx` – HTTP connections.
- `selectolax` – The HTML parser.

## Usage
Just start searching right away. Don't worry, Gemini won't hurt you (also [gemini](https://preview.redd.it/l-gemini-lmao-v0-6a6q0pl4ac2d1.png?auto=webp&s=31cd6b33329d895501d727e6346153bc2a3ea1d6)).

```python
search(
    "US to Japan",  # query
    hl="en",        # language
    ua=None,        # custom user agent or ours
    **kwargs        # kwargs to pass to httpx (optional)
) -> Result
```

For people who love async, we've also got you covered:
```python
await asearch(
    "US to Japan"   # query
    hl="en",        # language
    ua=None,        # custom user agent or ours
    **kwargs        # kwargs to pass to httpx (optional)
)
```

So, what does the `Result` class has to offer? At a glance:
```haskell
result.snippet?
      ⤷  .text: str
      ⤷  .name: str?

result.aside?
      ⤷ .text: str

result.weather?
      ⤷ .c: str
      ⤷ .f: str
      ⤷ .precipitation: str
      ⤷ .humidty: str
      ⤷ .wind_metric: str
      ⤷ .wind_imperial: str
      ⤷ .description: str
      ⤷ .forecast: PartialWeatherForReport[]
                   ⤷ .weekday: str
                   ⤷ .high_c: str
                   ⤷ .low_c: str
                   ⤷ .high_f: str
                   ⤷ .low_f: str

result.web: Web[]
            ⤷ .title: str
            ⤷ .url: str
            ⤷ .text: str

result.flights: Flight[]
                ⤷ .title: str
                ⤷ .description: str
                ⤷ .duration: str
                ⤷ .price: str

result.lyrics?
      ⤷ .text: str
      ⤷ .is_partial: bool
```

## Background
Data comes in different shapes and sizes, and Google played it extremely well. That also includes randomizing CSS class names making it almost impossible to scrape data.

Google sucks, but it's actually the knowledge base we all need. Say, there are these types of result pages:
- **Links** – What made Google, "Google." Or, `&udm=14`.
- **Rich blocks** – Rich blocks that introduce persons, places and more.
- **Weather** – Weather forecast.
- **Wikipedia (aside)** – Wikipedia text.
- **Flights** – Flights.

...and many more. (Contribute!)

Scraper APIs out there are hella expensive, and ain't no way I'm paying or entering their free tier. So, I made my own that's perfect for extracting data with LLMs.

<br />
<br />

<div align="center">
    <img src="https://github.com/user-attachments/assets/0c2e29fd-ea9b-4078-b210-b966a8dfc976" width="800" />
</div>

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "researches",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "google, search, scraper, research, llm",
    "author": null,
    "author_email": "AWeirdDev <aweirdscratcher@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/4d/30/87ac96745bf387b55926601d11a29f184257ffbad3979a38d63cb30deab9/researches-0.3.tar.gz",
    "platform": null,
    "description": "# researches\r\nResearches is a vanilla<sup>1</sup> Google scraper. Minimal requirements.\r\n\r\n```python\r\nsearch(\"Who invented papers?\")\r\n```\r\n\r\n<sub><sup>1</sup> In context, this refers to raw/unformatted data and contents. `researches` does not clean them up for you, and it's not guranteed to be 100% human-readable. However, feeding to LLMs may help as most of them use word-level tokenizers.</sub>\r\n\r\n## Requirements\r\n- A decent computer\r\n- Python \u2265 3.9\r\n- `httpx` \u2013 HTTP connections.\r\n- `selectolax` \u2013 The HTML parser.\r\n\r\n## Usage\r\nJust start searching right away. Don't worry, Gemini won't hurt you (also [gemini](https://preview.redd.it/l-gemini-lmao-v0-6a6q0pl4ac2d1.png?auto=webp&s=31cd6b33329d895501d727e6346153bc2a3ea1d6)).\r\n\r\n```python\r\nsearch(\r\n    \"US to Japan\",  # query\r\n    hl=\"en\",        # language\r\n    ua=None,        # custom user agent or ours\r\n    **kwargs        # kwargs to pass to httpx (optional)\r\n) -> Result\r\n```\r\n\r\nFor people who love async, we've also got you covered:\r\n```python\r\nawait asearch(\r\n    \"US to Japan\"   # query\r\n    hl=\"en\",        # language\r\n    ua=None,        # custom user agent or ours\r\n    **kwargs        # kwargs to pass to httpx (optional)\r\n)\r\n```\r\n\r\nSo, what does the `Result` class has to offer? At a glance:\r\n```haskell\r\nresult.snippet?\r\n      \u2937  .text: str\r\n      \u2937  .name: str?\r\n\r\nresult.aside?\r\n      \u2937 .text: str\r\n\r\nresult.weather?\r\n      \u2937 .c: str\r\n      \u2937 .f: str\r\n      \u2937 .precipitation: str\r\n      \u2937 .humidty: str\r\n      \u2937 .wind_metric: str\r\n      \u2937 .wind_imperial: str\r\n      \u2937 .description: str\r\n      \u2937 .forecast: PartialWeatherForReport[]\r\n                   \u2937 .weekday: str\r\n                   \u2937 .high_c: str\r\n                   \u2937 .low_c: str\r\n                   \u2937 .high_f: str\r\n                   \u2937 .low_f: str\r\n\r\nresult.web: Web[]\r\n            \u2937 .title: str\r\n            \u2937 .url: str\r\n            \u2937 .text: str\r\n\r\nresult.flights: Flight[]\r\n                \u2937 .title: str\r\n                \u2937 .description: str\r\n                \u2937 .duration: str\r\n                \u2937 .price: str\r\n\r\nresult.lyrics?\r\n      \u2937 .text: str\r\n      \u2937 .is_partial: bool\r\n```\r\n\r\n## Background\r\nData comes in different shapes and sizes, and Google played it extremely well. That also includes randomizing CSS class names making it almost impossible to scrape data.\r\n\r\nGoogle sucks, but it's actually the knowledge base we all need. Say, there are these types of result pages:\r\n- **Links** \u2013 What made Google, \"Google.\" Or, `&udm=14`.\r\n- **Rich blocks** \u2013 Rich blocks that introduce persons, places and more.\r\n- **Weather** \u2013 Weather forecast.\r\n- **Wikipedia (aside)** \u2013 Wikipedia text.\r\n- **Flights** \u2013 Flights.\r\n\r\n...and many more. (Contribute!)\r\n\r\nScraper APIs out there are hella expensive, and ain't no way I'm paying or entering their free tier. So, I made my own that's perfect for extracting data with LLMs.\r\n\r\n<br />\r\n<br />\r\n\r\n<div align=\"center\">\r\n    <img src=\"https://github.com/user-attachments/assets/0c2e29fd-ea9b-4078-b210-b966a8dfc976\" width=\"800\" />\r\n</div>\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Researches is a Google search scraper.",
    "version": "0.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/AWeirdDev/researches/issues",
        "Source": "https://github.com/AWeirdDev/researches"
    },
    "split_keywords": [
        "google",
        " search",
        " scraper",
        " research",
        " llm"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4d3087ac96745bf387b55926601d11a29f184257ffbad3979a38d63cb30deab9",
                "md5": "9fbad392fd2ffbe682bc606f4817f3ff",
                "sha256": "69b177d6d604f18cd112ad727219abbed886abeea625d2a12aa1c5a2ee9dbf88"
            },
            "downloads": -1,
            "filename": "researches-0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "9fbad392fd2ffbe682bc606f4817f3ff",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 8468,
            "upload_time": "2024-08-11T09:18:53",
            "upload_time_iso_8601": "2024-08-11T09:18:53.158727Z",
            "url": "https://files.pythonhosted.org/packages/4d/30/87ac96745bf387b55926601d11a29f184257ffbad3979a38d63cb30deab9/researches-0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-11 09:18:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AWeirdDev",
    "github_project": "researches",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "researches"
}

None