forumscraper


Nameforumscraper JSON
Version 0.1.7 PyPI version JSON
download
home_pageNone
SummaryA forum scraper library
upload_time2024-10-03 17:11:34
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseGPLv3
keywords text-processing scraper forums phpbb smf xmb invision xenforo
VCS
bugtrack_url
requirements requests reliq
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # forumscraper

forumscraper aims to be an universal, automatic and extensive scraper for forums.

# Installation

    pip install forumscraper

# Supported forums

- Invision Power Board (only 4.x version)
- PhpBB (currently excluding 1.x version)
- Simple Machines Forum
- XenForo
- XMB
- Hacker News (has aggressive protection, should be used with cookies from logged in account)
- StackExchange

# Output examples

Are created by `create-format-examples` script and contained in [examples](https://github.com/TUVIMEN/forumscraper/tree/master/examples) directory where they're grouped based on scraper and version. Files are in `json` format.

# Usage

## CLI

### General

Download any kind of supported forums from `URL`s into `DIR`, creating json files for threads named by their id's, same for users but beginning with `m-` e.g. `24` `29` `m-89` `m-125`.

    forumscraper --directory DIR URL1 URL2 URL3

Above behaviour is set by default `--names id`, and can be changed with `--names hash` which names files by sha256 sum of their source urls.

    forumscraper --names hash --directory DIR URL

By default if files to be created are found and are not empty, function exits not overwriting them. This can be changed using `--force` option.

forumscraper output logging information to `stdout` (can be changed with `--log FILE`) and information about failures to `stderr` (can be changed with `--failed FILE`)

Failures are generally ignored but setting `--pedantic` flag stops the execution if any failure is encountered.

Download `URL`s into `DIR` using `8` threads and log failures into `failures.txt` (note that this option is specified before the `--directory` option, otherwise it would create it in the specified directory if relative path is used)

    forumscraper  --failures failures.txt --threads 8 --directory DIR URL1 URL2 URL3

Download `URL`s with different scrapers

    forumscraper URL1 smf URL2 URL3 .thread URL4 xenforo2.forum URL5 URL6

Type of scrapers can be defined inbetween `URL`s, where all following `URL`s are assigned to previous type i.e. URL1 to default, URL2 and URL3 to smf and so on.

Type consists of `scraper_name` followed by `.` and `function_name`.

`scraper_name` can be: `all`, `invision`, `phpbb`, `hackernews`, `stackexchange`, `smf`, `smf1`, `smf2`, `xenforo`, `xenforo1`, `xenforo2`, `xmb` where `all`, `xenforo` and `smf` are instances of identification class meaning that they have to download the `URL` to identify its type which may cause redownloading of existing content if many `URL`s are passed as arguments i.e. all resources extracted from once identified type are assumed to have the same type, but passing thousands of thread `URL`s as arguments will always download them before scraping. `smf1`, `smf2`, `xenforo1`, `xenforo2` are just scrapers with assumed version.

`function_name` can be: `guess`, `findroot`, `thread`, `user`, `forum`, `tag`, `board` (`board` being the main page of the forum where subforums are listed). `guess` guesses the other types based on the `URL`s alone, `findroot` find the main page of forum from any link on site (useful for downloading the whole forum from random urls), other names are self explainatory.

`all`, `xenforo` and `smf` have also `identify` function that identifies site type.

`findroot` and `identify` write results to file specified by the `--output` (by default set to `stdout`) option which is made specifically for these functions. `findroot` return url to board and url from which it was found, separated by `\t`. `identify` return name of scraper and url from which it was identified, separated by `\t`.

Default type is set to `all.guess` and it is so efective that the only reason to not use it is to avoid redownloading from running the same command many times which is caused by identification process when using `--names id`.

Types can also be shortened e.g. `.` is equivalent to `all.guess`, `.thread` is equivalent to `all.thread` and `xenforo` is equivalent to `xenforo.guess`.

Get version

    forumscraper --version

Get some help (you might discover that many options are abbreviated to single letter)

    forumscraper --help

### Request options

Download `URL` with waiting `0.8` seconds and randomly waiting up to `400` miliseconds for each request

    forumscraper --wait 0.8 --wait-random 400 URL

Download `URL` using `5` retries and waiting `120` seconds between them

    forumscraper --retries 5 --retry-wait 120 URL

By default when encountered a non fatal failure (e.g. status code 301 and not 404) forumscraper tries 3 times waiting 60 seconds before the next attempt, setting `--retries 0` would disable retries and it's a valid (if not better) method assuming that one handles the `--failures` option correctly.

Download `URL` ignoring ssl errors with timeout set to `60` seconds and custom user-agent

    forumscraper --insecure --timeout 60 --user-agent 'why are we still here?'

`--proxies DICT`, `--headers DICT` and `--cookies DICT` (where `DICT` is python stringified dictionary) are directly passed to requests library.

### Settings

`--nothreads` doesn't download threads unless url passed is a thread.

`--users` download users.

`--reactions` download reactions.

`--boards` creates board files.

`--tags` creates tags files.

`--forums` creates forums files.

`--compression ALGO` compresses created files with `ALGO`, that can be `none`, `gzip`, `bzip2`, `lzma`.

`--only-urls-forums` writes found forum urls to `output`, doesn't scrape.

`--only-urls-threads` writes found thread urls to `output`, doesn't scrape.

`--thread-pages-max NUM` and `--pages-max NUM` set max number of pages traversed in each thread and forum respectively.

`--pages-max-depth NUM` sets recursion limit for forums.

`--pages-forums-max NUM` limits number of forums that are processed from every page in forum.

`--pages-threads-max NUM` limits number of threads that are processed from every page in forum.

Combining some of the above you get:

    forumscraper --thread-pages-max 1 --pages-max 1 --pages-forums-max 1 --pages-threads-max 1 URL1 URL2 URL3

which downloads only one page in one thread from one forum found from every `URL` which is very useful for debugging.

## Library

### Code

```python
import os
import sys
import forumscraper

ex = forumscraper.Extractor(timeout=90)

thread = ex.guess('https://xenforo.com/community/threads/forum-data-breach.180995/',output=forumscraper.Outputs.data|forumscraper.Outputs.threads,timeout=60,retries=0) #automatically identify forum and type of page and save results
thread['data']['threads'][0] #access the result
thread['data']['users'] #found users are also saved into an array

forum = ex.get_forum('https://xenforo.com/community/forums/off-topic.7/',output=forumscraper.Outputs.data|forumscraper.Outputs.urls|forumscraper.Outputs.threads,retries=0)  #get list of all threads and  urls from forum
forum['data']['threads'] #access the results
forums['urls']['threads'] #list of urls to found threads
forums['urls']['forums'] #list of urls to found forums

threads = ex.smf.get_forum('https://www.simplemachines.org/community/index.php?board=1.0',output=forumscraper.Outputs.only_urls_threads) #gather only urls to threads without scraping data
threads['urls']['threads']
threads['urls']['forums'] #is also created

forums = ex.smf.get_board('https://www.simplemachines.org/community/index.php',output=forumscraper.Outputs.only_urls_forums) #only get a list of urls to all forums
threads['urls']['forums']
threads['urls']['boards']
threads['urls']['tags'] #tags and boards are also gathered

ex.smf.get_thread('https://www.simplemachines.org/community/index.php?topic=578496.0',output=forumscraper.Outputs.only_urls_forums) #returns None

os.mkdir('xenforo')
os.chdir('xenforo')

xen = forumscraper.xenforo2(timeout=30,retries=3,retry_wait=10,wait=0.4,random_wait=400,max_workers=8,output=forumscraper.Outputs.write_by_id|forumscraper.Outputs.threads)
#specifies global config, writes output in files by their id (beginning with m- in case of users) in current directory
#ex.xenforo.v2 is an initialized instance of forumscraper.xenforo2 with the same settings as ex
#output by default is set to forumscraper.Outputs.write_by_id|forumscraper.Outputs.threads anyway

failures = []
files = xen.guess('https://xenforo.com/community/',logger=sys.stdout,failed=failures, undisturbed=True)
#failed=failures writes all the failed requests to be saved in failures array or file

for i in failures: #try to download failed one last time
    x = i.split(' ')
    if len(x) == 4 and x[1] == 'failed':
        xen.get_thread(x[0],state=files) #append results

files['files']['threads']
files['files']['users'] #lists of created files

#the above uses scraper that is an instance of ForumExtractor
#if the instance of ForumExtractorIdentify before checking if the files already exist based on url the page has to be downloaded to be indentified. Because of that any getters from this class returns results with 'scraper' field pointing to the indentified scraper type and further requests should be done through that object.

xen = forumscraper.xenforo2(timeout=30,retries=3,retry_wait=10,wait=0.4,random_wait=400,max_workers=8,output=forumscraper.Outputs.write_by_hash|forumscraper.Outputs.threads,undisturbed=True)
#specifies global config, writes output in files by sha256 hash of their url in current directory
#ex.xenforo is also an initialized forumscraper.xenforo

failures = []
files = xen.guess('https://xenforo.com/community/',logger=sys.stdout,failed=failures)
scraper = files['scraper'] #identified ForumScraper instance

for i in failures: #try to download failed one last time
    x = i.split(' ')
    if len(x) == 4 and x[1] == 'failed':
        scraper.get_thread(x[0],state=files) #use of already identified class

os.chdir('..')
```

### Scrapers

forumscraper defines:

    invision
    phpbb
    smf1
    smf2
    xenforo1
    xenforo2
    xmb
    hackernews
    stackexchange

scrapers that are instances of `ForumExtractor` class and also:

    Extractor
    smf
    xenforo

that are instances of `ForumExtractorIdentify`.

Instances of `ForumExtractorIdentify` identify and pass requests to `ForumExtractor` instances in them. This means that content from the first link is downloaded regardless if files with finished work exist. (So running `get_thread` method on failures using these scrapers will cause needless redownloading, unless `forumscraper.Outputs.write_by_hash` is used)

`Extractor` scraper has `invision`, `phpbb`, `smf`, `xenforo`, `xmb`, `hackernews`, `stackexchange` fields that are already initialized scrapers of declared type.

`xenforo` and `smf` have `v1` and `v2` fields that are already initialized scrapers of declared versions.

Initialization of scrapers allows to specify `**kwargs` as settings that are kept for requests made from these scrapers.

All scrapers have the following methods:

    guess
    findroot
    get_thread
    get_user
    get_forum
    get_tag
    get_board

`ForumExtractorIdentify` scrapers additionally have `identify` method.

which take as argument url, optionally already downloaded html either as `str`, `bytes` or `reliq` and state which allows to append output to previous results, and the same type of settings used on initialization of class, e.g.

```python
    ex = forumscraper.Extractor(headers={"Referer":"https://xenforo.com/community/"},timeout=20)
    state = ex.guess('https://xenforo.com/community/threads/selling-and-buying-second-hand-licenses.131205/',timeout=90)

    html = requests.get('https://xenforo.com/community/threads/is-it-possible-to-set-up-three-websites-with-a-second-hand-xenforo-license.222507/').text
    ex.guess('https://xenforo.com/community/threads/is-it-possible-to-set-up-three-websites-with-a-second-hand-xenforo-license.222507/',html,state,timeout=40)
```
`guess` method identifies based only on the url what kind of page is being passed and calls other methods so other methods are needed mostly for exceptions.

For most cases using `Extractor` and `guess` is preferred since they work really well. The only exceptions are if site has irregular urls so that `guess` doesn't work, or if you make a lot of calls to the same site with `output=forumscraper.Outputs.write_by_id` e.g. trying to scraper failed urls.

`guess` method creates `scraper-method` field in output that is pointing to function used.

Methods called from instances of `ForumExtractorIdentify` do the same, but also create `scraper` field pointing to instance of `ForumExtractor` used. This allows to circumvent the need of redownloading for each call just for identification.

```python
failures = []
results = ex.guess('https://www.simplemachines.org/community/index.php',output=forumscraper.Outputs.urls|forumscraper.Outputs.data,failed=failures,undisturbed=True)
#results['scraper-method'] points to ex.smf.v2.get_board

scraper = results['scraper'] #points to ex.smf.v2

for i in failures: #try to download failed one last time
    x = i.split(' ')
    if len(x) == 4 and x[1] == 'failed':
        scraper.get_thread(x[0],state=results) #save results in 'results'
```

`identify` and `findroot` methods ignore state, even though they can take it as argument.

`findroot` method returns `None` on failure or url to the root of the site (i.e. board) from any link of site, that is very useful when having some random urls and wanting to automatically download the whole forum.

`identify` methods returns `None` on failure or initialized `ForumExtractor` that can scrape given url.

The get functions and `guess` return `None` in case of failure or `dict` defined as

    {
       'data': {
            "boards": [],
            "tags": [],
            "forums": [],
            'threads': [],
            'users': []
        },
        'urls': {
            'threads': [],
            'users': [],
            'reactions':[]
            'forums': [],
            'tags': [],
            'boards': []
        }
       'files': {
            "boards": [],
            "tags": [],
            "forums": [],
            'threads': [],
            'users': []
        },
        'visited': set(),
        "scraper": None,
        "scraper-method": None,
    }

Where `data` field contains resulting dictionaries of data.

`urls` field contains found urls of specific type.

`file` field contains created files with results.

`visited` field contains every url visited by scraper, which will refuse to visit them again, see `force` setting for more info.

### Settings

At initialization of scrapers and use of `get_` methods you can specify the same settings.

`output=forumscraper.Outputs.write_by_id|forumscraper.Outputs.urls|forumscraper.Outputs.threads` changes behaviour of scraper and results returned by id. It takes flags from `forumscraper.Outputs`:

 - `write_by_id` - write results in json in files named by their id (beginning with `m-` in case of users) e.g `21` `29` `m-24` `m-281`
 - `write_by_hash` - write results in json in files named by sha256 hash of their source url
 - `only_urls_threads` - do not scrape, just get urls to threads and things above them
 - `only_urls_forums` - ignore everything logging only urls to found forums, tags and boards
 - `urls`  - save url from which resources were scraped
 - `data` - save results in python dictionary
 - `threads` - scrape threads
 - `users` - scrape users
 - `reactions` - scrape reactions in threads
 - `boards` - scrape boards
 - `forums` - scrape forums
 - `tags` - scrape tags

Disabling `users` and `reactions` greatly speeds up getting `xenforo` and `invision` threads.

`boards` `forums` and `tags` create files with names beginning with respectively `b-`, `f-`, `t-` followed by sha256 hash of source url. These options may be useful for getting basic information about threads without downloading them.

`logger=None`, `failed=None` can be set to list or file to which information will be logged.

`logger` logs only urls that are downloaded.

`failed` logs failures in format:

```
RESOURCE_URL failed STATUS_CODE FAILED_URL
RESOURCE_URL failed completely STATUS_CODE FAILED_URL
```

Resource fails completely only because of `STATUS_CODE` e.g. `404`.

`undisturbed=False` if set, scraper doesn't care about standard errors.

`pedantic=False` if set, scraper fails because of errors in scraping resources related to currently scraped e.g. if getting users of reactions fails.

`force=False` if set, scraper overwrites files, but will still refuse to scrape urls found in `visited` field of state, if you are passing state between functions and you want to redownload them you will have to set it to empty set e.g. `state['visited'] = set()` before every function call.

`max_workers=1` set number of threads used for scraping.

`compress_func=None` set compression function that will be called when writing to files, function should accept data in `bytes` as the first argument, e.g. `gzip.compress`.

`verify=True` if set to `False` ignore ssl errors.

`timeout=120` request timeout.

`proxies={}` requests library proxies dictionary.

`headers={}` requests library headers dictionary.

`cookies={}` requests library cookies dictionary.

`user_agent=None` custom user-agent.

`wait=0` waiting time for each request.

`wait_random=0` random waiting time up to specified miliseconds.

`retries=3` number of retries attempted in case of failure.

`retry_wait=60` waiting time between retries.

`thread_pages_max=0` if greater than `0` limits number of pages traversed in threads.

`pages_max=0` limits number of pages traversed in each forum, tag or board.

`pages_max_depth=0` sets recursion limit for forums, tags and boards.

`pages_forums_max=0` limits number of forums that are processed from every page in forum or board.

`pages_threads_max=0` limits number of threads that are processed from every page in forum or tag.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "forumscraper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "text-processing, scraper, forums, phpbb, smf, xmb, invision, xenforo",
    "author": null,
    "author_email": "Dominik Stanis\u0142aw Suchora <suchora.dominik7@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ce/4b/0acd4c73020ae9b848751b23a9783bd8a798bc00ddce28e832bcc57f2525/forumscraper-0.1.7.tar.gz",
    "platform": null,
    "description": "# forumscraper\n\nforumscraper aims to be an universal, automatic and extensive scraper for forums.\n\n# Installation\n\n    pip install forumscraper\n\n# Supported forums\n\n- Invision Power Board (only 4.x version)\n- PhpBB (currently excluding 1.x version)\n- Simple Machines Forum\n- XenForo\n- XMB\n- Hacker News (has aggressive protection, should be used with cookies from logged in account)\n- StackExchange\n\n# Output examples\n\nAre created by `create-format-examples` script and contained in [examples](https://github.com/TUVIMEN/forumscraper/tree/master/examples) directory where they're grouped based on scraper and version. Files are in `json` format.\n\n# Usage\n\n## CLI\n\n### General\n\nDownload any kind of supported forums from `URL`s into `DIR`, creating json files for threads named by their id's, same for users but beginning with `m-` e.g. `24` `29` `m-89` `m-125`.\n\n    forumscraper --directory DIR URL1 URL2 URL3\n\nAbove behaviour is set by default `--names id`, and can be changed with `--names hash` which names files by sha256 sum of their source urls.\n\n    forumscraper --names hash --directory DIR URL\n\nBy default if files to be created are found and are not empty, function exits not overwriting them. This can be changed using `--force` option.\n\nforumscraper output logging information to `stdout` (can be changed with `--log FILE`) and information about failures to `stderr` (can be changed with `--failed FILE`)\n\nFailures are generally ignored but setting `--pedantic` flag stops the execution if any failure is encountered.\n\nDownload `URL`s into `DIR` using `8` threads and log failures into `failures.txt` (note that this option is specified before the `--directory` option, otherwise it would create it in the specified directory if relative path is used)\n\n    forumscraper  --failures failures.txt --threads 8 --directory DIR URL1 URL2 URL3\n\nDownload `URL`s with different scrapers\n\n    forumscraper URL1 smf URL2 URL3 .thread URL4 xenforo2.forum URL5 URL6\n\nType of scrapers can be defined inbetween `URL`s, where all following `URL`s are assigned to previous type i.e. URL1 to default, URL2 and URL3 to smf and so on.\n\nType consists of `scraper_name` followed by `.` and `function_name`.\n\n`scraper_name` can be: `all`, `invision`, `phpbb`, `hackernews`, `stackexchange`, `smf`, `smf1`, `smf2`, `xenforo`, `xenforo1`, `xenforo2`, `xmb` where `all`, `xenforo` and `smf` are instances of identification class meaning that they have to download the `URL` to identify its type which may cause redownloading of existing content if many `URL`s are passed as arguments i.e. all resources extracted from once identified type are assumed to have the same type, but passing thousands of thread `URL`s as arguments will always download them before scraping. `smf1`, `smf2`, `xenforo1`, `xenforo2` are just scrapers with assumed version.\n\n`function_name` can be: `guess`, `findroot`, `thread`, `user`, `forum`, `tag`, `board` (`board` being the main page of the forum where subforums are listed). `guess` guesses the other types based on the `URL`s alone, `findroot` find the main page of forum from any link on site (useful for downloading the whole forum from random urls), other names are self explainatory.\n\n`all`, `xenforo` and `smf` have also `identify` function that identifies site type.\n\n`findroot` and `identify` write results to file specified by the `--output` (by default set to `stdout`) option which is made specifically for these functions. `findroot` return url to board and url from which it was found, separated by `\\t`. `identify` return name of scraper and url from which it was identified, separated by `\\t`.\n\nDefault type is set to `all.guess` and it is so efective that the only reason to not use it is to avoid redownloading from running the same command many times which is caused by identification process when using `--names id`.\n\nTypes can also be shortened e.g. `.` is equivalent to `all.guess`, `.thread` is equivalent to `all.thread` and `xenforo` is equivalent to `xenforo.guess`.\n\nGet version\n\n    forumscraper --version\n\nGet some help (you might discover that many options are abbreviated to single letter)\n\n    forumscraper --help\n\n### Request options\n\nDownload `URL` with waiting `0.8` seconds and randomly waiting up to `400` miliseconds for each request\n\n    forumscraper --wait 0.8 --wait-random 400 URL\n\nDownload `URL` using `5` retries and waiting `120` seconds between them\n\n    forumscraper --retries 5 --retry-wait 120 URL\n\nBy default when encountered a non fatal failure (e.g. status code 301 and not 404) forumscraper tries 3 times waiting 60 seconds before the next attempt, setting `--retries 0` would disable retries and it's a valid (if not better) method assuming that one handles the `--failures` option correctly.\n\nDownload `URL` ignoring ssl errors with timeout set to `60` seconds and custom user-agent\n\n    forumscraper --insecure --timeout 60 --user-agent 'why are we still here?'\n\n`--proxies DICT`, `--headers DICT` and `--cookies DICT` (where `DICT` is python stringified dictionary) are directly passed to requests library.\n\n### Settings\n\n`--nothreads` doesn't download threads unless url passed is a thread.\n\n`--users` download users.\n\n`--reactions` download reactions.\n\n`--boards` creates board files.\n\n`--tags` creates tags files.\n\n`--forums` creates forums files.\n\n`--compression ALGO` compresses created files with `ALGO`, that can be `none`, `gzip`, `bzip2`, `lzma`.\n\n`--only-urls-forums` writes found forum urls to `output`, doesn't scrape.\n\n`--only-urls-threads` writes found thread urls to `output`, doesn't scrape.\n\n`--thread-pages-max NUM` and `--pages-max NUM` set max number of pages traversed in each thread and forum respectively.\n\n`--pages-max-depth NUM` sets recursion limit for forums.\n\n`--pages-forums-max NUM` limits number of forums that are processed from every page in forum.\n\n`--pages-threads-max NUM` limits number of threads that are processed from every page in forum.\n\nCombining some of the above you get:\n\n    forumscraper --thread-pages-max 1 --pages-max 1 --pages-forums-max 1 --pages-threads-max 1 URL1 URL2 URL3\n\nwhich downloads only one page in one thread from one forum found from every `URL` which is very useful for debugging.\n\n## Library\n\n### Code\n\n```python\nimport os\nimport sys\nimport forumscraper\n\nex = forumscraper.Extractor(timeout=90)\n\nthread = ex.guess('https://xenforo.com/community/threads/forum-data-breach.180995/',output=forumscraper.Outputs.data|forumscraper.Outputs.threads,timeout=60,retries=0) #automatically identify forum and type of page and save results\nthread['data']['threads'][0] #access the result\nthread['data']['users'] #found users are also saved into an array\n\nforum = ex.get_forum('https://xenforo.com/community/forums/off-topic.7/',output=forumscraper.Outputs.data|forumscraper.Outputs.urls|forumscraper.Outputs.threads,retries=0)  #get list of all threads and  urls from forum\nforum['data']['threads'] #access the results\nforums['urls']['threads'] #list of urls to found threads\nforums['urls']['forums'] #list of urls to found forums\n\nthreads = ex.smf.get_forum('https://www.simplemachines.org/community/index.php?board=1.0',output=forumscraper.Outputs.only_urls_threads) #gather only urls to threads without scraping data\nthreads['urls']['threads']\nthreads['urls']['forums'] #is also created\n\nforums = ex.smf.get_board('https://www.simplemachines.org/community/index.php',output=forumscraper.Outputs.only_urls_forums) #only get a list of urls to all forums\nthreads['urls']['forums']\nthreads['urls']['boards']\nthreads['urls']['tags'] #tags and boards are also gathered\n\nex.smf.get_thread('https://www.simplemachines.org/community/index.php?topic=578496.0',output=forumscraper.Outputs.only_urls_forums) #returns None\n\nos.mkdir('xenforo')\nos.chdir('xenforo')\n\nxen = forumscraper.xenforo2(timeout=30,retries=3,retry_wait=10,wait=0.4,random_wait=400,max_workers=8,output=forumscraper.Outputs.write_by_id|forumscraper.Outputs.threads)\n#specifies global config, writes output in files by their id (beginning with m- in case of users) in current directory\n#ex.xenforo.v2 is an initialized instance of forumscraper.xenforo2 with the same settings as ex\n#output by default is set to forumscraper.Outputs.write_by_id|forumscraper.Outputs.threads anyway\n\nfailures = []\nfiles = xen.guess('https://xenforo.com/community/',logger=sys.stdout,failed=failures, undisturbed=True)\n#failed=failures writes all the failed requests to be saved in failures array or file\n\nfor i in failures: #try to download failed one last time\n    x = i.split(' ')\n    if len(x) == 4 and x[1] == 'failed':\n        xen.get_thread(x[0],state=files) #append results\n\nfiles['files']['threads']\nfiles['files']['users'] #lists of created files\n\n#the above uses scraper that is an instance of ForumExtractor\n#if the instance of ForumExtractorIdentify before checking if the files already exist based on url the page has to be downloaded to be indentified. Because of that any getters from this class returns results with 'scraper' field pointing to the indentified scraper type and further requests should be done through that object.\n\nxen = forumscraper.xenforo2(timeout=30,retries=3,retry_wait=10,wait=0.4,random_wait=400,max_workers=8,output=forumscraper.Outputs.write_by_hash|forumscraper.Outputs.threads,undisturbed=True)\n#specifies global config, writes output in files by sha256 hash of their url in current directory\n#ex.xenforo is also an initialized forumscraper.xenforo\n\nfailures = []\nfiles = xen.guess('https://xenforo.com/community/',logger=sys.stdout,failed=failures)\nscraper = files['scraper'] #identified ForumScraper instance\n\nfor i in failures: #try to download failed one last time\n    x = i.split(' ')\n    if len(x) == 4 and x[1] == 'failed':\n        scraper.get_thread(x[0],state=files) #use of already identified class\n\nos.chdir('..')\n```\n\n### Scrapers\n\nforumscraper defines:\n\n    invision\n    phpbb\n    smf1\n    smf2\n    xenforo1\n    xenforo2\n    xmb\n    hackernews\n    stackexchange\n\nscrapers that are instances of `ForumExtractor` class and also:\n\n    Extractor\n    smf\n    xenforo\n\nthat are instances of `ForumExtractorIdentify`.\n\nInstances of `ForumExtractorIdentify` identify and pass requests to `ForumExtractor` instances in them. This means that content from the first link is downloaded regardless if files with finished work exist. (So running `get_thread` method on failures using these scrapers will cause needless redownloading, unless `forumscraper.Outputs.write_by_hash` is used)\n\n`Extractor` scraper has `invision`, `phpbb`, `smf`, `xenforo`, `xmb`, `hackernews`, `stackexchange` fields that are already initialized scrapers of declared type.\n\n`xenforo` and `smf` have `v1` and `v2` fields that are already initialized scrapers of declared versions.\n\nInitialization of scrapers allows to specify `**kwargs` as settings that are kept for requests made from these scrapers.\n\nAll scrapers have the following methods:\n\n    guess\n    findroot\n    get_thread\n    get_user\n    get_forum\n    get_tag\n    get_board\n\n`ForumExtractorIdentify` scrapers additionally have `identify` method.\n\nwhich take as argument url, optionally already downloaded html either as `str`, `bytes` or `reliq` and state which allows to append output to previous results, and the same type of settings used on initialization of class, e.g.\n\n```python\n    ex = forumscraper.Extractor(headers={\"Referer\":\"https://xenforo.com/community/\"},timeout=20)\n    state = ex.guess('https://xenforo.com/community/threads/selling-and-buying-second-hand-licenses.131205/',timeout=90)\n\n    html = requests.get('https://xenforo.com/community/threads/is-it-possible-to-set-up-three-websites-with-a-second-hand-xenforo-license.222507/').text\n    ex.guess('https://xenforo.com/community/threads/is-it-possible-to-set-up-three-websites-with-a-second-hand-xenforo-license.222507/',html,state,timeout=40)\n```\n`guess` method identifies based only on the url what kind of page is being passed and calls other methods so other methods are needed mostly for exceptions.\n\nFor most cases using `Extractor` and `guess` is preferred since they work really well. The only exceptions are if site has irregular urls so that `guess` doesn't work, or if you make a lot of calls to the same site with `output=forumscraper.Outputs.write_by_id` e.g. trying to scraper failed urls.\n\n`guess` method creates `scraper-method` field in output that is pointing to function used.\n\nMethods called from instances of `ForumExtractorIdentify` do the same, but also create `scraper` field pointing to instance of `ForumExtractor` used. This allows to circumvent the need of redownloading for each call just for identification.\n\n```python\nfailures = []\nresults = ex.guess('https://www.simplemachines.org/community/index.php',output=forumscraper.Outputs.urls|forumscraper.Outputs.data,failed=failures,undisturbed=True)\n#results['scraper-method'] points to ex.smf.v2.get_board\n\nscraper = results['scraper'] #points to ex.smf.v2\n\nfor i in failures: #try to download failed one last time\n    x = i.split(' ')\n    if len(x) == 4 and x[1] == 'failed':\n        scraper.get_thread(x[0],state=results) #save results in 'results'\n```\n\n`identify` and `findroot` methods ignore state, even though they can take it as argument.\n\n`findroot` method returns `None` on failure or url to the root of the site (i.e. board) from any link of site, that is very useful when having some random urls and wanting to automatically download the whole forum.\n\n`identify` methods returns `None` on failure or initialized `ForumExtractor` that can scrape given url.\n\nThe get functions and `guess` return `None` in case of failure or `dict` defined as\n\n    {\n       'data': {\n            \"boards\": [],\n            \"tags\": [],\n            \"forums\": [],\n            'threads': [],\n            'users': []\n        },\n        'urls': {\n            'threads': [],\n            'users': [],\n            'reactions':[]\n            'forums': [],\n            'tags': [],\n            'boards': []\n        }\n       'files': {\n            \"boards\": [],\n            \"tags\": [],\n            \"forums\": [],\n            'threads': [],\n            'users': []\n        },\n        'visited': set(),\n        \"scraper\": None,\n        \"scraper-method\": None,\n    }\n\nWhere `data` field contains resulting dictionaries of data.\n\n`urls` field contains found urls of specific type.\n\n`file` field contains created files with results.\n\n`visited` field contains every url visited by scraper, which will refuse to visit them again, see `force` setting for more info.\n\n### Settings\n\nAt initialization of scrapers and use of `get_` methods you can specify the same settings.\n\n`output=forumscraper.Outputs.write_by_id|forumscraper.Outputs.urls|forumscraper.Outputs.threads` changes behaviour of scraper and results returned by id. It takes flags from `forumscraper.Outputs`:\n\n - `write_by_id` - write results in json in files named by their id (beginning with `m-` in case of users) e.g `21` `29` `m-24` `m-281`\n - `write_by_hash` - write results in json in files named by sha256 hash of their source url\n - `only_urls_threads` - do not scrape, just get urls to threads and things above them\n - `only_urls_forums` - ignore everything logging only urls to found forums, tags and boards\n - `urls`  - save url from which resources were scraped\n - `data` - save results in python dictionary\n - `threads` - scrape threads\n - `users` - scrape users\n - `reactions` - scrape reactions in threads\n - `boards` - scrape boards\n - `forums` - scrape forums\n - `tags` - scrape tags\n\nDisabling `users` and `reactions` greatly speeds up getting `xenforo` and `invision` threads.\n\n`boards` `forums` and `tags` create files with names beginning with respectively `b-`, `f-`, `t-` followed by sha256 hash of source url. These options may be useful for getting basic information about threads without downloading them.\n\n`logger=None`, `failed=None` can be set to list or file to which information will be logged.\n\n`logger` logs only urls that are downloaded.\n\n`failed` logs failures in format:\n\n```\nRESOURCE_URL failed STATUS_CODE FAILED_URL\nRESOURCE_URL failed completely STATUS_CODE FAILED_URL\n```\n\nResource fails completely only because of `STATUS_CODE` e.g. `404`.\n\n`undisturbed=False` if set, scraper doesn't care about standard errors.\n\n`pedantic=False` if set, scraper fails because of errors in scraping resources related to currently scraped e.g. if getting users of reactions fails.\n\n`force=False` if set, scraper overwrites files, but will still refuse to scrape urls found in `visited` field of state, if you are passing state between functions and you want to redownload them you will have to set it to empty set e.g. `state['visited'] = set()` before every function call.\n\n`max_workers=1` set number of threads used for scraping.\n\n`compress_func=None` set compression function that will be called when writing to files, function should accept data in `bytes` as the first argument, e.g. `gzip.compress`.\n\n`verify=True` if set to `False` ignore ssl errors.\n\n`timeout=120` request timeout.\n\n`proxies={}` requests library proxies dictionary.\n\n`headers={}` requests library headers dictionary.\n\n`cookies={}` requests library cookies dictionary.\n\n`user_agent=None` custom user-agent.\n\n`wait=0` waiting time for each request.\n\n`wait_random=0` random waiting time up to specified miliseconds.\n\n`retries=3` number of retries attempted in case of failure.\n\n`retry_wait=60` waiting time between retries.\n\n`thread_pages_max=0` if greater than `0` limits number of pages traversed in threads.\n\n`pages_max=0` limits number of pages traversed in each forum, tag or board.\n\n`pages_max_depth=0` sets recursion limit for forums, tags and boards.\n\n`pages_forums_max=0` limits number of forums that are processed from every page in forum or board.\n\n`pages_threads_max=0` limits number of threads that are processed from every page in forum or tag.\n",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "A forum scraper library",
    "version": "0.1.7",
    "project_urls": {
        "Homepage": "https://github.com/TUVIMEN/forumscraper"
    },
    "split_keywords": [
        "text-processing",
        " scraper",
        " forums",
        " phpbb",
        " smf",
        " xmb",
        " invision",
        " xenforo"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "93a27a6f920da0c86a7329cc6b6fb39efc82463570bb790716e076a5d73db878",
                "md5": "7e27f896b4df16d6875522640cb1e963",
                "sha256": "aa0a91e43c2664fe9784d07cab019fe834aa4b0d2997d26579d41932142c6241"
            },
            "downloads": -1,
            "filename": "forumscraper-0.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7e27f896b4df16d6875522640cb1e963",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 63528,
            "upload_time": "2024-10-03T17:11:32",
            "upload_time_iso_8601": "2024-10-03T17:11:32.720514Z",
            "url": "https://files.pythonhosted.org/packages/93/a2/7a6f920da0c86a7329cc6b6fb39efc82463570bb790716e076a5d73db878/forumscraper-0.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ce4b0acd4c73020ae9b848751b23a9783bd8a798bc00ddce28e832bcc57f2525",
                "md5": "11b57e6c58fa61ceedf1191f75a2a430",
                "sha256": "9df425492df904ad1c89c2a2f12baf6b18d035c720b3469f8e66fa49cd38aa3d"
            },
            "downloads": -1,
            "filename": "forumscraper-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "11b57e6c58fa61ceedf1191f75a2a430",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 59585,
            "upload_time": "2024-10-03T17:11:34",
            "upload_time_iso_8601": "2024-10-03T17:11:34.075652Z",
            "url": "https://files.pythonhosted.org/packages/ce/4b/0acd4c73020ae9b848751b23a9783bd8a798bc00ddce28e832bcc57f2525/forumscraper-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-03 17:11:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TUVIMEN",
    "github_project": "forumscraper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "reliq",
            "specs": []
        }
    ],
    "lcname": "forumscraper"
}
        
Elapsed time: 0.32966s