mediacloud-metadata

Name	mediacloud-metadata JSON
Version	1.4.1 JSON
	download
home_page	None
Summary	Media Cloud news article metadata extraction
upload_time	2024-12-22 02:53:27
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            Media Cloud Metadata Extractor
==============================

This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an
online news story. The methods for each are extracted from the larger [Media Cloud project](https://mediacloud.org),
but also build on numerous 3rd party libraries. The metadata extracted includes:

* the original URL of publication
* a normalized URL useful for de-duplication
* the canonical domain published on
* the date of publication
* the primary language used in the article text
* the title of the article
* a normalized title useful for de-duplication
* the text content of the news article
* the name of the library used to extract the article content

Other often-reused methods and configuration related to the mediacloud service also live in this package.


Installation
------------

`pip install mediacloud-metadata`

Usage
-----

If you pass in a URL, it will follow redirects and fetch the HTML for you.

```python
from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")
```

You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL
because that is used for some for some of the metadata extraction.

```python
from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
                   html_text="<html><head><title>my webpage ... </html>")
```

Development
-----------

If you are interested in adding code to this module, first clone the GitHub repository.

### Installing

* `flit install`
* `pre-commit install`

### Testing

`pytest`

### Distributing a New Version

1. Run `pytest` to make sure all the test pass
2. Update the version number in `pyproject.toml`
3. Make a brief note in the `CHANGELOG.md` about what changes
4. Commit the changes
5. Tag the commit with a semantic version number - `v*.*.*`
6. Push to repo to GitHub

#### Test Cache

Test are run against fixtures by default.  This can be changed with the use of '--use-cache=False' when running tests.
When adding new tests, re-run 'scripts/get-test-web-content.py'


Contributors
------------

Created as part of the Media Cloud Project. Contributes include:
* Rahul Bhargava (Media Cloud, Northeastern University)
* Paige Gulley (Media Cloud)
* Phil Budne (Media Cloud)
* Vangelis Banos (Internet Archive)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mediacloud-metadata",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Rahul Bhargava <rahul@mediacloud.org>",
    "download_url": "https://files.pythonhosted.org/packages/0d/a3/ea561111a68191145f4343e0d5a321eca395556d6ed8c6d0e03615d063e1/mediacloud_metadata-1.4.1.tar.gz",
    "platform": null,
    "description": "Media Cloud Metadata Extractor\n==============================\n\nThis is a package to extract a domain, title, publication date, text, and language content from the URL or text of an\nonline news story. The methods for each are extracted from the larger [Media Cloud project](https://mediacloud.org),\nbut also build on numerous 3rd party libraries. The metadata extracted includes:\n\n* the original URL of publication\n* a normalized URL useful for de-duplication\n* the canonical domain published on\n* the date of publication\n* the primary language used in the article text\n* the title of the article\n* a normalized title useful for de-duplication\n* the text content of the news article\n* the name of the library used to extract the article content\n\nOther often-reused methods and configuration related to the mediacloud service also live in this package.\n\n\nInstallation\n------------\n\n`pip install mediacloud-metadata`\n\nUsage\n-----\n\nIf you pass in a URL, it will follow redirects and fetch the HTML for you.\n\n```python\nfrom mcmetadata import extract\nmetadata = extract(url=\"https://my.awesome.news/story-path\")\n```\n\nYou can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL\nbecause that is used for some for some of the metadata extraction.\n\n```python\nfrom mcmetadata import extract\nmetadata = extract(url=\"https://my.awesome.news/story-path\",\n                   html_text=\"<html><head><title>my webpage ... </html>\")\n```\n\nDevelopment\n-----------\n\nIf you are interested in adding code to this module, first clone the GitHub repository.\n\n### Installing\n\n* `flit install`\n* `pre-commit install`\n\n### Testing\n\n`pytest`\n\n### Distributing a New Version\n\n1. Run `pytest` to make sure all the test pass\n2. Update the version number in `pyproject.toml`\n3. Make a brief note in the `CHANGELOG.md` about what changes\n4. Commit the changes\n5. Tag the commit with a semantic version number - `v*.*.*`\n6. Push to repo to GitHub\n\n#### Test Cache\n\nTest are run against fixtures by default.  This can be changed with the use of '--use-cache=False' when running tests.\nWhen adding new tests, re-run 'scripts/get-test-web-content.py'\n\n\nContributors\n------------\n\nCreated as part of the Media Cloud Project. Contributes include:\n* Rahul Bhargava (Media Cloud, Northeastern University)\n* Paige Gulley (Media Cloud)\n* Phil Budne (Media Cloud)\n* Vangelis Banos (Internet Archive)\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Media Cloud news article metadata extraction",
    "version": "1.4.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/mediacloud/meta-extractor/issues",
        "Homepage": "https://mediacloud.org",
        "Source Code": "https://github.com/mediacloud/meta-extractor"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a72e5b7b521fdddbae4b61d48bb1431d3afc8847ef19dd5285036b3d72663d15",
                "md5": "d685bfcdfc02a12651e985fb0b74140f",
                "sha256": "c8537ffe9a3e29851e234b12b0619a203161a3cfb6052af499dc6b1a6577af97"
            },
            "downloads": -1,
            "filename": "mediacloud_metadata-1.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d685bfcdfc02a12651e985fb0b74140f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 8830376,
            "upload_time": "2024-12-22T02:53:24",
            "upload_time_iso_8601": "2024-12-22T02:53:24.391658Z",
            "url": "https://files.pythonhosted.org/packages/a7/2e/5b7b521fdddbae4b61d48bb1431d3afc8847ef19dd5285036b3d72663d15/mediacloud_metadata-1.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0da3ea561111a68191145f4343e0d5a321eca395556d6ed8c6d0e03615d063e1",
                "md5": "498c26869e0ff86e80f30a7f5078fca8",
                "sha256": "f4498a0f3e50e10427bac1b3a5165cfea68e0aab8c2c0387ed16518ec60ed93e"
            },
            "downloads": -1,
            "filename": "mediacloud_metadata-1.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "498c26869e0ff86e80f30a7f5078fca8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 8728867,
            "upload_time": "2024-12-22T02:53:27",
            "upload_time_iso_8601": "2024-12-22T02:53:27.569737Z",
            "url": "https://files.pythonhosted.org/packages/0d/a3/ea561111a68191145f4343e0d5a321eca395556d6ed8c6d0e03615d063e1/mediacloud_metadata-1.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-22 02:53:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mediacloud",
    "github_project": "meta-extractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "mediacloud-metadata"
}

None