cellar-extractor


Namecellar-extractor JSON
Version 1.0.62 PyPI version JSON
download
home_page
SummaryLibrary for extracting cellar data
upload_time2023-10-18 14:47:50
maintainer
docs_urlNone
authorLawTech Lab
requires_python
licenseMIT
keywords cellar extractor
VCS
bugtrack_url
requirements xmltodict requests bs4 lxml
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Cellar extractor
This library contains two functions to get cellar case law data from eurlex.

## Version
Python 3.9

## Contributors

<!-- readme: contributors,gijsvd -start -->
<table>
<tr>
    <td align="center">
        <a href="https://github.com/pranavnbapat">
            <img src="https://avatars.githubusercontent.com/u/7271334?v=4" width="100;" alt="pranavnbapat"/>
            <br />
            <sub><b>Pranav Bapat</b></sub>
        </a>
    </td>
    <td align="center">
        <a href="https://github.com/Cloud956">
            <img src="https://avatars.githubusercontent.com/u/24865274?v=4" width="100;" alt="Cloud956"/>
            <br />
            <sub><b>Piotr Lewandowski</b></sub>
        </a>
    </td>
    <td align="center">
        <a href="https://github.com/shashankmc">
            <img src="https://avatars.githubusercontent.com/u/3445114?v=4" width="100;" alt="shashankmc"/>
            <br />
            <sub><b>shashankmc</b></sub>
        </a>
    </td>
    <td align="center">
        <a href="https://github.com/gijsvd">
            <img src="https://avatars.githubusercontent.com/u/31765316?v=4" width="100;" alt="gijsvd"/>
            <br />
            <sub><b>gijsvd</b></sub>
        </a>
    </td>
</tr>
</table>
<!-- readme: contributors,gijsvd -end -->

## How to install?
<code>pip install cellar-extractor</code>

## What are the functions?
<ol>
    <li><code>get_cellar</code></li>
    Gets all the ECLI data from the eurlex sparql endpoint and saves them in the CSV or JSON format, in-memory or as a saved file.
    <br>
    <li><code>get_cellar_extra</code></li>
    Gets all the ECLI data from the eurlex sparql endpoint, and on top of that scrapes the eurlex websites to acquire 
    the full text, keywords, case law directory code and eurovoc identifiers. If the user does have an eurlex account with access to the eurlex webservices, he can also 
    pass his webservices login credentials to the method, in order to extract data about works citing work and works 
    being cited by work. The full text is returned as a JSON file, rest of data as a CSV.  Can be in-memory or as saved files.
    <li><code>get_nodes_and_edges_lists</code></li>
    Gets 2 list objects, one for the nodes and edges of the citations within the passed dataframe.
    Allows the creation of a network graph of the citations. Can only be returned in-memory.
    <li><code>filter_subject_matter</code></li>
    Returns a dataframe of cases only containing a certain phrase in the column containing the subject of cases.
    <br>
</ol>

## What are the parameters?
<ol>
    <li><code>get_cellar</code></li>
    <strong>Parameters:</strong>
    <ul>
        <li><strong>max_ecli: int, optional, default 100</strong></li>
        Maximum number of ECLIs to retrieve.
        <li><strong>sd: date, optional, default '2022-05-01'</strong></li>
        The start last modification date (yyyy-mm-dd).
        <li><strong>ed: date, optional, default current date</strong></li>
        The end last modification date (yyyy-mm-dd).
        <li><strong>save_file: ['y', 'n'],optional, default 'y'</strong></li>
        Save data in a data folder, or return in-memory.
        <li><strong>file_format: ['csv', 'json'],optional, default 'csv'</strong></li>
        Returns the data as a JSON/dictionary, or as a CSV/Pandas Dataframe object.
    </ul>
    <li><code>get_cellar_extra</code></li>
    <ul> 
        <li><strong>max_ecli: int, optional, default 100</strong></li>
        Maximum number of ECLIs to retrieve.
        <li><strong>sd: date, optional, default '2022-05-01'</strong></li>
        The start last modification date (yyyy-mm-dd).
        <li><strong>ed: date, optional, default current date</strong></li>
        The end last modification date (yyyy-mm-dd).
        <li><strong>save_file: ['y', 'n'],optional, default 'y'</strong></li>
        Save the full text of cases as JSON file / return as a dictionary and save the rest of
        the data as a CSV file / return as a Pandas Dataframe object.
        <li><strong>threads: int ,optional, default 10</strong></li>
        Extracting the additional data takes a lot of time. The use of multi-threading can cut down this time.
        Even with this, the method may take a couple of minutes for a couple of hundred cases. A maximum number
        of 10 recommended, as this method may also affect the device's internet connection.
        <li><strong>username: string, optional, default empty string</strong></li>
        The username to the eurlex webservices.
        <li><strong>password: string, optional, default empty string</strong></li>
        The password to the eurlex webservices.
        <br>
    </ul>
    <li><code>get_nodes_and_edges_lists</code></li>
    <ul>
        <li><strong>df: DataFrame object, required, default None</strong></li>
        DataFrame of cellar metadata acquired from the get_cellar_extra method with eurlex webservice credentials passed.
        This method will only work on dataframes with citations data.
        <li><strong>only_local: boolean, optional, default False</strong></li>
        Flag for nodes and edges generation. If set to True, the network created will only include nodes and edges between 
        cases exclusively inside the given dataframe.
    </ul>
    <li><code>filter_subject_matter</code></li>
    <ul>
        <li><strong>df: DataFrame object, required, default None</strong></li>
        DataFrame of cellar metadata acquired from any of the cellar extraction methods listed above.
        <li><strong>phrase: string, required, default None</strong></li>
        The phrase which has to be present in the subject matter of cases. Case insensitive.
    </ul>
</ol>


## Examples
```
import cellar_extractor as cell

Below are examples for in-file saving:

cell.get_cellar(save_file='y', max_ecli=200, sd='2022-01-01', file_format='csv')
cell.get_cellar_extra(max_ecli=100, sd='2022-01-01', threads=10)

Below are examples for in-memory saving:

df = cell.get_cellar(save_file='n', file_format='csv', sd='2022-01-01', max_ecli=1000)
df,json = cell.get_cellar_extra(save_file='n', max_ecli=100, sd='2022-01-01', threads=10)
```


## License
[![License: Apache 2.0](https://img.shields.io/github/license/maastrichtlawtech/extraction_libraries)](https://opensource.org/licenses/Apache-2.0)

Previously under the [MIT License](https://opensource.org/licenses/MIT), as of 28/10/2022 this work is licensed under a [Apache License, Version 2.0](https://opensource.org/licenses/Apache-2.0).
```
Apache License, Version 2.0

Copyright (c) 2022 Maastricht Law & Tech Lab

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    
    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "cellar-extractor",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "cellar,extractor",
    "author": "LawTech Lab",
    "author_email": "p.lewandowski@student.maastrichtuniversity.nl",
    "download_url": "https://files.pythonhosted.org/packages/25/e3/5574e8db740a0982a35747f7e483c40a8e96adbc338eeedf2f4631c69e0f/cellar_extractor-1.0.62.tar.gz",
    "platform": null,
    "description": "## Cellar extractor\r\nThis library contains two functions to get cellar case law data from eurlex.\r\n\r\n## Version\r\nPython 3.9\r\n\r\n## Contributors\r\n\r\n<!-- readme: contributors,gijsvd -start -->\r\n<table>\r\n<tr>\r\n    <td align=\"center\">\r\n        <a href=\"https://github.com/pranavnbapat\">\r\n            <img src=\"https://avatars.githubusercontent.com/u/7271334?v=4\" width=\"100;\" alt=\"pranavnbapat\"/>\r\n            <br />\r\n            <sub><b>Pranav Bapat</b></sub>\r\n        </a>\r\n    </td>\r\n    <td align=\"center\">\r\n        <a href=\"https://github.com/Cloud956\">\r\n            <img src=\"https://avatars.githubusercontent.com/u/24865274?v=4\" width=\"100;\" alt=\"Cloud956\"/>\r\n            <br />\r\n            <sub><b>Piotr Lewandowski</b></sub>\r\n        </a>\r\n    </td>\r\n    <td align=\"center\">\r\n        <a href=\"https://github.com/shashankmc\">\r\n            <img src=\"https://avatars.githubusercontent.com/u/3445114?v=4\" width=\"100;\" alt=\"shashankmc\"/>\r\n            <br />\r\n            <sub><b>shashankmc</b></sub>\r\n        </a>\r\n    </td>\r\n    <td align=\"center\">\r\n        <a href=\"https://github.com/gijsvd\">\r\n            <img src=\"https://avatars.githubusercontent.com/u/31765316?v=4\" width=\"100;\" alt=\"gijsvd\"/>\r\n            <br />\r\n            <sub><b>gijsvd</b></sub>\r\n        </a>\r\n    </td>\r\n</tr>\r\n</table>\r\n<!-- readme: contributors,gijsvd -end -->\r\n\r\n## How to install?\r\n<code>pip install cellar-extractor</code>\r\n\r\n## What are the functions?\r\n<ol>\r\n    <li><code>get_cellar</code></li>\r\n    Gets all the ECLI data from the eurlex sparql endpoint and saves them in the CSV or JSON format, in-memory or as a saved file.\r\n    <br>\r\n    <li><code>get_cellar_extra</code></li>\r\n    Gets all the ECLI data from the eurlex sparql endpoint, and on top of that scrapes the eurlex websites to acquire \r\n    the full text, keywords, case law directory code and eurovoc identifiers. If the user does have an eurlex account with access to the eurlex webservices, he can also \r\n    pass his webservices login credentials to the method, in order to extract data about works citing work and works \r\n    being cited by work. The full text is returned as a JSON file, rest of data as a CSV.  Can be in-memory or as saved files.\r\n    <li><code>get_nodes_and_edges_lists</code></li>\r\n    Gets 2 list objects, one for the nodes and edges of the citations within the passed dataframe.\r\n    Allows the creation of a network graph of the citations. Can only be returned in-memory.\r\n    <li><code>filter_subject_matter</code></li>\r\n    Returns a dataframe of cases only containing a certain phrase in the column containing the subject of cases.\r\n    <br>\r\n</ol>\r\n\r\n## What are the parameters?\r\n<ol>\r\n    <li><code>get_cellar</code></li>\r\n    <strong>Parameters:</strong>\r\n    <ul>\r\n        <li><strong>max_ecli: int, optional, default 100</strong></li>\r\n        Maximum number of ECLIs to retrieve.\r\n        <li><strong>sd: date, optional, default '2022-05-01'</strong></li>\r\n        The start last modification date (yyyy-mm-dd).\r\n        <li><strong>ed: date, optional, default current date</strong></li>\r\n        The end last modification date (yyyy-mm-dd).\r\n        <li><strong>save_file: ['y', 'n'],optional, default 'y'</strong></li>\r\n        Save data in a data folder, or return in-memory.\r\n        <li><strong>file_format: ['csv', 'json'],optional, default 'csv'</strong></li>\r\n        Returns the data as a JSON/dictionary, or as a CSV/Pandas Dataframe object.\r\n    </ul>\r\n    <li><code>get_cellar_extra</code></li>\r\n    <ul> \r\n        <li><strong>max_ecli: int, optional, default 100</strong></li>\r\n        Maximum number of ECLIs to retrieve.\r\n        <li><strong>sd: date, optional, default '2022-05-01'</strong></li>\r\n        The start last modification date (yyyy-mm-dd).\r\n        <li><strong>ed: date, optional, default current date</strong></li>\r\n        The end last modification date (yyyy-mm-dd).\r\n        <li><strong>save_file: ['y', 'n'],optional, default 'y'</strong></li>\r\n        Save the full text of cases as JSON file / return as a dictionary and save the rest of\r\n        the data as a CSV file / return as a Pandas Dataframe object.\r\n        <li><strong>threads: int ,optional, default 10</strong></li>\r\n        Extracting the additional data takes a lot of time. The use of multi-threading can cut down this time.\r\n        Even with this, the method may take a couple of minutes for a couple of hundred cases. A maximum number\r\n        of 10 recommended, as this method may also affect the device's internet connection.\r\n        <li><strong>username: string, optional, default empty string</strong></li>\r\n        The username to the eurlex webservices.\r\n        <li><strong>password: string, optional, default empty string</strong></li>\r\n        The password to the eurlex webservices.\r\n        <br>\r\n    </ul>\r\n    <li><code>get_nodes_and_edges_lists</code></li>\r\n    <ul>\r\n        <li><strong>df: DataFrame object, required, default None</strong></li>\r\n        DataFrame of cellar metadata acquired from the get_cellar_extra method with eurlex webservice credentials passed.\r\n        This method will only work on dataframes with citations data.\r\n        <li><strong>only_local: boolean, optional, default False</strong></li>\r\n        Flag for nodes and edges generation. If set to True, the network created will only include nodes and edges between \r\n        cases exclusively inside the given dataframe.\r\n    </ul>\r\n    <li><code>filter_subject_matter</code></li>\r\n    <ul>\r\n        <li><strong>df: DataFrame object, required, default None</strong></li>\r\n        DataFrame of cellar metadata acquired from any of the cellar extraction methods listed above.\r\n        <li><strong>phrase: string, required, default None</strong></li>\r\n        The phrase which has to be present in the subject matter of cases. Case insensitive.\r\n    </ul>\r\n</ol>\r\n\r\n\r\n## Examples\r\n```\r\nimport cellar_extractor as cell\r\n\r\nBelow are examples for in-file saving:\r\n\r\ncell.get_cellar(save_file='y', max_ecli=200, sd='2022-01-01', file_format='csv')\r\ncell.get_cellar_extra(max_ecli=100, sd='2022-01-01', threads=10)\r\n\r\nBelow are examples for in-memory saving:\r\n\r\ndf = cell.get_cellar(save_file='n', file_format='csv', sd='2022-01-01', max_ecli=1000)\r\ndf,json = cell.get_cellar_extra(save_file='n', max_ecli=100, sd='2022-01-01', threads=10)\r\n```\r\n\r\n\r\n## License\r\n[![License: Apache 2.0](https://img.shields.io/github/license/maastrichtlawtech/extraction_libraries)](https://opensource.org/licenses/Apache-2.0)\r\n\r\nPreviously under the [MIT License](https://opensource.org/licenses/MIT), as of 28/10/2022 this work is licensed under a [Apache License, Version 2.0](https://opensource.org/licenses/Apache-2.0).\r\n```\r\nApache License, Version 2.0\r\n\r\nCopyright (c) 2022 Maastricht Law & Tech Lab\r\n\r\nLicensed under the Apache License, Version 2.0 (the \"License\");\r\nyou may not use this file except in compliance with the License.\r\nYou may obtain a copy of the License at\r\n    \r\n    http://www.apache.org/licenses/LICENSE-2.0\r\n\r\nUnless required by applicable law or agreed to in writing, software\r\ndistributed under the License is distributed on an \"AS IS\" BASIS,\r\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\r\nSee the License for the specific language governing permissions and\r\nlimitations under the License.\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Library for extracting cellar data",
    "version": "1.0.62",
    "project_urls": {
        "Bug Tracker": "https://github.com/maastrichtlawtech/extraction_libraries",
        "Build Source": "https://github.com/maastrichtlawtech/extraction_libraries"
    },
    "split_keywords": [
        "cellar",
        "extractor"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "25e35574e8db740a0982a35747f7e483c40a8e96adbc338eeedf2f4631c69e0f",
                "md5": "dd6e5de286fd33f9c75bebe6d0e58738",
                "sha256": "7c8c007c2ee9b7e486df9aab8daec9119c92e58a68534dee62afded207f9f229"
            },
            "downloads": -1,
            "filename": "cellar_extractor-1.0.62.tar.gz",
            "has_sig": false,
            "md5_digest": "dd6e5de286fd33f9c75bebe6d0e58738",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 21898,
            "upload_time": "2023-10-18T14:47:50",
            "upload_time_iso_8601": "2023-10-18T14:47:50.077834Z",
            "url": "https://files.pythonhosted.org/packages/25/e3/5574e8db740a0982a35747f7e483c40a8e96adbc338eeedf2f4631c69e0f/cellar_extractor-1.0.62.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-18 14:47:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maastrichtlawtech",
    "github_project": "extraction_libraries",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "xmltodict",
            "specs": []
        },
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "bs4",
            "specs": []
        },
        {
            "name": "lxml",
            "specs": []
        }
    ],
    "lcname": "cellar-extractor"
}
        
Elapsed time: 0.12475s