etudier

Name	etudier JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/edsu/etudier
Summary	Collect a citation graph from Google Scholar
upload_time	2023-01-04 10:13:15
maintainer
docs_url	None
author	Ed Summers
requires_python	>=3
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![Étudier in Action](figure.gif)

*étudier* is a small Python program that uses [Selenium], [requests-html] and
[networkx] to drive a *non-headless* browser to collect a citation graph around
a particular [Google Scholar] citation or set of search results. The resulting
network is written out as [GEXF] and [GraphML] files as well as an HTML file
that includes a [D3] network visualization (pictured above).

If you are wondering why it uses a non-headless browser it's because Google is
[quite protective] of this data and will routinely ask you to solve a captcha
(identifying street signs, cars, etc in photos) to prove you are not a bot.
*étudier* allows you to complete these captcha tasks when they occur and then it
continues on its way collecting data. You need to have a browser to interact
with in order to do your part.

Install
-------

You'll need to install [ChromeDriver] before doing anything else. If you use
Homebrew on OS X this is as easy as:

    brew cask install chromedriver

Then you'll want to install [Python 3] and:

    pip3 install etudier

Run
---

To use étudier you first need to navigate to a page on Google Scholar that you are
interested in, for example here is the page of citations that reference Sherry
Ortner's [Theory in Anthropology since the Sixties]. Then you start *etudier* up
pointed at that page.

    % etudier 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc='

If you are interested in starting with keyword search results in Google Scholar
you can do that too. For example here is the url for searching for "cscw memory"
if I was interested in papers that talk about the CSCW conference and memory:

    % etudier 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=cscw+memory&btnG='

Note: it's important to quote the URL so that the shell doesn't interpret the
ampersands as an attempt to background the process.

### --pages

By default *étudier* will collect the 10 citations on that page and then look at
the top 10 citations that reference each one. So you will end up with no more
than 100 citations being collected (10 on each page * 10 citations).

If you would like to get more than one page of results use the `--pages`. For
example this would result in no more than 400 (20 * 20) results being collected:

    % etudier --pages 2 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc=' 

### --depth

And finally if you would like to look at the citations of the citations you use the
--depth parameter. 

    % etudier --depth 2 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc='

This will collect the initial set of 10 citations, the top 10 citations for
each, and then the top 10 citations of each of those, so no more than 1000
citations 1000 citations (10 * 10 * 10). It's no more because there is certain
to be some cross-citation duplication.

### --output

By default `output.gexf`, `output.graphml` and `output.html` files will be
written to the current working directory, but you can change this with the
`--output` option to control the prefix that is used. The output file will
contain rudimentary metadata collected from Google Scholar including:

- *id* - the cluster identifier assigned by Google
- *url* - the url for the publication
- *title* - the title of the publication
- *authors* - a comma separated list of the publication authors
- *year* - the year of publication
- *cited-by* - the number of other publications that cite the publication
- *cited-by-url* - a Google Scholar URL for the list of citing publications
* modularity - the modularity value obtained from community detection

Features of HTML/D3 output
--------------------------

- Node's color shows its citation group
- Node's size shows its times being cited
- Click node to open its source website
- Dragable nodes
- Zoom and pan
- Double-click to center node
- Resizable window
- Text labels
- Hover to highlight 1st-order neighborhood
- Click and press node to fade surroundings

[Theory in Anthropology since the Sixties]: https://scholar.google.com/scholar?hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc=
[Google Scholar]: https://scholar.google.com
[Selenium]: https://docs.seleniumhq.org/
[requests-html]: http://html.python-requests.org/
[quite protective]: https://www.quora.com/Are-there-technological-or-logistical-challenges-that-explain-why-Google-does-not-have-an-official-API-for-Google-Scholar
[GEXF]: https://gephi.org/
[GraphML]: https://networkx.org/documentation/stable/reference/readwrite/graphml.html
[networkx]: https://networkx.github.io/
[D3]: https://d3js.org/
[Python 3]: https://www.python.org/downloads/
[ChromeDriver]: https://sites.google.com/a/chromium.org/chromedriver/

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/edsu/etudier",
    "name": "etudier",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3",
    "maintainer_email": "",
    "keywords": "",
    "author": "Ed Summers",
    "author_email": "ehs@pobox.com",
    "download_url": "https://files.pythonhosted.org/packages/55/ac/37983a814ca0346be96a3fe1d53f51dee9d0bb9cf48abdca42458b78bb34/etudier-0.2.0.tar.gz",
    "platform": null,
    "description": "![\u00c9tudier in Action](figure.gif)\n\n*\u00e9tudier* is a small Python program that uses [Selenium], [requests-html] and\n[networkx] to drive a *non-headless* browser to collect a citation graph around\na particular [Google Scholar] citation or set of search results. The resulting\nnetwork is written out as [GEXF] and [GraphML] files as well as an HTML file\nthat includes a [D3] network visualization (pictured above).\n\nIf you are wondering why it uses a non-headless browser it's because Google is\n[quite protective] of this data and will routinely ask you to solve a captcha\n(identifying street signs, cars, etc in photos) to prove you are not a bot.\n*\u00e9tudier* allows you to complete these captcha tasks when they occur and then it\ncontinues on its way collecting data. You need to have a browser to interact\nwith in order to do your part.\n\nInstall\n-------\n\nYou'll need to install [ChromeDriver] before doing anything else. If you use\nHomebrew on OS X this is as easy as:\n\n    brew cask install chromedriver\n\nThen you'll want to install [Python 3] and:\n\n    pip3 install etudier\n\nRun\n---\n\nTo use \u00e9tudier you first need to navigate to a page on Google Scholar that you are\ninterested in, for example here is the page of citations that reference Sherry\nOrtner's [Theory in Anthropology since the Sixties]. Then you start *etudier* up\npointed at that page.\n\n    % etudier 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc='\n\nIf you are interested in starting with keyword search results in Google Scholar\nyou can do that too. For example here is the url for searching for \"cscw memory\"\nif I was interested in papers that talk about the CSCW conference and memory:\n\n    % etudier 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=cscw+memory&btnG='\n\nNote: it's important to quote the URL so that the shell doesn't interpret the\nampersands as an attempt to background the process.\n\n### --pages\n\nBy default *\u00e9tudier* will collect the 10 citations on that page and then look at\nthe top 10 citations that reference each one. So you will end up with no more\nthan 100 citations being collected (10 on each page * 10 citations).\n\nIf you would like to get more than one page of results use the `--pages`. For\nexample this would result in no more than 400 (20 * 20) results being collected:\n\n    % etudier --pages 2 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc=' \n\n### --depth\n\nAnd finally if you would like to look at the citations of the citations you use the\n--depth parameter. \n\n    % etudier --depth 2 'https://scholar.google.com/scholar?start=0&hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc='\n\nThis will collect the initial set of 10 citations, the top 10 citations for\neach, and then the top 10 citations of each of those, so no more than 1000\ncitations 1000 citations (10 * 10 * 10). It's no more because there is certain\nto be some cross-citation duplication.\n\n### --output\n\nBy default `output.gexf`, `output.graphml` and `output.html` files will be\nwritten to the current working directory, but you can change this with the\n`--output` option to control the prefix that is used. The output file will\ncontain rudimentary metadata collected from Google Scholar including:\n\n- *id* - the cluster identifier assigned by Google\n- *url* - the url for the publication\n- *title* - the title of the publication\n- *authors* - a comma separated list of the publication authors\n- *year* - the year of publication\n- *cited-by* - the number of other publications that cite the publication\n- *cited-by-url* - a Google Scholar URL for the list of citing publications\n* modularity - the modularity value obtained from community detection\n\nFeatures of HTML/D3 output\n--------------------------\n\n- Node's color shows its citation group\n- Node's size shows its times being cited\n- Click node to open its source website\n- Dragable nodes\n- Zoom and pan\n- Double-click to center node\n- Resizable window\n- Text labels\n- Hover to highlight 1st-order neighborhood\n- Click and press node to fade surroundings\n\n[Theory in Anthropology since the Sixties]: https://scholar.google.com/scholar?hl=en&as_sdt=20000005&sciodt=0,21&cites=17950649785549691519&scipsc=\n[Google Scholar]: https://scholar.google.com\n[Selenium]: https://docs.seleniumhq.org/\n[requests-html]: http://html.python-requests.org/\n[quite protective]: https://www.quora.com/Are-there-technological-or-logistical-challenges-that-explain-why-Google-does-not-have-an-official-API-for-Google-Scholar\n[GEXF]: https://gephi.org/\n[GraphML]: https://networkx.org/documentation/stable/reference/readwrite/graphml.html\n[networkx]: https://networkx.github.io/\n[D3]: https://d3js.org/\n[Python 3]: https://www.python.org/downloads/\n[ChromeDriver]: https://sites.google.com/a/chromium.org/chromedriver/\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Collect a citation graph from Google Scholar",
    "version": "0.2.0",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5d6646e9caa4d560abf762585f8d04770996c74a7cc794cf86102f1da9452639",
                "md5": "a638dd1f4152e1fc01d3e9d5f2412d9b",
                "sha256": "d01e7df5e7f05a55278ee8401b1f1ac18b7548160b5d1d3393f91063acc7b4d2"
            },
            "downloads": -1,
            "filename": "etudier-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a638dd1f4152e1fc01d3e9d5f2412d9b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3",
            "size": 9666,
            "upload_time": "2023-01-04T10:13:14",
            "upload_time_iso_8601": "2023-01-04T10:13:14.884873Z",
            "url": "https://files.pythonhosted.org/packages/5d/66/46e9caa4d560abf762585f8d04770996c74a7cc794cf86102f1da9452639/etudier-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "55ac37983a814ca0346be96a3fe1d53f51dee9d0bb9cf48abdca42458b78bb34",
                "md5": "9147a47bd00b942d4e7e0b95d261fc5e",
                "sha256": "6ee4c4b09a889b8bd6cb9bc6fb0abca174ecfa83d00f7b88419a7740c844d0d8"
            },
            "downloads": -1,
            "filename": "etudier-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "9147a47bd00b942d4e7e0b95d261fc5e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3",
            "size": 11205,
            "upload_time": "2023-01-04T10:13:15",
            "upload_time_iso_8601": "2023-01-04T10:13:15.896767Z",
            "url": "https://files.pythonhosted.org/packages/55/ac/37983a814ca0346be96a3fe1d53f51dee9d0bb9cf48abdca42458b78bb34/etudier-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-04 10:13:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "edsu",
    "github_project": "etudier",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "etudier"
}

Ed Summers