longitudinal-trends


Namelongitudinal-trends JSON
Version 0.1.8 PyPI version JSON
download
home_page
SummaryGenerate long-term longitudinal Google Trends Data
upload_time2023-05-22 17:41:41
maintainer
docs_urlNone
authorMohammad Saleh Ahsan Sakir, Taeyong Park
requires_python>=3.7
licenseMIT
keywords longitudinal-trends longitudinal data python google-trends-api google-trends search-data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # longitudinal_trends

## Introduction

This is a python library for downloading cross-section and time-series Google Trends and converting them to longitudinal data.

Although Google Trends provides cross-section and time-series search data, longitudinal Google Trends data are not readily available. There exist several practical issues that make it difficult for researchers to generate longitudinal Google Trends data themselves. First, Google Trends provides normalized counts from zero to 100. As a result, combining different regions' time-series Google Trends data does not create desired longitudinal data. For the same reason, combining cross-sectional Google Trends data over time does not create desired longitudinal data. Second, Google Trends has restrictions on data formats and timeline. For instance, if you want to collect daily data for 2 years, you cannot do so. Google Trends automatically provides weekly data if your request timeline is more than 269 days. Similarly, Google Trends automatically provides monthly data if your request timeline is more than 269 weeks even though you want to collect weekly data.

The longitudinal_trends library resolves the aforementioned issues and allows researchers to generate longitudinal Google Trends.

This library is built on top of another library `pytrends` which also have few dependencies. As long as `Google Trends API`, `pytrends` and all their dependencies work, `longitudinal_trends` will also work!

***Note**: This library is still in planning phase and not fully tested.*

## Table of contents

* Installation
* Requirements
* Initiate `longitudinal_trends`
* Methods
  * WARNING
  * `cross_section`
  * `time_series`
  * `concat_time_series`
  * `convert_cross_section`
  * `all_in_one_method`
* Caveats
* Credits
* Disclaimer

## Installation

`pip install longitudinal-trends`

## Requirements

`pip install -r requiremnts.txt`

## Initiate longitudinal_trends

```python
from longitudinal_trends import RequestTrends
import datetime as dt

day_data = RequestTrends(keyword='Insomnia', topic='/m/0ddwt', folder_name='insomnia_save', start_date=dt.datetime(2021, 11, 1), end_date=dt.datetime(2022,10,24), data_format='daily')
```

The initiator call will initiate `pytrends` that initiates the `Google Trends API`. In the initiation stage, two folders will be created automatically

1. Parent folder that the users will choose the name of and
2. Folder corresponding to the data_format.
   So all the daily data will be stored under 'daily' folder for daily data, 'weekly' folder for weekly data and so on.
   **Parameters**

- `keyword`
  - The keyword to be used for collecting google trends data
- `topic`
  - The topic of the keyword. If any topic is to be used instead of search term.
    - For example, '/m/0ddwt' will give google trends data for Insomnia as topic of 'Disorder'.
      - **NOTE**: URL's have certain codes for special characters. For example, `%20` = white space, `%2F` = / (forward slash) etc.
    - If the topic and keyword are the same, then data provided will be for google trends search term and not any particular topic. So, `keyword='Insomnia', topic='Insomnia'` will provide google trends data for Insomnia as search term.
- `folder_name`
  - Name of folder to be created to save all the data
- `start_date`
  - Date to start from
- `end_date`
  - Date to end at
- `data_format`
  - Time basis query
  - Can choose only one from the list: ['daily', 'weekly', 'monthly']

## Methods

### WARNING

Please make sure to run the methods in the following sequence:

- `cross_section`
- `time_series`
- `concat_time_series`
- `convert_time_series`

We have noticed some unusual behaviors if not run in the given sequence. Firstly `concat_time_series` depends on `time_series` and `convert_cross_section` depends on all the three. We have noticed if `time_series` is ran before `cross_section` then sometimes the output gets incluenced by `time_series` parameters. We are troubleshooting the issue. Until then, please follow the sequence to attain the expected result.

### cross_section

```python
day_data.cross_section(geo='US', resolution="REGION")
```

This method will collect cross section data of the given keyword and timeline. It calls `pytrends.interest_by_region()` method from pytrends. The data is automatically saved in →  'folder_name'/'data_format'/by_region. Each file has data for the given region/countries all the country/state google trends index for 1 day/week/month. The filenames tells the date of the data time period and also has an indication of number of day/week/month.

For more information on pytrends `interest_by_region()` method, [check here](https://pypi.org/project/pytrends/#interest-by-region).

**PS**: *This method takes a long time to finish running. For example, it takes around 5 hours to collect 350 days of daily data. The time is mainly due to Google Trends API rate limit and resetting the limit.*

**Parameters**

- `geo`
  - Country/Region to collect data from. If left empty, then result will be worldwide i.e. data will be collected for all country. If left empty, defaults to worldwide country level.
- `resolution`
  - 'COUNTRY' returns country level data
  - 'REGION' returns region level data
  - 'CITY' returns city level data
  - Defaults to country

### time_series

```python
day_data.time_series(reference_geo='US-AL')
```

This method will collect over time data. It calls `pytrends.interest_over_time()` method from pytrends. For time series google trends data, by default google will provide weekly data if the days between start and end date is more than 270 days and will provide monthly data if the difference is more than 270 weeks. To tackle that problem, this method will collect the daily/weekly data into chunks less then 270 days/weeks. The collected data will be saved under → 'folder_name'/'data_format'/over_time/'reference_geo

For more information on pytrends `interest_over_time()` method, [check here](https://pypi.org/project/pytrends/#interest-over-time).

**Parameters**

- `reference_geo`
  - Country/State/City to be used as reference point to rescale the data in later part

### concat_time_series

```python
day_data.concat_time_series(reference_geo='US-AL', zero_replace=0.1)
```

This method will concat the time series data collected in `time_series()` method. Because the data points in  `time_series` is independent of each other, they needs to be re-aligned to get correct index for the given time period. This method concatenates `time_series` data for all the period and gives back the combined rescaled `time_series` data for the reference timeline. This rescaled `time_series` data will be used in the next method to rescale the `cross_section` data.

**Parameters**

- `reference_geo`
  - This is the same `geo` code that is used in collecting `time_series` data. If the time_series data for that geo is not collected beforehand, or the file does not exist, it will throw and error. Default is 'US'
- `zero_replace`
  - As data from different time periods are rescaled, sometimes the last/first data point of a period might be zero. Then the calculation will throw error or everything single data point will become zero. To avoid that, we are tweaking the zeroes to be of an insignificant number to carry on with the calculation.

### convert_cross_section

```python
day_data.convert_cross_section(reference_geo='US-AL', zero_replace=0.1)
```

This final method will rescale the cross section data based on the concatenated time series data. This will finally provide the accurate google trends index for each region/country/city over the provided time period.

**Parameters**

- `reference_geo`
  - Same as the reference_geo from `concat_time_series()`. If anyother is used, then the result will not be accurate
- `zero_replace`
  - Same as zero_replace from `concat_time_series()`. It is highly recommended to use the same to avoid incosistent results.

### all_in_one_method

```python
day_data.all_in_one_method(geo='US', reference_geo='US-AL', zero_replace=0.1)
```

This last method combines all the methods together and executes them in the correct sequence. It will collect the cross_section & time_series data, concat the time_series data and finally rescale the cross section data all in one go. All the files will be present for cross reference.

Note that the sequence of the first two methods `cross_section()` & `time_series()` don't matter since they are independent. However, the later two are depended on the first two. `concat_time_series()` is depended on `time_series()` and `convert_cross_section()` is depended on both `concat_time_series()` and `cross_section()`.

**Parameters**

- `geo`
  - Same as `geo` from `cross_section()`
- `reference_geo`
  - Same as `reference_geo` from `time_series()` and `concat_time_series()`
- `zero_replace`
  - Same as `zero_replace` from `concat_time_series()` and `convert_cross_section()`

## Caveats

This is not an Official or Supported API.

`longitudinal_trends` is built on top of `pytrends`. `pytrends` uses `Google Trends API` to collect trends data. So we do not have any control over the accuracy or quality of the trends data. It has been observed during tests that for the same inputs (keyword, topic, data_format, timeline), outputs were little different.

`zero_replace` is used to avoid division errors. But when the `zero_replace` is very small number, and there are a lot of zeroes in the dataset, then the final output will contain very big numbers. However, there is no specific rule or recommendation for the `zero_replace`. Its gonna be a trial & error.

On that note, if the search term is not very popular, then the resultant dataset will contain a lot of zeroes that will hugely impact the final outcome.

## Credits

- `pytrends` library
  - https://github.com/GeneralMills/pytrends/tree/0d6113a3920e7576d4b3459132b5d37fb7ab9bfb

## Disclaimer

This publication was made possible by the generous support of the Qatar Foundation through Carnegie Mellon University in Qatar's Seed Research program. The statements made herein are solely the responsibility of the authors.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "longitudinal-trends",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "longitudinal-trends longitudinal data python google-trends-api google-trends search-data",
    "author": "Mohammad Saleh Ahsan Sakir, Taeyong Park",
    "author_email": "ahsansakir506@gmail.com, taeyongp@andrew.cmu.edu",
    "download_url": "https://files.pythonhosted.org/packages/ae/9b/a5a2c56f4e7f10d0b196bc7e500544a0567985d9a0914ba5802bed6272f4/longitudinal_trends-0.1.8.tar.gz",
    "platform": null,
    "description": "# longitudinal_trends\n\n## Introduction\n\nThis is a python library for downloading cross-section and time-series Google Trends and converting them to longitudinal data.\n\nAlthough Google Trends provides cross-section and time-series search data, longitudinal Google Trends data are not readily available. There exist several practical issues that make it difficult for researchers to generate longitudinal Google Trends data themselves. First, Google Trends provides normalized counts from zero to 100. As a result, combining different regions' time-series Google Trends data does not create desired longitudinal data. For the same reason, combining cross-sectional Google Trends data over time does not create desired longitudinal data. Second, Google Trends has restrictions on data formats and timeline. For instance, if you want to collect daily data for 2 years, you cannot do so. Google Trends automatically provides weekly data if your request timeline is more than 269 days. Similarly, Google Trends automatically provides monthly data if your request timeline is more than 269 weeks even though you want to collect weekly data.\n\nThe longitudinal_trends library resolves the aforementioned issues and allows researchers to generate longitudinal Google Trends.\n\nThis library is built on top of another library `pytrends` which also have few dependencies. As long as `Google Trends API`, `pytrends` and all their dependencies work, `longitudinal_trends` will also work!\n\n***Note**: This library is still in planning phase and not fully tested.*\n\n## Table of contents\n\n* Installation\n* Requirements\n* Initiate `longitudinal_trends`\n* Methods\n  * WARNING\n  * `cross_section`\n  * `time_series`\n  * `concat_time_series`\n  * `convert_cross_section`\n  * `all_in_one_method`\n* Caveats\n* Credits\n* Disclaimer\n\n## Installation\n\n`pip install longitudinal-trends`\n\n## Requirements\n\n`pip install -r requiremnts.txt`\n\n## Initiate longitudinal_trends\n\n```python\nfrom longitudinal_trends import RequestTrends\nimport datetime as dt\n\nday_data = RequestTrends(keyword='Insomnia', topic='/m/0ddwt', folder_name='insomnia_save', start_date=dt.datetime(2021, 11, 1), end_date=dt.datetime(2022,10,24), data_format='daily')\n```\n\nThe initiator call will initiate `pytrends` that initiates the `Google Trends API`. In the initiation stage, two folders will be created automatically\n\n1. Parent folder that the users will choose the name of and\n2. Folder corresponding to the data_format.\n   So all the daily data will be stored under 'daily' folder for daily data, 'weekly' folder for weekly data and so on.\n   **Parameters**\n\n- `keyword`\n  - The keyword to be used for collecting google trends data\n- `topic`\n  - The topic of the keyword. If any topic is to be used instead of search term.\n    - For example, '/m/0ddwt' will give google trends data for Insomnia as topic of 'Disorder'.\n      - **NOTE**: URL's have certain codes for special characters. For example, `%20` = white space, `%2F` = / (forward slash) etc.\n    - If the topic and keyword are the same, then data provided will be for google trends search term and not any particular topic. So, `keyword='Insomnia', topic='Insomnia'` will provide google trends data for Insomnia as search term.\n- `folder_name`\n  - Name of folder to be created to save all the data\n- `start_date`\n  - Date to start from\n- `end_date`\n  - Date to end at\n- `data_format`\n  - Time basis query\n  - Can choose only one from the list: ['daily', 'weekly', 'monthly']\n\n## Methods\n\n### WARNING\n\nPlease make sure to run the methods in the following sequence:\n\n- `cross_section`\n- `time_series`\n- `concat_time_series`\n- `convert_time_series`\n\nWe have noticed some unusual behaviors if not run in the given sequence. Firstly `concat_time_series` depends on `time_series` and `convert_cross_section` depends on all the three. We have noticed if `time_series` is ran before `cross_section` then sometimes the output gets incluenced by `time_series` parameters. We are troubleshooting the issue. Until then, please follow the sequence to attain the expected result.\n\n### cross_section\n\n```python\nday_data.cross_section(geo='US', resolution=\"REGION\")\n```\n\nThis method will collect cross section data of the given keyword and timeline. It calls `pytrends.interest_by_region()` method from pytrends. The data is automatically saved in \u2192  'folder_name'/'data_format'/by_region. Each file has data for the given region/countries all the country/state google trends index for 1 day/week/month. The filenames tells the date of the data time period and also has an indication of number of day/week/month.\n\nFor more information on pytrends `interest_by_region()` method, [check here](https://pypi.org/project/pytrends/#interest-by-region).\n\n**PS**: *This method takes a long time to finish running. For example, it takes around 5 hours to collect 350 days of daily data. The time is mainly due to Google Trends API rate limit and resetting the limit.*\n\n**Parameters**\n\n- `geo`\n  - Country/Region to collect data from. If left empty, then result will be worldwide i.e. data will be collected for all country. If left empty, defaults to worldwide country level.\n- `resolution`\n  - 'COUNTRY' returns country level data\n  - 'REGION' returns region level data\n  - 'CITY' returns city level data\n  - Defaults to country\n\n### time_series\n\n```python\nday_data.time_series(reference_geo='US-AL')\n```\n\nThis method will collect over time data. It calls `pytrends.interest_over_time()` method from pytrends. For time series google trends data, by default google will provide weekly data if the days between start and end date is more than 270 days and will provide monthly data if the difference is more than 270 weeks. To tackle that problem, this method will collect the daily/weekly data into chunks less then 270 days/weeks. The collected data will be saved under \u2192 'folder_name'/'data_format'/over_time/'reference_geo\n\nFor more information on pytrends `interest_over_time()` method, [check here](https://pypi.org/project/pytrends/#interest-over-time).\n\n**Parameters**\n\n- `reference_geo`\n  - Country/State/City to be used as reference point to rescale the data in later part\n\n### concat_time_series\n\n```python\nday_data.concat_time_series(reference_geo='US-AL', zero_replace=0.1)\n```\n\nThis method will concat the time series data collected in `time_series()` method. Because the data points in  `time_series` is independent of each other, they needs to be re-aligned to get correct index for the given time period. This method concatenates `time_series` data for all the period and gives back the combined rescaled `time_series` data for the reference timeline. This rescaled `time_series` data will be used in the next method to rescale the `cross_section` data.\n\n**Parameters**\n\n- `reference_geo`\n  - This is the same `geo` code that is used in collecting `time_series` data. If the time_series data for that geo is not collected beforehand, or the file does not exist, it will throw and error. Default is 'US'\n- `zero_replace`\n  - As data from different time periods are rescaled, sometimes the last/first data point of a period might be zero. Then the calculation will throw error or everything single data point will become zero. To avoid that, we are tweaking the zeroes to be of an insignificant number to carry on with the calculation.\n\n### convert_cross_section\n\n```python\nday_data.convert_cross_section(reference_geo='US-AL', zero_replace=0.1)\n```\n\nThis final method will rescale the cross section data based on the concatenated time series data. This will finally provide the accurate google trends index for each region/country/city over the provided time period.\n\n**Parameters**\n\n- `reference_geo`\n  - Same as the reference_geo from `concat_time_series()`. If anyother is used, then the result will not be accurate\n- `zero_replace`\n  - Same as zero_replace from `concat_time_series()`. It is highly recommended to use the same to avoid incosistent results.\n\n### all_in_one_method\n\n```python\nday_data.all_in_one_method(geo='US', reference_geo='US-AL', zero_replace=0.1)\n```\n\nThis last method combines all the methods together and executes them in the correct sequence. It will collect the cross_section & time_series data, concat the time_series data and finally rescale the cross section data all in one go. All the files will be present for cross reference.\n\nNote that the sequence of the first two methods `cross_section()` & `time_series()` don't matter since they are independent. However, the later two are depended on the first two. `concat_time_series()` is depended on `time_series()` and `convert_cross_section()` is depended on both `concat_time_series()` and `cross_section()`.\n\n**Parameters**\n\n- `geo`\n  - Same as `geo` from `cross_section()`\n- `reference_geo`\n  - Same as `reference_geo` from `time_series()` and `concat_time_series()`\n- `zero_replace`\n  - Same as `zero_replace` from `concat_time_series()` and `convert_cross_section()`\n\n## Caveats\n\nThis is not an Official or Supported API.\n\n`longitudinal_trends` is built on top of `pytrends`. `pytrends` uses `Google Trends API` to collect trends data. So we do not have any control over the accuracy or quality of the trends data. It has been observed during tests that for the same inputs (keyword, topic, data_format, timeline), outputs were little different.\n\n`zero_replace` is used to avoid division errors. But when the `zero_replace` is very small number, and there are a lot of zeroes in the dataset, then the final output will contain very big numbers. However, there is no specific rule or recommendation for the `zero_replace`. Its gonna be a trial & error.\n\nOn that note, if the search term is not very popular, then the resultant dataset will contain a lot of zeroes that will hugely impact the final outcome.\n\n## Credits\n\n- `pytrends` library\n  - https://github.com/GeneralMills/pytrends/tree/0d6113a3920e7576d4b3459132b5d37fb7ab9bfb\n\n## Disclaimer\n\nThis publication was made possible by the generous support of the Qatar Foundation through Carnegie Mellon University in Qatar's Seed Research program. The statements made herein are solely the responsibility of the authors.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Generate long-term longitudinal Google Trends Data",
    "version": "0.1.8",
    "project_urls": {
        "homepage": "https://github.com/Mohammad-sakir/longitudinalTrends"
    },
    "split_keywords": [
        "longitudinal-trends",
        "longitudinal",
        "data",
        "python",
        "google-trends-api",
        "google-trends",
        "search-data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "287057acf78453dba90cdd0cf379c27d01d17f5ac457f29d6dadb73f8565c78f",
                "md5": "4656f5622e9bd1045787b77a7399895d",
                "sha256": "af105939b3ba0a5ff53229897f86291e768118f28bb2eeac63d5ddf102ca34a5"
            },
            "downloads": -1,
            "filename": "longitudinal_trends-0.1.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4656f5622e9bd1045787b77a7399895d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 10369,
            "upload_time": "2023-05-22T17:41:39",
            "upload_time_iso_8601": "2023-05-22T17:41:39.336741Z",
            "url": "https://files.pythonhosted.org/packages/28/70/57acf78453dba90cdd0cf379c27d01d17f5ac457f29d6dadb73f8565c78f/longitudinal_trends-0.1.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ae9ba5a2c56f4e7f10d0b196bc7e500544a0567985d9a0914ba5802bed6272f4",
                "md5": "dbe0ebecb1f55fa563a022faf68e9d40",
                "sha256": "9e0d7643fbfacb316a141c9906ad5a97d929066f09d263c505df2e57c3f44546"
            },
            "downloads": -1,
            "filename": "longitudinal_trends-0.1.8.tar.gz",
            "has_sig": false,
            "md5_digest": "dbe0ebecb1f55fa563a022faf68e9d40",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 14246,
            "upload_time": "2023-05-22T17:41:41",
            "upload_time_iso_8601": "2023-05-22T17:41:41.328093Z",
            "url": "https://files.pythonhosted.org/packages/ae/9b/a5a2c56f4e7f10d0b196bc7e500544a0567985d9a0914ba5802bed6272f4/longitudinal_trends-0.1.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-22 17:41:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Mohammad-sakir",
    "github_project": "longitudinalTrends",
    "github_not_found": true,
    "lcname": "longitudinal-trends"
}
        
Elapsed time: 0.06958s