<div style="display: flex; align-items: center;">
<img src="https://raw.githubusercontent.com/harmonize-tools/socio4health/main/docs/source/_static/image.png" alt="image info" height="100" width="100" style="margin-right: 20px;"/>
<a href="https://www.harmonize-tools.org/">
<img src="https://harmonize-tools.github.io/harmonize-logo.png" height="139" alt="socio4health logo"/>
</a>
</div>
<!-- badges: start -->
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://github.com/harmonize-tools/socio4health/blob/main/LICENSE.md/)
[](https://github.com/harmonize-tools/socio4health/graphs/contributors)

<!-- badges: end -->
## Overview
<p style="font-family: Arial, sans-serif; font-size: 14px;">
Package socio4health is an extraction, transformation, loading (ETL), and AI-assisted query and visualization (AI QV) tool designed to simplify the intricate process of collecting and merging data 📊 from multiple sources, focusing on sociodemographic and census datasets from Colombia, Brazil, and Peru, into a unified relational database structure.
</p>
- Seamlessly retrieve data from online data sources through web scraping, as well as from local files.
- Support for various data formats, including `.csv`, `.xlsx`, `.xls`, `.txt`, `.sav`, and compressed files, ensuring versatility in sourcing information.
- Consolidating extracted data into a pandas DataFrame.
- Consolidating transformed data into a cohesive relational database.
- Conduct precise queries and apply transformations to meet specific criteria.
## Dependencies
<table>
<tr>
<td align="center">
<a href="https://pandas.pydata.org/" target="_blank">
<img src="https://avatars.githubusercontent.com/u/21206976?s=280&v=4" height="50" alt="pandas logo">
</a>
</td>
<td align="left">
<strong>Pandas</strong><br>
Pandas is a fast, powerful, flexible, and easy-to-use open source data analysis and manipulation tool.<br>
</td>
</tr>
<tr>
<td align="center">
<a href="https://numpy.org/" target="_blank">
<img src="https://avatars.githubusercontent.com/u/288276?s=48&v=4" height="50" alt="numpy logo">
</a>
</td>
<td align="left">
<strong>Numpy</strong><br>
The fundamental package for scientific computing with Python.<br>
</td>
</tr>
<tr>
<td align="center">
<a href="https://scrapy.org/" target="_blank">
<img src="https://avatars.githubusercontent.com/u/733635?s=48&v=4" height="50" alt="scrapy logo">
</a>
</td>
<td align="left">
<strong>Scrapy</strong><br>
Framework for extracting the data you need from websites.<br>
</td>
</tr>
</table>
- <a href="https://openpyxl.readthedocs.io/en/stable/">openpyxl</a>
- <a href="https://py7zr.readthedocs.io/en/latest/">py7zr</a>
- <a href="https://pypi.org/project/pyreadstat/">pyreadstat</a>
- <a href="https://tqdm.github.io/">tqdm</a>
- <a href="https://requests.readthedocs.io/en/latest/">requests</a>
## Installation
**socio4health** can be installed via pip from [PyPI](https://pypi.org/project/socio4health/).
```python
# Install using pip
pip install socio4health
```
## How to Use it
To use the socio4health package, follow these steps:
1. Import the package in your Python script:
```python
from socio4health import Extractor()
from socio4health import Harmonizer
```
2. Create an instance of the `Extractor` class:
```python
extractor = Extractor()
```
3. Extract data from online sources and create a list of data information:
```python
url = 'https://www.example.com'
depth = 0
ext = 'csv'
list_datainfo = extractor.s4h_extract(url=url, depth=depth, ext=ext)
harmonizer = Harmonizer()
```
## Resources
<details>
<summary>
Package Website
</summary>
The [socio4health website](https://harmonize-tools.github.io/socio4health/) package website includes **API reference**, **user guide**, and **examples**. The site mainly concerns the release version, but you can also find documentation for the latest development version.
</details>
<details>
<summary>
Organisation Website
</summary>
[Harmonize](https://www.harmonize-tools.org/) is an international project that develops cost-effective and reproducible digital tools for stakeholders in Latin America and the Caribbean (LAC) affected by a changing climate. These stakeholders include cities, small islands, highlands, and the Amazon rainforest.
The project consists of resources and [tools](https://harmonize-tools.github.io/) developed in conjunction with different teams from Brazil, Colombia, Dominican Republic, Peru, and Spain.
</details>
## Organizations
<table>
<tr>
<td align="center">
<a href="https://www.bsc.es/" target="_blank">
<img src="https://imgs.search.brave.com/t_FUOTCQZmDh3ddbVSX1LgHYq4mzCxvVA8U_YHywMTc/rs:fit:500:0:0/g:ce/aHR0cHM6Ly9zb21t/YS5lcy93cC1jb250/ZW50L3VwbG9hZHMv/MjAyMi8wNC9CU0Mt/Ymx1ZS1zbWFsbC5q/cGc" height="64" alt="bsc logo">
</a>
</td>
<td align="center">
<a href="https://uniandes.edu.co/" target="_blank">
<img src="https://raw.githubusercontent.com/harmonize-tools/socio4health/refs/heads/main/docs/img/uniandes.png" height="64" alt="uniandes logo">
</a>
</td>
</tr>
</table>
## Authors / Contact information
Here is the contact information of authors/contributors in case users have questions or feedback.
</br>
</br>
<a href="https://github.com/dirreno">
<img src="https://avatars.githubusercontent.com/u/39099417?v=4" style="width: 50px; height: auto;" />
</a>
<span style="display: flex; align-items: center; margin-left: 10px;">
<strong>Diego Irreño</strong> (developer)
</span>
</br>
<a href="https://github.com/Ersebreck">
<img src="https://avatars.githubusercontent.com/u/81669194?v=4" style="width: 50px; height: auto;" />
</a>
<span style="display: flex; align-items: center; margin-left: 10px;">
<strong>Erick Lozano</strong> (developer)
</span>
</br>
<a href="https://github.com/Juanmontenegro99">
<img src="https://avatars.githubusercontent.com/u/60274234?v=4" style="width: 50px; height: auto;" />
</a>
<span style="display: flex; align-items: center; margin-left: 10px;">
<strong>Juan Montenegro</strong> (developer)
</span>
</br>
<a href="https://github.com/ingridvmoras">
<img src="https://avatars.githubusercontent.com/u/91691844?s=400&u=945efa0d09fcc25d1e592d2a9fddb984fdc6ceea&v=4" style="width: 50px; height: auto;" />
</a>
<span style="display: flex; align-items: center; margin-left: 10px;">
<strong>Ingrid Mora</strong> (documentation)
</span>
Raw data
{
"_id": null,
"home_page": "https://github.com/harmonize-tools/socio4health",
"name": "socio4health",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.10",
"maintainer_email": null,
"keywords": "extract transform load etl scraping relational census sociodemographic colombia brazil",
"author": "Erick Lozano, Diego Irre\u00f1o, Juan Montenegro, Ingrid Mora",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/3d/31/aceca6ab68e0ad000d0c773ff0ebfc91da727423e6c6fb0b05480a3f63da/socio4health-0.1.7.tar.gz",
"platform": null,
"description": "<div style=\"display: flex; align-items: center;\">\r\n <img src=\"https://raw.githubusercontent.com/harmonize-tools/socio4health/main/docs/source/_static/image.png\" alt=\"image info\" height=\"100\" width=\"100\" style=\"margin-right: 20px;\"/>\r\n <a href=\"https://www.harmonize-tools.org/\">\r\n <img src=\"https://harmonize-tools.github.io/harmonize-logo.png\" height=\"139\" alt=\"socio4health logo\"/>\r\n </a>\r\n</div>\r\n<!-- badges: start -->\r\n\r\n[](https://lifecycle.r-lib.org/articles/stages.html#experimental)\r\n[](https://github.com/harmonize-tools/socio4health/blob/main/LICENSE.md/)\r\n[](https://github.com/harmonize-tools/socio4health/graphs/contributors)\r\n\r\n<!-- badges: end -->\r\n\r\n## Overview\r\n<p style=\"font-family: Arial, sans-serif; font-size: 14px;\">\r\n Package socio4health is an extraction, transformation, loading (ETL), and AI-assisted query and visualization (AI QV) tool designed to simplify the intricate process of collecting and merging data \ud83d\udcca from multiple sources, focusing on sociodemographic and census datasets from Colombia, Brazil, and Peru, into a unified relational database structure.\r\n</p>\r\n\r\n- Seamlessly retrieve data from online data sources through web scraping, as well as from local files.\r\n- Support for various data formats, including `.csv`, `.xlsx`, `.xls`, `.txt`, `.sav`, and compressed files, ensuring versatility in sourcing information.\r\n- Consolidating extracted data into a pandas DataFrame.\r\n- Consolidating transformed data into a cohesive relational database.\r\n- Conduct precise queries and apply transformations to meet specific criteria.\r\n\r\n\r\n\r\n## Dependencies\r\n\r\n<table>\r\n <tr>\r\n <td align=\"center\">\r\n <a href=\"https://pandas.pydata.org/\" target=\"_blank\">\r\n <img src=\"https://avatars.githubusercontent.com/u/21206976?s=280&v=4\" height=\"50\" alt=\"pandas logo\">\r\n </a>\r\n </td>\r\n <td align=\"left\">\r\n <strong>Pandas</strong><br>\r\n Pandas is a fast, powerful, flexible, and easy-to-use open source data analysis and manipulation tool.<br>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td align=\"center\">\r\n <a href=\"https://numpy.org/\" target=\"_blank\">\r\n <img src=\"https://avatars.githubusercontent.com/u/288276?s=48&v=4\" height=\"50\" alt=\"numpy logo\">\r\n </a>\r\n </td>\r\n <td align=\"left\">\r\n <strong>Numpy</strong><br>\r\n The fundamental package for scientific computing with Python.<br>\r\n </td>\r\n </tr>\r\n <tr>\r\n <td align=\"center\">\r\n <a href=\"https://scrapy.org/\" target=\"_blank\">\r\n <img src=\"https://avatars.githubusercontent.com/u/733635?s=48&v=4\" height=\"50\" alt=\"scrapy logo\">\r\n </a>\r\n </td>\r\n <td align=\"left\">\r\n <strong>Scrapy</strong><br>\r\n Framework for extracting the data you need from websites.<br>\r\n </td>\r\n </tr>\r\n</table>\r\n\r\n- <a href=\"https://openpyxl.readthedocs.io/en/stable/\">openpyxl</a>\r\n- <a href=\"https://py7zr.readthedocs.io/en/latest/\">py7zr</a>\r\n- <a href=\"https://pypi.org/project/pyreadstat/\">pyreadstat</a>\r\n- <a href=\"https://tqdm.github.io/\">tqdm</a>\r\n- <a href=\"https://requests.readthedocs.io/en/latest/\">requests</a>\r\n\r\n## Installation\r\n\r\n**socio4health** can be installed via pip from [PyPI](https://pypi.org/project/socio4health/).\r\n\r\n```python\r\n# Install using pip\r\npip install socio4health\r\n```\r\n\r\n## How to Use it\r\n\r\nTo use the socio4health package, follow these steps:\r\n\r\n1. Import the package in your Python script:\r\n\r\n ```python\r\n from socio4health import Extractor()\r\n from socio4health import Harmonizer\r\n \r\n ```\r\n2. Create an instance of the `Extractor` class:\r\n\r\n ```python\r\n extractor = Extractor()\r\n ```\r\n\r\n3. Extract data from online sources and create a list of data information:\r\n\r\n ```python\r\n url = 'https://www.example.com'\r\n depth = 0\r\n ext = 'csv'\r\n list_datainfo = extractor.s4h_extract(url=url, depth=depth, ext=ext)\r\n harmonizer = Harmonizer()\r\n ```\r\n\r\n## Resources\r\n\r\n<details>\r\n<summary>\r\nPackage Website\r\n</summary>\r\n\r\nThe [socio4health website](https://harmonize-tools.github.io/socio4health/) package website includes **API reference**, **user guide**, and **examples**. The site mainly concerns the release version, but you can also find documentation for the latest development version.\r\n\r\n</details>\r\n<details>\r\n<summary>\r\nOrganisation Website\r\n</summary>\r\n\r\n[Harmonize](https://www.harmonize-tools.org/) is an international project that develops cost-effective and reproducible digital tools for stakeholders in Latin America and the Caribbean (LAC) affected by a changing climate. These stakeholders include cities, small islands, highlands, and the Amazon rainforest.\r\n\r\nThe project consists of resources and [tools](https://harmonize-tools.github.io/) developed in conjunction with different teams from Brazil, Colombia, Dominican Republic, Peru, and Spain.\r\n\r\n</details>\r\n\r\n## Organizations\r\n\r\n<table>\r\n <tr>\r\n <td align=\"center\">\r\n <a href=\"https://www.bsc.es/\" target=\"_blank\">\r\n <img src=\"https://imgs.search.brave.com/t_FUOTCQZmDh3ddbVSX1LgHYq4mzCxvVA8U_YHywMTc/rs:fit:500:0:0/g:ce/aHR0cHM6Ly9zb21t/YS5lcy93cC1jb250/ZW50L3VwbG9hZHMv/MjAyMi8wNC9CU0Mt/Ymx1ZS1zbWFsbC5q/cGc\" height=\"64\" alt=\"bsc logo\">\r\n </a>\r\n </td>\r\n <td align=\"center\">\r\n <a href=\"https://uniandes.edu.co/\" target=\"_blank\">\r\n <img src=\"https://raw.githubusercontent.com/harmonize-tools/socio4health/refs/heads/main/docs/img/uniandes.png\" height=\"64\" alt=\"uniandes logo\">\r\n </a>\r\n </td>\r\n </tr>\r\n</table>\r\n\r\n\r\n## Authors / Contact information\r\n\r\nHere is the contact information of authors/contributors in case users have questions or feedback.\r\n</br>\r\n</br>\r\n<a href=\"https://github.com/dirreno\">\r\n <img src=\"https://avatars.githubusercontent.com/u/39099417?v=4\" style=\"width: 50px; height: auto;\" />\r\n</a>\r\n<span style=\"display: flex; align-items: center; margin-left: 10px;\">\r\n <strong>Diego Irre\u00f1o</strong> (developer)\r\n</span>\r\n</br>\r\n<a href=\"https://github.com/Ersebreck\">\r\n <img src=\"https://avatars.githubusercontent.com/u/81669194?v=4\" style=\"width: 50px; height: auto;\" />\r\n</a>\r\n<span style=\"display: flex; align-items: center; margin-left: 10px;\">\r\n <strong>Erick Lozano</strong> (developer)\r\n</span>\r\n</br>\r\n<a href=\"https://github.com/Juanmontenegro99\">\r\n <img src=\"https://avatars.githubusercontent.com/u/60274234?v=4\" style=\"width: 50px; height: auto;\" />\r\n</a>\r\n<span style=\"display: flex; align-items: center; margin-left: 10px;\">\r\n <strong>Juan Montenegro</strong> (developer)\r\n</span>\r\n</br>\r\n<a href=\"https://github.com/ingridvmoras\">\r\n <img src=\"https://avatars.githubusercontent.com/u/91691844?s=400&u=945efa0d09fcc25d1e592d2a9fddb984fdc6ceea&v=4\" style=\"width: 50px; height: auto;\" />\r\n</a>\r\n<span style=\"display: flex; align-items: center; margin-left: 10px;\">\r\n <strong>Ingrid Mora</strong> (documentation)\r\n</span>\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Socio4health is a Python package for gathering and consolidating socio-demographic data.",
"version": "0.1.7",
"project_urls": {
"Bug Reports": "https://github.com/harmonize-tools/socio4health/issues",
"Homepage": "https://github.com/harmonize-tools/socio4health",
"Source": "https://github.com/harmonize-tools/socio4health/"
},
"split_keywords": [
"extract",
"transform",
"load",
"etl",
"scraping",
"relational",
"census",
"sociodemographic",
"colombia",
"brazil"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a3eb5aa2d938d587a187d0fc2e8dc0a04d8635dc430bcb5977dc6313683d6766",
"md5": "9fdcd97876fe2d57b3ba91d067648cb8",
"sha256": "1287f20577d1c3c40640706aae334cd8c1f40f0b659ae3894c1c7df6901f76f9"
},
"downloads": -1,
"filename": "socio4health-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9fdcd97876fe2d57b3ba91d067648cb8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.10",
"size": 30808,
"upload_time": "2025-09-15T21:16:58",
"upload_time_iso_8601": "2025-09-15T21:16:58.285825Z",
"url": "https://files.pythonhosted.org/packages/a3/eb/5aa2d938d587a187d0fc2e8dc0a04d8635dc430bcb5977dc6313683d6766/socio4health-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3d31aceca6ab68e0ad000d0c773ff0ebfc91da727423e6c6fb0b05480a3f63da",
"md5": "1715793243918154ccd712f6251185c4",
"sha256": "17ca886b191d580ac18c4517cd7c0100f3de3fa1062465e234ebb99d263881f7"
},
"downloads": -1,
"filename": "socio4health-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "1715793243918154ccd712f6251185c4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.10",
"size": 36185,
"upload_time": "2025-09-15T21:16:59",
"upload_time_iso_8601": "2025-09-15T21:16:59.588037Z",
"url": "https://files.pythonhosted.org/packages/3d/31/aceca6ab68e0ad000d0c773ff0ebfc91da727423e6c6fb0b05480a3f63da/socio4health-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-15 21:16:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "harmonize-tools",
"github_project": "socio4health",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "requests",
"specs": [
[
"~=",
"2.31.0"
]
]
},
{
"name": "Scrapy",
"specs": [
[
"~=",
"2.11.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
"~=",
"4.66.1"
]
]
},
{
"name": "pyreadstat",
"specs": [
[
"~=",
"1.2.6"
]
]
},
{
"name": "py7zr",
"specs": [
[
"~=",
"0.20.8"
]
]
},
{
"name": "pandas",
"specs": []
},
{
"name": "openpyxl",
"specs": [
[
"~=",
"3.1.2"
]
]
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "dask",
"specs": []
},
{
"name": "appdirs",
"specs": []
},
{
"name": "pyarrow",
"specs": []
},
{
"name": "deep_translator",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "geopandas",
"specs": []
}
],
"lcname": "socio4health"
}