# ScienceLinker: Enriching survey data with data from knowledgebases – and more
## Abstract
Data enrichment is a central part in many research proposals and daily work of data analysts.
Merging multiple datasets especially from different domains allows for new analyses and research insights.
Hereby enrichment is achieved by interlinking corresponding data items of two datasets and adding the additional data from the second dataset to the first.
While the general procedure tends to be recurrent, in practice further challenges arise.
Most of the time researchers require fully transparent and documented workflows where each step e.g. identification of compatible item types or of comparison keys can be inspected and fully controlled.
However, understanding and accessing new knowledge bases requires time and can be cumbersome.
Adding that sufficiency of data quality of less known datasets needs to be assessed yet.
Sciencelinker is a toolset that aggregates functionality to enrich primarily social science survey data but also other datasets.
The ScienceLinker python module integrates seamlessly with your data analysis setup and gives you full control of your analyses.
It allows for data linking and enrichment with large established KBs like DBpedia, WikiData, GeoNames, or other LOD knowledge bases.
Furthermore, the module provides the means to design a step-by-step transparent enrichment workflow using methods ranging from linking via direct field comparisons up to text analysis for more complex scenarios.
Additionally, it supports the retrieval of Twitter microposts, to allocate complementing material for investigation. This includes speedy identification of relevant microposts for a given topic and first analysis steps, like filtering by language or location.
## Explorative Approach
At ScienceLinker, users from diverse domains with computational expertise find a welcoming gateway to the world of data linking,
even if they are initially unaware or skeptical of such methods and data sources. Our platform showcases a powerful and versatile
set of tools, enabling users to analyze data comprehensively and extract various types of information, forming the essential foundation
for successful data linking endeavors. By combining complementary methods, ScienceLinker offers users the opportunity to swiftly explore
new methods and integrate multiple data sources, gaining insights into potential benefits for their own work, despite the outcomes not always
being perfect. With our all-in-on Python package, well-documented processes, and detailed processing logs, users can confidently navigate the
intricacies of data linking.
## ScienceLinker Overview
ScienceLinker is optimized for survey data but our methods can be applied on various kinds of data.
1. Tabular data: Data that is or can be structured in a table, this includes linked data. Rows or columns are used as input and can contain short texts or numbers. This data is eligible for linking with KB resources by comparing the values of denominated columns with property values of the resources. Also, identifying of compatible resource types can be supported.
2. Textual data: Longer texts that are part of research data, e.g. free text answers of respondents or comments in web forums etc. Named Entity Recognition and topic modelling can be used to analyse larger datasets automatically in a structured and fast way.
3. Content data: A keyword defining a topic of interest especially a social science concept can be used to compile a set of microposts that serves a additional material for investigation.
<div align="center">
<img src="http://sciencelinker.git.gesis.org/docs/_images/functions_overview.jpg" width="500"/>
</div>
## Documentation
Please find the documentation to our package and methods at [http://sciencelinker.git.gesis.org/docs/](http://sciencelinker.git.gesis.org/docs/index.html).
## SPARQL Lookup in Knowledge Graphs
KGs are a source for additional information that can be used to enrich a local dataset. A possible scenario: Given a list of
country names or codes from survey data, ScienceLinker can be used find resources in a KG that are of a given type (eg. country)
and have a label that matches with a country name from the given list. Such resources bear additional information such as
“Population density”, “Area” or “GNP” that can enrich the local dataset. [More](http://sciencelinker.git.gesis.org/docs/kg_coverpage.html)
## Geonames Lookup
ScienceLinker incorporates the functionality to retrieve longitude and latitude information for a given set of location names,
such as cities and countries, by utilizing the GeoNames web services.
This feature proves valuable in computing proximities between places, facilitating linking operations and enabling further
analyses based on geographical relationships. [More](http://sciencelinker.git.gesis.org/docs/geo_names.html)
## Named Entity Recognition (NER)
The integration of NER within ScienceLinker enables the extraction of structured information from unstructured text. By utilizing
the DBpedia URLs associated with recognized entities, it becomes possible to establish links to additional data within the DBpedia Knowledge Graph,
enhancing the given dataset that contains text. [More](http://sciencelinker.git.gesis.org/docs/ner_dbpspot.html)
## Topic Modelling
Topic modelling is a valuable technique used to extract topics from a collection of documents. It involves representing topics as
sets of words, utilizing the co-occurrence probabilities of words within the document set. One popular method for topic modelling
is [Latent Dirichlet Allocation (LDA)](https://papers.neurips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf).
The application of LDA within ScienceLinker enables the analysis of longer texts, such as online discussions, chats, articles,
and various other forms of written content. By identifying the topics present in a document, it becomes possible to establish
connections between documents that share similar topics. [More](http://sciencelinker.git.gesis.org/docs/lda_gensim.html)
## Micropost retrieval
Online discourse contains latent information about the attitudes and opinions of individuals on current and past topics, expressed
in a more natural context than traditional surveys. Analyzing online discourse provides an opportunity to gain additional insights into
respondents’ behavior, complementing findings from conventional surveys. Simultaneously, it enables the examination of the attitudes of
a diverse range of social media users towards specific subjects.
These valuable insights can be extracted using Natural Language Processing techniques, such as sentiment analysis and stance detection,
from the discourse. However, the user faces several challenges before that. [More](http://sciencelinker.git.gesis.org/docs/mpr_coverpage.html)
## Contact
* Dr. Benjamin Zapilko: benjamin.zapilko@gesis.org
* Felix Bensmann: felix.bensmann@gesis.org
## Funding
This project is funded by the Deutsche Forschungsgemeinschaft (DFG) under the Grant No. [404417453](https://gepris.dfg.de/gepris/projekt/404417453).
## Third Party Efforts
Our software project is built upon the collaborative efforts of various third-party resources, including libraries, web services, and data sources.
We rely on the invaluable contributions of [DBpedia Spotlight](https://www.dbpedia-spotlight.org/), [GeoNames](https://www.geonames.org/about.html),
and [Gensim](https://radimrehurek.com/gensim/), which generously provide their offerings free of charge, only requesting proper attribution, a
condition we are delighted to honor. Moreover, we extend our gratitude to the multitude of projects, akin to [DBpedia](https://www.dbpedia.org/)
and [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page), for their unwavering commitment to sharing data freely on the
semantic web. While we wholeheartedly embrace these open resources, it is essential to acknowledge that each data provider may have individual
terms and conditions governing the use of their data, and we encourage our users to adhere to these guidelines with respect and appreciation
for the valuable data made accessible to us.
Raw data
{
"_id": null,
"home_page": "",
"name": "sciencelinker",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "",
"keywords": "linking,interlinking,survey,survey data,knowledge base,KB,enrich,social science,DBpedia,Wikidata,Geonames,NER",
"author": "",
"author_email": "Felix Bensmann <sciencelinker01@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/15/c9/271f2dd99640ab86d1c3bedd7442708cef43ba9e0adc61cedc2021307910/sciencelinker-0.1.0.tar.gz",
"platform": null,
"description": "# ScienceLinker: Enriching survey data with data from knowledgebases \u2013 and more\n\n\n\n## Abstract\nData enrichment is a central part in many research proposals and daily work of data analysts. \nMerging multiple datasets especially from different domains allows for new analyses and research insights. \nHereby enrichment is achieved by interlinking corresponding data items of two datasets and adding the additional data from the second dataset to the first.\nWhile the general procedure tends to be recurrent, in practice further challenges arise.\nMost of the time researchers require fully transparent and documented workflows where each step e.g. identification of compatible item types or of comparison keys can be inspected and fully controlled. \nHowever, understanding and accessing new knowledge bases requires time and can be cumbersome.\nAdding that sufficiency of data quality of less known datasets needs to be assessed yet.\nSciencelinker is a toolset that aggregates functionality to enrich primarily social science survey data but also other datasets.\nThe ScienceLinker python module integrates seamlessly with your data analysis setup and gives you full control of your analyses.\nIt allows for data linking and enrichment with large established KBs like DBpedia, WikiData, GeoNames, or other LOD knowledge bases.\nFurthermore, the module provides the means to design a step-by-step transparent enrichment workflow using methods ranging from linking via direct field comparisons up to text analysis for more complex scenarios.\nAdditionally, it supports the retrieval of Twitter microposts, to allocate complementing material for investigation. This includes speedy identification of relevant microposts for a given topic and first analysis steps, like filtering by language or location.\n\n## Explorative Approach\nAt ScienceLinker, users from diverse domains with computational expertise find a welcoming gateway to the world of data linking, \neven if they are initially unaware or skeptical of such methods and data sources. Our platform showcases a powerful and versatile \nset of tools, enabling users to analyze data comprehensively and extract various types of information, forming the essential foundation \nfor successful data linking endeavors. By combining complementary methods, ScienceLinker offers users the opportunity to swiftly explore \nnew methods and integrate multiple data sources, gaining insights into potential benefits for their own work, despite the outcomes not always \nbeing perfect. With our all-in-on Python package, well-documented processes, and detailed processing logs, users can confidently navigate the \nintricacies of data linking.\n\n\n\n\n## ScienceLinker Overview\nScienceLinker is optimized for survey data but our methods can be applied on various kinds of data.\n1.\tTabular data: Data that is or can be structured in a table, this includes linked data. Rows or columns are used as input and can contain short texts or numbers. This data is eligible for linking with KB resources by comparing the values of denominated columns with property values of the resources. Also, identifying of compatible resource types can be supported.\n2.\tTextual data: Longer texts that are part of research data, e.g. free text answers of respondents or comments in web forums etc. Named Entity Recognition and topic modelling can be used to analyse larger datasets automatically in a structured and fast way.\n3.\tContent data: A keyword defining a topic of interest especially a social science concept can be used to compile a set of microposts that serves a additional material for investigation.\n\n<div align=\"center\">\n<img src=\"http://sciencelinker.git.gesis.org/docs/_images/functions_overview.jpg\" width=\"500\"/>\n</div>\n\n## Documentation\nPlease find the documentation to our package and methods at [http://sciencelinker.git.gesis.org/docs/](http://sciencelinker.git.gesis.org/docs/index.html).\n\n## SPARQL Lookup in Knowledge Graphs\nKGs are a source for additional information that can be used to enrich a local dataset. A possible scenario: Given a list of \ncountry names or codes from survey data, ScienceLinker can be used find resources in a KG that are of a given type (eg. country) \nand have a label that matches with a country name from the given list. Such resources bear additional information such as \n\u201cPopulation density\u201d, \u201cArea\u201d or \u201cGNP\u201d that can enrich the local dataset. [More](http://sciencelinker.git.gesis.org/docs/kg_coverpage.html)\n\n## Geonames Lookup\nScienceLinker incorporates the functionality to retrieve longitude and latitude information for a given set of location names, \nsuch as cities and countries, by utilizing the GeoNames web services. \nThis feature proves valuable in computing proximities between places, facilitating linking operations and enabling further \nanalyses based on geographical relationships. [More](http://sciencelinker.git.gesis.org/docs/geo_names.html)\n\n\n## Named Entity Recognition (NER)\nThe integration of NER within ScienceLinker enables the extraction of structured information from unstructured text. By utilizing \nthe DBpedia URLs associated with recognized entities, it becomes possible to establish links to additional data within the DBpedia Knowledge Graph, \nenhancing the given dataset that contains text. [More](http://sciencelinker.git.gesis.org/docs/ner_dbpspot.html)\n\n## Topic Modelling\nTopic modelling is a valuable technique used to extract topics from a collection of documents. It involves representing topics as \nsets of words, utilizing the co-occurrence probabilities of words within the document set. One popular method for topic modelling \nis [Latent Dirichlet Allocation (LDA)](https://papers.neurips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf).\n\nThe application of LDA within ScienceLinker enables the analysis of longer texts, such as online discussions, chats, articles, \nand various other forms of written content. By identifying the topics present in a document, it becomes possible to establish \nconnections between documents that share similar topics. [More](http://sciencelinker.git.gesis.org/docs/lda_gensim.html)\n\n## Micropost retrieval\nOnline discourse contains latent information about the attitudes and opinions of individuals on current and past topics, expressed \nin a more natural context than traditional surveys. Analyzing online discourse provides an opportunity to gain additional insights into \nrespondents\u2019 behavior, complementing findings from conventional surveys. Simultaneously, it enables the examination of the attitudes of \na diverse range of social media users towards specific subjects.\n\nThese valuable insights can be extracted using Natural Language Processing techniques, such as sentiment analysis and stance detection, \nfrom the discourse. However, the user faces several challenges before that. [More](http://sciencelinker.git.gesis.org/docs/mpr_coverpage.html)\n\n## Contact\n* Dr. Benjamin Zapilko: benjamin.zapilko@gesis.org\n* Felix Bensmann: felix.bensmann@gesis.org\n\n## Funding\nThis project is funded by the Deutsche Forschungsgemeinschaft (DFG) under the Grant No. [404417453](https://gepris.dfg.de/gepris/projekt/404417453).\n\n\n## Third Party Efforts\nOur software project is built upon the collaborative efforts of various third-party resources, including libraries, web services, and data sources. \nWe rely on the invaluable contributions of [DBpedia Spotlight](https://www.dbpedia-spotlight.org/), [GeoNames](https://www.geonames.org/about.html), \nand [Gensim](https://radimrehurek.com/gensim/), which generously provide their offerings free of charge, only requesting proper attribution, a \ncondition we are delighted to honor. Moreover, we extend our gratitude to the multitude of projects, akin to [DBpedia](https://www.dbpedia.org/) \nand [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page), for their unwavering commitment to sharing data freely on the \nsemantic web. While we wholeheartedly embrace these open resources, it is essential to acknowledge that each data provider may have individual \nterms and conditions governing the use of their data, and we encourage our users to adhere to these guidelines with respect and appreciation \nfor the valuable data made accessible to us.\n\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Enriching survey data with data from knowledge bases",
"version": "0.1.0",
"project_urls": {
"Download": "https://git.gesis.org/sciencelinker/sciencelinker-development",
"Homepage": "https://www.gesis.org/en/research/external-funding-projects/details/project/116/a-framework-for-finding-linking-and-enriching-social-science-linked-data"
},
"split_keywords": [
"linking",
"interlinking",
"survey",
"survey data",
"knowledge base",
"kb",
"enrich",
"social science",
"dbpedia",
"wikidata",
"geonames",
"ner"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e67e1600fb335e6655fa0be7b6cb429dd48c15d906f0cf799f5901c6b7c055c1",
"md5": "d8a727bb6417f0df49bd62c71381ff29",
"sha256": "1a208f43c5fa73fe71670b4d43bde89894ce91b16337e75e331544d10cf95ad2"
},
"downloads": -1,
"filename": "sciencelinker-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d8a727bb6417f0df49bd62c71381ff29",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 22871,
"upload_time": "2023-12-19T12:42:13",
"upload_time_iso_8601": "2023-12-19T12:42:13.022714Z",
"url": "https://files.pythonhosted.org/packages/e6/7e/1600fb335e6655fa0be7b6cb429dd48c15d906f0cf799f5901c6b7c055c1/sciencelinker-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "15c9271f2dd99640ab86d1c3bedd7442708cef43ba9e0adc61cedc2021307910",
"md5": "5512a63a35a5c8e8067fc22230c3dcd7",
"sha256": "f846f51d6dbcd2d12a139874976b68a0a1f5c2c28a88d19b9d065ba5647fd147"
},
"downloads": -1,
"filename": "sciencelinker-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "5512a63a35a5c8e8067fc22230c3dcd7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 24812,
"upload_time": "2023-12-19T12:42:15",
"upload_time_iso_8601": "2023-12-19T12:42:15.703245Z",
"url": "https://files.pythonhosted.org/packages/15/c9/271f2dd99640ab86d1c3bedd7442708cef43ba9e0adc61cedc2021307910/sciencelinker-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-19 12:42:15",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "sciencelinker"
}