strwythura


Namestrwythura JSON
Version 1.4.1 PyPI version JSON
download
home_pagehttps://github.com/DerwenAI/strwythura
SummaryConstruct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain.
upload_time2025-08-30 21:47:45
maintainerNone
docs_urlNone
authorPaco Nathan
requires_python<3.12,>=3.11
licenseMIT
keywords baml context engineering entity-linking entity-resolution graph algorithms graphrag interactive visualization knowledge-graph named-entity-recognition nlp ontology ontology-pipeline relation-extraction semantic-expansion semantic-layer semantic-random-walk textgraphs unstructured-data vector-database
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Strwythura

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16934079.svg)](https://doi.org/10.5281/zenodo.16934079)
![Licence](https://img.shields.io/github/license/DerwenAI/strwythura)
![Repo size](https://img.shields.io/github/repo-size/DerwenAI/strwythura)
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
![GitHub commit activity](https://img.shields.io/github/commit-activity/w/DerwenAI/strwythura?style=plastic)


**Strwythura** library/tutorial, based on a presentation about GraphRAG for
[GraphGeeks](https://graphgeeks.org/) on 2024-08-14

<details>
  <summary><h2>Overview</h2></summary>

How to construct a _knowledge graph_ (KG) from unstructured data
sources using _state of the art_ (SOTA) models for _named entity
recognition_ (NER), then implement an enhanced _GraphRAG_ approach,
and curate semantics for optimizing AI app outcomes within a
specific domain.

  * videos: <https://youtu.be/B6_NfvQL-BE>, <https://senzing.com/gph-graph-rag-llm-knowledge-graphs/>
  * slides: <https://derwen.ai/s/2njz#1>

Motivation for this tutorial comes from the stark fact that the
term "GraphRAG" means many things, based on multiple conflicting
definitions. Several popular implementations reveal a relatively 
cursory understanding about either _natural language processing_ (NLP)
or graph algorithms, plus a _vendor bias_ toward their own query language.

See this article for more details and history:
["Unbundling the Graph in GraphRAG"](https://www.oreilly.com/radar/unbundling-the-graph-in-graphrag/).

Instead of delegating KG construction to a _large language model_
(LLM), this tutorial shows the use of sophisticated NLP pipelines
based on `spaCy`, `GLiNER`, _TextRank_, and related libraries.
Results are better/faster/cheaper, plus this provides more control
and oversight for _intentional arrangement_ of the KG. Then for
downstream usage in a question/answer chat bot, an enhanced GraphRAG
approach leverages graph algorithms (e.g., _semantic random walk_)
to optimize retrieval of text chunks which ultimately get presented
to an LLM for _summarization_ to produce responses.

For more detailed discussions, see:

  * enhanced GraphRAG: ["GraphRAG to enhance LLM-based apps"](https://derwen.ai/s/hm7h#3)
  * ontology pipeline: ["Intentional Arrangement"](https://jessicatalisman.substack.com/) by **Jessica Talisman**
  * `spaCy`: <https://spacy.io/>
  * `GLiNER`: <https://huggingface.co/urchade/gliner_base>
  * _TextRank_: <https://www.derwen.ai/docs/ptr/explain_algo/>

Some key issues regarding KG construction with LLMs which don't get
addressed much by the graph community and AI community in general:

  1. LLMs tend to mangle cross-domain semantics when used for building graphs; see _Mai2024_ referenced in the "GraphRAG to enhance LLM-based apps" talk above.
  2. You need to introduce a _semantic layer_ for representing the domain context, which follows more of a _neurosymbolic AI_ approach.
  3. Most all LLMs perform _question rewriting_ in ways which cannot be disabled, even when the `temperature` parameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds.
  4. Any _model_ used for prediction introduces reasoning based on _generalization_, even more so when the model uses a _loss function_ for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
  5. The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.

Of course, YMMV.

This approach leverages _neurosymbolic AI_ methods, combining
practices from:

  * _natural language processing_
  * _graph data science_
  * _entity resolution_
  * _ontology pipeline_
  * _context engineering_
  * _human-in-the-loop_

Overall, this illustrates a reference implementation for
_entity-resolved retrieval-augmented generation_ (ER-RAG).
</details>


## Usage in applications

This runs with Python 3.11, though the range of versions may be
extended soon.

To `pip` install from [PyPi](https://pypi.org/project/strwythura/):

```bash
python3 -m pip install strwathura
python3 -m spacy download en_core_web_md
```

Then to integrate this library within an application:

  1. Copy settings in `config.toml` into a custom configuration file.
  2. Subclass `DomainContext` to extend it for the use case.
  3. Define semantics in `domain.ttl` for the domain context.
  4. Run entity resolution to merge the structured datasets.
  5. Run `Ollama` and have already downloaded the Gemma3 LLM as described below.
  6. Instantiate new `DomainContext`, `Strwythura`, `VisHTML`, and `GraphRAG` objects or their subclassed extensions.
  7. ...
  8. Profit

Follow the patterns in the `build.py` and `errag.py` example scripts.

If you're working with documents in a language other than English,
well first that's absolutely fantastic, though next you need to:

  * Update model settings in the `config.toml` file.
  * Change the `spaCy` model downloaded here.
  * Also change the language tags used in `domain.ttl` as needed.


## Set up for demo or development

This library uses [`poetry`](https://python-poetry.org/docs/) for
package management, and first you need to install it. Then run:

```bash
poetry update
poetry run python3 -m spacy download en_core_web_md
```


<details>
  <summary><h2>Demo Part 1: Entity Resolution</h2></summary>

Run _entity resolution_ (ER) to produce entities and relations from
_structured data sources_, which tend to be more reliable than those
extracted from unstructured content.

What does this ER step buy us?  ER allows us to merge multiple
structured data sets, even without consistent _foreign keys_ being
available, producing an overlay of entities and relations among them
-- which is useful as a "backbone" for constructing a KG. Morever
when there are judgements being made from the KG about people or
organizations, ER provides accountability for the merge decisions.

This is especially important in public sector, healthcare, banking,
insurance -- i.e., in use cases where you might need to "send flowers"
when automated judgements go wrong.  For example, someone gets denied
a loan, has a medical insurance claim blocked, gets a tax audit, has
their voter registration voided, becomes the subject of an arrest
warrant, and so on.  In other words, people and organizations tend to
take legal actions when someone else causes them harm. You'll want an
audit trail of decisions based on evidence, when your software systems
are making these kinds of judgements.

For the domain context in this tutorial, say we have two hypothetical
datasets which provide business directory listings:

  * `sz_er/acme_biz.json` -- "ACME Business Directory"
  * `sz_er/corp_home.json` -- "Corporates Home UK"

Plus we have slices from datasets which provide listings about
researchers and scientific authors:

  * `sz_er/orcid.json` -- [ORCID](https://orcid.org/)
  * `sz_er/scopus.json` -- [Scopus](https://www.elsevier.com/products/scopus/data)

These four datasets can be merged using ER, with the results being a
domain-specific _thesaurus_ that generates graph elements: entities,
relations, properties. We'll blend this into our _semantic layer_ used
for organizing the KG later.


The following steps are optional, since these ER results have already
been pre-computed and provided in the `sz_er/export.json` file.
If you'd like to run [Senzing](https://senzing.com/docs/quickstart/)
to reproduce these ER results, use the following steps -- otherwise
continue to the "Part 2" of this tutorial.

Senzing SDK runs in Python or Java, though ER can also be run in batch
with a container from DockerHub:

```bash
docker pull senzing/demo-senzing
```

Once this container is available, run:

```bash
docker run -it --rm --volume ./sz_er:/tmp/data senzing/demo-senzing
```

This brings up a Linux command line prompt `I have no name!` and the
local subdirectory `sz_er` will be mapped to the `/tmp/data' directory
Type the following commands for batch ER into the command line prompt.

First, set up the Senzing configuration for merging these datasets:

```bash
G2ConfigTool.py
```

Within the configuration tool, register the names of the data sources
being used:

```
addDataSource ACME_BIZ
addDataSource CORP_HOME
addDataSource ORCID
addDataSource SCOPUS
save
exit
```

Load each file and run ER on its data records:

```bash
G2Loader.py -f /tmp/data/acme_biz.json
G2Loader.py -f /tmp/data/corp_home.json
G2Loader.py -f /tmp/data/orcid.json
G2Loader.py -f /tmp/data/scopus.json
```

Export the ER results to the `sz_er/export.json` file, then exit the
container:

```bash
G2Export.py -F JSON -o /tmp/data/export.json
exit
```

This will later get parsed to produce the `sz_er/er.ttl` file (RDF in
"Turtle" format) during the next part of the demo to augment the
_semantic layer_.

</details>


<details>
  <summary><h2>Demo Part 2: Build Assets</h2></summary>

Given as input:

  * `domain.ttl` -- semantics for the domain context
  * `sz_er/export.json` -- a domain-specific thesaurus based on ER
  * a list of structured datasets used in ER
  * a list of URLs from which to scrape content

The `domain.ttl` file provides a basis for iterating with an _ontology
pipeline_ process, to represent the semantics for the given domain.
It specifies metadata in terms of _vocabulary_, _taxonomy_, and
_thesaurus_ -- to use in representing the core entities and relations
in the KG.

The `curate.py` script described below then will introduce the
_human-in-the-loop_ part of this process, where you can review
entities extracted from documents. Based on this analysis, decide
where to refine the domain context to be able to _extract_,
_classify_, and _connect_ more of what gets extracted from
_unstructured data sources_ and linked into the KG.  Overall, this
process distills elements of the _lexical graph_, linking them with
elements from the _data graph_, to produce a more abstracted (i.e.,
less noisy) _semantic layer_ as the resulting KG.

Meanwhle, let's get started. The `build.py` script scrapes text
sources and constructs a _knowledge graph_ plus _entity embeddings_,
with nodes linked to chunks in a _vector store_:

```bash
poetry run python3 build.py
```

Demo data used in this case includes articles about the linkage
between eating _processed red meat_ frequently and the risks of
_dementia_ later in life, based on long-term studies.

The approach in this tutorial iterates through multiple steps to
produce the assets needed for GraphRAG downstream:

  1. Scrape each URL using `requests` and `BeautifulSoup`
  2. Split the text into _chunks_
  3. Build _vector embeddings_ for each chunk, in `LanceDB`
  4. Parse each text chunk using `spaCy`, iterating per sentence
  5. Extract _entities_ from each sentence using `GLiNER`
  6. Build a _lexical graph_ from the parse trees in `NetworkX`
  7. Run a _textrank_ algorithm to rank important entities
  8. Build an embedding model for entities using `gensim.Word2Vec`
  9. Generate an interactive visualization using `PyVis`

Note: processing may take a few extra minutes the first time it runs
since `PyTorch` must download a large (~2GB) file.

The assets get serialized into these files:

  * `data/lancedb` -- vector database tables in `LanceDB`
  * `data/kg.json` -- serialization of `NetworkX` graph
  * `data/sem.csv` -- entity semantics from `curate.py`
  * `data/entity.w2v` -- entity embeddings in `Gensim`
  * `data/url_cache.sqlite` -- URL cache in `SQLite`
  * `kg.html` -- interactive graph visualization in `PyVis`
</details>

<details>
  <summary><h2>Demo Part 3: Enhanced GraphRAG chat bot</h2></summary>

A good downstream use case for exploring a newly constructed KG is
GraphRAG, used for grounding the responses by an LLM in a
question/answer chat.

This implementation uses `BAML` <https://docs.boundaryml.com/home>
and leverages the KG using _semantic random walks_.

To set up, first download/install `Ollama` <https://ollama.com/>
and pull the Gemma3 model <https://huggingface.co/google/gemma-3-12b-it>

```bash
ollama pull gemma3:12b
```

Then run the `errag.py` script for an interactive GraphRAG example:

```bash
poetry run python3 errag.py
```
</details>

<details>
  <summary><h2>Demo Part 4: Curating an Ontology Pipeline</h2></summary>

This code uses a _semantic layer_ -- in other words, a "backbone" for
the KG -- to organize the entities and relations which get abstracted
from the lexical graph.

For now, run the `curate.py` script to generate a view of the ranked
NER results, serialized as the `data/sem.csv` file.  This can be
viewed in a spreadsheet to understand how to iterate on the semantic
definitions for more effective graph organization in the domain of the
scraped documents.

```bash
poetry run python3 curate.py
```
</details>


<details>
  <summary><h2>Unbundling GraphRAG</h2></summary>

**Objective:**

Construct a _knowledge graph_ (KG) using open source libraries where
deep learning models provide narrowly-focused _point solutions_ to
generate components for a graph: nodes, edges, properties.

These steps define a generalized process, where this tutorial picks up
at the _lexical graph_, without the _entity linking_ (EL) part yet:

**Semantic layer:**

  1. Load any semantics for domain context from pre-defined controlled vocabularies, taxonomies, thesauri, ontologies, etc., directly into the KG.

**Data graph:**

  1. Load the structured data sources or updates into a data graph.
  2. Perform _entity resolution_ (ER) on PII extracted from the data graph.
  3. Blend the ER results into the semantic layer as a "backbone" for structuring the KG.

**Lexical graph:**

  1. Parse the text chunks, using lemmatization to normalize token spans.
  2. Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
  3. Analyze _named entity recognition_ (NER) to extract candidate entities from noun phrase spans.
  4. Analyze _relation extraction_ (RE) to extract relations between pairwise entities.
  5. Perform _entity linking_ (EL) leveraging the ER results.
  6. Promote the extracted entities and relations up to the semantic layer.

Of course many vendors suggest using a _large language model_ (LLM) as
a _one-size-fits-all_ (OSFA) "black box" approach for extracting
entities and generating an entire graph **automagically**.

However, the business process of _resolution_ -- for both entities and
relations -- requires _judgements_. If the entities getting resolved
are low-risk, low-effort in nature, then yeah knock yourself out. If
the entities represent _people_ or _organizations_, these have agency
and may take actions when misrepresented in applications which have
consequences.

Whenever judgements get delegated to _model-based_ approaches,
_generalization_ becomes a form of reasoning employed.  When the
technology within the model is based on _loss functions_, then
generalization becomes dominant -- regardless of any marketing claims
about "AI reasoning" made by tech firms.

Fortunately, decisions can be made _without models_, even in AI
applications. Shock, horror!!! Please, say it isn't so!?! Brace
yourselves, using models is a thing, but not the only thing.  For more
detailed discussion, see:

  * Part 1: Let's talk about "Today's AI" <https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/>
  * Part 2: Let's talk about "Resolution" <https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/>

Also keep in mind that black box approaches don't work especially well
for regulated environments, where audits, explanations, evidence, data
provenance, etc., are required.

Moreover, KGs used in mission-critical apps, such as investigations,
generally require periodic data updates, so construction isn't a
one-step process. By producing a KG based on the approach sketched
above, updates can be handled more effectively.  Any downstream use
cases, such as AI applications, also benefit from improved quality of
semantics and representation.
</details>


<details>
  <summary><h2>FAQ</h2></summary>

<dl>
<dt>Q:</dt><dd>"Have you tried this with <code>langextract</code> yet?"</dd>
<dt>A:</dt><dd>"I'll take <code>How does an instructor know a student ignored the README?</code> from the <a href="https://en.wiktionary.org/wiki/fuck_around_and_find_out" target="_blank"><em>What is FAFO?</em></a> category, for $200 ... but yes of course, it's an interesting package, which builds on other interesting work used here. Except that key parts of it miss the point entirely, in ways that only a hyperscaler could possibly fuck up so badly."</dd>
</dl>

<dl>
<dt>Q:</dt><dd>"What the hell is the name of this repo about?"</dd>
<dt>A:</dt><dd>"As you may have noticed, many open source projects published in this GitHub organization are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word <a href="https://translate.google.com/details?sl=cy&tl=en&text=strwythura&op=translate" target="_blank"><code>strwythura</code></a> translates as the verb <strong>structure</strong> in English."</dd>
</dl>

<dl>
<dt>Q:</dt><dd>"Why aren't you using an LLM to build the graph instead?"</dd>
<dt>A:</dt><dd>"I promise to visit you in jail."</dd>
</dl>

<dl>
<dt>Q:</dt><dd>"Um, yeah, like, didn't Karpathy say to use <em>vibe coding</em>, or something? #justsayin"
<dt>A:</dt><dd>"<a href="https://effinbirds.com/" target="_blank">Piss the eff off</a> tech bro. Srsly, like yesterday -- you're embarrassing our entire industry with your overly exuberant ignorance."</dd>
</dl>

</details>


<details>
  <summary>Developer Notes</summary>

After each `BAML` release update, some committer needs to regenerate
its Python client source:

```bash
poetry run baml-cli generate --from strwythura/baml_src
```
</details>


<details>
  <summary>Experimental: Relation Extraction evaluation</summary>

Current Python libraries for _relation extraction_ (RE) are
probably best characterized as "experimental research projects".

Their tokenization approaches tend to make the mistake of "throwing
the baby out with the bath water" by not leveraging other available
information, e.g., what we have in the _textgraph_ representation of
the parsed documents. Also, they tend to ignore the semantic
constraints of the domain context, while computationally boiling
the ocean.

RE libraries which have been evaluated:

  * `GLiREL`: <https://github.com/jackboyla/GLiREL>
  * `ReLIK`: <https://github.com/SapienzaNLP/relik>
  * `OpenNRE`: <https://github.com/thunlp/OpenNRE>
  * `mREBEL`: <https://github.com/Babelscape/rebel>

This project had used `GLiREL` although its results were quite sparse.
RE will be replaced by `BAML` or `DSPy` workflows in the near future.

There is some experimental code which illustrates `OpenNRE` evaluation.
Use the `archive/nre.sh` script to load OpenNRE pre-trained models
before running the `archive/opennre.ipynb` notebook.

This may not work in many environments, depending on how well the
`OpenNRE` library is being maintained.
</details>


<details>
  <summary>Experimental: Tutorial notebooks</summary>

<p>
A collection of Jupyter notebooks were used to prototype code. These
help illustrate important intermediate steps within these workflows:
</p>

```bash
.venv/bin/jupyter-lab
```

<ul>
<li>`archive/construct.ipynb` -- detailed KG construction using a lexical graph</li>
<li>`archive/chunk.ipynb` -- simple example of how to scrape and chunk text</li>
<li>`archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)</li>
<li>`archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)</li>
</ul>

<p>
These are now archived, though kept available for study.
</p>
</details>


<details>
  <summary>License and Copyright</summary>

Source code for **Strwythura** plus its logo, documentation, and examples
have an [MIT license](https://spdx.org/licenses/MIT.html) which is
succinct and simplifies use in commercial applications.

All materials herein are Copyright © 2024-2025 Senzing, Inc.
</details>


<details>
  <summary>Kudos and Attribution</summary>

Please use the following BibTeX entry for citing **Strwythura** if you use
it in your research or software.
Citations are helpful for the continued development and maintenance of
this library.

```bibtex
@software{strwythura,
  author = {Paco Nathan},
  title = {{Strwythura: construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain}},
  year = 2024,
  publisher = {Senzing},
  doi = {10.5281/zenodo.16934079},
  url = {https://github.com/DerwenAI/strwythura}
}
```

Kudos to 
[@louisguitton](https://github.com/louisguitton), 
[@cj2001](https://github.com/cj2001),
[@prrao87](https://github.com/prrao87), 
[@hellovai](https://github.com/hellovai),
[@docktermj](https://github.com/docktermj), 
[@jbutcher21](https://github.com/jbutcher21),  
and the kind folks at [GraphGeeks](https://graphgeeks.org/) for their support.
</details>


## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=derwenai/strwythura&type=Date)](https://star-history.com/#derwenai/strwythura&Date)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/DerwenAI/strwythura",
    "name": "strwythura",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.11",
    "maintainer_email": null,
    "keywords": "baml, context engineering, entity-linking, entity-resolution, graph algorithms, graphrag, interactive visualization, knowledge-graph, named-entity-recognition, nlp, ontology, ontology-pipeline, relation-extraction, semantic-expansion, semantic-layer, semantic-random-walk, textgraphs, unstructured-data, vector-database",
    "author": "Paco Nathan",
    "author_email": "paco@senzing.com",
    "download_url": "https://files.pythonhosted.org/packages/6f/c4/86e8807a359b7931998f10d29fd1cd0c5848ae358dad3873f11ea1dd7d5f/strwythura-1.4.1.tar.gz",
    "platform": null,
    "description": "# Strwythura\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16934079.svg)](https://doi.org/10.5281/zenodo.16934079)\n![Licence](https://img.shields.io/github/license/DerwenAI/strwythura)\n![Repo size](https://img.shields.io/github/repo-size/DerwenAI/strwythura)\n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)\n![GitHub commit activity](https://img.shields.io/github/commit-activity/w/DerwenAI/strwythura?style=plastic)\n\n\n**Strwythura** library/tutorial, based on a presentation about GraphRAG for\n[GraphGeeks](https://graphgeeks.org/) on 2024-08-14\n\n<details>\n  <summary><h2>Overview</h2></summary>\n\nHow to construct a _knowledge graph_ (KG) from unstructured data\nsources using _state of the art_ (SOTA) models for _named entity\nrecognition_ (NER), then implement an enhanced _GraphRAG_ approach,\nand curate semantics for optimizing AI app outcomes within a\nspecific domain.\n\n  * videos: <https://youtu.be/B6_NfvQL-BE>, <https://senzing.com/gph-graph-rag-llm-knowledge-graphs/>\n  * slides: <https://derwen.ai/s/2njz#1>\n\nMotivation for this tutorial comes from the stark fact that the\nterm \"GraphRAG\" means many things, based on multiple conflicting\ndefinitions. Several popular implementations reveal a relatively \ncursory understanding about either _natural language processing_ (NLP)\nor graph algorithms, plus a _vendor bias_ toward their own query language.\n\nSee this article for more details and history:\n[\"Unbundling the Graph in GraphRAG\"](https://www.oreilly.com/radar/unbundling-the-graph-in-graphrag/).\n\nInstead of delegating KG construction to a _large language model_\n(LLM), this tutorial shows the use of sophisticated NLP pipelines\nbased on `spaCy`, `GLiNER`, _TextRank_, and related libraries.\nResults are better/faster/cheaper, plus this provides more control\nand oversight for _intentional arrangement_ of the KG. Then for\ndownstream usage in a question/answer chat bot, an enhanced GraphRAG\napproach leverages graph algorithms (e.g., _semantic random walk_)\nto optimize retrieval of text chunks which ultimately get presented\nto an LLM for _summarization_ to produce responses.\n\nFor more detailed discussions, see:\n\n  * enhanced GraphRAG: [\"GraphRAG to enhance LLM-based apps\"](https://derwen.ai/s/hm7h#3)\n  * ontology pipeline: [\"Intentional Arrangement\"](https://jessicatalisman.substack.com/) by **Jessica Talisman**\n  * `spaCy`: <https://spacy.io/>\n  * `GLiNER`: <https://huggingface.co/urchade/gliner_base>\n  * _TextRank_: <https://www.derwen.ai/docs/ptr/explain_algo/>\n\nSome key issues regarding KG construction with LLMs which don't get\naddressed much by the graph community and AI community in general:\n\n  1. LLMs tend to mangle cross-domain semantics when used for building graphs; see _Mai2024_ referenced in the \"GraphRAG to enhance LLM-based apps\" talk above.\n  2. You need to introduce a _semantic layer_ for representing the domain context, which follows more of a _neurosymbolic AI_ approach.\n  3. Most all LLMs perform _question rewriting_ in ways which cannot be disabled, even when the `temperature` parameter is set to zero; this leads to relative degrees of \"hallucinated questions\" for which there are no clear workarounds.\n  4. Any _model_ used for prediction introduces reasoning based on _generalization_, even more so when the model uses a _loss function_ for training; this tends to be the point where KG structure and semantics turn into crap; see the \"Let's talk about ...\" articles linked below.\n  5. The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.\n\nOf course, YMMV.\n\nThis approach leverages _neurosymbolic AI_ methods, combining\npractices from:\n\n  * _natural language processing_\n  * _graph data science_\n  * _entity resolution_\n  * _ontology pipeline_\n  * _context engineering_\n  * _human-in-the-loop_\n\nOverall, this illustrates a reference implementation for\n_entity-resolved retrieval-augmented generation_ (ER-RAG).\n</details>\n\n\n## Usage in applications\n\nThis runs with Python 3.11, though the range of versions may be\nextended soon.\n\nTo `pip` install from [PyPi](https://pypi.org/project/strwythura/):\n\n```bash\npython3 -m pip install strwathura\npython3 -m spacy download en_core_web_md\n```\n\nThen to integrate this library within an application:\n\n  1. Copy settings in `config.toml` into a custom configuration file.\n  2. Subclass `DomainContext` to extend it for the use case.\n  3. Define semantics in `domain.ttl` for the domain context.\n  4. Run entity resolution to merge the structured datasets.\n  5. Run `Ollama` and have already downloaded the Gemma3 LLM as described below.\n  6. Instantiate new `DomainContext`, `Strwythura`, `VisHTML`, and `GraphRAG` objects or their subclassed extensions.\n  7. ...\n  8. Profit\n\nFollow the patterns in the `build.py` and `errag.py` example scripts.\n\nIf you're working with documents in a language other than English,\nwell first that's absolutely fantastic, though next you need to:\n\n  * Update model settings in the `config.toml` file.\n  * Change the `spaCy` model downloaded here.\n  * Also change the language tags used in `domain.ttl` as needed.\n\n\n## Set up for demo or development\n\nThis library uses [`poetry`](https://python-poetry.org/docs/) for\npackage management, and first you need to install it. Then run:\n\n```bash\npoetry update\npoetry run python3 -m spacy download en_core_web_md\n```\n\n\n<details>\n  <summary><h2>Demo Part 1: Entity Resolution</h2></summary>\n\nRun _entity resolution_ (ER) to produce entities and relations from\n_structured data sources_, which tend to be more reliable than those\nextracted from unstructured content.\n\nWhat does this ER step buy us?  ER allows us to merge multiple\nstructured data sets, even without consistent _foreign keys_ being\navailable, producing an overlay of entities and relations among them\n-- which is useful as a \"backbone\" for constructing a KG. Morever\nwhen there are judgements being made from the KG about people or\norganizations, ER provides accountability for the merge decisions.\n\nThis is especially important in public sector, healthcare, banking,\ninsurance -- i.e., in use cases where you might need to \"send flowers\"\nwhen automated judgements go wrong.  For example, someone gets denied\na loan, has a medical insurance claim blocked, gets a tax audit, has\ntheir voter registration voided, becomes the subject of an arrest\nwarrant, and so on.  In other words, people and organizations tend to\ntake legal actions when someone else causes them harm. You'll want an\naudit trail of decisions based on evidence, when your software systems\nare making these kinds of judgements.\n\nFor the domain context in this tutorial, say we have two hypothetical\ndatasets which provide business directory listings:\n\n  * `sz_er/acme_biz.json` -- \"ACME Business Directory\"\n  * `sz_er/corp_home.json` -- \"Corporates Home UK\"\n\nPlus we have slices from datasets which provide listings about\nresearchers and scientific authors:\n\n  * `sz_er/orcid.json` -- [ORCID](https://orcid.org/)\n  * `sz_er/scopus.json` -- [Scopus](https://www.elsevier.com/products/scopus/data)\n\nThese four datasets can be merged using ER, with the results being a\ndomain-specific _thesaurus_ that generates graph elements: entities,\nrelations, properties. We'll blend this into our _semantic layer_ used\nfor organizing the KG later.\n\n\nThe following steps are optional, since these ER results have already\nbeen pre-computed and provided in the `sz_er/export.json` file.\nIf you'd like to run [Senzing](https://senzing.com/docs/quickstart/)\nto reproduce these ER results, use the following steps -- otherwise\ncontinue to the \"Part 2\" of this tutorial.\n\nSenzing SDK runs in Python or Java, though ER can also be run in batch\nwith a container from DockerHub:\n\n```bash\ndocker pull senzing/demo-senzing\n```\n\nOnce this container is available, run:\n\n```bash\ndocker run -it --rm --volume ./sz_er:/tmp/data senzing/demo-senzing\n```\n\nThis brings up a Linux command line prompt `I have no name!` and the\nlocal subdirectory `sz_er` will be mapped to the `/tmp/data' directory\nType the following commands for batch ER into the command line prompt.\n\nFirst, set up the Senzing configuration for merging these datasets:\n\n```bash\nG2ConfigTool.py\n```\n\nWithin the configuration tool, register the names of the data sources\nbeing used:\n\n```\naddDataSource ACME_BIZ\naddDataSource CORP_HOME\naddDataSource ORCID\naddDataSource SCOPUS\nsave\nexit\n```\n\nLoad each file and run ER on its data records:\n\n```bash\nG2Loader.py -f /tmp/data/acme_biz.json\nG2Loader.py -f /tmp/data/corp_home.json\nG2Loader.py -f /tmp/data/orcid.json\nG2Loader.py -f /tmp/data/scopus.json\n```\n\nExport the ER results to the `sz_er/export.json` file, then exit the\ncontainer:\n\n```bash\nG2Export.py -F JSON -o /tmp/data/export.json\nexit\n```\n\nThis will later get parsed to produce the `sz_er/er.ttl` file (RDF in\n\"Turtle\" format) during the next part of the demo to augment the\n_semantic layer_.\n\n</details>\n\n\n<details>\n  <summary><h2>Demo Part 2: Build Assets</h2></summary>\n\nGiven as input:\n\n  * `domain.ttl` -- semantics for the domain context\n  * `sz_er/export.json` -- a domain-specific thesaurus based on ER\n  * a list of structured datasets used in ER\n  * a list of URLs from which to scrape content\n\nThe `domain.ttl` file provides a basis for iterating with an _ontology\npipeline_ process, to represent the semantics for the given domain.\nIt specifies metadata in terms of _vocabulary_, _taxonomy_, and\n_thesaurus_ -- to use in representing the core entities and relations\nin the KG.\n\nThe `curate.py` script described below then will introduce the\n_human-in-the-loop_ part of this process, where you can review\nentities extracted from documents. Based on this analysis, decide\nwhere to refine the domain context to be able to _extract_,\n_classify_, and _connect_ more of what gets extracted from\n_unstructured data sources_ and linked into the KG.  Overall, this\nprocess distills elements of the _lexical graph_, linking them with\nelements from the _data graph_, to produce a more abstracted (i.e.,\nless noisy) _semantic layer_ as the resulting KG.\n\nMeanwhle, let's get started. The `build.py` script scrapes text\nsources and constructs a _knowledge graph_ plus _entity embeddings_,\nwith nodes linked to chunks in a _vector store_:\n\n```bash\npoetry run python3 build.py\n```\n\nDemo data used in this case includes articles about the linkage\nbetween eating _processed red meat_ frequently and the risks of\n_dementia_ later in life, based on long-term studies.\n\nThe approach in this tutorial iterates through multiple steps to\nproduce the assets needed for GraphRAG downstream:\n\n  1. Scrape each URL using `requests` and `BeautifulSoup`\n  2. Split the text into _chunks_\n  3. Build _vector embeddings_ for each chunk, in `LanceDB`\n  4. Parse each text chunk using `spaCy`, iterating per sentence\n  5. Extract _entities_ from each sentence using `GLiNER`\n  6. Build a _lexical graph_ from the parse trees in `NetworkX`\n  7. Run a _textrank_ algorithm to rank important entities\n  8. Build an embedding model for entities using `gensim.Word2Vec`\n  9. Generate an interactive visualization using `PyVis`\n\nNote: processing may take a few extra minutes the first time it runs\nsince `PyTorch` must download a large (~2GB) file.\n\nThe assets get serialized into these files:\n\n  * `data/lancedb` -- vector database tables in `LanceDB`\n  * `data/kg.json` -- serialization of `NetworkX` graph\n  * `data/sem.csv` -- entity semantics from `curate.py`\n  * `data/entity.w2v` -- entity embeddings in `Gensim`\n  * `data/url_cache.sqlite` -- URL cache in `SQLite`\n  * `kg.html` -- interactive graph visualization in `PyVis`\n</details>\n\n<details>\n  <summary><h2>Demo Part 3: Enhanced GraphRAG chat bot</h2></summary>\n\nA good downstream use case for exploring a newly constructed KG is\nGraphRAG, used for grounding the responses by an LLM in a\nquestion/answer chat.\n\nThis implementation uses `BAML` <https://docs.boundaryml.com/home>\nand leverages the KG using _semantic random walks_.\n\nTo set up, first download/install `Ollama` <https://ollama.com/>\nand pull the Gemma3 model <https://huggingface.co/google/gemma-3-12b-it>\n\n```bash\nollama pull gemma3:12b\n```\n\nThen run the `errag.py` script for an interactive GraphRAG example:\n\n```bash\npoetry run python3 errag.py\n```\n</details>\n\n<details>\n  <summary><h2>Demo Part 4: Curating an Ontology Pipeline</h2></summary>\n\nThis code uses a _semantic layer_ -- in other words, a \"backbone\" for\nthe KG -- to organize the entities and relations which get abstracted\nfrom the lexical graph.\n\nFor now, run the `curate.py` script to generate a view of the ranked\nNER results, serialized as the `data/sem.csv` file.  This can be\nviewed in a spreadsheet to understand how to iterate on the semantic\ndefinitions for more effective graph organization in the domain of the\nscraped documents.\n\n```bash\npoetry run python3 curate.py\n```\n</details>\n\n\n<details>\n  <summary><h2>Unbundling GraphRAG</h2></summary>\n\n**Objective:**\n\nConstruct a _knowledge graph_ (KG) using open source libraries where\ndeep learning models provide narrowly-focused _point solutions_ to\ngenerate components for a graph: nodes, edges, properties.\n\nThese steps define a generalized process, where this tutorial picks up\nat the _lexical graph_, without the _entity linking_ (EL) part yet:\n\n**Semantic layer:**\n\n  1. Load any semantics for domain context from pre-defined controlled vocabularies, taxonomies, thesauri, ontologies, etc., directly into the KG.\n\n**Data graph:**\n\n  1. Load the structured data sources or updates into a data graph.\n  2. Perform _entity resolution_ (ER) on PII extracted from the data graph.\n  3. Blend the ER results into the semantic layer as a \"backbone\" for structuring the KG.\n\n**Lexical graph:**\n\n  1. Parse the text chunks, using lemmatization to normalize token spans.\n  2. Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.\n  3. Analyze _named entity recognition_ (NER) to extract candidate entities from noun phrase spans.\n  4. Analyze _relation extraction_ (RE) to extract relations between pairwise entities.\n  5. Perform _entity linking_ (EL) leveraging the ER results.\n  6. Promote the extracted entities and relations up to the semantic layer.\n\nOf course many vendors suggest using a _large language model_ (LLM) as\na _one-size-fits-all_ (OSFA) \"black box\" approach for extracting\nentities and generating an entire graph **automagically**.\n\nHowever, the business process of _resolution_ -- for both entities and\nrelations -- requires _judgements_. If the entities getting resolved\nare low-risk, low-effort in nature, then yeah knock yourself out. If\nthe entities represent _people_ or _organizations_, these have agency\nand may take actions when misrepresented in applications which have\nconsequences.\n\nWhenever judgements get delegated to _model-based_ approaches,\n_generalization_ becomes a form of reasoning employed.  When the\ntechnology within the model is based on _loss functions_, then\ngeneralization becomes dominant -- regardless of any marketing claims\nabout \"AI reasoning\" made by tech firms.\n\nFortunately, decisions can be made _without models_, even in AI\napplications. Shock, horror!!! Please, say it isn't so!?! Brace\nyourselves, using models is a thing, but not the only thing.  For more\ndetailed discussion, see:\n\n  * Part 1: Let's talk about \"Today's AI\" <https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/>\n  * Part 2: Let's talk about \"Resolution\" <https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/>\n\nAlso keep in mind that black box approaches don't work especially well\nfor regulated environments, where audits, explanations, evidence, data\nprovenance, etc., are required.\n\nMoreover, KGs used in mission-critical apps, such as investigations,\ngenerally require periodic data updates, so construction isn't a\none-step process. By producing a KG based on the approach sketched\nabove, updates can be handled more effectively.  Any downstream use\ncases, such as AI applications, also benefit from improved quality of\nsemantics and representation.\n</details>\n\n\n<details>\n  <summary><h2>FAQ</h2></summary>\n\n<dl>\n<dt>Q:</dt><dd>\"Have you tried this with <code>langextract</code> yet?\"</dd>\n<dt>A:</dt><dd>\"I'll take <code>How does an instructor know a student ignored the README?</code> from the <a href=\"https://en.wiktionary.org/wiki/fuck_around_and_find_out\" target=\"_blank\"><em>What is FAFO?</em></a> category, for $200 ... but yes of course, it's an interesting package, which builds on other interesting work used here. Except that key parts of it miss the point entirely, in ways that only a hyperscaler could possibly fuck up so badly.\"</dd>\n</dl>\n\n<dl>\n<dt>Q:</dt><dd>\"What the hell is the name of this repo about?\"</dd>\n<dt>A:</dt><dd>\"As you may have noticed, many open source projects published in this GitHub organization are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word <a href=\"https://translate.google.com/details?sl=cy&tl=en&text=strwythura&op=translate\" target=\"_blank\"><code>strwythura</code></a> translates as the verb <strong>structure</strong> in English.\"</dd>\n</dl>\n\n<dl>\n<dt>Q:</dt><dd>\"Why aren't you using an LLM to build the graph instead?\"</dd>\n<dt>A:</dt><dd>\"I promise to visit you in jail.\"</dd>\n</dl>\n\n<dl>\n<dt>Q:</dt><dd>\"Um, yeah, like, didn't Karpathy say to use <em>vibe coding</em>, or something? #justsayin\"\n<dt>A:</dt><dd>\"<a href=\"https://effinbirds.com/\" target=\"_blank\">Piss the eff off</a> tech bro. Srsly, like yesterday -- you're embarrassing our entire industry with your overly exuberant ignorance.\"</dd>\n</dl>\n\n</details>\n\n\n<details>\n  <summary>Developer Notes</summary>\n\nAfter each `BAML` release update, some committer needs to regenerate\nits Python client source:\n\n```bash\npoetry run baml-cli generate --from strwythura/baml_src\n```\n</details>\n\n\n<details>\n  <summary>Experimental: Relation Extraction evaluation</summary>\n\nCurrent Python libraries for _relation extraction_ (RE) are\nprobably best characterized as \"experimental research projects\".\n\nTheir tokenization approaches tend to make the mistake of \"throwing\nthe baby out with the bath water\" by not leveraging other available\ninformation, e.g., what we have in the _textgraph_ representation of\nthe parsed documents. Also, they tend to ignore the semantic\nconstraints of the domain context, while computationally boiling\nthe ocean.\n\nRE libraries which have been evaluated:\n\n  * `GLiREL`: <https://github.com/jackboyla/GLiREL>\n  * `ReLIK`: <https://github.com/SapienzaNLP/relik>\n  * `OpenNRE`: <https://github.com/thunlp/OpenNRE>\n  * `mREBEL`: <https://github.com/Babelscape/rebel>\n\nThis project had used `GLiREL` although its results were quite sparse.\nRE will be replaced by `BAML` or `DSPy` workflows in the near future.\n\nThere is some experimental code which illustrates `OpenNRE` evaluation.\nUse the `archive/nre.sh` script to load OpenNRE pre-trained models\nbefore running the `archive/opennre.ipynb` notebook.\n\nThis may not work in many environments, depending on how well the\n`OpenNRE` library is being maintained.\n</details>\n\n\n<details>\n  <summary>Experimental: Tutorial notebooks</summary>\n\n<p>\nA collection of Jupyter notebooks were used to prototype code. These\nhelp illustrate important intermediate steps within these workflows:\n</p>\n\n```bash\n.venv/bin/jupyter-lab\n```\n\n<ul>\n<li>`archive/construct.ipynb` -- detailed KG construction using a lexical graph</li>\n<li>`archive/chunk.ipynb` -- simple example of how to scrape and chunk text</li>\n<li>`archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)</li>\n<li>`archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)</li>\n</ul>\n\n<p>\nThese are now archived, though kept available for study.\n</p>\n</details>\n\n\n<details>\n  <summary>License and Copyright</summary>\n\nSource code for **Strwythura** plus its logo, documentation, and examples\nhave an [MIT license](https://spdx.org/licenses/MIT.html) which is\nsuccinct and simplifies use in commercial applications.\n\nAll materials herein are Copyright \u00a9 2024-2025 Senzing, Inc.\n</details>\n\n\n<details>\n  <summary>Kudos and Attribution</summary>\n\nPlease use the following BibTeX entry for citing **Strwythura** if you use\nit in your research or software.\nCitations are helpful for the continued development and maintenance of\nthis library.\n\n```bibtex\n@software{strwythura,\n  author = {Paco Nathan},\n  title = {{Strwythura: construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain}},\n  year = 2024,\n  publisher = {Senzing},\n  doi = {10.5281/zenodo.16934079},\n  url = {https://github.com/DerwenAI/strwythura}\n}\n```\n\nKudos to \n[@louisguitton](https://github.com/louisguitton), \n[@cj2001](https://github.com/cj2001),\n[@prrao87](https://github.com/prrao87), \n[@hellovai](https://github.com/hellovai),\n[@docktermj](https://github.com/docktermj), \n[@jbutcher21](https://github.com/jbutcher21),  \nand the kind folks at [GraphGeeks](https://graphgeeks.org/) for their support.\n</details>\n\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=derwenai/strwythura&type=Date)](https://star-history.com/#derwenai/strwythura&Date)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain.",
    "version": "1.4.1",
    "project_urls": {
        "Homepage": "https://github.com/DerwenAI/strwythura",
        "doi": "https://doi.org/10.5281/zenodo.16934079",
        "slides": "https://derwen.ai/s/2njz#1",
        "video": "https://senzing.com/gph-graph-rag-llm-knowledge-graphs/"
    },
    "split_keywords": [
        "baml",
        " context engineering",
        " entity-linking",
        " entity-resolution",
        " graph algorithms",
        " graphrag",
        " interactive visualization",
        " knowledge-graph",
        " named-entity-recognition",
        " nlp",
        " ontology",
        " ontology-pipeline",
        " relation-extraction",
        " semantic-expansion",
        " semantic-layer",
        " semantic-random-walk",
        " textgraphs",
        " unstructured-data",
        " vector-database"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "28b0fa052767b0ec6104ac189a961c30c4182b64068f1728ddf987ccbea9d76c",
                "md5": "85e64d5dac6d63bbe88c50014b7c804b",
                "sha256": "8b72328fba7d1485a5e355eeb7bf1d5d34d6b272414dbfe5b9b8c639cf5cdf0d"
            },
            "downloads": -1,
            "filename": "strwythura-1.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "85e64d5dac6d63bbe88c50014b7c804b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.11",
            "size": 49106,
            "upload_time": "2025-08-30T21:47:43",
            "upload_time_iso_8601": "2025-08-30T21:47:43.443289Z",
            "url": "https://files.pythonhosted.org/packages/28/b0/fa052767b0ec6104ac189a961c30c4182b64068f1728ddf987ccbea9d76c/strwythura-1.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6fc486e8807a359b7931998f10d29fd1cd0c5848ae358dad3873f11ea1dd7d5f",
                "md5": "ab7a44b5a7090993d454b8d9a814a4e1",
                "sha256": "a138c8c80ad84cc1e5887f5b737c7bc356e2bfe73a4c761c37ff6d8932c0e249"
            },
            "downloads": -1,
            "filename": "strwythura-1.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "ab7a44b5a7090993d454b8d9a814a4e1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.11",
            "size": 44522,
            "upload_time": "2025-08-30T21:47:45",
            "upload_time_iso_8601": "2025-08-30T21:47:45.214123Z",
            "url": "https://files.pythonhosted.org/packages/6f/c4/86e8807a359b7931998f10d29fd1cd0c5848ae358dad3873f11ea1dd7d5f/strwythura-1.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-30 21:47:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DerwenAI",
    "github_project": "strwythura",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "strwythura"
}
        
Elapsed time: 2.67130s