strwythura


Namestrwythura JSON
Version 1.3.0 PyPI version JSON
download
home_pagehttps://github.com/DerwenAI/strwythura
SummaryConstruct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain.
upload_time2025-08-28 01:09:45
maintainerNone
docs_urlNone
authorPaco Nathan
requires_python<3.12,>=3.11
licenseMIT
keywords baml context engineering entity-linking entity-resolution graph algorithms graphrag interactive visualization knowledge-graph named-entity-recognition nlp ontology ontology-pipeline relation-extraction semantic-expansion semantic-layer semantic-random-walk textgraphs unstructured-data vector-database
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Strwythura

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16934079.svg)](https://doi.org/10.5281/zenodo.16934079)
![Licence](https://img.shields.io/github/license/DerwenAI/strwythura)
![Repo size](https://img.shields.io/github/repo-size/DerwenAI/strwythura)
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
![GitHub commit activity](https://img.shields.io/github/commit-activity/w/DerwenAI/strwythura?style=plastic)


**Strwythura** library/tutorial, based on a presentation about GraphRAG for
[GraphGeeks](https://graphgeeks.org/) on 2024-08-14

<details>
  <summary><h2>Overview</h2></summary>

How to construct a _knowledge graph_ (KG) from unstructured data
sources using _state of the art_ (SOTA) models for _named entity
recognition_ (NER), then implement an enhanced _GraphRAG_ approach,
and curate semantics for optimizing AI app outcomes within a
specific domain.

  * videos: <https://youtu.be/B6_NfvQL-BE>, <https://senzing.com/gph-graph-rag-llm-knowledge-graphs/>
  * slides: <https://derwen.ai/s/2njz#1>

Motivation for this tutorial comes from the stark fact that the
term "GraphRAG" means many things, based on multiple conflicting
definitions. Several popular implementations reveal a relatively 
cursory understanding about either _natural language processing_ (NLP)
or graph algorithms, plus a _vendor bias_ toward their own query language.

See this article for more details and history:
["Unbundling the Graph in GraphRAG"](https://www.oreilly.com/radar/unbundling-the-graph-in-graphrag/).

Instead of delegating KG construction to a _large language model_
(LLM), this tutorial shows the use of sophisticated NLP pipelines
based on `spaCy`, `GLiNER`, _TextRank_, and related libraries.
Results are better/faster/cheaper, plus this provides more control
and oversight for _intentional arrangement_ of the KG. Then for
downstream usage in a question/answer chat bot, an enhanced GraphRAG
approach leverages graph algorithms (e.g., _semantic random walk_)
to optimize retrieval of text chunks which ultimately get presented
to an LLM for _summarization_ to produce responses.

For more detailed discussions, see:

  * enhanced GraphRAG: ["GraphRAG to enhance LLM-based apps"](https://derwen.ai/s/hm7h#3)
  * ontology pipeline: ["Intentional Arrangement"](https://jessicatalisman.substack.com/) by **Jessica Talisman**
  * `spaCy`: <https://spacy.io/>
  * `GLiNER`: <https://huggingface.co/urchade/gliner_base>
  * _TextRank_: <https://www.derwen.ai/docs/ptr/explain_algo/>

Some key issues regarding KG construction with LLMs which don't get
addressed much by the graph community and AI community in general:

  1. LLMs tend to mangle cross-domain semantics when used for building graphs; see _Mai2024_ referenced in the "GraphRAG to enhance LLM-based apps" talk above.
  2. You need to introduce a _semantic layer_ for representing the domain context, which follows more of a _neurosymbolic AI_ approach.
  3. Most all LLMs perform _question rewriting_ in ways which cannot be disabled, even when the `temperature` parameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds.
  4. Any _model_ used for prediction introduces reasoning based on _generalization_, even more so when the model uses a _loss function_ for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
  5. The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.

Of course, YMMV.

This approach leverages _neurosymbolic AI_ methods, combining
practices from:

  * _natural language processing_
  * _graph data science_
  * _ontology pipeline_
  * _context engineering_
  * _human-in-the-loop_

Overall, this illustrates a reference implementation for
_entity-resolved retrieval-augmented generation_ (ER-RAG).
</details>


## Usage in applications

This runs with Python 3.11, though the range of versions may be
extended soon.

To `pip` install from [PyPi](https://pypi.org/project/strwythura/):

```bash
python3 -m pip install strwathura
python3 -m spacy download en_core_web_md
```

Then to integrate this library within an application:

  1. Run `Ollama` and have already downloaded the Gemma3 LLM as described below.
  2. Copy settings in `config.toml` into a custom configuration file.
  3. Define semantics in `domain.ttl` for the domain context of the use case.
  4. Instantiate new `Strwythura` and `GraphRAG` objects.

Follow the example patterns in `build.py` and `errag.py` respectively.

If you're working with documents in a language other than English,
well first that's absolutely fantastic. Next, you need to:

  * Update model settings in the `config.toml` file.
  * Change the `spaCy` model downloaded here.
  * Also change the language tags used in `domain.ttl` as needed.


## Set up for demo or development

This library uses [`poetry`](https://python-poetry.org/docs/) for
package management, and first you need to install it. Then run:

```bash
poetry update
poetry run python3 -m spacy download en_core_web_md
```


<details>
  <summary><h2>Demo Part 1: Build Assets</h2></summary>

Given as input:

  * a list of URLs from which to scrape content
  * `domain.ttl` -- semantics for the domain context

The `domain.ttl` file provides a basis for iterating with an _ontology
pipeline_ process, to represent the semantics for the given domain.
It specifies metadata in terms of _vocabulary_, _taxonomy_, and
_thesaurus_ -- to use in representing the core entities and relations
in the KG.

The `curate.py` script described below then will introduce the
_human-in-the-loop_ part of this process, where you can review
entities extracted from documents. Based on this analysis, decide
where to refine the domain context to be able to _extract_,
_classify_, and _connect_ more of what gets extracted from
_unstructured data sources_ and linked into the KG.  Overall, this
process distills elements of the _lexical graph_, linking them with
elements from the _data graph_, to produce a more abstracted (i.e.,
less noisy) _semantic layer_ as the resulting KG.

Meanwhle, let's get started. The `build.py` script scrapes text
sources and constructs a _knowledge graph_ plus _entity embeddings_,
with nodes linked to chunks in a _vector store_:

```bash
poetry run python3 build.py
```

Demo data used in this case includes articles about the linkage
between eating _processed red meat_ frequently and the risks of
_dementia_ later in life, based on long-term studies.

The approach in this tutorial iterates through multiple steps to
produce the assets needed for GraphRAG downstream:

  1. Scrape each URL using `requests` and `BeautifulSoup`
  2. Split the text into _chunks_
  3. Build _vector embeddings_ for each chunk, in `LanceDB`
  4. Parse each text chunk using `spaCy`, iterating per sentence
  5. Extract _entities_ from each sentence using `GLiNER`
  6. Build a _lexical graph_ from the parse trees in `NetworkX`
  7. Run a _textrank_ algorithm to rank important entities
  8. Build an embedding model for entities using `gensim.Word2Vec`
  9. Generate an interactive visualization using `PyVis`

Note: processing may take a few extra minutes the first time it runs
since `PyTorch` must download a large (~2GB) file.

The assets get serialized into these files:

  * `data/lancedb` -- vector database tables in `LanceDB`
  * `data/kg.json` -- serialization of `NetworkX` graph
  * `data/sem.csv` -- entity semantics from `curate.py`
  * `data/entity.w2v` -- entity embeddings in `Gensim`
  * `data/url_cache.sqlite` -- URL cache in `SQLite`
  * `kg.html` -- interactive graph visualization in `PyVis`
</details>

<details>
  <summary><h2>Demo Part 2: Enhanced GraphRAG chat bot</h2></summary>

A good downstream use case for exploring a newly constructed KG is
GraphRAG, used for grounding the responses by an LLM in a
question/answer chat.

This implementation uses `BAML` <https://docs.boundaryml.com/home>
and leverages the KG using _semantic random walks_.

To set up, first download/install `Ollama` <https://ollama.com/>
and pull the Gemma3 model <https://huggingface.co/google/gemma-3-12b-it>

```bash
ollama pull gemma3:12b
```

Then run the `errag.py` script for an interactive GraphRAG example:

```bash
poetry run python3 errag.py
```
</details>

<details>
  <summary><h2>Demo Part 3: Curating an Ontology Pipeline</h2></summary>

This code uses a _semantic layer_ -- in other words, a "backbone" for
the KG -- to organize the entities and relations which get abstracted
from the lexical graph.

If you had previously run _entity resolution_ from _structured data
sources_, which tend to be more reliable than unstructured content,
this approach could integrate those results as well.

For now, run the `curate.py` script to generate a view of the ranked
NER results, serialized as the `data/sem.csv` file.  This can be
viewed in a spreadsheet to understand how to iterate on the semantic
definitions for more effective graph organization in the domain of the
scraped documents.

```bash
poetry run python3 curate.py
```
</details>


<details>
  <summary><h2>Unbundling GraphRAG</h2></summary>

**Objective:**

Construct a _knowledge graph_ (KG) using open source libraries where
deep learning models provide narrowly-focused _point solutions_ to
generate components for a graph: nodes, edges, properties.

These steps define a generalized process, where this tutorial picks up
at the _lexical graph_, without the _entity linking_ (EL) part yet:

**Semantic layer:**

  1. Load any semantics for domain context from pre-defined controlled vocabularies, taxonomies, thesauri, ontologies, etc., directly into the KG.

**Data graph:**

  1. Load the structured data sources or updates into a data graph.
  2. Perform _entity resolution_ (ER) on PII extracted from the data graph.
  3. Use ER results to generate a semantic layer as a "backbone" for the KG.

**Lexical graph:**

  1. Parse the text chunks, using lemmatization to normalize token spans.
  2. Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
  3. Analyze _named entity recognition_ (NER) to extract candidate entities from noun phrase spans.
  4. Analyze _relation extraction_ (RE) to extract relations between pairwise entities.
  5. Perform _entity linking_ (EL) leveraging the ER results.
  6. Promote the extracted entities and relations up to the semantic layer.

Of course many vendors suggest using a _large language model_ (LLM) as
a _one-size-fits-all_ (OSFA) "black box" approach for extracting
entities and generating an entire graph **automagically**.

However, the business process of _resolution_ -- for both entities and
relations -- requires _judgements_. If the entities getting resolved
are low-risk, low-effort in nature, then yeah knock yourself out. If
the entities represent _people_ or _organizations_, these have agency
and may take actions when misrepresented in applications which have
consequences.

Whenever judgements get delegated to _model-based_ approaches,
_generalization_ becomes a form of reasoning employed.  When the
technology within the model is based on _loss functions_, then
generalization becomes dominant -- regardless of any marketing claims
about "AI reasoning" made by tech firms.

Fortunately, decisions can be made _without models_, even in AI
applications. Shock, horror!!! Please, say it isn't so!?! Brace
yourselves, using models is a thing, but not the only thing.  For more
detailed discussion, see:

  * Part 1: Let's talk about "Today's AI" <https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/>
  * Part 2: Let's talk about "Resolution" <https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/>

Also keep in mind that black box approaches don't work especially well
for regulated environments, where audits, explanations, evidence, data
provenance, etc., are required.

Moreover, KGs used in mission-critical apps, such as investigations,
generally require periodic data updates, so construction isn't a
one-step process. By producing a KG based on the approach sketched
above, updates can be handled more effectively.  Any downstream use
cases, such as AI applications, also benefit from improved quality of
semantics and representation.
</details>


<details>
  <summary><h2>FAQ</h2></summary>

<dl>
<dt>Q:</dt><dd>"Have you tried this with <code>langextract</code> yet?"</dd>
<dt>A:</dt><dd>"I'll take <code>How does an instructor know a student ignored the README?</code> from the <a href="https://en.wiktionary.org/wiki/fuck_around_and_find_out" target="_blank"><em>What is FAFO?</em></a> category, for $200 ... but yes of course, it's an interesting package, which builds on other interesting work used here. Except that key parts of it miss the point entirely, in ways that only a hyperscaler could possibly fuck up so badly."</dd>
</dl>

<dl>
<dt>Q:</dt><dd>"What the hell is the name of this repo about?"</dd>
<dt>A:</dt><dd>"As you may have noticed, many open source projects published in this GitHub organization are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word <a href="https://translate.google.com/details?sl=cy&tl=en&text=strwythura&op=translate" target="_blank"><code>strwythura</code></a> translates as the verb <strong>structure</strong> in English."</dd>
</dl>

<dl>
<dt>Q:</dt><dd>"Why aren't you using an LLM to build the graph instead?"</dd>
<dt>A:</dt><dd>"I promise to visit you in jail."</dd>
</dl>

<dl>
<dt>Q:</dt><dd>"Um, yeah, like, didn't Karpathy say to use <em>vibe coding</em>, or something? #justsayin"
<dt>A:</dt><dd>"<a href="https://effinbirds.com/" target="_blank">Piss the eff off</a> tech bro. Srsly, like yesterday -- you're embarrassing our entire industry with your overly exuberant ignorance."</dd>
</dl>

</details>


<details>
  <summary>Developer Notes</summary>

After each `BAML` release update, some committer needs to regenerate
its Python client source:

```bash
poetry run baml-cli generate --from strwythura/baml_src
```
</details>


<details>
  <summary>Experimental: Relation Extraction evaluation</summary>

Current Python libraries for _relation extraction_ (RE) are
probably best characterized as "experimental research projects".

Their tokenization approaches tend to make the mistake of "throwing
the baby out with the bath water" by not leveraging other available
information, e.g., what we have in the _textgraph_ representation of
the parsed documents. Also, they tend to ignore the semantic
constraints of the domain context, while computationally boiling
the ocean.

RE libraries which have been evaluated:

  * `GLiREL`: <https://github.com/jackboyla/GLiREL>
  * `ReLIK`: <https://github.com/SapienzaNLP/relik>
  * `OpenNRE`: <https://github.com/thunlp/OpenNRE>
  * `mREBEL`: <https://github.com/Babelscape/rebel>

This project had used `GLiREL` although its results were quite sparse.
RE will be replaced by `BAML` or `DSPy` workflows in the near future.

There is some experimental code which illustrates `OpenNRE` evaluation.
Use the `archive/nre.sh` script to load OpenNRE pre-trained models
before running the `archive/opennre.ipynb` notebook.

This may not work in many environments, depending on how well the
`OpenNRE` library is being maintained.
</details>


<details>
  <summary>Experimental: Tutorial notebooks</summary>

<p>
A collection of Jupyter notebooks were used to prototype code. These
help illustrate important intermediate steps within these workflows:
</p>

```bash
.venv/bin/jupyter-lab
```

<ul>
<li>Part 1: `archive/construct.ipynb` -- detailed KG construction using a lexical graph</li>
<li>Part 2: `archive/chunk.ipynb` -- simple example of how to scrape and chunk text</li>
<li>Part 3: `archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)</li>
<li>Part 4: `archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)</li>
</ul>

<p>
These are now archived, though kept available for study.
</p>
</details>


<details>
  <summary>License and Copyright</summary>

Source code for **Strwythura** plus its logo, documentation, and examples
have an [MIT license](https://spdx.org/licenses/MIT.html) which is
succinct and simplifies use in commercial applications.

All materials herein are Copyright © 2024-2025 Senzing, Inc.
</details>


<details>
  <summary>Kudos and Attribution</summary>

Please use the following BibTeX entry for citing **Strwythura** if you use
it in your research or software.
Citations are helpful for the continued development and maintenance of
this library.

```bibtex
@software{strwythura,
  author = {Paco Nathan},
  title = {{Strwythura: construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain}},
  year = 2024,
  publisher = {Senzing},
  doi = {10.5281/zenodo.16934079},
  url = {https://github.com/DerwenAI/strwythura}
}
```

Kudos to 
[@louisguitton](https://github.com/louisguitton), 
[@cj2001](https://github.com/cj2001),
[@prrao87](https://github.com/prrao87), 
[@hellovai](https://github.com/hellovai),
[@docktermj](https://github.com/docktermj), 
[@jbutcher21](https://github.com/jbutcher21),  
and the kind folks at [GraphGeeks](https://graphgeeks.org/) for their support.
</details>


## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=derwenai/strwythura&type=Date)](https://star-history.com/#derwenai/strwythura&Date)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/DerwenAI/strwythura",
    "name": "strwythura",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.11",
    "maintainer_email": null,
    "keywords": "baml, context engineering, entity-linking, entity-resolution, graph algorithms, graphrag, interactive visualization, knowledge-graph, named-entity-recognition, nlp, ontology, ontology-pipeline, relation-extraction, semantic-expansion, semantic-layer, semantic-random-walk, textgraphs, unstructured-data, vector-database",
    "author": "Paco Nathan",
    "author_email": "paco@senzing.com",
    "download_url": "https://files.pythonhosted.org/packages/fe/24/18eeba607989300fa55b9df512ce8b0a3bfa8b5904ed35354a0adf82a552/strwythura-1.3.0.tar.gz",
    "platform": null,
    "description": "# Strwythura\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16934079.svg)](https://doi.org/10.5281/zenodo.16934079)\n![Licence](https://img.shields.io/github/license/DerwenAI/strwythura)\n![Repo size](https://img.shields.io/github/repo-size/DerwenAI/strwythura)\n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)\n![GitHub commit activity](https://img.shields.io/github/commit-activity/w/DerwenAI/strwythura?style=plastic)\n\n\n**Strwythura** library/tutorial, based on a presentation about GraphRAG for\n[GraphGeeks](https://graphgeeks.org/) on 2024-08-14\n\n<details>\n  <summary><h2>Overview</h2></summary>\n\nHow to construct a _knowledge graph_ (KG) from unstructured data\nsources using _state of the art_ (SOTA) models for _named entity\nrecognition_ (NER), then implement an enhanced _GraphRAG_ approach,\nand curate semantics for optimizing AI app outcomes within a\nspecific domain.\n\n  * videos: <https://youtu.be/B6_NfvQL-BE>, <https://senzing.com/gph-graph-rag-llm-knowledge-graphs/>\n  * slides: <https://derwen.ai/s/2njz#1>\n\nMotivation for this tutorial comes from the stark fact that the\nterm \"GraphRAG\" means many things, based on multiple conflicting\ndefinitions. Several popular implementations reveal a relatively \ncursory understanding about either _natural language processing_ (NLP)\nor graph algorithms, plus a _vendor bias_ toward their own query language.\n\nSee this article for more details and history:\n[\"Unbundling the Graph in GraphRAG\"](https://www.oreilly.com/radar/unbundling-the-graph-in-graphrag/).\n\nInstead of delegating KG construction to a _large language model_\n(LLM), this tutorial shows the use of sophisticated NLP pipelines\nbased on `spaCy`, `GLiNER`, _TextRank_, and related libraries.\nResults are better/faster/cheaper, plus this provides more control\nand oversight for _intentional arrangement_ of the KG. Then for\ndownstream usage in a question/answer chat bot, an enhanced GraphRAG\napproach leverages graph algorithms (e.g., _semantic random walk_)\nto optimize retrieval of text chunks which ultimately get presented\nto an LLM for _summarization_ to produce responses.\n\nFor more detailed discussions, see:\n\n  * enhanced GraphRAG: [\"GraphRAG to enhance LLM-based apps\"](https://derwen.ai/s/hm7h#3)\n  * ontology pipeline: [\"Intentional Arrangement\"](https://jessicatalisman.substack.com/) by **Jessica Talisman**\n  * `spaCy`: <https://spacy.io/>\n  * `GLiNER`: <https://huggingface.co/urchade/gliner_base>\n  * _TextRank_: <https://www.derwen.ai/docs/ptr/explain_algo/>\n\nSome key issues regarding KG construction with LLMs which don't get\naddressed much by the graph community and AI community in general:\n\n  1. LLMs tend to mangle cross-domain semantics when used for building graphs; see _Mai2024_ referenced in the \"GraphRAG to enhance LLM-based apps\" talk above.\n  2. You need to introduce a _semantic layer_ for representing the domain context, which follows more of a _neurosymbolic AI_ approach.\n  3. Most all LLMs perform _question rewriting_ in ways which cannot be disabled, even when the `temperature` parameter is set to zero; this leads to relative degrees of \"hallucinated questions\" for which there are no clear workarounds.\n  4. Any _model_ used for prediction introduces reasoning based on _generalization_, even more so when the model uses a _loss function_ for training; this tends to be the point where KG structure and semantics turn into crap; see the \"Let's talk about ...\" articles linked below.\n  5. The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.\n\nOf course, YMMV.\n\nThis approach leverages _neurosymbolic AI_ methods, combining\npractices from:\n\n  * _natural language processing_\n  * _graph data science_\n  * _ontology pipeline_\n  * _context engineering_\n  * _human-in-the-loop_\n\nOverall, this illustrates a reference implementation for\n_entity-resolved retrieval-augmented generation_ (ER-RAG).\n</details>\n\n\n## Usage in applications\n\nThis runs with Python 3.11, though the range of versions may be\nextended soon.\n\nTo `pip` install from [PyPi](https://pypi.org/project/strwythura/):\n\n```bash\npython3 -m pip install strwathura\npython3 -m spacy download en_core_web_md\n```\n\nThen to integrate this library within an application:\n\n  1. Run `Ollama` and have already downloaded the Gemma3 LLM as described below.\n  2. Copy settings in `config.toml` into a custom configuration file.\n  3. Define semantics in `domain.ttl` for the domain context of the use case.\n  4. Instantiate new `Strwythura` and `GraphRAG` objects.\n\nFollow the example patterns in `build.py` and `errag.py` respectively.\n\nIf you're working with documents in a language other than English,\nwell first that's absolutely fantastic. Next, you need to:\n\n  * Update model settings in the `config.toml` file.\n  * Change the `spaCy` model downloaded here.\n  * Also change the language tags used in `domain.ttl` as needed.\n\n\n## Set up for demo or development\n\nThis library uses [`poetry`](https://python-poetry.org/docs/) for\npackage management, and first you need to install it. Then run:\n\n```bash\npoetry update\npoetry run python3 -m spacy download en_core_web_md\n```\n\n\n<details>\n  <summary><h2>Demo Part 1: Build Assets</h2></summary>\n\nGiven as input:\n\n  * a list of URLs from which to scrape content\n  * `domain.ttl` -- semantics for the domain context\n\nThe `domain.ttl` file provides a basis for iterating with an _ontology\npipeline_ process, to represent the semantics for the given domain.\nIt specifies metadata in terms of _vocabulary_, _taxonomy_, and\n_thesaurus_ -- to use in representing the core entities and relations\nin the KG.\n\nThe `curate.py` script described below then will introduce the\n_human-in-the-loop_ part of this process, where you can review\nentities extracted from documents. Based on this analysis, decide\nwhere to refine the domain context to be able to _extract_,\n_classify_, and _connect_ more of what gets extracted from\n_unstructured data sources_ and linked into the KG.  Overall, this\nprocess distills elements of the _lexical graph_, linking them with\nelements from the _data graph_, to produce a more abstracted (i.e.,\nless noisy) _semantic layer_ as the resulting KG.\n\nMeanwhle, let's get started. The `build.py` script scrapes text\nsources and constructs a _knowledge graph_ plus _entity embeddings_,\nwith nodes linked to chunks in a _vector store_:\n\n```bash\npoetry run python3 build.py\n```\n\nDemo data used in this case includes articles about the linkage\nbetween eating _processed red meat_ frequently and the risks of\n_dementia_ later in life, based on long-term studies.\n\nThe approach in this tutorial iterates through multiple steps to\nproduce the assets needed for GraphRAG downstream:\n\n  1. Scrape each URL using `requests` and `BeautifulSoup`\n  2. Split the text into _chunks_\n  3. Build _vector embeddings_ for each chunk, in `LanceDB`\n  4. Parse each text chunk using `spaCy`, iterating per sentence\n  5. Extract _entities_ from each sentence using `GLiNER`\n  6. Build a _lexical graph_ from the parse trees in `NetworkX`\n  7. Run a _textrank_ algorithm to rank important entities\n  8. Build an embedding model for entities using `gensim.Word2Vec`\n  9. Generate an interactive visualization using `PyVis`\n\nNote: processing may take a few extra minutes the first time it runs\nsince `PyTorch` must download a large (~2GB) file.\n\nThe assets get serialized into these files:\n\n  * `data/lancedb` -- vector database tables in `LanceDB`\n  * `data/kg.json` -- serialization of `NetworkX` graph\n  * `data/sem.csv` -- entity semantics from `curate.py`\n  * `data/entity.w2v` -- entity embeddings in `Gensim`\n  * `data/url_cache.sqlite` -- URL cache in `SQLite`\n  * `kg.html` -- interactive graph visualization in `PyVis`\n</details>\n\n<details>\n  <summary><h2>Demo Part 2: Enhanced GraphRAG chat bot</h2></summary>\n\nA good downstream use case for exploring a newly constructed KG is\nGraphRAG, used for grounding the responses by an LLM in a\nquestion/answer chat.\n\nThis implementation uses `BAML` <https://docs.boundaryml.com/home>\nand leverages the KG using _semantic random walks_.\n\nTo set up, first download/install `Ollama` <https://ollama.com/>\nand pull the Gemma3 model <https://huggingface.co/google/gemma-3-12b-it>\n\n```bash\nollama pull gemma3:12b\n```\n\nThen run the `errag.py` script for an interactive GraphRAG example:\n\n```bash\npoetry run python3 errag.py\n```\n</details>\n\n<details>\n  <summary><h2>Demo Part 3: Curating an Ontology Pipeline</h2></summary>\n\nThis code uses a _semantic layer_ -- in other words, a \"backbone\" for\nthe KG -- to organize the entities and relations which get abstracted\nfrom the lexical graph.\n\nIf you had previously run _entity resolution_ from _structured data\nsources_, which tend to be more reliable than unstructured content,\nthis approach could integrate those results as well.\n\nFor now, run the `curate.py` script to generate a view of the ranked\nNER results, serialized as the `data/sem.csv` file.  This can be\nviewed in a spreadsheet to understand how to iterate on the semantic\ndefinitions for more effective graph organization in the domain of the\nscraped documents.\n\n```bash\npoetry run python3 curate.py\n```\n</details>\n\n\n<details>\n  <summary><h2>Unbundling GraphRAG</h2></summary>\n\n**Objective:**\n\nConstruct a _knowledge graph_ (KG) using open source libraries where\ndeep learning models provide narrowly-focused _point solutions_ to\ngenerate components for a graph: nodes, edges, properties.\n\nThese steps define a generalized process, where this tutorial picks up\nat the _lexical graph_, without the _entity linking_ (EL) part yet:\n\n**Semantic layer:**\n\n  1. Load any semantics for domain context from pre-defined controlled vocabularies, taxonomies, thesauri, ontologies, etc., directly into the KG.\n\n**Data graph:**\n\n  1. Load the structured data sources or updates into a data graph.\n  2. Perform _entity resolution_ (ER) on PII extracted from the data graph.\n  3. Use ER results to generate a semantic layer as a \"backbone\" for the KG.\n\n**Lexical graph:**\n\n  1. Parse the text chunks, using lemmatization to normalize token spans.\n  2. Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.\n  3. Analyze _named entity recognition_ (NER) to extract candidate entities from noun phrase spans.\n  4. Analyze _relation extraction_ (RE) to extract relations between pairwise entities.\n  5. Perform _entity linking_ (EL) leveraging the ER results.\n  6. Promote the extracted entities and relations up to the semantic layer.\n\nOf course many vendors suggest using a _large language model_ (LLM) as\na _one-size-fits-all_ (OSFA) \"black box\" approach for extracting\nentities and generating an entire graph **automagically**.\n\nHowever, the business process of _resolution_ -- for both entities and\nrelations -- requires _judgements_. If the entities getting resolved\nare low-risk, low-effort in nature, then yeah knock yourself out. If\nthe entities represent _people_ or _organizations_, these have agency\nand may take actions when misrepresented in applications which have\nconsequences.\n\nWhenever judgements get delegated to _model-based_ approaches,\n_generalization_ becomes a form of reasoning employed.  When the\ntechnology within the model is based on _loss functions_, then\ngeneralization becomes dominant -- regardless of any marketing claims\nabout \"AI reasoning\" made by tech firms.\n\nFortunately, decisions can be made _without models_, even in AI\napplications. Shock, horror!!! Please, say it isn't so!?! Brace\nyourselves, using models is a thing, but not the only thing.  For more\ndetailed discussion, see:\n\n  * Part 1: Let's talk about \"Today's AI\" <https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/>\n  * Part 2: Let's talk about \"Resolution\" <https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/>\n\nAlso keep in mind that black box approaches don't work especially well\nfor regulated environments, where audits, explanations, evidence, data\nprovenance, etc., are required.\n\nMoreover, KGs used in mission-critical apps, such as investigations,\ngenerally require periodic data updates, so construction isn't a\none-step process. By producing a KG based on the approach sketched\nabove, updates can be handled more effectively.  Any downstream use\ncases, such as AI applications, also benefit from improved quality of\nsemantics and representation.\n</details>\n\n\n<details>\n  <summary><h2>FAQ</h2></summary>\n\n<dl>\n<dt>Q:</dt><dd>\"Have you tried this with <code>langextract</code> yet?\"</dd>\n<dt>A:</dt><dd>\"I'll take <code>How does an instructor know a student ignored the README?</code> from the <a href=\"https://en.wiktionary.org/wiki/fuck_around_and_find_out\" target=\"_blank\"><em>What is FAFO?</em></a> category, for $200 ... but yes of course, it's an interesting package, which builds on other interesting work used here. Except that key parts of it miss the point entirely, in ways that only a hyperscaler could possibly fuck up so badly.\"</dd>\n</dl>\n\n<dl>\n<dt>Q:</dt><dd>\"What the hell is the name of this repo about?\"</dd>\n<dt>A:</dt><dd>\"As you may have noticed, many open source projects published in this GitHub organization are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word <a href=\"https://translate.google.com/details?sl=cy&tl=en&text=strwythura&op=translate\" target=\"_blank\"><code>strwythura</code></a> translates as the verb <strong>structure</strong> in English.\"</dd>\n</dl>\n\n<dl>\n<dt>Q:</dt><dd>\"Why aren't you using an LLM to build the graph instead?\"</dd>\n<dt>A:</dt><dd>\"I promise to visit you in jail.\"</dd>\n</dl>\n\n<dl>\n<dt>Q:</dt><dd>\"Um, yeah, like, didn't Karpathy say to use <em>vibe coding</em>, or something? #justsayin\"\n<dt>A:</dt><dd>\"<a href=\"https://effinbirds.com/\" target=\"_blank\">Piss the eff off</a> tech bro. Srsly, like yesterday -- you're embarrassing our entire industry with your overly exuberant ignorance.\"</dd>\n</dl>\n\n</details>\n\n\n<details>\n  <summary>Developer Notes</summary>\n\nAfter each `BAML` release update, some committer needs to regenerate\nits Python client source:\n\n```bash\npoetry run baml-cli generate --from strwythura/baml_src\n```\n</details>\n\n\n<details>\n  <summary>Experimental: Relation Extraction evaluation</summary>\n\nCurrent Python libraries for _relation extraction_ (RE) are\nprobably best characterized as \"experimental research projects\".\n\nTheir tokenization approaches tend to make the mistake of \"throwing\nthe baby out with the bath water\" by not leveraging other available\ninformation, e.g., what we have in the _textgraph_ representation of\nthe parsed documents. Also, they tend to ignore the semantic\nconstraints of the domain context, while computationally boiling\nthe ocean.\n\nRE libraries which have been evaluated:\n\n  * `GLiREL`: <https://github.com/jackboyla/GLiREL>\n  * `ReLIK`: <https://github.com/SapienzaNLP/relik>\n  * `OpenNRE`: <https://github.com/thunlp/OpenNRE>\n  * `mREBEL`: <https://github.com/Babelscape/rebel>\n\nThis project had used `GLiREL` although its results were quite sparse.\nRE will be replaced by `BAML` or `DSPy` workflows in the near future.\n\nThere is some experimental code which illustrates `OpenNRE` evaluation.\nUse the `archive/nre.sh` script to load OpenNRE pre-trained models\nbefore running the `archive/opennre.ipynb` notebook.\n\nThis may not work in many environments, depending on how well the\n`OpenNRE` library is being maintained.\n</details>\n\n\n<details>\n  <summary>Experimental: Tutorial notebooks</summary>\n\n<p>\nA collection of Jupyter notebooks were used to prototype code. These\nhelp illustrate important intermediate steps within these workflows:\n</p>\n\n```bash\n.venv/bin/jupyter-lab\n```\n\n<ul>\n<li>Part 1: `archive/construct.ipynb` -- detailed KG construction using a lexical graph</li>\n<li>Part 2: `archive/chunk.ipynb` -- simple example of how to scrape and chunk text</li>\n<li>Part 3: `archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)</li>\n<li>Part 4: `archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)</li>\n</ul>\n\n<p>\nThese are now archived, though kept available for study.\n</p>\n</details>\n\n\n<details>\n  <summary>License and Copyright</summary>\n\nSource code for **Strwythura** plus its logo, documentation, and examples\nhave an [MIT license](https://spdx.org/licenses/MIT.html) which is\nsuccinct and simplifies use in commercial applications.\n\nAll materials herein are Copyright \u00a9 2024-2025 Senzing, Inc.\n</details>\n\n\n<details>\n  <summary>Kudos and Attribution</summary>\n\nPlease use the following BibTeX entry for citing **Strwythura** if you use\nit in your research or software.\nCitations are helpful for the continued development and maintenance of\nthis library.\n\n```bibtex\n@software{strwythura,\n  author = {Paco Nathan},\n  title = {{Strwythura: construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain}},\n  year = 2024,\n  publisher = {Senzing},\n  doi = {10.5281/zenodo.16934079},\n  url = {https://github.com/DerwenAI/strwythura}\n}\n```\n\nKudos to \n[@louisguitton](https://github.com/louisguitton), \n[@cj2001](https://github.com/cj2001),\n[@prrao87](https://github.com/prrao87), \n[@hellovai](https://github.com/hellovai),\n[@docktermj](https://github.com/docktermj), \n[@jbutcher21](https://github.com/jbutcher21),  \nand the kind folks at [GraphGeeks](https://graphgeeks.org/) for their support.\n</details>\n\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=derwenai/strwythura&type=Date)](https://star-history.com/#derwenai/strwythura&Date)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Construct a knowledge graph from unstructured data sources, organized by results from entity resolution, implementing an enhanced GraphRAG approach, and also implementing an ontology pipeline plus context engineering for optimizing AI application outcomes within a specific domain.",
    "version": "1.3.0",
    "project_urls": {
        "Homepage": "https://github.com/DerwenAI/strwythura",
        "doi": "https://doi.org/10.5281/zenodo.16934079",
        "slides": "https://derwen.ai/s/2njz#1",
        "video": "https://senzing.com/gph-graph-rag-llm-knowledge-graphs/"
    },
    "split_keywords": [
        "baml",
        " context engineering",
        " entity-linking",
        " entity-resolution",
        " graph algorithms",
        " graphrag",
        " interactive visualization",
        " knowledge-graph",
        " named-entity-recognition",
        " nlp",
        " ontology",
        " ontology-pipeline",
        " relation-extraction",
        " semantic-expansion",
        " semantic-layer",
        " semantic-random-walk",
        " textgraphs",
        " unstructured-data",
        " vector-database"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "66c7d67e7f58069481c8273e3f8e0dad60de9489ebab721c5eb926bbf46b909e",
                "md5": "8a8f55a571753104ffaaa0eb44caffd8",
                "sha256": "a7e6d2fe5f6b95582b7d410eb45af0eea8bfd0940c2786876ee5e1ec01d43b5b"
            },
            "downloads": -1,
            "filename": "strwythura-1.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8a8f55a571753104ffaaa0eb44caffd8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.11",
            "size": 44155,
            "upload_time": "2025-08-28T01:09:44",
            "upload_time_iso_8601": "2025-08-28T01:09:44.248979Z",
            "url": "https://files.pythonhosted.org/packages/66/c7/d67e7f58069481c8273e3f8e0dad60de9489ebab721c5eb926bbf46b909e/strwythura-1.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fe2418eeba607989300fa55b9df512ce8b0a3bfa8b5904ed35354a0adf82a552",
                "md5": "3be9add2a7a05436cd1757642f9fa0eb",
                "sha256": "d561faecc9f88da2b0d666071ce95204b0835674a4696a51a5ed38d48f4172ba"
            },
            "downloads": -1,
            "filename": "strwythura-1.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3be9add2a7a05436cd1757642f9fa0eb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.11",
            "size": 38568,
            "upload_time": "2025-08-28T01:09:45",
            "upload_time_iso_8601": "2025-08-28T01:09:45.388828Z",
            "url": "https://files.pythonhosted.org/packages/fe/24/18eeba607989300fa55b9df512ce8b0a3bfa8b5904ed35354a0adf82a552/strwythura-1.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-28 01:09:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DerwenAI",
    "github_project": "strwythura",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "strwythura"
}
        
Elapsed time: 1.34856s