Name | iab-mapper JSON |
Version |
0.3.1
JSON |
| download |
home_page | None |
Summary | Local IAB Content Taxonomy 2.x -> 3.0 mapper with vectors, SCD, OpenRTB/VAST exporters. |
upload_time | 2025-09-10 15:43:13 |
maintainer | None |
docs_url | None |
author | Mixpeek |
requires_python | >=3.9 |
license | MIT |
keywords |
iab
taxonomy
content
openrtb
ctv
classification
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<p align="center">
<img src="assets/header.png" alt="IAB Taxonomy Mapper" width="900" />
</p>
# IAB Content Taxonomy Mapper (Local CLI)
<p align="center">
<a href="https://pypi.org/project/iab-mapper/"><img alt="PyPI" src="https://img.shields.io/pypi/v/iab-mapper.svg"></a>
<a href="https://github.com/mixpeek/iab-mapper/actions"><img alt="CI" src="https://github.com/mixpeek/iab-mapper/actions/workflows/ci.yml/badge.svg"></a>
<a href="https://github.com/mixpeek/iab-mapper/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/license-MIT-blue.svg"></a>
</p>
Map **IAB Content Taxonomy 2.x** labels/codes to **IAB 3.0** locally with a deterministic → fuzzy → (optional) semantic pipeline.
Outputs are **IAB‑3.0–compatible IDs** for OpenRTB/VAST, with optional **vector attributes** (Channel, Type, Format, Language, Source, Environment) and **SCD** awareness.
> Local-first by default. No external APIs are required; LLM re‑rank is optional.
---
## 📚 Table of Contents
- [✨ Features](#-features)
- [Why migrate to IAB 3.0?](#-why-migrate-to-iab-30)
- [How it works](#-how-it-works)
- [🔧 Install](#-install)
- [🚀 Quick Start](#-quick-start)
- [🐍 Python API](#-python-api-alternative-to-cli)
- [📥 Input Formats](#-input-formats)
- [📤 Output Formats](#-output-formats)
- [⚙️ Useful Flags](#️-useful-flags)
- [🧩 Vectors](#-vectors-orthogonal-attributes)
- [✅ IAB 3.0 Conformance Notes](#-iab-30-conformance-notes)
- [📎 Official IAB References](#-official-iab-references)
- [🧯 Troubleshooting](#-troubleshooting)
- [📦 Example Commands](#-example-commands)
- [📜 License](#-license)
---
### Update catalogs (fetch latest from IAB)
Use the bundled fetcher to sync to the latest Content Taxonomy files from the official IAB GitHub repository. It will locate the latest 2.x and 3.x datasets and normalize them into this tool’s schemas.
```bash
# via Python script (direct)
python scripts/update_catalogs.py
# or via CLI command
mixpeek-iab-mapper update-catalogs --exact3 "3.1" --exact2 "2.2"
# Optional: use a GitHub token to raise rate limits
# export GITHUB_TOKEN=ghp_...
```
Outputs:
- `iab_mapper/data/iab_2x.json` → `[{"code","label"}]`
- `iab_mapper/data/iab_3x.json` → `[{"id","label","path":[],"scd":bool}]`
Replace or extend `synonyms_*.json` and `vectors_*.json` as needed for your org.
---
## ✨ Features
- Deterministic alias/exact matching → fuzzy string matching → **optional local embeddings** (Sentence-Transformers) for near-misses
- Emits **IAB 3.0 IDs** (not just labels) and configurable **`cattax`** for OpenRTB conformance
- Multi-category output per input; **vector attributes** support
- **SCD (Sensitive Content) flag** visibility and optional exclusion (`--drop-scd`)
- Exports **CSV or JSON**; includes **OpenRTB** and **VAST CONTENTCAT** helpers
- Local-only, reproducible, versioned catalogs
---
## 🔎 Why migrate to IAB 3.0?
- 3.0 introduces clearer separation of primary topic “aboutness” vs. orthogonal vectors (e.g., news vs. opinion, formats, channels).
- Better support for CTV/video, podcasts, games, and app stores.
- Non‑backwards compatible in areas like News/Opinion and entertainment genres; careful migration is required.
This tool makes migration practical: it emits valid 3.0 IDs and helps curate edge cases with overrides, synonyms, thresholds, and audit outputs.
---
## 🧠 How it works
1) Normalize text and apply alias/exact matches via synonyms.
2) Fuzzy retrieval (rapidfuzz | TF‑IDF | BM25) with configurable thresholds.
3) Optional semantic augmentation with local embeddings (Sentence‑Transformers or TF‑IDF KNN).
4) Optional local LLM re‑ranking (Ollama) for ordering only.
5) Assemble outputs: topic IDs + vector IDs → OpenRTB `content.cat` with configurable `cattax`.
6) SCD flags are surfaced and can be excluded with `--drop-scd`.
---
## 🔧 Install
### From PyPI (recommended)
```bash
pip install iab-mapper
```
### 1) Clone / unpack
```bash
unzip iab-mapper.zip && cd iab-mapper
```
### 2) Python env & install
```bash
python -m venv .venv && source .venv/bin/activate
pip install -e .
# Optional (enable local embeddings / KNN search)
pip install -e ".[emb]"
```
> If you need fully offline installs, pre-bundle the Sentence-Transformers model in your image/host and point to it via `--emb-model` (local path).
---
## 📁 Project Layout
```
iab-mapper/
pyproject.toml
sample_2x_codes.csv
iab_mapper/
__init__.py
cli.py
pipeline.py
matching.py
normalize.py
embeddings.py
io_utils.py
data/
iab_2x.json
iab_3x.json
synonyms_2x.json
synonyms_3x.json
vectors_channel.json
vectors_type.json
vectors_format.json
vectors_language.json
vectors_source.json
vectors_environment.json
```
Replace the stub `data/*.json` with your **full IAB catalogs** (include `id`, `label`, `path`, and `scd` on 3.0 nodes).
---
## 🚀 Quick Start
```bash
# map the sample CSV using fuzzy matching only
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json
# enable local embeddings (improves recall on free-text labels)
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings
```
The output contains for each input row:
- `out_ids` → **IAB 3.0 IDs** (topics + any vector IDs)
- `openrtb` → `{"content":{"cat":[...],"cattax":"<enum>"}}` (configurable via `--cattax`)
- `vast_contentcat` → `"id1","id2",...`
- Topic confidences, sources (`"exact"/"fuzzy"/"embed"/"override"), SCD flags, and chosen vectors.
---
## 🐍 Python API (alternative to CLI)
Install:
```bash
pip install iab-mapper
```
Basic usage:
```python
from pathlib import Path
from iab_mapper.pipeline import Mapper, MapConfig
import iab_mapper as pkg
# Use packaged stub catalogs or point data_dir to your own
data_dir = Path(pkg.__file__).parent / "data"
cfg = MapConfig(
fuzzy_method="bm25", # rapidfuzz|tfidf|bm25
fuzzy_cut=0.92,
use_embeddings=False, # set True and choose emb_model to enable
max_topics=3,
drop_scd=False,
cattax="2", # OpenRTB content.cattax enum
overrides_path=None # path to JSON overrides if desired
)
mapper = Mapper(cfg, str(data_dir))
# Single record with optional vectors
rec = {
"code": "2-12",
"label": "Food & Drink",
"channel": "editorial",
"type": "article",
"format": "video",
"language": "en",
"source": "professional",
"environment": "ctv",
}
out = mapper.map_record(rec)
print(out["out_ids"]) # topic + vector IDs
print(out["openrtb"]) # {"content": {"cat": [...], "cattax": "2"}}
print(out["vast_contentcat"]) # "id1","id2",...
# Or just map topics
topics = mapper.map_topics("Cooking how-to")
# Batch over a list of dicts
rows = [rec, {"label": "Sports"}]
mapped = [mapper.map_record(r) for r in rows]
```
Enable local embeddings (optional):
```python
cfg = MapConfig(fuzzy_method="rapidfuzz", use_embeddings=True, emb_model="tfidf", emb_cut=0.8)
mapper = Mapper(cfg, str(data_dir))
out = mapper.map_record({"label": "Cooking how-to"})
```
Use overrides (force mapping before matching):
```python
cfg = MapConfig(overrides_path="overrides.json") # [{"code":"1-4","label":null,"ids":["2-3-18"]}]
mapper = Mapper(cfg, str(data_dir))
```
---
## 📥 Input Formats
### CSV
- Required columns: `label`
- Optional columns: `code` (2.x), `channel`, `type`, `format`, `language`, `source`, `environment`
Example:
```csv
code,label,channel,type,format,language,source,environment
1-4,Sports,editorial,article,video,en,professional,ctv
, Cooking how-to ,editorial,article,video,en,professional,web
```
### JSON
- List of objects with the same fields as CSV.
---
## 📤 Output Formats
### CSV
- Includes compact JSON strings for complex fields (e.g., `topic_ids`, `openrtb`).
### JSON
- List of records. Example snippet:
```json
{
"in_code": "2-12",
"in_label": "Food & Drink",
"out_ids": ["3-5-2", "1026", "1068"],
"out_labels": ["Food & Drink > Cooking"],
"topic_ids": ["3-5-2"],
"topic_confidence": [0.89],
"topic_sources": ["fuzzy"],
"topic_scd": [false],
"vectors": {"channel":"editorial","type":"article","format":"video","language":"en","source":"professional","environment":"ctv"},
"cattax": "2",
"openrtb": {"content":{"cat":["3-5-2","1026","1068"],"cattax":"2"}},
"vast_contentcat": ""3-5-2","1026","1068""
}
```
---
## ⚙️ Useful Flags
```bash
# thresholds
--fuzzy-cut 0.92 # 0..1 (higher = stricter)
--use-embeddings # enable local embeddings
--emb-model all-MiniLM-L6-v2
--emb-cut 0.80 # cosine similarity cut
--max-topics 3 # max topic categories per row
--drop-scd # exclude SCD nodes from results
--cattax 2 # set OpenRTB content.cattax enum for Content Taxonomy
--overrides overrides.json# JSON overrides applied before matching
--unmapped-out misses.json# write rows with no topic_ids to file
```
---
## 🧩 Vectors (Orthogonal Attributes)
Pass via columns or pre-fill in your CSV:
- **Channel** (`vectors_channel.json`): e.g., `editorial`, `ugc`
- **Type** (`vectors_type.json`): e.g., `article`, `podcast`, `livestream`
- **Format** (`vectors_format.json`): e.g., `video`, `text`, `audio`
- **Language** (`vectors_language.json`): e.g., `en`, `es`, `de`
- **Source** (`vectors_source.json`): e.g., `professional`, `brand`, `news`
- **Environment** (`vectors_environment.json`): e.g., `ctv`, `web`, `app`
Each value maps to a **stable IAB 3.0 ID** that is appended to the `cat` array.
---
## ✅ IAB 3.0 Conformance Notes
- Emits **IDs** for `content.cat` and sets **`"cattax":"<enum>"`**.
- Supports **multiple categories per content** (topic IDs + vectors).
- **Strict ID validation**: only IDs present in your 3.0 catalog are emitted.
- **SCD-aware**: show SCD flags and optionally exclude (`--drop-scd`).
> This tool is **not affiliated with IAB**. It is an independent utility for compatibility with IAB Content Taxonomy.
---
## 📎 Official IAB References
- Content Taxonomy 3.0 Implementation Guide (PDF): https://iabtechlab.com/wp-content/uploads/2021/09/Implementation-Guide-Content-Taxonomy-3-0-pc-Sept2021.pdf
- IAB Tech Lab Content Taxonomy page: https://iabtechlab.com/standards/content-taxonomy/
- Implementation guidance (historic mappings and migration notes):
- https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/implementation.md#content-21-to-ad-product-20-taxonomy-mapping-implementation-guidance
- https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Taxonomy%20Mappings/Ad%20Product%202.0%20to%20Content%202.1.tsv
- https://github.com/katieshell/Taxonomies/blob/main/implementation.md#implementation-guidance-for-content-1--content-2-mapping
- https://github.com/katieshell/Taxonomies/blob/main/implementation.md#migrating-from-content-taxonomy-10
---
## 🔬 Evaluation (recommended)
Create a small gold set for your domain and run periodic checks:
```bash
# (pseudo) compare mapped.json to gold.json for accuracy & unmapped rates
python scripts/eval.py mapped.json gold.json
```
Gate releases on accuracy deltas so behavior stays stable for audits.
---
## 🛠️ Updating Catalogs
Replace the stub JSONs in `iab_mapper/data/` with your official datasets:
- `iab_2x.json` → include `code`, `label`
- `iab_3x.json` → include `id`, `label`, `path[]`, `scd`
- `synonyms_*.json` → org-specific aliases
- `vectors_*.json` → official vector catalogs mapping values to stable 3.0 IDs
Commit with a version bump and note `taxonomy_version` in your release notes.
---
## 🧯 Troubleshooting
- **No matches:** lower `--fuzzy-cut` or enable `--use-embeddings`.
- **Weird matches:** raise thresholds; add synonyms into `synonyms_*.json`.
- **Offline:** pre-bundle ST model; set `--emb-model` to a local folder path.
- **CSV issues:** ensure UTF-8 and header row (`label` required).
- **Unmapped:** inspect `--unmapped-out` and add overrides/synonyms as needed.
---
## 📦 Example Commands
```bash
# Strict fuzzy only
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.csv --fuzzy-cut 0.95
# Embeddings on, drop SCD, max 2 topics, custom cattax, collect unmapped
mixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings --drop-scd --max-topics 2 --cattax 2 --unmapped-out misses.json
```
---
## 📜 License
MIT. See [LICENSE](LICENSE).
Include IAB attribution in your deployed UI/footer:
> “IAB is a registered trademark of the Interactive Advertising Bureau. This tool is an independent utility built by Mixpeek for interoperability with IAB Content Taxonomy standards.”
Raw data
{
"_id": null,
"home_page": null,
"name": "iab-mapper",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "iab, taxonomy, content, openrtb, ctv, classification",
"author": "Mixpeek",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/48/b7/f1796e31dc977e2f6c899895f87e403144931ece6cbe16cbc317b060b2e4/iab_mapper-0.3.1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"assets/header.png\" alt=\"IAB Taxonomy Mapper\" width=\"900\" />\n</p>\n\n# IAB Content Taxonomy Mapper (Local CLI)\n\n<p align=\"center\">\n <a href=\"https://pypi.org/project/iab-mapper/\"><img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/iab-mapper.svg\"></a>\n <a href=\"https://github.com/mixpeek/iab-mapper/actions\"><img alt=\"CI\" src=\"https://github.com/mixpeek/iab-mapper/actions/workflows/ci.yml/badge.svg\"></a>\n <a href=\"https://github.com/mixpeek/iab-mapper/blob/main/LICENSE\"><img alt=\"License\" src=\"https://img.shields.io/badge/license-MIT-blue.svg\"></a>\n</p>\n\nMap **IAB Content Taxonomy 2.x** labels/codes to **IAB 3.0** locally with a deterministic \u2192 fuzzy \u2192 (optional) semantic pipeline.\nOutputs are **IAB\u20113.0\u2013compatible IDs** for OpenRTB/VAST, with optional **vector attributes** (Channel, Type, Format, Language, Source, Environment) and **SCD** awareness.\n\n> Local-first by default. No external APIs are required; LLM re\u2011rank is optional.\n\n---\n\n## \ud83d\udcda Table of Contents\n\n- [\u2728 Features](#-features)\n- [Why migrate to IAB 3.0?](#-why-migrate-to-iab-30)\n- [How it works](#-how-it-works)\n- [\ud83d\udd27 Install](#-install)\n- [\ud83d\ude80 Quick Start](#-quick-start)\n- [\ud83d\udc0d Python API](#-python-api-alternative-to-cli)\n- [\ud83d\udce5 Input Formats](#-input-formats)\n- [\ud83d\udce4 Output Formats](#-output-formats)\n- [\u2699\ufe0f Useful Flags](#\ufe0f-useful-flags)\n- [\ud83e\udde9 Vectors](#-vectors-orthogonal-attributes)\n- [\u2705 IAB 3.0 Conformance Notes](#-iab-30-conformance-notes)\n- [\ud83d\udcce Official IAB References](#-official-iab-references)\n- [\ud83e\uddef Troubleshooting](#-troubleshooting)\n- [\ud83d\udce6 Example Commands](#-example-commands)\n- [\ud83d\udcdc License](#-license)\n\n---\n\n### Update catalogs (fetch latest from IAB)\n\nUse the bundled fetcher to sync to the latest Content Taxonomy files from the official IAB GitHub repository. It will locate the latest 2.x and 3.x datasets and normalize them into this tool\u2019s schemas.\n\n```bash\n# via Python script (direct)\npython scripts/update_catalogs.py\n\n# or via CLI command\nmixpeek-iab-mapper update-catalogs --exact3 \"3.1\" --exact2 \"2.2\"\n# Optional: use a GitHub token to raise rate limits\n# export GITHUB_TOKEN=ghp_...\n```\n\nOutputs:\n- `iab_mapper/data/iab_2x.json` \u2192 `[{\"code\",\"label\"}]`\n- `iab_mapper/data/iab_3x.json` \u2192 `[{\"id\",\"label\",\"path\":[],\"scd\":bool}]`\n\nReplace or extend `synonyms_*.json` and `vectors_*.json` as needed for your org.\n\n---\n\n## \u2728 Features\n- Deterministic alias/exact matching \u2192 fuzzy string matching \u2192 **optional local embeddings** (Sentence-Transformers) for near-misses\n- Emits **IAB 3.0 IDs** (not just labels) and configurable **`cattax`** for OpenRTB conformance\n- Multi-category output per input; **vector attributes** support\n- **SCD (Sensitive Content) flag** visibility and optional exclusion (`--drop-scd`)\n- Exports **CSV or JSON**; includes **OpenRTB** and **VAST CONTENTCAT** helpers\n- Local-only, reproducible, versioned catalogs\n\n---\n\n## \ud83d\udd0e Why migrate to IAB 3.0?\n\n- 3.0 introduces clearer separation of primary topic \u201caboutness\u201d vs. orthogonal vectors (e.g., news vs. opinion, formats, channels).\n- Better support for CTV/video, podcasts, games, and app stores.\n- Non\u2011backwards compatible in areas like News/Opinion and entertainment genres; careful migration is required.\n\nThis tool makes migration practical: it emits valid 3.0 IDs and helps curate edge cases with overrides, synonyms, thresholds, and audit outputs.\n\n---\n\n## \ud83e\udde0 How it works\n\n1) Normalize text and apply alias/exact matches via synonyms.\n2) Fuzzy retrieval (rapidfuzz | TF\u2011IDF | BM25) with configurable thresholds.\n3) Optional semantic augmentation with local embeddings (Sentence\u2011Transformers or TF\u2011IDF KNN).\n4) Optional local LLM re\u2011ranking (Ollama) for ordering only.\n5) Assemble outputs: topic IDs + vector IDs \u2192 OpenRTB `content.cat` with configurable `cattax`.\n6) SCD flags are surfaced and can be excluded with `--drop-scd`.\n\n---\n\n## \ud83d\udd27 Install\n\n### From PyPI (recommended)\n```bash\npip install iab-mapper\n```\n\n### 1) Clone / unpack\n```bash\nunzip iab-mapper.zip && cd iab-mapper\n```\n\n### 2) Python env & install\n```bash\npython -m venv .venv && source .venv/bin/activate\npip install -e .\n# Optional (enable local embeddings / KNN search)\npip install -e \".[emb]\"\n```\n\n> If you need fully offline installs, pre-bundle the Sentence-Transformers model in your image/host and point to it via `--emb-model` (local path).\n\n---\n\n## \ud83d\udcc1 Project Layout\n```\niab-mapper/\n pyproject.toml\n sample_2x_codes.csv\n iab_mapper/\n __init__.py\n cli.py\n pipeline.py\n matching.py\n normalize.py\n embeddings.py\n io_utils.py\n data/\n iab_2x.json\n iab_3x.json\n synonyms_2x.json\n synonyms_3x.json\n vectors_channel.json\n vectors_type.json\n vectors_format.json\n vectors_language.json\n vectors_source.json\n vectors_environment.json\n```\n\nReplace the stub `data/*.json` with your **full IAB catalogs** (include `id`, `label`, `path`, and `scd` on 3.0 nodes).\n\n---\n\n## \ud83d\ude80 Quick Start\n\n```bash\n# map the sample CSV using fuzzy matching only\nmixpeek-iab-mapper sample_2x_codes.csv -o mapped.json\n\n# enable local embeddings (improves recall on free-text labels)\nmixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings\n```\n\nThe output contains for each input row:\n- `out_ids` \u2192 **IAB 3.0 IDs** (topics + any vector IDs)\n- `openrtb` \u2192 `{\"content\":{\"cat\":[...],\"cattax\":\"<enum>\"}}` (configurable via `--cattax`)\n- `vast_contentcat` \u2192 `\"id1\",\"id2\",...`\n- Topic confidences, sources (`\"exact\"/\"fuzzy\"/\"embed\"/\"override\"), SCD flags, and chosen vectors.\n\n---\n\n## \ud83d\udc0d Python API (alternative to CLI)\n\nInstall:\n```bash\npip install iab-mapper\n```\n\nBasic usage:\n```python\nfrom pathlib import Path\nfrom iab_mapper.pipeline import Mapper, MapConfig\nimport iab_mapper as pkg\n\n# Use packaged stub catalogs or point data_dir to your own\ndata_dir = Path(pkg.__file__).parent / \"data\"\n\ncfg = MapConfig(\n fuzzy_method=\"bm25\", # rapidfuzz|tfidf|bm25\n fuzzy_cut=0.92,\n use_embeddings=False, # set True and choose emb_model to enable\n max_topics=3,\n drop_scd=False,\n cattax=\"2\", # OpenRTB content.cattax enum\n overrides_path=None # path to JSON overrides if desired\n)\n\nmapper = Mapper(cfg, str(data_dir))\n\n# Single record with optional vectors\nrec = {\n \"code\": \"2-12\",\n \"label\": \"Food & Drink\",\n \"channel\": \"editorial\",\n \"type\": \"article\",\n \"format\": \"video\",\n \"language\": \"en\",\n \"source\": \"professional\",\n \"environment\": \"ctv\",\n}\n\nout = mapper.map_record(rec)\nprint(out[\"out_ids\"]) # topic + vector IDs\nprint(out[\"openrtb\"]) # {\"content\": {\"cat\": [...], \"cattax\": \"2\"}}\nprint(out[\"vast_contentcat\"]) # \"id1\",\"id2\",...\n\n# Or just map topics\ntopics = mapper.map_topics(\"Cooking how-to\")\n\n# Batch over a list of dicts\nrows = [rec, {\"label\": \"Sports\"}]\nmapped = [mapper.map_record(r) for r in rows]\n```\n\nEnable local embeddings (optional):\n```python\ncfg = MapConfig(fuzzy_method=\"rapidfuzz\", use_embeddings=True, emb_model=\"tfidf\", emb_cut=0.8)\nmapper = Mapper(cfg, str(data_dir))\nout = mapper.map_record({\"label\": \"Cooking how-to\"})\n```\n\nUse overrides (force mapping before matching):\n```python\ncfg = MapConfig(overrides_path=\"overrides.json\") # [{\"code\":\"1-4\",\"label\":null,\"ids\":[\"2-3-18\"]}]\nmapper = Mapper(cfg, str(data_dir))\n```\n\n---\n\n## \ud83d\udce5 Input Formats\n\n### CSV\n- Required columns: `label`\n- Optional columns: `code` (2.x), `channel`, `type`, `format`, `language`, `source`, `environment`\n\nExample:\n```csv\ncode,label,channel,type,format,language,source,environment\n1-4,Sports,editorial,article,video,en,professional,ctv\n, Cooking how-to ,editorial,article,video,en,professional,web\n```\n\n### JSON\n- List of objects with the same fields as CSV.\n\n---\n\n## \ud83d\udce4 Output Formats\n\n### CSV\n- Includes compact JSON strings for complex fields (e.g., `topic_ids`, `openrtb`).\n\n### JSON\n- List of records. Example snippet:\n```json\n{\n \"in_code\": \"2-12\",\n \"in_label\": \"Food & Drink\",\n \"out_ids\": [\"3-5-2\", \"1026\", \"1068\"],\n \"out_labels\": [\"Food & Drink > Cooking\"],\n \"topic_ids\": [\"3-5-2\"],\n \"topic_confidence\": [0.89],\n \"topic_sources\": [\"fuzzy\"],\n \"topic_scd\": [false],\n \"vectors\": {\"channel\":\"editorial\",\"type\":\"article\",\"format\":\"video\",\"language\":\"en\",\"source\":\"professional\",\"environment\":\"ctv\"},\n \"cattax\": \"2\",\n \"openrtb\": {\"content\":{\"cat\":[\"3-5-2\",\"1026\",\"1068\"],\"cattax\":\"2\"}},\n \"vast_contentcat\": \"\"3-5-2\",\"1026\",\"1068\"\"\n}\n```\n\n---\n\n## \u2699\ufe0f Useful Flags\n\n```bash\n# thresholds\n--fuzzy-cut 0.92 # 0..1 (higher = stricter)\n--use-embeddings # enable local embeddings\n--emb-model all-MiniLM-L6-v2\n--emb-cut 0.80 # cosine similarity cut\n--max-topics 3 # max topic categories per row\n--drop-scd # exclude SCD nodes from results\n--cattax 2 # set OpenRTB content.cattax enum for Content Taxonomy\n--overrides overrides.json# JSON overrides applied before matching\n--unmapped-out misses.json# write rows with no topic_ids to file\n```\n\n---\n\n## \ud83e\udde9 Vectors (Orthogonal Attributes)\nPass via columns or pre-fill in your CSV:\n- **Channel** (`vectors_channel.json`): e.g., `editorial`, `ugc`\n- **Type** (`vectors_type.json`): e.g., `article`, `podcast`, `livestream`\n- **Format** (`vectors_format.json`): e.g., `video`, `text`, `audio`\n- **Language** (`vectors_language.json`): e.g., `en`, `es`, `de`\n- **Source** (`vectors_source.json`): e.g., `professional`, `brand`, `news`\n- **Environment** (`vectors_environment.json`): e.g., `ctv`, `web`, `app`\n\nEach value maps to a **stable IAB 3.0 ID** that is appended to the `cat` array.\n\n---\n\n## \u2705 IAB 3.0 Conformance Notes\n- Emits **IDs** for `content.cat` and sets **`\"cattax\":\"<enum>\"`**. \n- Supports **multiple categories per content** (topic IDs + vectors). \n- **Strict ID validation**: only IDs present in your 3.0 catalog are emitted. \n- **SCD-aware**: show SCD flags and optionally exclude (`--drop-scd`).\n\n> This tool is **not affiliated with IAB**. It is an independent utility for compatibility with IAB Content Taxonomy.\n\n---\n\n## \ud83d\udcce Official IAB References\n\n- Content Taxonomy 3.0 Implementation Guide (PDF): https://iabtechlab.com/wp-content/uploads/2021/09/Implementation-Guide-Content-Taxonomy-3-0-pc-Sept2021.pdf\n- IAB Tech Lab Content Taxonomy page: https://iabtechlab.com/standards/content-taxonomy/\n- Implementation guidance (historic mappings and migration notes):\n - https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/implementation.md#content-21-to-ad-product-20-taxonomy-mapping-implementation-guidance\n - https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Taxonomy%20Mappings/Ad%20Product%202.0%20to%20Content%202.1.tsv\n - https://github.com/katieshell/Taxonomies/blob/main/implementation.md#implementation-guidance-for-content-1--content-2-mapping\n - https://github.com/katieshell/Taxonomies/blob/main/implementation.md#migrating-from-content-taxonomy-10\n\n---\n\n## \ud83d\udd2c Evaluation (recommended)\nCreate a small gold set for your domain and run periodic checks:\n```bash\n# (pseudo) compare mapped.json to gold.json for accuracy & unmapped rates\npython scripts/eval.py mapped.json gold.json\n```\nGate releases on accuracy deltas so behavior stays stable for audits.\n\n---\n\n## \ud83d\udee0\ufe0f Updating Catalogs\nReplace the stub JSONs in `iab_mapper/data/` with your official datasets:\n- `iab_2x.json` \u2192 include `code`, `label`\n- `iab_3x.json` \u2192 include `id`, `label`, `path[]`, `scd`\n- `synonyms_*.json` \u2192 org-specific aliases\n - `vectors_*.json` \u2192 official vector catalogs mapping values to stable 3.0 IDs\n\nCommit with a version bump and note `taxonomy_version` in your release notes.\n\n---\n\n## \ud83e\uddef Troubleshooting\n- **No matches:** lower `--fuzzy-cut` or enable `--use-embeddings`.\n- **Weird matches:** raise thresholds; add synonyms into `synonyms_*.json`.\n- **Offline:** pre-bundle ST model; set `--emb-model` to a local folder path.\n- **CSV issues:** ensure UTF-8 and header row (`label` required).\n - **Unmapped:** inspect `--unmapped-out` and add overrides/synonyms as needed.\n\n---\n\n## \ud83d\udce6 Example Commands\n```bash\n# Strict fuzzy only\nmixpeek-iab-mapper sample_2x_codes.csv -o mapped.csv --fuzzy-cut 0.95\n\n# Embeddings on, drop SCD, max 2 topics, custom cattax, collect unmapped\nmixpeek-iab-mapper sample_2x_codes.csv -o mapped.json --use-embeddings --drop-scd --max-topics 2 --cattax 2 --unmapped-out misses.json\n```\n\n---\n\n## \ud83d\udcdc License\nMIT. See [LICENSE](LICENSE).\n\nInclude IAB attribution in your deployed UI/footer:\n> \u201cIAB is a registered trademark of the Interactive Advertising Bureau. This tool is an independent utility built by Mixpeek for interoperability with IAB Content Taxonomy standards.\u201d\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Local IAB Content Taxonomy 2.x -> 3.0 mapper with vectors, SCD, OpenRTB/VAST exporters.",
"version": "0.3.1",
"project_urls": {
"Homepage": "https://github.com/mixpeek/iab-mapper",
"Issues": "https://github.com/mixpeek/iab-mapper/issues",
"Repository": "https://github.com/mixpeek/iab-mapper"
},
"split_keywords": [
"iab",
" taxonomy",
" content",
" openrtb",
" ctv",
" classification"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "821ec38633e6714d4552da6b8a291a238b1e4e16cb9e76fb113c0cd02a7e594c",
"md5": "b2ccb187a48841ecfa16d24191467728",
"sha256": "9f16024158ff75f19571ccc9b27f9dd265a9cad2d7483d9a93633bd88cf8931a"
},
"downloads": -1,
"filename": "iab_mapper-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b2ccb187a48841ecfa16d24191467728",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 44233,
"upload_time": "2025-09-10T15:43:12",
"upload_time_iso_8601": "2025-09-10T15:43:12.172954Z",
"url": "https://files.pythonhosted.org/packages/82/1e/c38633e6714d4552da6b8a291a238b1e4e16cb9e76fb113c0cd02a7e594c/iab_mapper-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "48b7f1796e31dc977e2f6c899895f87e403144931ece6cbe16cbc317b060b2e4",
"md5": "435dc347eb8ab1dc81bc4c4bc7776e4b",
"sha256": "e68d467b5527c2e42d2272168a6fa617b4f6f7dacc49f9bad12a316f828a2041"
},
"downloads": -1,
"filename": "iab_mapper-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "435dc347eb8ab1dc81bc4c4bc7776e4b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 48229,
"upload_time": "2025-09-10T15:43:13",
"upload_time_iso_8601": "2025-09-10T15:43:13.216240Z",
"url": "https://files.pythonhosted.org/packages/48/b7/f1796e31dc977e2f6c899895f87e403144931ece6cbe16cbc317b060b2e4/iab_mapper-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-10 15:43:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mixpeek",
"github_project": "iab-mapper",
"github_not_found": true,
"lcname": "iab-mapper"
}