[](https://github.com/Modern36/journal_digital_corpus/actions/workflows/hatch-publish-to-pypi.yml)
[](https://github.com/pre-commit/pre-commit)
[](https://github.com/pypa/hatch)
[](https://doi.org/10.5281/zenodo.15596191)
# Journal Digital Corpus
The **Journal Digital Corpus** is a curated, timestamped transcription corpus
derived from Swedish historical newsreels. It combines speech-to-text
transcriptions and intertitle OCR to enable scalable and searchable analysis of
early-to-mid 20th-century audiovisual media.
The SF Veckorevy newsreels—-screened weekly across Sweden for over five
decades—-form one of the most extensive audiovisual records of 20th-century
Swedish life. Yet their research potential has remained largely untapped due to
barriers to access and analysis. The Journal Digital Corpus offers the first
comprehensive transcription of both speech and intertitles from this material.
This corpus is the result of two purpose-built libraries:
- **[SweScribe](https://github.com/Modern36/swescribe)** – an ASR pipeline
developed for transcription of speech in historical Swedish newsreels.
- **[stum](https://github.com/Modern36/stum)** – an OCR tool for detecting and
transcribing intertitles in silent film footage.
<!-- numbers --> The corpus consists of 2,225,334 words transcribed from 204 hours of speech across 2,544 videos and 302,312 words from 49,107 intertitles from 4,327 videos. <!-- numbers -->
The primary files used for this project are publicly available on
[Filmarkivet.se](https://www.filmarkivet.se/), a web
resource containing curated parts of Swedish film archives.
## Installation
Git clone repository, cd in to the directory and run:
`python -m pip install -e . `
`python -m pip install journal_digital`
## 2025-06-04
Created with `SweScribe==v0.1.0` and `stum==v.0.2.0` on `2025-06-04` without
manual editing.
## Files
- `/name_year.tsv`: Pairings of filename and publication year, based on metadata
from [The Swedish Media Database (SMDB)](https://smdb.kb.se/).
```
/corpus
├── /intertitle
│ ├── /collection_1
│ ├── /collection_2
│ └── /collection_3
│ ├── /1920
│ │ ├── video_1.srt
│ │ ├── video_2.srt
│ │ └── video_3.srt
│ ├── /1921
│ │ ├── video_1.srt
│ │ ├── video_2.srt
│ │ └── video_3.srt
│ └── /1922
│ ├── video_1.srt
│ ├── video_2.srt
│ └── video_3.srt
├── /speech
│ ├── /collection_1
│ ├── /collection_2
│ └── /collection_3
│ ├── /1920
│ │ ├── video_1.srt
│ │ ├── video_2.srt
│ │ └── video_3.srt
│ ├── /1921
│ │ ├── video_1.srt
│ │ ├── video_2.srt
│ │ └── video_3.srt
│ └── /1922
│ ├── video_1.srt
│ ├── video_2.srt
│ └── video_3.srt
```
### Development Setup
`python -m pip install '.[dev]'`
`pre-commit install`
Add your path to videos got `JOURNAL_DIGITALROOT` in `.env`.
## Research Context and Licensing
### Modern Times 1936
The Journal Digital Corpus was developed for the
[Modern Times 1936](https://modernatider1936.se/en/) research
[project at Lund University](https://portal.research.lu.se/sv/projects/modern-times-1936-2),
Sweden. The project investigates what software "sees," "hears," and "perceives"
when pattern recognition technologies such as 'AI' are applied to media
historical sources. The project is
[funded by Riksbankens Jubileumsfond](https://www.rj.se/bidrag/2021/moderna-tider-1936/).
### License
The Journal Digital Corpus is licensed under the [CC-BY-NC 4.0](./LICENSE)
International license.
## References
```bibtex
@article{bain2022whisperx,
title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
journal={INTERSPEECH 2023},
year={2023}
}
```
```bibtex
@inproceedings{malmsten2022hearing,
title={Hearing voices at the national library : a speech corpus and acoustic model for the Swedish language},
author={Malmsten, Martin and Haffenden, Chris and B{\"o}rjeson, Love},
booktitle={Proceeding of Fonetik 2022 : Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR},
volume={3},
year={2022}
}
```
```bibtex
@inproceedings{zhou2017east,
title={East: an efficient and accurate scene text detector},
author={Zhou, Xinyu and Yao, Cong and Wen, He and Wang, Yuzhi and Zhou, Shuchang and He, Weiran and Liang, Jiajun},
booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
pages={5551--5560},
year={2017}
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "journal-digital",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": "Mathias Johansson <MathiasJoha@gmail.com>, Robert Aspenskog <robert.aspenskog@gmail.com>",
"keywords": "automatic speech recognition, intertitle, newsreels, speech-to-text, swedish, transcription, whisperx",
"author": null,
"author_email": "Robert Aspenskog <robert.aspenskog@gmail.com>, Mathias Johansson <MathiasJoha@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/1f/45/3f402a594f04f2de25185a5e8f0b0343fa8f8a49b0710de105e8c38ba58d/journal_digital-2025.10.13.tar.gz",
"platform": null,
"description": "[](https://github.com/Modern36/journal_digital_corpus/actions/workflows/hatch-publish-to-pypi.yml)\n[](https://github.com/pre-commit/pre-commit)\n[](https://github.com/pypa/hatch)\n[](https://doi.org/10.5281/zenodo.15596191)\n\n\n# Journal Digital Corpus\n\nThe **Journal Digital Corpus** is a curated, timestamped transcription corpus\nderived from Swedish historical newsreels. It combines speech-to-text\ntranscriptions and intertitle OCR to enable scalable and searchable analysis of\nearly-to-mid 20th-century audiovisual media.\n\nThe SF Veckorevy newsreels\u2014-screened weekly across Sweden for over five\ndecades\u2014-form one of the most extensive audiovisual records of 20th-century\nSwedish life. Yet their research potential has remained largely untapped due to\nbarriers to access and analysis. The Journal Digital Corpus offers the first\ncomprehensive transcription of both speech and intertitles from this material.\n\nThis corpus is the result of two purpose-built libraries:\n\n- **[SweScribe](https://github.com/Modern36/swescribe)** \u2013 an ASR pipeline\n developed for transcription of speech in historical Swedish newsreels.\n- **[stum](https://github.com/Modern36/stum)** \u2013 an OCR tool for detecting and\n transcribing intertitles in silent film footage.\n\n<!-- numbers --> The corpus consists of 2,225,334 words transcribed from 204 hours of speech across 2,544 videos and 302,312 words from 49,107 intertitles from 4,327 videos. <!-- numbers -->\n\n\n\nThe primary files used for this project are publicly available on\n[Filmarkivet.se](https://www.filmarkivet.se/), a web\nresource containing curated parts of Swedish film archives.\n\n## Installation\n\nGit clone repository, cd in to the directory and run:\n`python -m pip install -e . `\n\n`python -m pip install journal_digital`\n\n## 2025-06-04\n\nCreated with `SweScribe==v0.1.0` and `stum==v.0.2.0` on `2025-06-04` without\nmanual editing.\n\n## Files\n\n- `/name_year.tsv`: Pairings of filename and publication year, based on metadata\n from [The Swedish Media Database (SMDB)](https://smdb.kb.se/).\n\n```\n/corpus\n\u251c\u2500\u2500 /intertitle\n\u2502 \u251c\u2500\u2500 /collection_1\n\u2502 \u251c\u2500\u2500 /collection_2\n\u2502 \u2514\u2500\u2500 /collection_3\n\u2502 \u251c\u2500\u2500 /1920\n\u2502 \u2502 \u251c\u2500\u2500 video_1.srt\n\u2502 \u2502 \u251c\u2500\u2500 video_2.srt\n\u2502 \u2502 \u2514\u2500\u2500 video_3.srt\n\u2502 \u251c\u2500\u2500 /1921\n\u2502 \u2502 \u251c\u2500\u2500 video_1.srt\n\u2502 \u2502 \u251c\u2500\u2500 video_2.srt\n\u2502 \u2502 \u2514\u2500\u2500 video_3.srt\n\u2502 \u2514\u2500\u2500 /1922\n\u2502 \u251c\u2500\u2500 video_1.srt\n\u2502 \u251c\u2500\u2500 video_2.srt\n\u2502 \u2514\u2500\u2500 video_3.srt\n\u251c\u2500\u2500 /speech\n\u2502 \u251c\u2500\u2500 /collection_1\n\u2502 \u251c\u2500\u2500 /collection_2\n\u2502 \u2514\u2500\u2500 /collection_3\n\u2502 \u251c\u2500\u2500 /1920\n\u2502 \u2502 \u251c\u2500\u2500 video_1.srt\n\u2502 \u2502 \u251c\u2500\u2500 video_2.srt\n\u2502 \u2502 \u2514\u2500\u2500 video_3.srt\n\u2502 \u251c\u2500\u2500 /1921\n\u2502 \u2502 \u251c\u2500\u2500 video_1.srt\n\u2502 \u2502 \u251c\u2500\u2500 video_2.srt\n\u2502 \u2502 \u2514\u2500\u2500 video_3.srt\n\u2502 \u2514\u2500\u2500 /1922\n\u2502 \u251c\u2500\u2500 video_1.srt\n\u2502 \u251c\u2500\u2500 video_2.srt\n\u2502 \u2514\u2500\u2500 video_3.srt\n```\n\n### Development Setup\n\n`python -m pip install '.[dev]'`\n`pre-commit install`\n\nAdd your path to videos got `JOURNAL_DIGITALROOT` in `.env`.\n\n\n## Research Context and Licensing\n\n### Modern Times 1936\n\nThe Journal Digital Corpus was developed for the\n[Modern Times 1936](https://modernatider1936.se/en/) research\n[project at Lund University](https://portal.research.lu.se/sv/projects/modern-times-1936-2),\nSweden. The project investigates what software \"sees,\" \"hears,\" and \"perceives\"\nwhen pattern recognition technologies such as 'AI' are applied to media\nhistorical sources. The project is\n[funded by Riksbankens Jubileumsfond](https://www.rj.se/bidrag/2021/moderna-tider-1936/).\n\n### License\n\nThe Journal Digital Corpus is licensed under the [CC-BY-NC 4.0](./LICENSE)\nInternational license.\n\n## References\n\n```bibtex\n@article{bain2022whisperx,\n title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},\n author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},\n journal={INTERSPEECH 2023},\n year={2023}\n}\n```\n\n```bibtex\n@inproceedings{malmsten2022hearing,\n title={Hearing voices at the national library : a speech corpus and acoustic model for the Swedish language},\n author={Malmsten, Martin and Haffenden, Chris and B{\\\"o}rjeson, Love},\n booktitle={Proceeding of Fonetik 2022 : Speech, Music and Hearing Quarterly Progress and Status Report, TMH-QPSR},\n volume={3},\n year={2022}\n}\n```\n\n```bibtex\n@inproceedings{zhou2017east,\n title={East: an efficient and accurate scene text detector},\n author={Zhou, Xinyu and Yao, Cong and Wen, He and Wang, Yuzhi and Zhou, Shuchang and He, Weiran and Liang, Jiajun},\n booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},\n pages={5551--5560},\n year={2017}\n}\n```\n",
"bugtrack_url": null,
"license": "CC-BY-NC-4.0",
"summary": "Transcriptions from the Swedish newsreel archive Journal Digital",
"version": "2025.10.13",
"project_urls": {
"Homepage": "https://modernatider1936.se",
"Repository": "https://github.com/Modern36/journal_digital_corpus"
},
"split_keywords": [
"automatic speech recognition",
" intertitle",
" newsreels",
" speech-to-text",
" swedish",
" transcription",
" whisperx"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "898a2636e8f7eff41f20cd60e9ea809cc15dfb75ac08987d50b4b19936699b3b",
"md5": "f16a76c0bdf6e599b851db764fb139e1",
"sha256": "789c5054215587ff1e7fdb2d30ed3ae58ac5687d934175cd408ae4eed59ee097"
},
"downloads": -1,
"filename": "journal_digital-2025.10.13-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "f16a76c0bdf6e599b851db764fb139e1",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 15795,
"upload_time": "2025-10-13T12:37:16",
"upload_time_iso_8601": "2025-10-13T12:37:16.827447Z",
"url": "https://files.pythonhosted.org/packages/89/8a/2636e8f7eff41f20cd60e9ea809cc15dfb75ac08987d50b4b19936699b3b/journal_digital-2025.10.13-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1f453f402a594f04f2de25185a5e8f0b0343fa8f8a49b0710de105e8c38ba58d",
"md5": "03a701c830c1842ce553309b011d270f",
"sha256": "c91e481d826a0c00a0cb197be1339dd899a05cc7e55b47961ea3e6470a5e212d"
},
"downloads": -1,
"filename": "journal_digital-2025.10.13.tar.gz",
"has_sig": false,
"md5_digest": "03a701c830c1842ce553309b011d270f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 10115487,
"upload_time": "2025-10-13T12:37:18",
"upload_time_iso_8601": "2025-10-13T12:37:18.470916Z",
"url": "https://files.pythonhosted.org/packages/1f/45/3f402a594f04f2de25185a5e8f0b0343fa8f8a49b0710de105e8c38ba58d/journal_digital-2025.10.13.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-13 12:37:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Modern36",
"github_project": "journal_digital_corpus",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "journal-digital"
}