deepparse


Namedeepparse JSON
Version 0.9.13 PyPI version JSON
download
home_pagehttps://deepparse.org/
SummaryA library for parsing multinational street addresses using deep learning.
upload_time2024-09-12 22:13:00
maintainerNone
docs_urlNone
authorMarouane Yassine, David Beauchemin
requires_python>=3.8
licenseLGPLv3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            <div align="center">
<img src="https://raw.githubusercontent.com/GRAAL-Research/deepparse/main/docs/source/_static/logos/deepparse.png" width="220" height="91"/>


[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/deepparse)](https://pypi.org/project/deepparse)
[![PyPI Status](https://badge.fury.io/py/deepparse.svg)](https://badge.fury.io/py/deepparse)
[![PyPI Status](https://pepy.tech/badge/deepparse)](https://pepy.tech/project/deepparse)
[![Downloads](https://pepy.tech/badge/deepparse/month)](https://pepy.tech/project/deepparse)

[![Formatting](https://github.com/GRAAL-Research/deepparse/actions/workflows/formatting.yml/badge.svg?branch=stable)](https://github.com/GRAAL-Research/deepparse/actions/workflows/formatting.yml)
[![Linting](https://github.com/GRAAL-Research/deepparse/actions/workflows/linting.yml/badge.svg?branch=stable)](https://github.com/GRAAL-Research/deepparse/actions/workflows/linting.yml)
[![Tests](https://github.com/GRAAL-Research/deepparse/actions/workflows/tests.yml/badge.svg?branch=stable)](https://github.com/GRAAL-Research/deepparse/actions/workflows/tests.yml)
[![Docs](https://github.com/GRAAL-Research/deepparse/actions/workflows/docs.yml/badge.svg?branch=stable)](https://github.com/GRAAL-Research/deepparse/actions/workflows/docs.yml)

[![codecov](https://codecov.io/gh/GRAAL-Research/deepparse/branch/main/graph/badge.svg)](https://codecov.io/gh/GRAAL-Research/deepparse)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/62464699ff0740d0b8064227c4274b98)](https://www.codacy.com/gh/GRAAL-Research/deepparse/dashboard?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=GRAAL-Research/deepparse&amp;utm_campaign=Badge_Grade)
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>

[![pr welcome](https://img.shields.io/badge/PR-Welcome-%23FF8300.svg?)](https://img.shields.io/badge/PR-Welcome-%23FF8300.svg?)
[![License: LGPL v3](https://img.shields.io/badge/License-LGPL%20v3-blue.svg)](http://www.gnu.org/licenses/lgpl-3.0)
[![DOI](https://zenodo.org/badge/276474742.svg)](https://zenodo.org/badge/latestdoi/276474742)

[![Download](https://img.shields.io/badge/Download%20Dataset-blue?style=for-the-badge&logo=download)](https://github.com/GRAAL-Research/deepparse-address-data)

[![Rate on Openbase](https://badges.openbase.com/python/rating/deepparse.svg)](https://openbase.com/python/deepparse?utm_source=embedded&utm_medium=badge&utm_campaign=rate-badge)
</div>

## Here is Deepparse.

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning.

Use deepparse to

- parse multinational address using one of our pretrained models with or without attention mechanism,
- parse addresses directly from the command line without code to write,
- parse addresses with our out-of-the-box FastAPI parser,
- retrain our pretrained models on new data to improve parsing on specific country address patterns,
- retrain our pretrained models with new prediction tags easily,
- retrain our pretrained models with or without freezing some layers,
- train a new Seq2Seq addresses parsing models easily using a new model configuration.

Read the documentation at [deepparse.org](https://deepparse.org).

Deepparse is compatible with the __latest version of PyTorch__ and  __Python >= 3.8__.

### Countries and Results

We evaluate our models on two forms of address data

- **clean data** which refers to addresses containing elements from four categories, namely a street name, a
  municipality, a province and a postal code,
- **incomplete data** which is made up of addresses missing at least one category amongst the aforementioned ones.

You can get our dataset [here](https://github.com/GRAAL-Research/deepparse-address-data).

#### Clean Data

The following table presents the accuracy (using clean data) on the 20 countries we used during training for both our
models. Attention mechanisms improve performance by around 0.5% for all countries.

| Country        |   FastText (%) |   BPEmb (%) | Country     |   FastText (%) |   BPEmb (%) |
|:---------------|---------------:|------------:|:------------|---------------:|------------:|
| Norway         |          99.06 |       98.3  | Austria     |          99.21 |       97.82 |
| Italy          |          99.65 |       98.93 | Mexico      |          99.49 |       98.9  |
| United Kingdom |          99.58 |       97.62 | Switzerland |          98.9  |       98.38 |
| Germany        |          99.72 |       99.4  | Denmark     |          99.71 |       99.55 |
| France         |          99.6  |       98.18 | Brazil      |          99.31 |       97.69 |
| Netherlands    |          99.47 |       99.54 | Australia   |          99.68 |       98.44 |
| Poland         |          99.64 |       99.52 | Czechia     |          99.48 |       99.03 |
| United States  |          99.56 |       97.69 | Canada      |          99.76 |       99.03 |
| South Korea    |          99.97 |       99.99 | Russia      |          98.9  |       96.97 |
| Spain          |          99.73 |       99.4  | Finland     |          99.77 |       99.76 |

We have also made a zero-shot evaluation of our models using clean data from 41 other countries; the results are shown
in the next table.

| Country      |   FastText (%) |   BPEmb (%) | Country       |   FastText (%) |   BPEmb (%) |
|:-------------|---------------:|------------:|:--------------|---------------:|------------:|
| Latvia       |          89.29 |       68.31 | Faroe Islands |          71.22 |       64.74 |
| Colombia     |          85.96 |       68.09 | Singapore     |          86.03 |       67.19 |
| Réunion      |          84.3  |       78.65 | Indonesia     |          62.38 |       63.04 |
| Japan        |          36.26 |       34.97 | Portugal      |          93.09 |       72.01 |
| Algeria      |          86.32 |       70.59 | Belgium       |          93.14 |       86.06 |
| Malaysia     |          83.14 |       89.64 | Ukraine       |          93.34 |       89.42 |
| Estonia      |          87.62 |       70.08 | Bangladesh    |          72.28 |       65.63 |
| Slovenia     |          89.01 |       83.96 | Hungary       |          51.52 |       37.87 |
| Bermuda      |          83.19 |       59.16 | Romania       |          90.04 |       82.9  |
| Philippines  |          63.91 |       57.36 | Belarus       |          93.25 |       78.59 |
| Bosnia       |          88.54 |       67.46 | Moldova       |          89.22 |       57.48 |
| Lithuania    |          93.28 |       69.97 | Paraguay      |          96.02 |       87.07 |
| Croatia      |          95.8  |       81.76 | Argentina     |          81.68 |       71.2  |
| Ireland      |          80.16 |       54.44 | Kazakhstan    |          89.04 |       76.13 |
| Greece       |          87.08 |       38.95 | Bulgaria      |          91.16 |       65.76 |
| Serbia       |          92.87 |       76.79 | New Caledonia |          94.45 |       94.46 |
| Sweden       |          73.13 |       86.85 | Venezuela     |          79.23 |       70.88 |
| New Zealand  |          91.25 |       75.57 | Iceland       |          83.7  |       77.09 |
| India        |          70.3  |       63.68 | Uzbekistan    |          85.85 |       70.1  |
| Cyprus       |          89.64 |       89.47 | Slovakia      |          78.34 |       68.96 |
| South Africa |          95.68 |       74.82 |

Moreover, we also tested the performance when using attention mechanism to further improve zero-shot performance on
those countries; the result are shown in the next table.

| Country       |   FastText (%) |   FastTextAtt (%) |   BPEmb (%) |   BPEmbAtt (%) | Country       |   FastText (%) |   FastTextAtt (%) |   BPEmb (%) |   BPEmbAtt (%) |
|:--------------|---------------:|------------------:|------------:|---------------:|:--------------|---------------:|------------------:|------------:|---------------:|
| Ireland       |          80.16 |             89.11 |       54.44 |          81.84 | Serbia        |          92.87 |             95.88 |       76.79 |           91.4 |
| Uzbekistan    |          85.85 |             87.24 |       70.1  |          76.71 | Ukraine       |          93.34 |             94.58 |       89.42 |          92.65 |
| South Africa  |          95.68 |             97.25 |       74.82 |          97.95 | Paraguay      |          96.02 |             97.08 |       87.07 |          97.36 |
| Greece        |          87.08 |             86.04 |       38.95 |          58.79 | Algeria       |          86.32 |              87.3 |       70.59 |          84.56 |
| Belarus       |          93.25 |             97.4  |       78.59 |          97.49 | Sweden        |          73.13 |             89.24 |       86.85 |          93.53 |
| Portugal      |          93.09 |             94.92 |       72.01 |          93.76 | Hungary       |          51.52 |             51.08 |       37.87 |          24.48 |
| Iceland       |          83.7  |             96.54 |       77.09 |          96.63 | Colombia      |          85.96 |             90.08 |       68.09 |          88.52 |
| Latvia        |          89.29 |             93.14 |       68.31 |          73.79 | Malaysia      |          83.14 |             74.62 |       89.64 |          91.14 |
| Bosnia        |          88.54 |             87.27 |       67.46 |          89.02 | India         |           70.3 |             75.31 |       63.68 |          80.56 |
| Réunion       |          84.3  |             97.74 |       78.65 |          94.27 | Croatia       |           95.8 |             95.32 |       81.76 |          85.99 |
| Estonia       |          87.62 |             88.2  |       70.08 |          77.32 | New Caledonia |          94.45 |             99.61 |       94.46 |          99.77 |
| Japan         |          36.26 |             46.91 |       34.97 |          49.48 | New Zealand   |          91.25 |                97 |       75.57 |           95.7 |
| Singapore     |          86.03 |             89.92 |       67.19 |          88.17 | Romania       |          90.04 |             95.38 |        82.9 |          93.41 |
| Bangladesh    |          72.28 |             78.21 |       65.63 |          77.09 | Slovakia      |          78.34 |             82.29 |       68.96 |             96 |
| Argentina     |          81.68 |             88.59 |       71.2  |          86.8  | Kazakhstan    |          89.04 |             92.37 |       76.13 |          96.08 |
| Venezuela     |          79.23 |             95.47 |       70.88 |          96.38 | Indonesia     |          62.38 |             66.87 |       63.04 |          71.17 |
| Bulgaria      |          91.16 |             91.73 |       65.76 |          93.28 | Cyprus        |          89.64 |             97.44 |       89.47 |          98.01 |
| Bermuda       |          83.19 |             93.25 |       59.16 |          93.8  | Moldova       |          89.22 |             92.07 |       57.48 |          89.08 |
| Slovenia      |          89.01 |             95.08 |       83.96 |          96.73 | Lithuania     |          93.28 |             87.74 |       69.97 |          78.67 |
| Philippines   |          63.91 |             81.94 |       57.36 |          83.42 | Belgium       |          93.14 |             90.72 |       86.06 |          89.85 |
| Faroe Islands |          71.22 |             73.23 |       64.74 |          85.39 |               |                |                   |             |                |

#### Incomplete Data

The following table presents the accuracy on the 20 countries we used during training for both our models but for
incomplete data. We didn't test on the other 41 countries since we did not train on them and therefore do not expect to
achieve an interesting performance. Attention mechanisms improve performance by around 0.5% for all countries.

| Country        |   FastText (%) |   BPEmb (%) | Country     |   FastText (%) |   BPEmb (%) |
|:---------------|---------------:|------------:|:------------|---------------:|------------:|
| Norway         |          99.52 |       99.75 | Austria     |          99.55 |       98.94 |
| Italy          |          99.16 |       98.88 | Mexico      |          97.24 |       95.93 |
| United Kingdom |          97.85 |       95.2  | Switzerland |          99.2  |       99.47 |
| Germany        |          99.41 |       99.38 | Denmark     |          97.86 |       97.9  |
| France         |          99.51 |       98.49 | Brazil      |          98.96 |       97.12 |
| Netherlands    |          98.74 |       99.46 | Australia   |          99.34 |       98.7  |
| Poland         |          99.43 |       99.41 | Czechia     |          98.78 |       98.88 |
| United States  |          98.49 |       96.5  | Canada      |          98.96 |       96.98 |
| South Korea    |          91.1  |       99.89 | Russia      |          97.18 |       96.01 |
| Spain          |          99.07 |       98.35 | Finland     |          99.04 |       99.52 |

## Getting Started:

```python
from deepparse.parser import AddressParser
from deepparse.dataset_container import CSVDatasetContainer

address_parser = AddressParser(model_type="bpemb", device=0)

# you can parse one address
parsed_address = address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")

# or multiple addresses
parsed_address = address_parser(
    [
        "350 rue des Lilas Ouest Québec Québec G1L 1B6",
        "350 rue des Lilas Ouest Québec Québec G1L 1B6",
    ]
)

# or multinational addresses
# Canada, US, Germany, UK and South Korea
parsed_address = address_parser(
    [
        "350 rue des Lilas Ouest Québec Québec G1L 1B6",
        "777 Brockton Avenue, Abington MA 2351",
        "Ansgarstr. 4, Wallenhorst, 49134",
        "221 B Baker Street",
        "서울특별시 종로구 사직로3길 23",
    ]
)

# you can also get the probability of the predicted tags
parsed_address = address_parser(
    "350 rue des Lilas Ouest Québec Québec G1L 1B6", with_prob=True
)

# Print the parsed address
print(parsed_address)

# or using one of our dataset container
addresses_to_parse = CSVDatasetContainer(
    "./a_path.csv", column_names=["address_column_name"], is_training_container=False
)
address_parser(addresses_to_parse)
```

The default predictions tags are the following

- `"StreetNumber"`: for the street number,
- `"StreetName"`: for the name of the street,
- `"Unit"`: for the unit (such as apartment),
- `"Municipality"`: for the municipality,
- `"Province"`: for the province or local region,
- `"PostalCode"`: for the postal code,
- `"Orientation"`: for the street orientation (e.g. west, east),
- `"GeneralDelivery"`: for other delivery information.

### Parse Addresses From the Command Line

You can also use our cli to parse addresses using:

```sh
parse <parsing_model> <dataset_path> <export_file_name>
```

### Parse Addresses Using Your Own Retrained Model

> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/retrained_model_parsing.py) for a complete
> example.

```python
address_parser = AddressParser(
    model_type="bpemb",
    device=0,
    path_to_retrained_model="path/to/retrained/bpemb/model.p",
)

address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")
```

### Parse Address With Our Out-Of-The-Box API

We also offer an out-of-the-box RESTAPI to parse addresses using FastAPI.

#### Installation:

First, ensure that you have Docker Engine and Docker Compose installed on your machine.
If not, you can install them using the following documentations in the following order:

1. [Docker Engine](https://docs.docker.com/engine/install/)
2. [Docker Compose](https://docs.docker.com/compose/install/linux/#install-using-the-repository)



Once you have Docker Engine and Docker Compose installed, you can run the following command to start the FastAPI application:

```sh
docker compose up app
```

#### Sentry

Also, you can monitor your application usage with [Sentry](https://sentry.io) by setting the environment variable `SENTRY_DSN` to your Sentry's project
DSN. There is an example of the `.env` file in the project's root named `.env_example`. You can copy it using the following command: 

```sh
cp .env_example .env
```
#### Request Examples

Once the application is up and running and port `8000` is exported on your localhost, you can send a request with one
of the following methods:

##### cURL POST request
```sh
curl -X POST --location "http://127.0.0.1:8000/parse/bpemb-attention" --http1.1 \
    -H "Host: 127.0.0.1:8000" \
    -H "Content-Type: application/json" \
    -d "[
          {\"raw\": \"350 rue des Lilas Ouest Quebec city Quebec G1L 1B6\"},
          {\"raw\": \"2325 Rue de l'Université, Québec, QC G1V 0A6\"}
        ]"
```

#####  Python POST request

```python
import requests

url = 'http://localhost:8000/parse/bpemb'
addresses = [
    {"raw": "350 rue des Lilas Ouest Quebec city Quebec G1L 1B6"},
    {"raw": "2325 Rue de l'Université, Québec, QC G1V 0A6"}
    ]

response = requests.post(url, json=addresses)
parsed_addresses = response.json()
print(parsed_addresses)
```



### Retrain a Model

> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/fine_tuning.py) for a complete example
> using Pickle
> and [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/fine_tuning_with_csv_dataset.py)
> for a complete example using CSV.

```python
# We will retrain the fasttext version of our pretrained model.
address_parser = AddressParser(model_type="fasttext", device=0)

address_parser.retrain(training_container, train_ratio=0.8, epochs=5, batch_size=8)
```

One can also freeze some layers to speed up the training using the ``layers_to_freeze`` parameter.

```python
address_parser.retrain(
    training_container,
    train_ratio=0.8,
    epochs=5,
    batch_size=8,
    layers_to_freeze="seq2seq",
)
```

Or you can also give a specific name to the retrained model. This name will be use as the model name (for print and
class name) when reloading it.

```python
address_parser.retrain(
    training_container,
    train_ratio=0.8,
    epochs=5,
    batch_size=8,
    name_of_the_retrain_parser="MyNewParser",
)
```

### Retrain a Model With an Attention Mechanism

> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/retrain_attention_model.py) for a complete
> example.

```python
# We will retrain the fasttext version of our pretrained model.
address_parser = AddressParser(
    model_type="fasttext", device=0, attention_mechanism=True
)

address_parser.retrain(training_container, train_ratio=0.8, epochs=5, batch_size=8)
```

### Retrain a Model With New Tags

> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/retrain_with_new_prediction_tags.py) for a
> complete example.

```python
address_components = {"ATag": 0, "AnotherTag": 1, "EOS": 2}
address_parser.retrain(
    training_container,
    train_ratio=0.8,
    epochs=1,
    batch_size=128,
    prediction_tags=address_components,
)
```

### Retrain a Seq2Seq Model From Scratch

> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/retrain_with_new_seq2seq_params.py) for
> a complete example.

```python
seq2seq_params = {"encoder_hidden_size": 512, "decoder_hidden_size": 512}
address_parser.retrain(
    training_container,
    train_ratio=0.8,
    epochs=1,
    batch_size=128,
    seq2seq_params=seq2seq_params,
)
```

### Download Our Models

Deepparse handles model downloads when you use it, but you can also pre-download our model. Here are the URLs to download our pretrained models directly

- [FastText](https://graal.ift.ulaval.ca/public/deepparse/fasttext.ckpt),
- [FastTextAttention](https://graal.ift.ulaval.ca/public/deepparse/fasttext_attention.ckpt),
- [BPEmb](https://graal.ift.ulaval.ca/public/deepparse/bpemb.ckpt),
- [BPEmbAttention](https://graal.ift.ulaval.ca/public/deepparse/bpemb_attention.ckpt),
- [FastText Light](https://graal.ift.ulaval.ca/public/deepparse/fasttext.magnitude.gz) (
  using [Magnitude Light](https://github.com/davebulaval/magnitude-light)).

Or you can use our CLI to download our pretrained models directly using:

```sh
download_model <model_name>
```

Starting at version 0.9.8, we will also release the weights with the GitHub release note available [here](https://github.com/GRAAL-Research/deepparse/releases).

------------------

## Installation

Before installing deepparse, you must have the latest version of [PyTorch](https://pytorch.org/) in your environment.

- **Install the stable version of Deepparse:**

```sh
pip install deepparse
```

- **Install the stable version of Deepparse with the app extra dependencies:**

```sh
pip install deepparse[app]  # for bash terminal
pip install 'deepparse[app]' # for ZSH terminal
```

- **Install the stable version of Deepparse with all extra dependencies:**

```sh
pip install deepparse[all]  # for bash terminal
pip install 'deepparse[all]' # for ZSH terminal
```

- **Install the latest development version of Deepparse:**

```sh
pip install -U git+https://github.com/GRAAL-Research/deepparse.git@dev
```

------------------

## Cite

Use the following for the article;

```
@misc{yassine2020leveraging,
    title={{Leveraging Subword Embeddings for Multinational Address Parsing}},
    author={Marouane Yassine and David Beauchemin and François Laviolette and Luc Lamontagne},
    year={2020},
    eprint={2006.16152},
    archivePrefix={arXiv}
}
```

and this one for the package;

```
@misc{deepparse,
    author = {Marouane Yassine and David Beauchemin},
    title  = {{Deepparse: A State-Of-The-Art Deep Learning Multinational Addresses Parser}},
    year   = {2020},
    note   = {\url{https://deepparse.org}}
}
```

------------------

## Contributing to Deepparse

We welcome user input, whether it is regarding bugs found in the library or feature propositions ! Make sure to have a
look at our [contributing guidelines](https://github.com/GRAAL-Research/deepparse/blob/main/.github/CONTRIBUTING.md)
for more details on this matter.

## License

Deepparse is LGPLv3 licensed, as found in
the [LICENSE file](https://github.com/GRAAL-Research/deepparse/blob/main/LICENSE).

------------------

            

Raw data

            {
    "_id": null,
    "home_page": "https://deepparse.org/",
    "name": "deepparse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Marouane Yassine, David Beauchemin",
    "author_email": "marouane.yassine.1@ulaval.ca, david.beauchemin.5@ulaval.ca",
    "download_url": "https://files.pythonhosted.org/packages/6a/74/e3fde05da4a4a472a86a99b85ae852281a22528d8fedb9f1568c185104e1/deepparse-0.9.13.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n<img src=\"https://raw.githubusercontent.com/GRAAL-Research/deepparse/main/docs/source/_static/logos/deepparse.png\" width=\"220\" height=\"91\"/>\n\n\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/deepparse)](https://pypi.org/project/deepparse)\n[![PyPI Status](https://badge.fury.io/py/deepparse.svg)](https://badge.fury.io/py/deepparse)\n[![PyPI Status](https://pepy.tech/badge/deepparse)](https://pepy.tech/project/deepparse)\n[![Downloads](https://pepy.tech/badge/deepparse/month)](https://pepy.tech/project/deepparse)\n\n[![Formatting](https://github.com/GRAAL-Research/deepparse/actions/workflows/formatting.yml/badge.svg?branch=stable)](https://github.com/GRAAL-Research/deepparse/actions/workflows/formatting.yml)\n[![Linting](https://github.com/GRAAL-Research/deepparse/actions/workflows/linting.yml/badge.svg?branch=stable)](https://github.com/GRAAL-Research/deepparse/actions/workflows/linting.yml)\n[![Tests](https://github.com/GRAAL-Research/deepparse/actions/workflows/tests.yml/badge.svg?branch=stable)](https://github.com/GRAAL-Research/deepparse/actions/workflows/tests.yml)\n[![Docs](https://github.com/GRAAL-Research/deepparse/actions/workflows/docs.yml/badge.svg?branch=stable)](https://github.com/GRAAL-Research/deepparse/actions/workflows/docs.yml)\n\n[![codecov](https://codecov.io/gh/GRAAL-Research/deepparse/branch/main/graph/badge.svg)](https://codecov.io/gh/GRAAL-Research/deepparse)\n[![Codacy Badge](https://app.codacy.com/project/badge/Grade/62464699ff0740d0b8064227c4274b98)](https://www.codacy.com/gh/GRAAL-Research/deepparse/dashboard?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=GRAAL-Research/deepparse&amp;utm_campaign=Badge_Grade)\n<a href=\"https://github.com/psf/black\"><img alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"></a>\n\n[![pr welcome](https://img.shields.io/badge/PR-Welcome-%23FF8300.svg?)](https://img.shields.io/badge/PR-Welcome-%23FF8300.svg?)\n[![License: LGPL v3](https://img.shields.io/badge/License-LGPL%20v3-blue.svg)](http://www.gnu.org/licenses/lgpl-3.0)\n[![DOI](https://zenodo.org/badge/276474742.svg)](https://zenodo.org/badge/latestdoi/276474742)\n\n[![Download](https://img.shields.io/badge/Download%20Dataset-blue?style=for-the-badge&logo=download)](https://github.com/GRAAL-Research/deepparse-address-data)\n\n[![Rate on Openbase](https://badges.openbase.com/python/rating/deepparse.svg)](https://openbase.com/python/deepparse?utm_source=embedded&utm_medium=badge&utm_campaign=rate-badge)\n</div>\n\n## Here is Deepparse.\n\nDeepparse is a state-of-the-art library for parsing multinational street addresses using deep learning.\n\nUse deepparse to\n\n- parse multinational address using one of our pretrained models with or without attention mechanism,\n- parse addresses directly from the command line without code to write,\n- parse addresses with our out-of-the-box FastAPI parser,\n- retrain our pretrained models on new data to improve parsing on specific country address patterns,\n- retrain our pretrained models with new prediction tags easily,\n- retrain our pretrained models with or without freezing some layers,\n- train a new Seq2Seq addresses parsing models easily using a new model configuration.\n\nRead the documentation at [deepparse.org](https://deepparse.org).\n\nDeepparse is compatible with the __latest version of PyTorch__ and  __Python >= 3.8__.\n\n### Countries and Results\n\nWe evaluate our models on two forms of address data\n\n- **clean data** which refers to addresses containing elements from four categories, namely a street name, a\n  municipality, a province and a postal code,\n- **incomplete data** which is made up of addresses missing at least one category amongst the aforementioned ones.\n\nYou can get our dataset [here](https://github.com/GRAAL-Research/deepparse-address-data).\n\n#### Clean Data\n\nThe following table presents the accuracy (using clean data) on the 20 countries we used during training for both our\nmodels. Attention mechanisms improve performance by around 0.5% for all countries.\n\n| Country        |   FastText (%) |   BPEmb (%) | Country     |   FastText (%) |   BPEmb (%) |\n|:---------------|---------------:|------------:|:------------|---------------:|------------:|\n| Norway         |          99.06 |       98.3  | Austria     |          99.21 |       97.82 |\n| Italy          |          99.65 |       98.93 | Mexico      |          99.49 |       98.9  |\n| United Kingdom |          99.58 |       97.62 | Switzerland |          98.9  |       98.38 |\n| Germany        |          99.72 |       99.4  | Denmark     |          99.71 |       99.55 |\n| France         |          99.6  |       98.18 | Brazil      |          99.31 |       97.69 |\n| Netherlands    |          99.47 |       99.54 | Australia   |          99.68 |       98.44 |\n| Poland         |          99.64 |       99.52 | Czechia     |          99.48 |       99.03 |\n| United States  |          99.56 |       97.69 | Canada      |          99.76 |       99.03 |\n| South Korea    |          99.97 |       99.99 | Russia      |          98.9  |       96.97 |\n| Spain          |          99.73 |       99.4  | Finland     |          99.77 |       99.76 |\n\nWe have also made a zero-shot evaluation of our models using clean data from 41 other countries; the results are shown\nin the next table.\n\n| Country      |   FastText (%) |   BPEmb (%) | Country       |   FastText (%) |   BPEmb (%) |\n|:-------------|---------------:|------------:|:--------------|---------------:|------------:|\n| Latvia       |          89.29 |       68.31 | Faroe Islands |          71.22 |       64.74 |\n| Colombia     |          85.96 |       68.09 | Singapore     |          86.03 |       67.19 |\n| R\u00e9union      |          84.3  |       78.65 | Indonesia     |          62.38 |       63.04 |\n| Japan        |          36.26 |       34.97 | Portugal      |          93.09 |       72.01 |\n| Algeria      |          86.32 |       70.59 | Belgium       |          93.14 |       86.06 |\n| Malaysia     |          83.14 |       89.64 | Ukraine       |          93.34 |       89.42 |\n| Estonia      |          87.62 |       70.08 | Bangladesh    |          72.28 |       65.63 |\n| Slovenia     |          89.01 |       83.96 | Hungary       |          51.52 |       37.87 |\n| Bermuda      |          83.19 |       59.16 | Romania       |          90.04 |       82.9  |\n| Philippines  |          63.91 |       57.36 | Belarus       |          93.25 |       78.59 |\n| Bosnia       |          88.54 |       67.46 | Moldova       |          89.22 |       57.48 |\n| Lithuania    |          93.28 |       69.97 | Paraguay      |          96.02 |       87.07 |\n| Croatia      |          95.8  |       81.76 | Argentina     |          81.68 |       71.2  |\n| Ireland      |          80.16 |       54.44 | Kazakhstan    |          89.04 |       76.13 |\n| Greece       |          87.08 |       38.95 | Bulgaria      |          91.16 |       65.76 |\n| Serbia       |          92.87 |       76.79 | New Caledonia |          94.45 |       94.46 |\n| Sweden       |          73.13 |       86.85 | Venezuela     |          79.23 |       70.88 |\n| New Zealand  |          91.25 |       75.57 | Iceland       |          83.7  |       77.09 |\n| India        |          70.3  |       63.68 | Uzbekistan    |          85.85 |       70.1  |\n| Cyprus       |          89.64 |       89.47 | Slovakia      |          78.34 |       68.96 |\n| South Africa |          95.68 |       74.82 |\n\nMoreover, we also tested the performance when using attention mechanism to further improve zero-shot performance on\nthose countries; the result are shown in the next table.\n\n| Country       |   FastText (%) |   FastTextAtt (%) |   BPEmb (%) |   BPEmbAtt (%) | Country       |   FastText (%) |   FastTextAtt (%) |   BPEmb (%) |   BPEmbAtt (%) |\n|:--------------|---------------:|------------------:|------------:|---------------:|:--------------|---------------:|------------------:|------------:|---------------:|\n| Ireland       |          80.16 |             89.11 |       54.44 |          81.84 | Serbia        |          92.87 |             95.88 |       76.79 |           91.4 |\n| Uzbekistan    |          85.85 |             87.24 |       70.1  |          76.71 | Ukraine       |          93.34 |             94.58 |       89.42 |          92.65 |\n| South Africa  |          95.68 |             97.25 |       74.82 |          97.95 | Paraguay      |          96.02 |             97.08 |       87.07 |          97.36 |\n| Greece        |          87.08 |             86.04 |       38.95 |          58.79 | Algeria       |          86.32 |              87.3 |       70.59 |          84.56 |\n| Belarus       |          93.25 |             97.4  |       78.59 |          97.49 | Sweden        |          73.13 |             89.24 |       86.85 |          93.53 |\n| Portugal      |          93.09 |             94.92 |       72.01 |          93.76 | Hungary       |          51.52 |             51.08 |       37.87 |          24.48 |\n| Iceland       |          83.7  |             96.54 |       77.09 |          96.63 | Colombia      |          85.96 |             90.08 |       68.09 |          88.52 |\n| Latvia        |          89.29 |             93.14 |       68.31 |          73.79 | Malaysia      |          83.14 |             74.62 |       89.64 |          91.14 |\n| Bosnia        |          88.54 |             87.27 |       67.46 |          89.02 | India         |           70.3 |             75.31 |       63.68 |          80.56 |\n| R\u00e9union       |          84.3  |             97.74 |       78.65 |          94.27 | Croatia       |           95.8 |             95.32 |       81.76 |          85.99 |\n| Estonia       |          87.62 |             88.2  |       70.08 |          77.32 | New Caledonia |          94.45 |             99.61 |       94.46 |          99.77 |\n| Japan         |          36.26 |             46.91 |       34.97 |          49.48 | New Zealand   |          91.25 |                97 |       75.57 |           95.7 |\n| Singapore     |          86.03 |             89.92 |       67.19 |          88.17 | Romania       |          90.04 |             95.38 |        82.9 |          93.41 |\n| Bangladesh    |          72.28 |             78.21 |       65.63 |          77.09 | Slovakia      |          78.34 |             82.29 |       68.96 |             96 |\n| Argentina     |          81.68 |             88.59 |       71.2  |          86.8  | Kazakhstan    |          89.04 |             92.37 |       76.13 |          96.08 |\n| Venezuela     |          79.23 |             95.47 |       70.88 |          96.38 | Indonesia     |          62.38 |             66.87 |       63.04 |          71.17 |\n| Bulgaria      |          91.16 |             91.73 |       65.76 |          93.28 | Cyprus        |          89.64 |             97.44 |       89.47 |          98.01 |\n| Bermuda       |          83.19 |             93.25 |       59.16 |          93.8  | Moldova       |          89.22 |             92.07 |       57.48 |          89.08 |\n| Slovenia      |          89.01 |             95.08 |       83.96 |          96.73 | Lithuania     |          93.28 |             87.74 |       69.97 |          78.67 |\n| Philippines   |          63.91 |             81.94 |       57.36 |          83.42 | Belgium       |          93.14 |             90.72 |       86.06 |          89.85 |\n| Faroe Islands |          71.22 |             73.23 |       64.74 |          85.39 |               |                |                   |             |                |\n\n#### Incomplete Data\n\nThe following table presents the accuracy on the 20 countries we used during training for both our models but for\nincomplete data. We didn't test on the other 41 countries since we did not train on them and therefore do not expect to\nachieve an interesting performance. Attention mechanisms improve performance by around 0.5% for all countries.\n\n| Country        |   FastText (%) |   BPEmb (%) | Country     |   FastText (%) |   BPEmb (%) |\n|:---------------|---------------:|------------:|:------------|---------------:|------------:|\n| Norway         |          99.52 |       99.75 | Austria     |          99.55 |       98.94 |\n| Italy          |          99.16 |       98.88 | Mexico      |          97.24 |       95.93 |\n| United Kingdom |          97.85 |       95.2  | Switzerland |          99.2  |       99.47 |\n| Germany        |          99.41 |       99.38 | Denmark     |          97.86 |       97.9  |\n| France         |          99.51 |       98.49 | Brazil      |          98.96 |       97.12 |\n| Netherlands    |          98.74 |       99.46 | Australia   |          99.34 |       98.7  |\n| Poland         |          99.43 |       99.41 | Czechia     |          98.78 |       98.88 |\n| United States  |          98.49 |       96.5  | Canada      |          98.96 |       96.98 |\n| South Korea    |          91.1  |       99.89 | Russia      |          97.18 |       96.01 |\n| Spain          |          99.07 |       98.35 | Finland     |          99.04 |       99.52 |\n\n## Getting Started:\n\n```python\nfrom deepparse.parser import AddressParser\nfrom deepparse.dataset_container import CSVDatasetContainer\n\naddress_parser = AddressParser(model_type=\"bpemb\", device=0)\n\n# you can parse one address\nparsed_address = address_parser(\"350 rue des Lilas Ouest Qu\u00e9bec Qu\u00e9bec G1L 1B6\")\n\n# or multiple addresses\nparsed_address = address_parser(\n    [\n        \"350 rue des Lilas Ouest Qu\u00e9bec Qu\u00e9bec G1L 1B6\",\n        \"350 rue des Lilas Ouest Qu\u00e9bec Qu\u00e9bec G1L 1B6\",\n    ]\n)\n\n# or multinational addresses\n# Canada, US, Germany, UK and South Korea\nparsed_address = address_parser(\n    [\n        \"350 rue des Lilas Ouest Qu\u00e9bec Qu\u00e9bec G1L 1B6\",\n        \"777 Brockton Avenue, Abington MA 2351\",\n        \"Ansgarstr. 4, Wallenhorst, 49134\",\n        \"221 B Baker Street\",\n        \"\uc11c\uc6b8\ud2b9\ubcc4\uc2dc \uc885\ub85c\uad6c \uc0ac\uc9c1\ub85c3\uae38 23\",\n    ]\n)\n\n# you can also get the probability of the predicted tags\nparsed_address = address_parser(\n    \"350 rue des Lilas Ouest Qu\u00e9bec Qu\u00e9bec G1L 1B6\", with_prob=True\n)\n\n# Print the parsed address\nprint(parsed_address)\n\n# or using one of our dataset container\naddresses_to_parse = CSVDatasetContainer(\n    \"./a_path.csv\", column_names=[\"address_column_name\"], is_training_container=False\n)\naddress_parser(addresses_to_parse)\n```\n\nThe default predictions tags are the following\n\n- `\"StreetNumber\"`: for the street number,\n- `\"StreetName\"`: for the name of the street,\n- `\"Unit\"`: for the unit (such as apartment),\n- `\"Municipality\"`: for the municipality,\n- `\"Province\"`: for the province or local region,\n- `\"PostalCode\"`: for the postal code,\n- `\"Orientation\"`: for the street orientation (e.g. west, east),\n- `\"GeneralDelivery\"`: for other delivery information.\n\n### Parse Addresses From the Command Line\n\nYou can also use our cli to parse addresses using:\n\n```sh\nparse <parsing_model> <dataset_path> <export_file_name>\n```\n\n### Parse Addresses Using Your Own Retrained Model\n\n> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/retrained_model_parsing.py) for a complete\n> example.\n\n```python\naddress_parser = AddressParser(\n    model_type=\"bpemb\",\n    device=0,\n    path_to_retrained_model=\"path/to/retrained/bpemb/model.p\",\n)\n\naddress_parser(\"350 rue des Lilas Ouest Qu\u00e9bec Qu\u00e9bec G1L 1B6\")\n```\n\n### Parse Address With Our Out-Of-The-Box API\n\nWe also offer an out-of-the-box RESTAPI to parse addresses using FastAPI.\n\n#### Installation:\n\nFirst, ensure that you have Docker Engine and Docker Compose installed on your machine.\nIf not, you can install them using the following documentations in the following order:\n\n1. [Docker Engine](https://docs.docker.com/engine/install/)\n2. [Docker Compose](https://docs.docker.com/compose/install/linux/#install-using-the-repository)\n\n\n\nOnce you have Docker Engine and Docker Compose installed, you can run the following command to start the FastAPI application:\n\n```sh\ndocker compose up app\n```\n\n#### Sentry\n\nAlso, you can monitor your application usage with [Sentry](https://sentry.io) by setting the environment variable `SENTRY_DSN` to your Sentry's project\nDSN. There is an example of the `.env` file in the project's root named `.env_example`. You can copy it using the following command: \n\n```sh\ncp .env_example .env\n```\n#### Request Examples\n\nOnce the application is up and running and port `8000` is exported on your localhost, you can send a request with one\nof the following methods:\n\n##### cURL POST request\n```sh\ncurl -X POST --location \"http://127.0.0.1:8000/parse/bpemb-attention\" --http1.1 \\\n    -H \"Host: 127.0.0.1:8000\" \\\n    -H \"Content-Type: application/json\" \\\n    -d \"[\n          {\\\"raw\\\": \\\"350 rue des Lilas Ouest Quebec city Quebec G1L 1B6\\\"},\n          {\\\"raw\\\": \\\"2325 Rue de l'Universit\u00e9, Qu\u00e9bec, QC G1V 0A6\\\"}\n        ]\"\n```\n\n#####  Python POST request\n\n```python\nimport requests\n\nurl = 'http://localhost:8000/parse/bpemb'\naddresses = [\n    {\"raw\": \"350 rue des Lilas Ouest Quebec city Quebec G1L 1B6\"},\n    {\"raw\": \"2325 Rue de l'Universit\u00e9, Qu\u00e9bec, QC G1V 0A6\"}\n    ]\n\nresponse = requests.post(url, json=addresses)\nparsed_addresses = response.json()\nprint(parsed_addresses)\n```\n\n\n\n### Retrain a Model\n\n> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/fine_tuning.py) for a complete example\n> using Pickle\n> and [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/fine_tuning_with_csv_dataset.py)\n> for a complete example using CSV.\n\n```python\n# We will retrain the fasttext version of our pretrained model.\naddress_parser = AddressParser(model_type=\"fasttext\", device=0)\n\naddress_parser.retrain(training_container, train_ratio=0.8, epochs=5, batch_size=8)\n```\n\nOne can also freeze some layers to speed up the training using the ``layers_to_freeze`` parameter.\n\n```python\naddress_parser.retrain(\n    training_container,\n    train_ratio=0.8,\n    epochs=5,\n    batch_size=8,\n    layers_to_freeze=\"seq2seq\",\n)\n```\n\nOr you can also give a specific name to the retrained model. This name will be use as the model name (for print and\nclass name) when reloading it.\n\n```python\naddress_parser.retrain(\n    training_container,\n    train_ratio=0.8,\n    epochs=5,\n    batch_size=8,\n    name_of_the_retrain_parser=\"MyNewParser\",\n)\n```\n\n### Retrain a Model With an Attention Mechanism\n\n> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/retrain_attention_model.py) for a complete\n> example.\n\n```python\n# We will retrain the fasttext version of our pretrained model.\naddress_parser = AddressParser(\n    model_type=\"fasttext\", device=0, attention_mechanism=True\n)\n\naddress_parser.retrain(training_container, train_ratio=0.8, epochs=5, batch_size=8)\n```\n\n### Retrain a Model With New Tags\n\n> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/retrain_with_new_prediction_tags.py) for a\n> complete example.\n\n```python\naddress_components = {\"ATag\": 0, \"AnotherTag\": 1, \"EOS\": 2}\naddress_parser.retrain(\n    training_container,\n    train_ratio=0.8,\n    epochs=1,\n    batch_size=128,\n    prediction_tags=address_components,\n)\n```\n\n### Retrain a Seq2Seq Model From Scratch\n\n> See [here](https://github.com/GRAAL-Research/deepparse/blob/main/examples/retrain_with_new_seq2seq_params.py) for\n> a complete example.\n\n```python\nseq2seq_params = {\"encoder_hidden_size\": 512, \"decoder_hidden_size\": 512}\naddress_parser.retrain(\n    training_container,\n    train_ratio=0.8,\n    epochs=1,\n    batch_size=128,\n    seq2seq_params=seq2seq_params,\n)\n```\n\n### Download Our Models\n\nDeepparse handles model downloads when you use it, but you can also pre-download our model. Here are the URLs to download our pretrained models directly\n\n- [FastText](https://graal.ift.ulaval.ca/public/deepparse/fasttext.ckpt),\n- [FastTextAttention](https://graal.ift.ulaval.ca/public/deepparse/fasttext_attention.ckpt),\n- [BPEmb](https://graal.ift.ulaval.ca/public/deepparse/bpemb.ckpt),\n- [BPEmbAttention](https://graal.ift.ulaval.ca/public/deepparse/bpemb_attention.ckpt),\n- [FastText Light](https://graal.ift.ulaval.ca/public/deepparse/fasttext.magnitude.gz) (\n  using [Magnitude Light](https://github.com/davebulaval/magnitude-light)).\n\nOr you can use our CLI to download our pretrained models directly using:\n\n```sh\ndownload_model <model_name>\n```\n\nStarting at version 0.9.8, we will also release the weights with the GitHub release note available [here](https://github.com/GRAAL-Research/deepparse/releases).\n\n------------------\n\n## Installation\n\nBefore installing deepparse, you must have the latest version of [PyTorch](https://pytorch.org/) in your environment.\n\n- **Install the stable version of Deepparse:**\n\n```sh\npip install deepparse\n```\n\n- **Install the stable version of Deepparse with the app extra dependencies:**\n\n```sh\npip install deepparse[app]  # for bash terminal\npip install 'deepparse[app]' # for ZSH terminal\n```\n\n- **Install the stable version of Deepparse with all extra dependencies:**\n\n```sh\npip install deepparse[all]  # for bash terminal\npip install 'deepparse[all]' # for ZSH terminal\n```\n\n- **Install the latest development version of Deepparse:**\n\n```sh\npip install -U git+https://github.com/GRAAL-Research/deepparse.git@dev\n```\n\n------------------\n\n## Cite\n\nUse the following for the article;\n\n```\n@misc{yassine2020leveraging,\n    title={{Leveraging Subword Embeddings for Multinational Address Parsing}},\n    author={Marouane Yassine and David Beauchemin and Fran\u00e7ois Laviolette and Luc Lamontagne},\n    year={2020},\n    eprint={2006.16152},\n    archivePrefix={arXiv}\n}\n```\n\nand this one for the package;\n\n```\n@misc{deepparse,\n    author = {Marouane Yassine and David Beauchemin},\n    title  = {{Deepparse: A State-Of-The-Art Deep Learning Multinational Addresses Parser}},\n    year   = {2020},\n    note   = {\\url{https://deepparse.org}}\n}\n```\n\n------------------\n\n## Contributing to Deepparse\n\nWe welcome user input, whether it is regarding bugs found in the library or feature propositions ! Make sure to have a\nlook at our [contributing guidelines](https://github.com/GRAAL-Research/deepparse/blob/main/.github/CONTRIBUTING.md)\nfor more details on this matter.\n\n## License\n\nDeepparse is LGPLv3 licensed, as found in\nthe [LICENSE file](https://github.com/GRAAL-Research/deepparse/blob/main/LICENSE).\n\n------------------\n",
    "bugtrack_url": null,
    "license": "LGPLv3",
    "summary": "A library for parsing multinational street addresses using deep learning.",
    "version": "0.9.13",
    "project_urls": {
        "Download": "https://github.com/GRAAL-Research/deepparse/archive/v0.9.13.zip",
        "Homepage": "https://deepparse.org/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "11def34e256068da513f2681e6102c15c144604f4de54dd0aa0e35cf3a7d5ac4",
                "md5": "be3c835b8e60a34aca50684fe3497fd8",
                "sha256": "4a62ada5659dffc64e8c51c5b0cb4e2d007f1233de4433b6fcd853b125296049"
            },
            "downloads": -1,
            "filename": "deepparse-0.9.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "be3c835b8e60a34aca50684fe3497fd8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 225087,
            "upload_time": "2024-09-12T22:12:59",
            "upload_time_iso_8601": "2024-09-12T22:12:59.043199Z",
            "url": "https://files.pythonhosted.org/packages/11/de/f34e256068da513f2681e6102c15c144604f4de54dd0aa0e35cf3a7d5ac4/deepparse-0.9.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6a74e3fde05da4a4a472a86a99b85ae852281a22528d8fedb9f1568c185104e1",
                "md5": "f5b152e7159681f8d273b4d32fe58915",
                "sha256": "04542c64870e7893893f3a2dac82ad486622485459dc30353c30cc3bc7496669"
            },
            "downloads": -1,
            "filename": "deepparse-0.9.13.tar.gz",
            "has_sig": false,
            "md5_digest": "f5b152e7159681f8d273b4d32fe58915",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 150553,
            "upload_time": "2024-09-12T22:13:00",
            "upload_time_iso_8601": "2024-09-12T22:13:00.950143Z",
            "url": "https://files.pythonhosted.org/packages/6a/74/e3fde05da4a4a472a86a99b85ae852281a22528d8fedb9f1568c185104e1/deepparse-0.9.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-12 22:13:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GRAAL-Research",
    "github_project": "deepparse",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "lcname": "deepparse"
}
        
Elapsed time: 0.32271s