.. image:: https://img.shields.io/pypi/v/advertools.svg
:target: https://pypi.python.org/pypi/advertools
.. image:: https://readthedocs.org/projects/advertools/badge/?version=latest
:target: https://advertools.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: https://static.pepy.tech/badge/advertools
:target: http://pepy.tech/project/advertools
**Announcing** `Data Science with Python for SEO course <https://bit.ly/dsseo-course>`_: Cohort based course, interactive, live-coding.
``advertools``: productivity & analysis tools to scale your online marketing
============================================================================
| A digital marketer is a data scientist.
| Your job is to manage, manipulate, visualize, communicate, understand,
and make decisions based on data.
You might be doing basic stuff, like copying and pasting text on spread
sheets, you might be running large scale automated platforms with
sophisticated algorithms, or somewhere in between. In any case your job
is all about working with data.
As a data scientist you don't spend most of your time producing cool
visualizations or finding great insights. The majority of your time is spent
wrangling with URLs, figuring out how to stitch together two tables, hoping
that the dates, won't break, without you knowing, or trying to generate the
next 124,538 keywords for an upcoming campaign, by the end of the week!
``advertools`` is a Python package that can hopefully make that part of your job a little easier.
Installation
------------
.. code:: bash
python3 -m pip install advertools
Philosophy/approach
-------------------
It's very easy to learn how to use advertools. There are two main reasons for that.
First, it is essentially a set of independent functions that you can easily learn and
use. There are no special data structures, or additional learning that you need. With
basic Python, and an understanding of the tasks that these functions help with, you
should be able to pick it up fairly easily. In other words, if you know how to use an
Excel formula, you can easily use any advertools function.
The second reason is that `advertools` follows the UNIX philosophy in its design and
approach. Here is one of the various summaries of the UNIX philosophy by Doug McIlroy:
Write programs that do one thing and do it well. Write programs to work together.
Write programs to handle text streams, because that is a universal interface.
Let's see how advertools follows that:
**Do one thing and do it well:** Each function in advertools aims for that. There is a
function that just extracts hashtags from a text list, another one to crawl websites,
one to test which URLs are blocked by robots.txt files, and one for downloading XML
sitemaps. Although they are designed to work together as a full pipeline, they can be
run independently in whichever combination or sequence you want.
**Write programs to work together:** Independence does not mean they are unrelated. The
workflows are designed to aid the online marketing practitioner in various steps for
understanding websites, SEO analysis, creating SEM campaigns and others.
**Programs to handle text streams because that is a universal interface:** In Data
Science the most used data structure that can be considered “universal” is the
DataFrame. So, most functions return either a DataFrame or a file that can be read into
one. Once you have it, you have the full power of all other tools like pandas for
further manipulating the data, Plotly for visualization, or any machine learning
library that can more easily handle tabular data.
This way it is kept modular as well as flexible and integrated.
As a next step most of these functions are being converted to no-code
`interactive apps <https://adver.tools>`_ for non-coders, and taking them to the next
level.
SEM Campaigns
-------------
The most important thing to achieve in SEM is a proper mapping between the
three main elements of a search campaign
**Keywords** (the intention) -> **Ads** (your promise) -> **Landing Pages** (your delivery of the promise)
Once you have this done, you can focus on management and analysis. More importantly,
once you know that you can set this up in an easy way, you know you can focus
on more strategic issues. In practical terms you need two main tables to get started:
* Keywords: You can `generate keywords <https://advertools.readthedocs.io/en/master/advertools.kw_generate.html>`_ (note I didn't say research) with the
`kw_generate` function.
* Ads: There are two approaches that you can use:
* Bottom-up: You can create text ads for a large number of products by simple
replacement of product names, and providing a placeholder in case your text
is too long. Check out the `ad_create <https://advertools.readthedocs.io/en/master/advertools.ad_create.html>`_ function for more details.
* Top-down: Sometimes you have a long description text that you want to split
into headlines, descriptions and whatever slots you want to split them into.
`ad_from_string <https://advertools.readthedocs.io/en/master/advertools.ad_from_string.html>`_
helps you accomplish that.
* Tutorials and additional resources
* Get started with `Data Science for Digital Marketing and SEO/SEM <https://www.oncrawl.com/technical-seo/data-science-seo-digital-marketing-guide-beginners/>`_
* `Setting a full SEM campaign <https://www.datacamp.com/community/tutorials/sem-data-science>`_ for DataCamp's website tutorial
* Project to practice `generating SEM keywords with Python <https://www.datacamp.com/projects/400>`_ on DataCamp
* `Setting up SEM campaigns on a large scale <https://www.semrush.com/blog/setting-up-search-engine-marketing-campaigns-on-large-scale/>`_ tutorial on SEMrush
* Visual `tool to generate keywords <https://www.dashboardom.com/advertools>`_ online based on the `kw_generate` function
SEO
---
Probably the most comprehensive online marketing area that is both technical
(crawling, indexing, rendering, redirects, etc.) and non-technical (content
creation, link building, outreach, etc.). Here are some tools that can help
with your SEO
* `SEO crawler: <https://advertools.readthedocs.io/en/master/advertools.spider.html>`_
A generic SEO crawler that can be customized, built with Scrapy, & with several
features:
* Standard SEO elements extracted by default (title, header tags, body text,
status code, response and request headers, etc.)
* CSS and XPath selectors: You probably have more specific needs in mind, so
you can easily pass any selectors to be extracted in addition to the
standard elements being extracted
* Custom settings: full access to Scrapy's settings, allowing you to better
control the crawling behavior (set custom headers, user agent, stop spider
after x pages, seconds, megabytes, save crawl logs, run jobs at intervals
where you can stop and resume your crawls, which is ideal for large crawls
or for continuous monitoring, and many more options)
* Following links: option to only crawl a set of specified pages or to follow
and discover all pages through links
* `robots.txt downloader <https://advertools.readthedocs.io/en/master/advertools.sitemaps.html#advertools.sitemaps.robotstxt_to_df>`_
A simple downloader of robots.txt files in a DataFrame format, so you can
keep track of changes across crawls if any, and check the rules, sitemaps,
etc.
* `XML Sitemaps downloader / parser <https://advertools.readthedocs.io/en/master/advertools.sitemaps.html>`_
An essential part of any SEO analysis is to check XML sitemaps. This is a
simple function with which you can download one or more sitemaps (by
providing the URL for a robots.txt file, a sitemap file, or a sitemap index
* `SERP importer and parser for Google & YouTube <https://advertools.readthedocs.io/en/master/advertools.serp.html>`_
Connect to Google's API and get the search data you want. Multiple search
parameters supported, all in one function call, and all results returned in a
DataFrame
* Tutorials and additional resources
* A visual tool built with the ``serp_goog`` function to get `SERP rankings on Google <https://www.dashboardom.com/google-serp>`_
* A tutorial on `analyzing SERPs on a large scale with Python <https://www.semrush.com/blog/analyzing-search-engine-results-pages/>`_ on SEMrush
* `SERP datasets on Kaggle <https://www.kaggle.com/eliasdabbas/datasets?search=engine>`_ for practicing on different industries and use cases
* `SERP notebooks on Kaggle <https://www.kaggle.com/eliasdabbas/notebooks?sortBy=voteCount&group=everyone&pageSize=20&userId=484496&tagIds=1220>`_
some examples on how you might tackle such data
* `Content Analysis with XML Sitemaps and Python <https://www.semrush.com/blog/content-analysis-xml-sitemaps-python/>`_
* XML dataset examples: `news sites <https://www.kaggle.com/eliasdabbas/news-sitemaps>`_, `Turkish news sites <https://www.kaggle.com/eliasdabbas/turk-haber-sitelerinin-site-haritalari>`_,
`Bloomberg news <https://www.kaggle.com/eliasdabbas/bloomberg-business-articles-urls>`_
Text & Content Analysis (for SEO & Social Media)
------------------------------------------------
URLs, page titles, tweets, video descriptions, comments, hashtags are some
examples of the types of text we deal with. ``advertools`` provides a few
options for text analysis
* `Word frequency <https://advertools.readthedocs.io/en/master/advertools.word_frequency.html>`_
Counting words in a text list is one of the most basic and important tasks in
text mining. What is also important is counting those words by taking in
consideration their relative weights in the dataset. ``word_frequency`` does
just that.
* `URL Analysis <https://advertools.readthedocs.io/en/master/advertools.urlytics.html>`_
We all have to handle many thousands of URLs in reports, crawls, social media
extracts, XML sitemaps and so on. ``url_to_df`` converts your URLs into
easily readable DataFrames.
* `Emoji <https://advertools.readthedocs.io/en/master/advertools.emoji.html>`_
Produced with one click, extremely expressive, highly diverse (3k+ emoji),
and very popular, it's important to capture what people are trying to communicate
with emoji. Extracting emoji, get their names, groups, and sub-groups is
possible. The full emoji database is also available for convenience, as well
as an ``emoji_search`` function in case you want some ideas for your next
social media or any kind of communication
* `extract_ functions <https://advertools.readthedocs.io/en/master/advertools.extract.html>`_
The text that we deal with contains many elements and entities that have
their own special meaning and usage. There is a group of convenience
functions to help in extracting and getting basic statistics about structured
entities in text; emoji, hashtags, mentions, currency, numbers, URLs, questions
and more. You can also provide a special regex for your own needs.
* `Stopwords <https://advertools.readthedocs.io/en/master/advertools.stopwords.html>`_
A list of stopwords in forty different languages to help in text analysis.
* Tutorial on DataCamp for creating the ``word_frequency`` function and
explaining the importance of the difference between `absolute and weighted word frequency <https://www.datacamp.com/community/tutorials/absolute-weighted-word-frequency>`_
* `Text Analysis for Online Marketers <https://www.semrush.com/blog/text-analysis-for-online-marketers/>`_
An introductory article on SEMrush
Social Media
------------
In addition to the text analysis techniques provided, you can also connect to
the Twitter and YouTube data APIs. The main benefits of using ``advertools``
for this:
* Handles pagination and request limits: typically every API has a limited
number of results that it returns. You have to handle pagination when you
need more than the limit per request, which you typically do. This is handled
by default
* DataFrame results: APIs send you back data in a formats that need to be
parsed and cleaned so you can more easily start your analysis. This is also
handled automatically
* Multiple requests: in YouTube's case you might want to request data for the
same query across several countries, languages, channels, etc. You can
specify them all in one request and get the product of all the requests in
one response
* Tutorials and additional resources
* A visual tool to `check what is trending on Twitter <https://www.dashboardom.com/trending-twitter>`_ for all available locations
* A `Twitter data analysis dashboard <https://www.dashboardom.com/twitterdash>`_ with many options
* How to use the `Twitter data API with Python <https://www.kaggle.com/eliasdabbas/twitter-in-a-dataframe>`_
* `Extracting entities from social media posts <https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts>`_ tutorial on Kaggle
* `Analyzing 131k tweets <https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts>`_ by European Football clubs tutorial on Kaggle
* An overview of the `YouTube data API with Python <https://www.kaggle.com/eliasdabbas/youtube-data-api>`_
Conventions
-----------
Function names mostly start with the object you are working on, so you can use
autocomplete to discover other options:
| ``kw_``: for keywords-related functions
| ``ad_``: for ad-related functions
| ``url_``: URL tracking and generation
| ``extract_``: for extracting entities from social media posts (mentions, hashtags, emoji, etc.)
| ``emoji_``: emoji related functions and objects
| ``twitter``: a module for querying the Twitter API and getting results in a DataFrame
| ``youtube``: a module for querying the YouTube Data API and getting results in a DataFrame
| ``crawlytics``: a module for analyzing crawl data (compare, links, redirects, and more)
| ``serp_``: get search engine results pages in a DataFrame, currently available: Google and YouTube
| ``crawl``: a function you will probably use a lot if you do SEO
| ``*_to_df``: a set of convenience functions for converting to DataFrames
(log files, XML sitemaps, robots.txt files, and lists of URLs)
=======================
Change Log - advertools
=======================
0.16.1 (2024-08-19)
-------------------
* Fixed
- Ensure meta crawl data included in URLs crawled by following links.
0.16.0 (2024-08-18)
-------------------
* Added
- Enable the ``meta`` parameter for the crawl function for: arbitrary metadata,
custom request headers, and 3rd party plugins like playwright.
* Changed
- Raise an error when supplying a custom log format with supplying fields.
0.15.1 (2024-07-16)
-------------------
* Fixed
- Make file path for ``emoji_df`` relative to advertools ``__path__``.
- Allow the extension ``.jsonl`` for crawling.
0.15.0 (2024-07-15)
-------------------
* Added
- Enable supplying request headers in ``sitemap_to_df``, contributed by `@joejoinerr <https://github.com/joejoinerr>`_
- New function ``crawlytics.compare`` for comparing two crawls.
- New function ``crawlytics.running_crawls`` for getting data on currently running crawl jobs (\*NIX only for now).
- New parameter ``date_format`` to ``logs_to_df`` for custom date formats.
* Changed
- Removed the `relatedSite` parameter from ``serp_goog`` - deprecated.
- Update emoji regex and functionality to v15.1.
* Fixed
- Use int64 instead of int for YouTube count columns, contributed by `@DanielP77 <https://github.com/DanielP77>`_
0.14.4 (2024-07-13)
-------------------
* Fixed
- Use ``pd.NA`` instead of ``np.nan`` for empty values in ``url_to_df``.
0.14.3 (2024-06-27)
-------------------
* Changed
- Use a different XPath expression for `body_text` while crawling.
0.14.2 (2024-02-24)
-------------------
* Changed
- Allow ``sitemap_to_df`` to work on offline sitemaps.
0.14.1 (2024-02-21)
-------------------
* Fixed
- Preserve the order of supplied URLs in the output of ``url_to_df``.
0.14.0 (2024-02-18)
-------------------
* Added
- New module ``crawlytics`` for analyzing crawl DataFrames. Includes functions to
analyze crawl DataFrames (``images``, ``redirects``, and ``links``), as well as
functions to handle large files (``jl_to_parquet``, ``jl_subset``, ``parquet_columns``).
- New ``encoding`` option for ``logs_to_df``.
- Option to save the output of ``url_to_df`` to a parquet file.
* Changed
- Remove requirement to delete existing log output and error files if they exist.
The function will now overwrite them if they do.
- Autothrottling is enabled by default in ``crawl_headers`` to minimize being blocked.
* Fixed
- Always get absolute path for img src while crawling.
- Handle NA src attributes when extracting images.
- Change fillna(method="ffill") to ffill for ``url_to_df``.
0.13.5 (2023-08-22)
-------------------
* Added
- Initial experimental functionality for ``crawl_images``.
* Changed
- Enable autothrottling by default for ``crawl_headers``.
0.13.4 (2023-07-26)
-------------------
* Fixed
- Make img attributes consistent in length, and support all attributes.
0.13.3 (2023-06-27)
-------------------
* Changed
- Allow optional trailing space in log files (contributed by @andypayne)
* Fixed
- Replace newlines with spaces while parsing JSON-LD which was causing
errors in some cases.
0.13.2 (2022-09-30)
-------------------
* Added
- Crawling recipe for how to use the ``DEFAULT_REQUEST_HEADERS`` to change
the default headers.
* Changed
- Split long lists of URL while crawling regardless of the ``follow_links``
parameter
* Fixed
- Clarify that while authenticating for Twitter only ``app_key`` and
``app_secret`` are required, with the option to provide ``oauth_token``
and ``oauth_token_secret`` if/when needed.
0.13.1 (2022-05-11)
-------------------
* Added
- Command line interface with most functions
- Make documentation interactive for most pages using ``thebe-sphinx``
* Changed
- Use `np.nan` wherever there are missing values in ``url_to_df``
* Fixed
- Don't remove double quotes from etags when downloading XML sitemaps
- Replace instances of ``pd.DataFrame.append`` with ``pd.concat``, which is
depracated.
- Replace empty values with np.nan for the size column in ``logs_to_df``
0.13.0 (2022-02-10)
-------------------
* Added
- New function ``crawl_headers``: A crawler that only makes `HEAD` requests
to a known list of URLs.
- New function ``reverse_dns_lookup``: A way to get host information for a
large list of IP addresses concurrently.
- New options for crawling: `exclude_url_params`, `include_url_params`,
`exclude_url_regex`, and `include_url_regex` for controlling which links to
follow while crawling.
* Fixed
- Any ``custom_settings`` options given to the ``crawl`` function that were
defined using a dictionary can now be set without issues. There was an
issue if those options were not strings.
* Changed
- The `skip_url_params` option was removed and replaced with the more
versatile ``exclude_url_params``, which accepts either ``True`` or a list
of URL parameters to exclude while following links.
0.12.3 (2021-11-27)
-------------------
* Fixed
- Crawler stops when provided with bad URLs in list mode.
0.12.0,1,2 (2021-11-27)
-----------------------
* Added
- New function ``logs_to_df``: Convert a log file of any non-JSON format
into a pandas DataFrame and save it to a `parquet` file. This also
compresses the file to a much smaller size.
- Crawler extracts all available ``img`` attributes: 'alt', 'crossorigin',
'height', 'ismap', 'loading', 'longdesc', 'referrerpolicy', 'sizes',
'src', 'srcset', 'usemap', and 'width' (excluding global HTML attributes
like ``style`` and ``draggable``).
- New parameter for the ``crawl`` function ``skip_url_params``: Defaults to
False, consistent with previous behavior, with the ability to not
follow/crawl links containing any URL parameters.
- New column for ``url_to_df`` "last_dir": Extract the value in the last
directory for each of the URLs.
* Changed
- Query parameter columns in ``url_to_df`` DataFrame are now sorted by how
full the columns are (the percentage of values that are not `NA`)
0.11.1 (2021-04-09)
-------------------
* Added
- The `nofollow` attribute for nav, header, and footer links.
* Fixed
- Timeout error while downloading robots.txt files.
- Make extracting nav, header, and footer links consistent with all links.
0.11.0 (2021-03-31)
-------------------
* Added
- New parameter `recursive` for ``sitemap_to_df`` to control whether or not
to get all sub sitemaps (default), or to only get the current
(sitemapindex) one.
- New columns for ``sitemap_to_df``: ``sitemap_size_mb``
(1 MB = 1,024x1,024 bytes), and ``sitemap_last_modified`` and ``etag``
(if available).
- Option to request multiple robots.txt files with ``robotstxt_to_df``.
- Option to save downloaded robots DataFrame(s) to a file with
``robotstxt_to_df`` using the new parameter ``output_file``.
- Two new columns for ``robotstxt_to_df``: ``robotstxt_last_modified`` and
``etag`` (if available).
- Raise `ValueError` in ``crawl`` if ``css_selectors`` or
``xpath_selectors`` contain any of the default crawl column headers
- New XPath code recipes for custom extraction.
- New function ``crawllogs_to_df`` which converts crawl logs to a DataFrame
provided they were saved while using the ``crawl`` function.
- New columns in ``crawl``: `viewport`, `charset`, all `h` headings
(whichever is available), nav, header and footer links and text, if
available.
- Crawl errors don't stop crawling anymore, and the error message is
included in the output file under a new `errors` and/or `jsonld_errors`
column(s).
- In case of having JSON-LD errors, errors are reported in their respective
column, and the remainder of the page is scraped.
* Changed
- Removed column prefix `resp_meta_` from columns containing it
- Redirect URLs and reasons are separated by '@@' for consistency with
other multiple-value columns
- Links extracted while crawling are not unique any more (all links are
extracted).
- Emoji data updated with v13.1.
- Heading tags are scraped even if they are empty, e.g. <h2></h2>.
- Default user agent for crawling is now advertools/VERSION.
* Fixed
- Handle sitemap index files that contain links to themselves, with an
error message included in the final DataFrame
- Error in robots.txt files caused by comments preceded by whitespace
- Zipped robots.txt files causing a parsing issue
- Crawl issues on some Linux systems when providing a long list of URLs
* Removed
- Columns from the ``crawl`` output: `url_redirected_to`, `links_fragment`
0.10.7 (2020-09-18)
-------------------
* Added
- New function ``knowledge_graph`` for querying Google's API
- Faster ``sitemap_to_df`` with threads
- New parameter `max_workers` for ``sitemap_to_df`` to determine how fast
it could go
- New parameter `capitalize_adgroups` for ``kw_generate`` to determine
whether or not to keep ad groups as is, or set them to title case (the
default)
* Fixed
- Remove restrictions on the number of URLs provided to ``crawl``,
assuming `follow_links` is set to `False` (list mode)
- JSON-LD issue breaking crawls when it's invalid (now skipped)
* Removed
- Deprecate the ``youtube.guide_categories_list`` (no longer supported by
the API)
0.10.6 (2020-06-30)
-------------------
* Added
- JSON-LD support in crawling. If available on a page, JSON-LD items will
have special columns, and multiple JSON-LD snippets will be numbered for
easy filtering
* Changed
- Stricter parsing for rel attributes, making sure they are in link
elements as well
- Date column names for ``robotstxt_to_df`` and ``sitemap_to_df`` unified
as "download_date"
- Numbering OG, Twitter, and JSON-LD where multiple elements are present in
the same page, follows a unified approach: no numbering for the first
element, and numbers start with "1" from the second element on. "element",
"element_1", "element_2" etc.
0.10.5 (2020-06-14)
-------------------
* Added
- New features for the ``crawl`` function:
* Extract canonical tags if available
* Extract alternate `href` and `hreflang` tags if available
* Open Graph data "og:title", "og:type", "og:image", etc.
* Twitter cards data "twitter:site", "twitter:title", etc.
* Fixed
- Minor fixes to ``robotstxt_to_df``:
* Allow whitespace in fields
* Allow case-insensitive fields
* Changed
- ``crawl`` now only supports `output_file` with the extension ".jl"
- ``word_frequency`` drops `wtd_freq` and `rel_value` columns if `num_list`
is not provided
0.10.4 (2020-06-07)
-------------------
* Added
- New function ``url_to_df``, splitting URLs into their components and to a
DataFrame
- Slight speed up for ``robotstxt_test``
0.10.3 (2020-06-03)
-------------------
* Added
- New function ``robotstxt_test``, testing URLs and whether they can be
fetched by certain user-agents
* Changed
- Documentation main page relayout, grouping of topics, & sidebar captions
- Various documentation clarifications and new tests
0.10.2 (2020-05-25)
-------------------
* Added
- User-Agent info to requests getting sitemaps and robotstxt files
- CSS/XPath selectors support for the crawl function
- Support for custom spider settings with a new parameter ``custom_settings``
* Fixed
- Update changed supported search operators and values for CSE
0.10.1 (2020-05-23)
-------------------
* Changed
- Links are better handled, and new output columns are available:
``links_url``, ``links_text``, ``links_fragment``, ``links_nofollow``
- ``body_text`` extraction is improved by containing <p>, <li>, and <span>
elements
0.10.0 (2020-05-21)
-------------------
* Added
- New function ``crawl`` for crawling and parsing websites
- New function ``robotstxt_to_df`` downloading robots.txt files into
DataFrames
0.9.1 (2020-05-19)
------------------
* Added
- Ability to specify robots.txt file for ``sitemap_to_df``
- Ability to retreive any kind of sitemap (news, video, or images)
- Errors column to the returnd DataFrame if any errors occur
- A new ``sitemap_downloaded`` column showing datetime of getting the
sitemap
* Fixed
- Logging issue causing ``sitemap_to_df`` to log the same action twice
- Issue preventing URLs not ending with xml or gz from being retreived
- Correct sitemap URL showing in the ``sitemap`` column
0.9.0 (2020-04-03)
------------------
* Added
- New function ``sitemap_to_df`` imports an XML sitemap into a
``DataFrame``
0.8.1 (2020-02-08)
------------------
* Changed
- Column `query_time` is now named `queryTime` in the `youtube` functions
- Handle json_normalize import from pandas based on pandas version
0.8.0 (2020-02-02)
------------------
* Added
- New module `youtube` connecting to all GET requests in API
- `extract_numbers` new function
- `emoji_search` new function
- `emoji_df` new variable containing all emoji as a DataFrame
* Changed
- Emoji database updated to v13.0
- `serp_goog` with expanded `pagemap` and metadata
* Fixed
- `serp_goog` errors, some parameters not appearing in result
df
- `extract_numbers` issue when providing dash as a separator
in the middle
0.7.3 (2019-04-17)
------------------
* Added
- New function `extract_exclamations` very similar to
`extract_questions`
- New function `extract_urls`, also counts top domains and
top TLDs
- New keys to `extract_emoji`; `top_emoji_categories`
& `top_emoji_sub_categories`
- Groups and sub-groups to `emoji db`
0.7.2 (2019-03-29)
------------------
* Changed
- Emoji regex updated
- Simpler extraction of Spanish `questions`
0.7.1 (2019-03-26)
------------------
* Fixed
- Missing __init__ imports.
0.7.0 (2019-03-26)
------------------
* Added
- New `extract_` functions:
* Generic `extract` used by all others, and takes
arbitrary regex to extract text.
* `extract_questions` to get question mark statistics, as
well as the text of questions asked.
* `extract_currency` shows text that has currency symbols in it, as
well as surrounding text.
* `extract_intense_words` gets statistics about, and extract words with
any character repeated three or more times, indicating an intense
feeling (+ve or -ve).
- New function `word_tokenize`:
* Used by `word_frequency` to get tokens of
1,2,3-word phrases (or more).
* Split a list of text into tokens of a specified number of words each.
- New stop-words from the ``spaCy`` package:
**current:** Arabic, Azerbaijani, Danish, Dutch, English, Finnish,
French, German, Greek, Hungarian, Italian, Kazakh, Nepali, Norwegian,
Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
**new:** Bengali, Catalan, Chinese, Croatian, Hebrew, Hindi, Indonesian,
Irish, Japanese, Persian, Polish, Sinhala, Tagalog, Tamil, Tatar, Telugu,
Thai, Ukrainian, Urdu, Vietnamese
* Changed
- `word_frequency` takes new parameters:
* `regex` defaults to words, but can be changed to anything '\S+'
to split words and keep punctuation for example.
* `sep` not longer used as an option, the above `regex` can
be used instead
* `num_list` now optional, and defaults to counts of 1 each if not
provided. Useful for counting `abs_freq` only if data not
available.
* `phrase_len` the number of words in each split token. Defaults
to 1 and can be set to 2 or higher. This helps in analyzing phrases
as opposed to words.
- Parameters supplied to `serp_goog` appear at the beginning
of the result df
- `serp_youtube` now contains `nextPageToken` to make
paginating requests easier
0.6.0 (2019-02-11)
------------------
* New function
- `extract_words` to extract an arbitrary set of words
* Minor updates
- `ad_from_string` slots argument reflects new text
ad lenghts
- `hashtag` regex improved
0.5.3 (2019-01-31)
------------------
* Fix minor bugs
- Handle Twitter search queries with 0 results in final request
0.5.2 (2018-12-01)
------------------
* Fix minor bugs
- Properly handle requests for >50 items (`serp_youtube`)
- Rewrite test for _dict_product
- Fix issue with string printing error msg
0.5.1 (2018-11-06)
------------------
* Fix minor bugs
- _dict_product implemented with lists
- Missing keys in some YouTube responses
0.5.0 (2018-11-04)
------------------
* New function `serp_youtube`
- Query YouTube API for videos, channels, or playlists
- Multiple queries (product of parameters) in one function call
- Reponse looping and merging handled, one DataFrame
* `serp_goog` return Google's original error messages
* twitter responses with entities, get the entities extracted, each in a
separate column
0.4.1 (2018-10-13)
------------------
* New function `serp_goog` (based on Google CSE)
- Query Google search and get the result in a DataFrame
- Make multiple queries / requests in one function call
- All responses merged in one DataFrame
* twitter.get_place_trends results are ranked by town and country
0.4.0 (2018-10-08)
------------------
* New Twitter module based on twython
- Wraps 20+ functions for getting Twitter API data
- Gets data in a pands DataFrame
- Handles looping over requests higher than the defaults
* Tested on Python 3.7
0.3.0 (2018-08-14)
------------------
* Search engine marketing cheat sheet.
* New set of extract\_ functions with summary stats for each:
* extract_hashtags
* extract_mentions
* extract_emoji
* Tests and bug fixes
0.2.0 (2018-07-06)
------------------
* New set of kw_<match-type> functions.
* Full testing and coverage.
0.1.0 (2018-07-02)
------------------
* First release on PyPI.
* Functions available:
- ad_create: create a text ad place words in placeholders
- ad_from_string: split a long string to shorter string that fit into
given slots
- kw_generate: generate keywords from lists of products and words
- url_utm_ga: generate a UTM-tagged URL for Google Analytics tracking
- word_frequency: measure the absolute and weighted frequency of words in
collection of documents
Raw data
{
"_id": null,
"home_page": "https://github.com/eliasdabbas/advertools",
"name": "advertools",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "advertising marketing search-engine-optimization adwords seo sem bingads keyword-research",
"author": "Elias Dabbas",
"author_email": "eliasdabbas@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/08/47/4caf569af29ceb4d395eb615bfc5578f90a6c7910971187fabfe302f14f9/advertools-0.16.1.tar.gz",
"platform": null,
"description": ".. image:: https://img.shields.io/pypi/v/advertools.svg\n :target: https://pypi.python.org/pypi/advertools\n\n.. image:: https://readthedocs.org/projects/advertools/badge/?version=latest\n :target: https://advertools.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n.. image:: https://static.pepy.tech/badge/advertools\n :target: http://pepy.tech/project/advertools \n\n\n**Announcing** `Data Science with Python for SEO course <https://bit.ly/dsseo-course>`_: Cohort based course, interactive, live-coding.\n\n\n``advertools``: productivity & analysis tools to scale your online marketing\n============================================================================\n\n| A digital marketer is a data scientist.\n| Your job is to manage, manipulate, visualize, communicate, understand,\n and make decisions based on data.\n\nYou might be doing basic stuff, like copying and pasting text on spread\nsheets, you might be running large scale automated platforms with\nsophisticated algorithms, or somewhere in between. In any case your job\nis all about working with data.\n\nAs a data scientist you don't spend most of your time producing cool\nvisualizations or finding great insights. The majority of your time is spent\nwrangling with URLs, figuring out how to stitch together two tables, hoping\nthat the dates, won't break, without you knowing, or trying to generate the\nnext 124,538 keywords for an upcoming campaign, by the end of the week!\n\n``advertools`` is a Python package that can hopefully make that part of your job a little easier.\n\nInstallation\n------------\n\n.. code:: bash\n\n python3 -m pip install advertools\n\n\nPhilosophy/approach\n-------------------\n\nIt's very easy to learn how to use advertools. There are two main reasons for that.\n\nFirst, it is essentially a set of independent functions that you can easily learn and \nuse. There are no special data structures, or additional learning that you need. With \nbasic Python, and an understanding of the tasks that these functions help with, you \nshould be able to pick it up fairly easily. In other words, if you know how to use an \nExcel formula, you can easily use any advertools function.\n\nThe second reason is that `advertools` follows the UNIX philosophy in its design and \napproach. Here is one of the various summaries of the UNIX philosophy by Doug McIlroy: \n\n Write programs that do one thing and do it well. Write programs to work together. \n Write programs to handle text streams, because that is a universal interface.\n\nLet's see how advertools follows that:\n\n**Do one thing and do it well:** Each function in advertools aims for that. There is a \nfunction that just extracts hashtags from a text list, another one to crawl websites, \none to test which URLs are blocked by robots.txt files, and one for downloading XML \nsitemaps. Although they are designed to work together as a full pipeline, they can be \nrun independently in whichever combination or sequence you want.\n\n**Write programs to work together:** Independence does not mean they are unrelated. The \nworkflows are designed to aid the online marketing practitioner in various steps for \nunderstanding websites, SEO analysis, creating SEM campaigns and others.\n\n**Programs to handle text streams because that is a universal interface:** In Data \nScience the most used data structure that can be considered \u201cuniversal\u201d is the \nDataFrame. So, most functions return either a DataFrame or a file that can be read into \none. Once you have it, you have the full power of all other tools like pandas for \nfurther manipulating the data, Plotly for visualization, or any machine learning \nlibrary that can more easily handle tabular data.\n\nThis way it is kept modular as well as flexible and integrated. \nAs a next step most of these functions are being converted to no-code\n`interactive apps <https://adver.tools>`_ for non-coders, and taking them to the next \nlevel.\n\n\nSEM Campaigns\n-------------\nThe most important thing to achieve in SEM is a proper mapping between the\nthree main elements of a search campaign\n\n**Keywords** (the intention) -> **Ads** (your promise) -> **Landing Pages** (your delivery of the promise)\nOnce you have this done, you can focus on management and analysis. More importantly,\nonce you know that you can set this up in an easy way, you know you can focus\non more strategic issues. In practical terms you need two main tables to get started:\n\n* Keywords: You can `generate keywords <https://advertools.readthedocs.io/en/master/advertools.kw_generate.html>`_ (note I didn't say research) with the\n `kw_generate` function.\n\n* Ads: There are two approaches that you can use:\n\n * Bottom-up: You can create text ads for a large number of products by simple\n replacement of product names, and providing a placeholder in case your text\n is too long. Check out the `ad_create <https://advertools.readthedocs.io/en/master/advertools.ad_create.html>`_ function for more details.\n * Top-down: Sometimes you have a long description text that you want to split\n into headlines, descriptions and whatever slots you want to split them into.\n `ad_from_string <https://advertools.readthedocs.io/en/master/advertools.ad_from_string.html>`_\n helps you accomplish that.\n\n* Tutorials and additional resources\n\n * Get started with `Data Science for Digital Marketing and SEO/SEM <https://www.oncrawl.com/technical-seo/data-science-seo-digital-marketing-guide-beginners/>`_\n * `Setting a full SEM campaign <https://www.datacamp.com/community/tutorials/sem-data-science>`_ for DataCamp's website tutorial\n * Project to practice `generating SEM keywords with Python <https://www.datacamp.com/projects/400>`_ on DataCamp\n * `Setting up SEM campaigns on a large scale <https://www.semrush.com/blog/setting-up-search-engine-marketing-campaigns-on-large-scale/>`_ tutorial on SEMrush\n * Visual `tool to generate keywords <https://www.dashboardom.com/advertools>`_ online based on the `kw_generate` function\n\n\nSEO\n---\nProbably the most comprehensive online marketing area that is both technical\n(crawling, indexing, rendering, redirects, etc.) and non-technical (content\ncreation, link building, outreach, etc.). Here are some tools that can help\nwith your SEO\n\n* `SEO crawler: <https://advertools.readthedocs.io/en/master/advertools.spider.html>`_\n A generic SEO crawler that can be customized, built with Scrapy, & with several\n features:\n\n * Standard SEO elements extracted by default (title, header tags, body text,\n status code, response and request headers, etc.)\n * CSS and XPath selectors: You probably have more specific needs in mind, so\n you can easily pass any selectors to be extracted in addition to the\n standard elements being extracted\n * Custom settings: full access to Scrapy's settings, allowing you to better\n control the crawling behavior (set custom headers, user agent, stop spider\n after x pages, seconds, megabytes, save crawl logs, run jobs at intervals\n where you can stop and resume your crawls, which is ideal for large crawls\n or for continuous monitoring, and many more options)\n * Following links: option to only crawl a set of specified pages or to follow\n and discover all pages through links\n\n* `robots.txt downloader <https://advertools.readthedocs.io/en/master/advertools.sitemaps.html#advertools.sitemaps.robotstxt_to_df>`_\n A simple downloader of robots.txt files in a DataFrame format, so you can\n keep track of changes across crawls if any, and check the rules, sitemaps,\n etc.\n* `XML Sitemaps downloader / parser <https://advertools.readthedocs.io/en/master/advertools.sitemaps.html>`_\n An essential part of any SEO analysis is to check XML sitemaps. This is a\n simple function with which you can download one or more sitemaps (by\n providing the URL for a robots.txt file, a sitemap file, or a sitemap index\n* `SERP importer and parser for Google & YouTube <https://advertools.readthedocs.io/en/master/advertools.serp.html>`_\n Connect to Google's API and get the search data you want. Multiple search\n parameters supported, all in one function call, and all results returned in a\n DataFrame\n\n* Tutorials and additional resources\n\n * A visual tool built with the ``serp_goog`` function to get `SERP rankings on Google <https://www.dashboardom.com/google-serp>`_\n * A tutorial on `analyzing SERPs on a large scale with Python <https://www.semrush.com/blog/analyzing-search-engine-results-pages/>`_ on SEMrush\n * `SERP datasets on Kaggle <https://www.kaggle.com/eliasdabbas/datasets?search=engine>`_ for practicing on different industries and use cases\n * `SERP notebooks on Kaggle <https://www.kaggle.com/eliasdabbas/notebooks?sortBy=voteCount&group=everyone&pageSize=20&userId=484496&tagIds=1220>`_\n some examples on how you might tackle such data\n * `Content Analysis with XML Sitemaps and Python <https://www.semrush.com/blog/content-analysis-xml-sitemaps-python/>`_\n * XML dataset examples: `news sites <https://www.kaggle.com/eliasdabbas/news-sitemaps>`_, `Turkish news sites <https://www.kaggle.com/eliasdabbas/turk-haber-sitelerinin-site-haritalari>`_,\n `Bloomberg news <https://www.kaggle.com/eliasdabbas/bloomberg-business-articles-urls>`_\n\n\nText & Content Analysis (for SEO & Social Media)\n------------------------------------------------\n\nURLs, page titles, tweets, video descriptions, comments, hashtags are some\nexamples of the types of text we deal with. ``advertools`` provides a few\noptions for text analysis\n\n\n* `Word frequency <https://advertools.readthedocs.io/en/master/advertools.word_frequency.html>`_\n Counting words in a text list is one of the most basic and important tasks in\n text mining. What is also important is counting those words by taking in\n consideration their relative weights in the dataset. ``word_frequency`` does\n just that.\n* `URL Analysis <https://advertools.readthedocs.io/en/master/advertools.urlytics.html>`_\n We all have to handle many thousands of URLs in reports, crawls, social media\n extracts, XML sitemaps and so on. ``url_to_df`` converts your URLs into\n easily readable DataFrames.\n\n* `Emoji <https://advertools.readthedocs.io/en/master/advertools.emoji.html>`_\n Produced with one click, extremely expressive, highly diverse (3k+ emoji),\n and very popular, it's important to capture what people are trying to communicate\n with emoji. Extracting emoji, get their names, groups, and sub-groups is\n possible. The full emoji database is also available for convenience, as well\n as an ``emoji_search`` function in case you want some ideas for your next\n social media or any kind of communication\n* `extract_ functions <https://advertools.readthedocs.io/en/master/advertools.extract.html>`_\n The text that we deal with contains many elements and entities that have\n their own special meaning and usage. There is a group of convenience\n functions to help in extracting and getting basic statistics about structured\n entities in text; emoji, hashtags, mentions, currency, numbers, URLs, questions\n and more. You can also provide a special regex for your own needs.\n* `Stopwords <https://advertools.readthedocs.io/en/master/advertools.stopwords.html>`_\n A list of stopwords in forty different languages to help in text analysis.\n* Tutorial on DataCamp for creating the ``word_frequency`` function and\n explaining the importance of the difference between `absolute and weighted word frequency <https://www.datacamp.com/community/tutorials/absolute-weighted-word-frequency>`_\n* `Text Analysis for Online Marketers <https://www.semrush.com/blog/text-analysis-for-online-marketers/>`_\n An introductory article on SEMrush\n\nSocial Media\n------------\n\nIn addition to the text analysis techniques provided, you can also connect to\nthe Twitter and YouTube data APIs. The main benefits of using ``advertools``\nfor this:\n\n* Handles pagination and request limits: typically every API has a limited\n number of results that it returns. You have to handle pagination when you\n need more than the limit per request, which you typically do. This is handled\n by default\n* DataFrame results: APIs send you back data in a formats that need to be\n parsed and cleaned so you can more easily start your analysis. This is also\n handled automatically\n* Multiple requests: in YouTube's case you might want to request data for the\n same query across several countries, languages, channels, etc. You can\n specify them all in one request and get the product of all the requests in\n one response\n\n* Tutorials and additional resources\n\n* A visual tool to `check what is trending on Twitter <https://www.dashboardom.com/trending-twitter>`_ for all available locations\n* A `Twitter data analysis dashboard <https://www.dashboardom.com/twitterdash>`_ with many options\n* How to use the `Twitter data API with Python <https://www.kaggle.com/eliasdabbas/twitter-in-a-dataframe>`_\n* `Extracting entities from social media posts <https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts>`_ tutorial on Kaggle\n* `Analyzing 131k tweets <https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts>`_ by European Football clubs tutorial on Kaggle\n* An overview of the `YouTube data API with Python <https://www.kaggle.com/eliasdabbas/youtube-data-api>`_\n\n\nConventions\n-----------\n\nFunction names mostly start with the object you are working on, so you can use\nautocomplete to discover other options:\n\n| ``kw_``: for keywords-related functions\n| ``ad_``: for ad-related functions\n| ``url_``: URL tracking and generation\n| ``extract_``: for extracting entities from social media posts (mentions, hashtags, emoji, etc.)\n| ``emoji_``: emoji related functions and objects\n| ``twitter``: a module for querying the Twitter API and getting results in a DataFrame\n| ``youtube``: a module for querying the YouTube Data API and getting results in a DataFrame\n| ``crawlytics``: a module for analyzing crawl data (compare, links, redirects, and more)\n| ``serp_``: get search engine results pages in a DataFrame, currently available: Google and YouTube\n| ``crawl``: a function you will probably use a lot if you do SEO\n| ``*_to_df``: a set of convenience functions for converting to DataFrames\n (log files, XML sitemaps, robots.txt files, and lists of URLs)\n\n\n=======================\nChange Log - advertools\n=======================\n\n0.16.1 (2024-08-19)\n-------------------\n\n* Fixed\n - Ensure meta crawl data included in URLs crawled by following links.\n\n0.16.0 (2024-08-18)\n-------------------\n\n* Added\n - Enable the ``meta`` parameter for the crawl function for: arbitrary metadata,\n custom request headers, and 3rd party plugins like playwright.\n* Changed\n - Raise an error when supplying a custom log format with supplying fields.\n\n0.15.1 (2024-07-16)\n-------------------\n\n* Fixed\n - Make file path for ``emoji_df`` relative to advertools ``__path__``.\n - Allow the extension ``.jsonl`` for crawling.\n\n0.15.0 (2024-07-15)\n-------------------\n\n* Added\n - Enable supplying request headers in ``sitemap_to_df``, contributed by `@joejoinerr <https://github.com/joejoinerr>`_\n - New function ``crawlytics.compare`` for comparing two crawls.\n - New function ``crawlytics.running_crawls`` for getting data on currently running crawl jobs (\\*NIX only for now).\n - New parameter ``date_format`` to ``logs_to_df`` for custom date formats.\n\n* Changed\n - Removed the `relatedSite` parameter from ``serp_goog`` - deprecated.\n - Update emoji regex and functionality to v15.1.\n\n* Fixed\n - Use int64 instead of int for YouTube count columns, contributed by `@DanielP77 <https://github.com/DanielP77>`_\n\n0.14.4 (2024-07-13)\n-------------------\n\n* Fixed\n - Use ``pd.NA`` instead of ``np.nan`` for empty values in ``url_to_df``.\n\n0.14.3 (2024-06-27)\n-------------------\n\n* Changed\n - Use a different XPath expression for `body_text` while crawling.\n\n0.14.2 (2024-02-24)\n-------------------\n\n* Changed\n - Allow ``sitemap_to_df`` to work on offline sitemaps.\n\n0.14.1 (2024-02-21)\n-------------------\n\n* Fixed\n - Preserve the order of supplied URLs in the output of ``url_to_df``.\n\n0.14.0 (2024-02-18)\n-------------------\n\n* Added\n - New module ``crawlytics`` for analyzing crawl DataFrames. Includes functions to\n analyze crawl DataFrames (``images``, ``redirects``, and ``links``), as well as\n functions to handle large files (``jl_to_parquet``, ``jl_subset``, ``parquet_columns``).\n - New ``encoding`` option for ``logs_to_df``.\n - Option to save the output of ``url_to_df`` to a parquet file.\n\n* Changed\n - Remove requirement to delete existing log output and error files if they exist.\n The function will now overwrite them if they do.\n - Autothrottling is enabled by default in ``crawl_headers`` to minimize being blocked.\n\n* Fixed\n - Always get absolute path for img src while crawling.\n - Handle NA src attributes when extracting images.\n - Change fillna(method=\"ffill\") to ffill for ``url_to_df``.\n\n0.13.5 (2023-08-22)\n-------------------\n\n* Added\n - Initial experimental functionality for ``crawl_images``.\n\n* Changed\n - Enable autothrottling by default for ``crawl_headers``.\n\n0.13.4 (2023-07-26)\n-------------------\n\n* Fixed\n - Make img attributes consistent in length, and support all attributes.\n\n0.13.3 (2023-06-27)\n-------------------\n\n* Changed\n - Allow optional trailing space in log files (contributed by @andypayne)\n\n* Fixed\n - Replace newlines with spaces while parsing JSON-LD which was causing \n errors in some cases.\n\n\n0.13.2 (2022-09-30)\n-------------------\n\n* Added\n - Crawling recipe for how to use the ``DEFAULT_REQUEST_HEADERS`` to change\n the default headers.\n\n* Changed\n - Split long lists of URL while crawling regardless of the ``follow_links``\n parameter\n\n* Fixed\n - Clarify that while authenticating for Twitter only ``app_key`` and \n ``app_secret`` are required, with the option to provide ``oauth_token``\n and ``oauth_token_secret`` if/when needed.\n\n\n0.13.1 (2022-05-11)\n-------------------\n\n* Added\n - Command line interface with most functions\n - Make documentation interactive for most pages using ``thebe-sphinx``\n\n* Changed\n - Use `np.nan` wherever there are missing values in ``url_to_df``\n\n* Fixed\n - Don't remove double quotes from etags when downloading XML sitemaps\n - Replace instances of ``pd.DataFrame.append`` with ``pd.concat``, which is\n depracated.\n - Replace empty values with np.nan for the size column in ``logs_to_df``\n\n\n0.13.0 (2022-02-10)\n-------------------\n\n* Added\n - New function ``crawl_headers``: A crawler that only makes `HEAD` requests\n to a known list of URLs.\n - New function ``reverse_dns_lookup``: A way to get host information for a\n large list of IP addresses concurrently.\n - New options for crawling: `exclude_url_params`, `include_url_params`,\n `exclude_url_regex`, and `include_url_regex` for controlling which links to\n follow while crawling.\n\n* Fixed\n - Any ``custom_settings`` options given to the ``crawl`` function that were\n defined using a dictionary can now be set without issues. There was an\n issue if those options were not strings.\n\n* Changed\n - The `skip_url_params` option was removed and replaced with the more\n versatile ``exclude_url_params``, which accepts either ``True`` or a list\n of URL parameters to exclude while following links.\n\n0.12.3 (2021-11-27)\n-------------------\n\n* Fixed\n - Crawler stops when provided with bad URLs in list mode.\n\n0.12.0,1,2 (2021-11-27)\n-----------------------\n\n* Added\n - New function ``logs_to_df``: Convert a log file of any non-JSON format\n into a pandas DataFrame and save it to a `parquet` file. This also\n compresses the file to a much smaller size.\n - Crawler extracts all available ``img`` attributes: 'alt', 'crossorigin',\n 'height', 'ismap', 'loading', 'longdesc', 'referrerpolicy', 'sizes',\n 'src', 'srcset', 'usemap', and 'width' (excluding global HTML attributes\n like ``style`` and ``draggable``).\n - New parameter for the ``crawl`` function ``skip_url_params``: Defaults to\n False, consistent with previous behavior, with the ability to not\n follow/crawl links containing any URL parameters.\n - New column for ``url_to_df`` \"last_dir\": Extract the value in the last\n directory for each of the URLs.\n\n* Changed\n - Query parameter columns in ``url_to_df`` DataFrame are now sorted by how\n full the columns are (the percentage of values that are not `NA`)\n \n0.11.1 (2021-04-09)\n-------------------\n\n* Added\n - The `nofollow` attribute for nav, header, and footer links.\n\n* Fixed\n - Timeout error while downloading robots.txt files.\n - Make extracting nav, header, and footer links consistent with all links.\n\n0.11.0 (2021-03-31)\n-------------------\n\n* Added\n - New parameter `recursive` for ``sitemap_to_df`` to control whether or not\n to get all sub sitemaps (default), or to only get the current\n (sitemapindex) one.\n - New columns for ``sitemap_to_df``: ``sitemap_size_mb``\n (1 MB = 1,024x1,024 bytes), and ``sitemap_last_modified`` and ``etag``\n (if available).\n - Option to request multiple robots.txt files with ``robotstxt_to_df``.\n - Option to save downloaded robots DataFrame(s) to a file with\n ``robotstxt_to_df`` using the new parameter ``output_file``.\n - Two new columns for ``robotstxt_to_df``: ``robotstxt_last_modified`` and\n ``etag`` (if available).\n - Raise `ValueError` in ``crawl`` if ``css_selectors`` or\n ``xpath_selectors`` contain any of the default crawl column headers\n - New XPath code recipes for custom extraction.\n - New function ``crawllogs_to_df`` which converts crawl logs to a DataFrame\n provided they were saved while using the ``crawl`` function.\n - New columns in ``crawl``: `viewport`, `charset`, all `h` headings\n (whichever is available), nav, header and footer links and text, if\n available.\n - Crawl errors don't stop crawling anymore, and the error message is\n included in the output file under a new `errors` and/or `jsonld_errors`\n column(s).\n - In case of having JSON-LD errors, errors are reported in their respective\n column, and the remainder of the page is scraped.\n\n* Changed\n - Removed column prefix `resp_meta_` from columns containing it\n - Redirect URLs and reasons are separated by '@@' for consistency with\n other multiple-value columns\n - Links extracted while crawling are not unique any more (all links are\n extracted).\n - Emoji data updated with v13.1.\n - Heading tags are scraped even if they are empty, e.g. <h2></h2>.\n - Default user agent for crawling is now advertools/VERSION.\n\n* Fixed\n - Handle sitemap index files that contain links to themselves, with an\n error message included in the final DataFrame\n - Error in robots.txt files caused by comments preceded by whitespace\n - Zipped robots.txt files causing a parsing issue\n - Crawl issues on some Linux systems when providing a long list of URLs\n\n* Removed\n - Columns from the ``crawl`` output: `url_redirected_to`, `links_fragment`\n\n\n0.10.7 (2020-09-18)\n-------------------\n\n* Added\n - New function ``knowledge_graph`` for querying Google's API\n - Faster ``sitemap_to_df`` with threads\n - New parameter `max_workers` for ``sitemap_to_df`` to determine how fast\n it could go\n - New parameter `capitalize_adgroups` for ``kw_generate`` to determine\n whether or not to keep ad groups as is, or set them to title case (the\n default)\n\n* Fixed\n - Remove restrictions on the number of URLs provided to ``crawl``,\n assuming `follow_links` is set to `False` (list mode)\n - JSON-LD issue breaking crawls when it's invalid (now skipped)\n\n* Removed\n - Deprecate the ``youtube.guide_categories_list`` (no longer supported by\n the API)\n\n0.10.6 (2020-06-30)\n-------------------\n\n* Added\n - JSON-LD support in crawling. If available on a page, JSON-LD items will\n have special columns, and multiple JSON-LD snippets will be numbered for\n easy filtering\n* Changed\n - Stricter parsing for rel attributes, making sure they are in link\n elements as well\n - Date column names for ``robotstxt_to_df`` and ``sitemap_to_df`` unified\n as \"download_date\"\n - Numbering OG, Twitter, and JSON-LD where multiple elements are present in\n the same page, follows a unified approach: no numbering for the first\n element, and numbers start with \"1\" from the second element on. \"element\",\n \"element_1\", \"element_2\" etc.\n\n0.10.5 (2020-06-14)\n-------------------\n\n* Added\n - New features for the ``crawl`` function:\n * Extract canonical tags if available\n * Extract alternate `href` and `hreflang` tags if available\n * Open Graph data \"og:title\", \"og:type\", \"og:image\", etc.\n * Twitter cards data \"twitter:site\", \"twitter:title\", etc.\n\n* Fixed\n - Minor fixes to ``robotstxt_to_df``:\n * Allow whitespace in fields\n * Allow case-insensitive fields\n\n* Changed\n - ``crawl`` now only supports `output_file` with the extension \".jl\"\n - ``word_frequency`` drops `wtd_freq` and `rel_value` columns if `num_list`\n is not provided\n\n0.10.4 (2020-06-07)\n-------------------\n\n* Added\n - New function ``url_to_df``, splitting URLs into their components and to a\n DataFrame\n - Slight speed up for ``robotstxt_test``\n\n0.10.3 (2020-06-03)\n-------------------\n\n* Added\n - New function ``robotstxt_test``, testing URLs and whether they can be\n fetched by certain user-agents\n\n* Changed\n - Documentation main page relayout, grouping of topics, & sidebar captions\n - Various documentation clarifications and new tests\n\n0.10.2 (2020-05-25)\n-------------------\n\n* Added\n - User-Agent info to requests getting sitemaps and robotstxt files\n - CSS/XPath selectors support for the crawl function\n - Support for custom spider settings with a new parameter ``custom_settings``\n\n* Fixed\n - Update changed supported search operators and values for CSE\n\n0.10.1 (2020-05-23)\n-------------------\n\n* Changed\n - Links are better handled, and new output columns are available:\n ``links_url``, ``links_text``, ``links_fragment``, ``links_nofollow``\n - ``body_text`` extraction is improved by containing <p>, <li>, and <span>\n elements\n\n0.10.0 (2020-05-21)\n-------------------\n\n* Added\n - New function ``crawl`` for crawling and parsing websites\n - New function ``robotstxt_to_df`` downloading robots.txt files into\n DataFrames\n\n0.9.1 (2020-05-19)\n------------------\n\n* Added\n - Ability to specify robots.txt file for ``sitemap_to_df``\n - Ability to retreive any kind of sitemap (news, video, or images)\n - Errors column to the returnd DataFrame if any errors occur\n - A new ``sitemap_downloaded`` column showing datetime of getting the\n sitemap\n\n* Fixed\n - Logging issue causing ``sitemap_to_df`` to log the same action twice\n - Issue preventing URLs not ending with xml or gz from being retreived\n - Correct sitemap URL showing in the ``sitemap`` column\n\n0.9.0 (2020-04-03)\n------------------\n\n* Added\n - New function ``sitemap_to_df`` imports an XML sitemap into a\n ``DataFrame``\n\n0.8.1 (2020-02-08)\n------------------\n\n* Changed\n - Column `query_time` is now named `queryTime` in the `youtube` functions\n - Handle json_normalize import from pandas based on pandas version\n\n0.8.0 (2020-02-02)\n------------------\n\n* Added\n - New module `youtube` connecting to all GET requests in API\n - `extract_numbers` new function\n - `emoji_search` new function\n - `emoji_df` new variable containing all emoji as a DataFrame\n\n* Changed\n - Emoji database updated to v13.0\n - `serp_goog` with expanded `pagemap` and metadata\n\n* Fixed\n - `serp_goog` errors, some parameters not appearing in result\n df\n - `extract_numbers` issue when providing dash as a separator\n in the middle\n\n0.7.3 (2019-04-17)\n------------------\n\n* Added\n - New function `extract_exclamations` very similar to\n `extract_questions`\n - New function `extract_urls`, also counts top domains and\n top TLDs\n - New keys to `extract_emoji`; `top_emoji_categories`\n & `top_emoji_sub_categories`\n - Groups and sub-groups to `emoji db`\n\n0.7.2 (2019-03-29)\n------------------\n\n* Changed\n - Emoji regex updated\n - Simpler extraction of Spanish `questions`\n\n0.7.1 (2019-03-26)\n------------------\n\n* Fixed\n - Missing __init__ imports.\n\n\n0.7.0 (2019-03-26)\n------------------\n\n* Added\n - New `extract_` functions:\n\n * Generic `extract` used by all others, and takes\n arbitrary regex to extract text.\n * `extract_questions` to get question mark statistics, as\n well as the text of questions asked.\n * `extract_currency` shows text that has currency symbols in it, as\n well as surrounding text.\n * `extract_intense_words` gets statistics about, and extract words with\n any character repeated three or more times, indicating an intense\n feeling (+ve or -ve).\n\n - New function `word_tokenize`:\n \n * Used by `word_frequency` to get tokens of\n 1,2,3-word phrases (or more).\n * Split a list of text into tokens of a specified number of words each.\n\n - New stop-words from the ``spaCy`` package:\n\n **current:** Arabic, Azerbaijani, Danish, Dutch, English, Finnish,\n French, German, Greek, Hungarian, Italian, Kazakh, Nepali, Norwegian,\n Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.\n\n **new:** Bengali, Catalan, Chinese, Croatian, Hebrew, Hindi, Indonesian,\n Irish, Japanese, Persian, Polish, Sinhala, Tagalog, Tamil, Tatar, Telugu,\n Thai, Ukrainian, Urdu, Vietnamese\n\n* Changed\n - `word_frequency` takes new parameters:\n * `regex` defaults to words, but can be changed to anything '\\S+'\n to split words and keep punctuation for example.\n\n * `sep` not longer used as an option, the above `regex` can\n be used instead\n\n * `num_list` now optional, and defaults to counts of 1 each if not\n provided. Useful for counting `abs_freq` only if data not\n available.\n\n * `phrase_len` the number of words in each split token. Defaults\n to 1 and can be set to 2 or higher. This helps in analyzing phrases\n as opposed to words.\n\n - Parameters supplied to `serp_goog` appear at the beginning\n of the result df\n - `serp_youtube` now contains `nextPageToken` to make\n paginating requests easier\n\n0.6.0 (2019-02-11)\n------------------\n\n* New function\n - `extract_words` to extract an arbitrary set of words\n* Minor updates\n - `ad_from_string` slots argument reflects new text\n ad lenghts\n - `hashtag` regex improved\n\n0.5.3 (2019-01-31)\n------------------\n\n* Fix minor bugs\n - Handle Twitter search queries with 0 results in final request\n\n0.5.2 (2018-12-01)\n------------------\n\n* Fix minor bugs\n - Properly handle requests for >50 items (`serp_youtube`)\n - Rewrite test for _dict_product\n - Fix issue with string printing error msg\n\n0.5.1 (2018-11-06)\n------------------\n\n* Fix minor bugs\n - _dict_product implemented with lists\n - Missing keys in some YouTube responses\n\n0.5.0 (2018-11-04)\n------------------\n\n* New function `serp_youtube`\n - Query YouTube API for videos, channels, or playlists\n - Multiple queries (product of parameters) in one function call\n - Reponse looping and merging handled, one DataFrame \n* `serp_goog` return Google's original error messages\n* twitter responses with entities, get the entities extracted, each in a\n separate column\n\n\n0.4.1 (2018-10-13)\n------------------\n\n* New function `serp_goog` (based on Google CSE)\n - Query Google search and get the result in a DataFrame\n - Make multiple queries / requests in one function call\n - All responses merged in one DataFrame\n* twitter.get_place_trends results are ranked by town and country\n\n0.4.0 (2018-10-08)\n------------------\n\n* New Twitter module based on twython\n - Wraps 20+ functions for getting Twitter API data\n - Gets data in a pands DataFrame\n - Handles looping over requests higher than the defaults\n* Tested on Python 3.7\n\n0.3.0 (2018-08-14)\n------------------\n\n* Search engine marketing cheat sheet.\n* New set of extract\\_ functions with summary stats for each:\n * extract_hashtags\n * extract_mentions\n * extract_emoji\n* Tests and bug fixes\n\n0.2.0 (2018-07-06)\n------------------\n\n* New set of kw_<match-type> functions.\n* Full testing and coverage. \n\n0.1.0 (2018-07-02)\n------------------\n\n* First release on PyPI.\n* Functions available:\n - ad_create: create a text ad place words in placeholders\n - ad_from_string: split a long string to shorter string that fit into\n given slots\n - kw_generate: generate keywords from lists of products and words\n - url_utm_ga: generate a UTM-tagged URL for Google Analytics tracking\n - word_frequency: measure the absolute and weighted frequency of words in\n collection of documents\n",
"bugtrack_url": null,
"license": "MIT license",
"summary": "Productivity and analysis tools for online marketing",
"version": "0.16.1",
"project_urls": {
"Homepage": "https://github.com/eliasdabbas/advertools"
},
"split_keywords": [
"advertising",
"marketing",
"search-engine-optimization",
"adwords",
"seo",
"sem",
"bingads",
"keyword-research"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "10b98e1f59a72c96d9e0eb3d70c0219bc6acf3d86ee6c73d709797ddffc0a920",
"md5": "f7908608b52a603541cd5f04cdf5b2ca",
"sha256": "0cdbedc0f39ca094a926746056f4b7cb4c7d04b7e3b7e5a6dd320d6394f20ddb"
},
"downloads": -1,
"filename": "advertools-0.16.1-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "f7908608b52a603541cd5f04cdf5b2ca",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 252324,
"upload_time": "2024-08-19T10:22:05",
"upload_time_iso_8601": "2024-08-19T10:22:05.498823Z",
"url": "https://files.pythonhosted.org/packages/10/b9/8e1f59a72c96d9e0eb3d70c0219bc6acf3d86ee6c73d709797ddffc0a920/advertools-0.16.1-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "08474caf569af29ceb4d395eb615bfc5578f90a6c7910971187fabfe302f14f9",
"md5": "adad73ef502b5137978d4ec38812172d",
"sha256": "e3614f852c2e9b76c0464bc818f967b1a5498c34211206fc6b5acae5842d51a1"
},
"downloads": -1,
"filename": "advertools-0.16.1.tar.gz",
"has_sig": false,
"md5_digest": "adad73ef502b5137978d4ec38812172d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 550874,
"upload_time": "2024-08-19T10:22:08",
"upload_time_iso_8601": "2024-08-19T10:22:08.083148Z",
"url": "https://files.pythonhosted.org/packages/08/47/4caf569af29ceb4d395eb615bfc5578f90a6c7910971187fabfe302f14f9/advertools-0.16.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-19 10:22:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "eliasdabbas",
"github_project": "advertools",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "advertools"
}