c2xg

Name	c2xg JSON
Version	2.2 JSON
	download
home_page	http://www.c2xg.io
Summary	Construction Grammars for Natural Language Processing and Computational Linguistics
upload_time	2024-03-14 01:27:55
maintainer
docs_url	None
author	Jonathan Dunn
requires_python	>=3.7
license	LGPL 3.0
keywords	grammar induction syntax cxg unsupervised learning natural language processing computational linguistics construction grammar cognitive linguistics usage-based grammar
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # c2xg 2.0
------------

Computational Construction Grammar, or *c2xg*, is a Python package for learning and working with construction grammars. 

Why CxG? Constructions are grammatical entities that support a straight-forward quantification of linguistic structure.

This package currently support 18 languages: English (eng), Arabic (ara), Danish (dan), German (deu), Greek (ell), Farsi (fas), Finnish (fin), French (fra), Hindi (hin), Indonesian (ind), Italian (ita), Dutch (nld), Polish (pol), Portuguese (por), Russian (rus), Spanish (spa), Swedish (swe), Turkish (tur)

### Further Documentation
-----------------------

More detailed linguistic documentation is available in the draft book *Computational Construction Grammar: A Usage-Based Approach* available at https://www.jdunn.name/cxg

Usage examples in a working environment are available at https://doi.org/10.24433/CO.9944630.v1

Detailed descriptions of each pre-trained grammar are available at https://doi.org/10.17605/OSF.IO/SA6R3

### Installation
--------------

To install the full package and its dependencies with *pip*, use:

	pip install git+https://github.com/jonathandunn/c2xg.git

### Download Models
--------------------

A number of pre-trained grammar models are available for download and use with C2xG.

Note: The *download_model()* function must be used to install pre-trained grammars.

	from c2xg import download_model

Models can be downloaded as follows: 

	download_model(model = False, data_dir = None, out_dir = None)

These parameters are as follows:

	model (str)	Name of a pre-trained grammar model or its shortcut
	data_dir (str)	Main data directory, creates 'data' in current directory if none given
	out_dir (str)	Output data directory, creates 'OUT' in main data directory if none given

For example, to download the model pre-trained on an English blogs corpus, use:

	download_model(model = "BL", data_dir = "CxG_data")

Models have been pre-trained for each of the languages above, with additional English models trained on separate corpora. To see the models available, please view the following lists:

<details><summary>View: General Models</summary>
	
	"ara": "cxg_multi_v02.ara.1000k_words.model.zip",
	"dan": "cxg_multi_v02.dan.1000k_words.model.zip",
	"deu": "cxg_multi_v02.deu.1000k_words.model.zip",
	"ell": "cxg_multi_v02.ell.1000k_words.model.zip",
	"eng": "cxg_multi_v02.eng.1000k_words.model.zip",
	"fas": "cxg_multi_v02.fas.1000k_words.model.zip",
	"fin": "cxg_multi_v02.fin.1000k_words.model.zip",
	"fra": "cxg_multi_v02.fra.1000k_words.model.zip",
	"hin": "cxg_multi_v02.hin.1000k_words.model.zip",
	"ind": "cxg_multi_v02.ind.1000k_words.model.zip",
	"ita": "cxg_multi_v02.ita.1000k_words.model.zip",
	"nld": "cxg_multi_v02.nld.1000k_words.model.zip",
	"pol": "cxg_multi_v02.pol.1000k_words.model.zip",
	"por": "cxg_multi_v02.por.1000k_words.model.zip",
	"rus": "cxg_multi_v02.rus.1000k_words.model.zip",
 	"spa": "cxg_multi_v02.spa.1000k_words.model.zip",
  	"swe": "cxg_multi_v02.swe.1000k_words.model.zip",
   	"tur": "cxg_multi_v02.tur.1000k_words.model.zip",

</details>

<details><summary>View: English Corpus Models</summary>

	"BL": "cxg_corpus_blogs_final_v2.eng.1000k_words.model.zip",
	"NC": "cxg_corpus_comments_final_v2.eng.1000k_words.model.zip",
	"EU": "cxg_corpus_eu_final_v2.eng.1000k_words.model.zip",
	"PG": "cxg_corpus_pg_final_v2.eng.1000k_words.model.zip",
	"PR": "cxg_corpus_reviews_final_v2.eng.1000k_words.model.zip",
	"OS": "cxg_corpus_subs_final_v2.eng.1000k_words.model.zip",
	"TW": "cxg_corpus_tw_final_v2.eng.1000k_words.model.zip",
	"WK": "cxg_corpus_wiki_final_v2.eng.1000k_words.model.zip",

</details>

### Usage: Initialising
---------------------

To use C2xG, it must first be initialised with the following command.

_note:_ This process may take a few minutes depending on your machine. 

	from c2xg import C2xG
	CxG = C2xG(model = "BL") 

Initialisation accepts the following parameters:

	model (str)			Pre-trained grammar file name in the out directory, or corresponding shortcut
						See "Download Models" and "Further Documentation" for more
	data_dir (str)			Working directory, creates 'data' in current directory if none given
	in_dir (str)			Input directory name, creates 'IN' in 'data_dir' if none given
	out_dir (str)			Output directory name, creates 'OUT' in 'data_dir' if none given
	language (str)			Language for file names, default 'N/A'
	nickname (str) 			Nickname for file names, default 'cxg'
	max_sentence_length (int) 	Cutoff length for loading a given sentence, 50 by default
	normalization (bool)		Normalize frequency by ngram type and frequency strata, True by default
	max_words (bool) 		Limit the number of words when reading input data, False by default
	cbow_file (str)			Name of cbow file to load or create
	sg_file (str) 			Name of skip-gram file to load or create

For example, to initialise an instance of C2xG with the English Wiki corpus in the folder "CxG_data", use:

	CxG_wiki = C2xG(model = "WK", data_dir = "CxG_data")

### Usage: Grammar Parsing
---------------------------

***CxG.parse()***

The *parse()* function takes a text, file name, or list of file names and returns a sparse matrix with construction frequencies for each line in the text. 

	CxG.parse(self, input, input_type = "files", mode = "syn", third_order = False)

Which accepts the following parameters: 

	input (str or list of str)	A filename or list of filenames to be parsed, sourced from 'in' directory
	input_type (str)		"files" if input contains filenames or "lines" if input contains data
	mode (str, default "syn")	Type(s) of constructions to be parsed ("lex", "syn", "full", or "all")
	third_order (bool)		Whether third-order constructions are used, False by default

For example, to take a text file in the 'in' directory and parse it for lexical constructions, use:

	parse_lex = CxG.parse(input = "my_sentences.txt", input_type = "files", mode = "lex")

***CxG.parse_types()***

The *parse_types()* function takes a text, file name, or list of file names and returns a sparse matrix with construction type frequencies over all inputs. 

	CxG.parse_types(self, input, input_type = "files", mode = "syn", third_order = False)

Which takes the following parameters:

	input (str or list of str)	A filename or list of filenames to be parsed, sourced from 'in' directory
	input_type (str)		"files" if input contains filenames or "lines" if input contains data
	mode (str, default "syn")	Type(s) of representations to be parsed ("lex", "syn", "full", or "all")
	third_order (bool)		Whether third-order constructions are used, False by default

For example, to take a text file in the 'in' directory and parse it for all constructions types, use:

	parse_all_types = CxG.parse_types(input = "my_sentences.txt", input_type = "files", mode = "all")

### Usage: Grammar Analysis
----------------------------

***CxG.get_type_token_ratio()***

The *get_type_token_ratio()* method takes a text, file name, or list of file names and returns the following: the type and token counts for all inputs, and the ratio thereof.

	get_type_token_ratio(self, input_data, input_type, mode = "syn", third_order = False)

Which takes the following parameters:

	input (str or list of str)	A filename or list of filenames to be parsed, sourced from 'in' directory
	input_type (str)		"files" if input contains filenames or "lines" if input contains data
	mode (str, default "syn")	Type(s) of representations to be parsed ("lex", "syn", "full", or "all")
	third_order (bool)		Whether third-order constructions are used, False by default

 For example, to take a text file in the 'in' directory and obtain the lexical construction type/token counts and type-token ratio, use:

	get_ratio = CxG.get_type_token_ratio(input = "my_sentences.txt", input_type = "files", mode = "lex")

***CxG.get_association()***

The *get_association()* function returns a dataframe with assocation measures for word pairs in the input data. This dataframe includes the words in the pair, their Delta-P scores in both left and right directions, the difference in scores, the maximum score, and the frequency within the data. 

_note:_ For more on these measures, see https://arxiv.org/abs/2104.01297

	get_association(self, freq_threshold = 1, normalization = True, grammar_type = "full", lex_only = False, data = False)

Which takes the following parameters:

	freq_threshold (int) 		Only consider bigrams above this frequency threshold, 1 by default
	normalization (bool) 		Normalize frequency by ngram type and frequency strata, True by default
	grammar_type (str)		Suffix for pickle file name for file containing discounts, default "full"
	lex_only (bool)			Limit n-grams examined to lexical entries only, False by default
	data (str or list of str)	A filename or list of filenames to be parsed, sourced from 'in' directory

For example, to examples word pairs that occur at least ten times, use:

	delta_association = CxG.get_association(input = "my_sentences.txt", freq_threshold = 10)

### Usage: Grammar Exploration
-------------------------------

***CxG.print_constructions()***

The *print_constructions()* function prints, returns, and creates the file "temp.txt" in the 'out' directory containing a list of constructions of the selected type and their IDs from the initialised model.

	print_constructions(self, mode="lex")

Which takes the following parameters:

	mode (str, default "lex")	Type(s) of representations to be examined ("lex", "syn", "full", or "all")

For example, to print all constructions in the model loaded, use

	CxG_wiki = C2xG(model = "WK", data_dir = "CxG_data") # initialise model, as above
 	all_wiki_constructions = CxG.print_constructions(mode = "all)

***CxG.print_examples()***

The *print_examples()* function creates the file "temp.txt" with a list of constructions in the 'out' directory containing a list of constructions of the selected type and their IDs from the initialised model (or grammar) with examples from the selected data.

	print_examples(self, grammar, input_file, n = 50, output = False, send_back = False)

Which takes the following parameters:

	grammar (str or CxG.grammar)	Type of grammar to examine ("lex", "syn", "full", "all")
						Alternatively, grammars can be specified with:
						C2xG.{type}_grammar.loc[:,"Chunk"].values'
	input_file (str)		A filename or list of filenames to be parsed, sourced from 'in' directory
	n (int)				Limit examples per construction, 50 by default
	output (bool)			Print examples in console, False by default
	send_back (bool)		Return examples as variable, False by default

For example, to print and return 10 examples of each syntactic construction in the chosen data, use:

	syn_examples = CxG.print_examples(grammar = "syn", input_file = "my_sentences.txt", n = 10,
 					output = True, send_back = true)

### Usage: Learning
----------------------------

***CxG.learn()***

The *learn()* function creates and returns new grammar models like those obtained using the *download_model()* function, using a given data input. This function returns three separate dataframes for lexical, syntactic, and full grammars.

_note:_ This function is likely to take some time, especially with more learning/forgetting rounds. 

	learn(self, input_data, npmi_threshold = 0.75, starting_index = 0, min_count = None, 
		max_vocab = None, cbow_range = False, sg_range = False, get_examples = True, 
		increments = 50000, learning_rounds = 20, forgetting_rounds = 40, cluster_only = False)

Which takes the following parameters:

	input_data (str or list of str) 	A filename or list of filenames to be parsed, sourced from 'in' directory
	npmi_threshold (int)		Normalised pointwise mutual information threshold value, 0.75 by default. 
 						For use with 'gensim.Phrases', for more information see:
						https://radimrehurek.com/gensim/models/phrases.html
	starting_index (int)		Index in input to begin learning, if not the beginning, 0 by default
	min_count (int)			Minimum ngram token count to maintain. If none, derived from 'max_words' during initialisation
	max_vocab (int)			Maximum vocabulary size, no maximum by default
	cbow_range (int)		Maximum cbow clusters, 250 by default
	sg_range (int) 			Maximum skip-gram clusters, 2500 by default
	get_examples (bool)		If true, also run 'get_examples'. Use 'help(C2xG.get_examples)' for more.
	increments (int)		Defines both the number of words to discard and where to stop, 50000 by default
	learning_rounds (int) 		Number of learning rounds to build/refine vocabulary, 20 by default
	forgetting_rounds (int) 	Number of forgetting rounds to prune vocabulary, 40 by default
	cluster_only (bool) 		Only use clusters from embedding models, False by default

Each learning fold consists of three tasks: (i) estimating association values from background data; this requires a large amount of data (e.g., 20 files); (ii) extracting candidate constructions; this requires a moderate amount of data (e.g., 5 files); (iii) evaluating potential grammars against a test set; this requires a small amount of data (e.g., 1 file or 10 mil words).

The freq_threshold is used to control the number of potential constructions to consider. It can be set at 20. The turn limit controls how far the search process can go. It can be set at 10.

For example, to create a simple model with only two rounds, eight rounds of forgetting, a cbow range of 50, and a skip-gram range of 500, while also getting a list of examles use:

	lex_gram, syn_gram, full_gram = CxG.learn(input_file = "my_sentences.txt", get_examples = True,
 							learning_rounds = 2, forgetting_rounds = 8, 
							cbow_range = 50, sg_range = 500)

***Example of using CxG.learn()***

Below is a simple example that shows the code necessary to learn a new grammar with an existing set of embeddings. In this example, the input corpus is the file "corpus.blogs.gz" and it is contained in DATA > IN. The embedding files are passed using the *cbow_file* and *sg_file* parameters. Their expected location is in DATA > OUT. These are the default locations; it is also possible to specify different locations when initializing the C2xG class.

    cxg = c2xg.C2xG(data_dir = "DATA", 
                    cbow_file = "training_corpus.01.cbow.bin",
                    sg_file = "training_corpus.01.sg.bin",
                    )

    cxg.learn(input_data = "corpus.blogs.gz", max_vocab = 10000)

This example will first cluster the sg and cbow embeddings to form the categories needed to formulate slot constraints. It will then proceed to learning the grammar itself.

***CxG.learn_embeddings()***

The *learn_embeddings()* function creates new cbow and skip-gram embeddings using input data. The _learn()_ function will do this automatically by default, but this function generates them in isolation. 

For quality assurance, it is best when possible to learn and examine the embeddings directly, for instance using Gensim.

_note:_ Embeddings are stored in the class as 'self.cbow_model' and 'self.sg_model'.

	learn_embeddings(self, input_data, name="embeddings")

Which takes the following parameters:

	input_data (str or list of str)	A filename or list of filenames to be parsed, sourced from 'in' directory
	name (str) 			The nickname to use when saving models, 'embeddings' by default.

For example, learn embeddings and save them with the nickname "new_embeddings":

	CxG.learn_embeddings(input_file = "my_sentences.txt", name = "new_embeddings)

Raw data

            {
    "_id": null,
    "home_page": "http://www.c2xg.io",
    "name": "c2xg",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "grammar induction,syntax,cxg,unsupervised learning,natural language processing,computational linguistics,construction grammar,cognitive linguistics,usage-based grammar",
    "author": "Jonathan Dunn",
    "author_email": "jedunn@illinois.edu",
    "download_url": "",
    "platform": null,
    "description": "# c2xg 2.0\r\n------------\r\n\r\nComputational Construction Grammar, or *c2xg*, is a Python package for learning and working with construction grammars. \r\n\r\nWhy CxG? Constructions are grammatical entities that support a straight-forward quantification of linguistic structure.\r\n\r\nThis package currently support 18 languages: English (eng), Arabic (ara), Danish (dan), German (deu), Greek (ell), Farsi (fas), Finnish (fin), French (fra), Hindi (hin), Indonesian (ind), Italian (ita), Dutch (nld), Polish (pol), Portuguese (por), Russian (rus), Spanish (spa), Swedish (swe), Turkish (tur)\r\n\r\n### Further Documentation\r\n-----------------------\r\n\r\nMore detailed linguistic documentation is available in the draft book *Computational Construction Grammar: A Usage-Based Approach* available at https://www.jdunn.name/cxg\r\n\r\nUsage examples in a working environment are available at https://doi.org/10.24433/CO.9944630.v1\r\n\r\nDetailed descriptions of each pre-trained grammar are available at https://doi.org/10.17605/OSF.IO/SA6R3\r\n\r\n### Installation\r\n--------------\r\n\r\nTo install the full package and its dependencies with *pip*, use:\r\n\r\n\tpip install git+https://github.com/jonathandunn/c2xg.git\r\n\r\n### Download Models\r\n--------------------\r\n\r\nA number of pre-trained grammar models are available for download and use with C2xG.\r\n\r\nNote: The *download_model()* function must be used to install pre-trained grammars.\r\n\r\n\tfrom c2xg import download_model\r\n\r\nModels can be downloaded as follows: \r\n\r\n\tdownload_model(model = False, data_dir = None, out_dir = None)\r\n\r\nThese parameters are as follows:\r\n\r\n\tmodel (str)\tName of a pre-trained grammar model or its shortcut\r\n\tdata_dir (str)\tMain data directory, creates 'data' in current directory if none given\r\n\tout_dir (str)\tOutput data directory, creates 'OUT' in main data directory if none given\r\n\r\nFor example, to download the model pre-trained on an English blogs corpus, use:\r\n\r\n\tdownload_model(model = \"BL\", data_dir = \"CxG_data\")\r\n\r\nModels have been pre-trained for each of the languages above, with additional English models trained on separate corpora. To see the models available, please view the following lists:\r\n\r\n<details><summary>View: General Models</summary>\r\n\t\r\n\t\"ara\": \"cxg_multi_v02.ara.1000k_words.model.zip\",\r\n\t\"dan\": \"cxg_multi_v02.dan.1000k_words.model.zip\",\r\n\t\"deu\": \"cxg_multi_v02.deu.1000k_words.model.zip\",\r\n\t\"ell\": \"cxg_multi_v02.ell.1000k_words.model.zip\",\r\n\t\"eng\": \"cxg_multi_v02.eng.1000k_words.model.zip\",\r\n\t\"fas\": \"cxg_multi_v02.fas.1000k_words.model.zip\",\r\n\t\"fin\": \"cxg_multi_v02.fin.1000k_words.model.zip\",\r\n\t\"fra\": \"cxg_multi_v02.fra.1000k_words.model.zip\",\r\n\t\"hin\": \"cxg_multi_v02.hin.1000k_words.model.zip\",\r\n\t\"ind\": \"cxg_multi_v02.ind.1000k_words.model.zip\",\r\n\t\"ita\": \"cxg_multi_v02.ita.1000k_words.model.zip\",\r\n\t\"nld\": \"cxg_multi_v02.nld.1000k_words.model.zip\",\r\n\t\"pol\": \"cxg_multi_v02.pol.1000k_words.model.zip\",\r\n\t\"por\": \"cxg_multi_v02.por.1000k_words.model.zip\",\r\n\t\"rus\": \"cxg_multi_v02.rus.1000k_words.model.zip\",\r\n \t\"spa\": \"cxg_multi_v02.spa.1000k_words.model.zip\",\r\n  \t\"swe\": \"cxg_multi_v02.swe.1000k_words.model.zip\",\r\n   \t\"tur\": \"cxg_multi_v02.tur.1000k_words.model.zip\",\r\n\r\n</details>\r\n\r\n<details><summary>View: English Corpus Models</summary>\r\n\r\n\t\"BL\": \"cxg_corpus_blogs_final_v2.eng.1000k_words.model.zip\",\r\n\t\"NC\": \"cxg_corpus_comments_final_v2.eng.1000k_words.model.zip\",\r\n\t\"EU\": \"cxg_corpus_eu_final_v2.eng.1000k_words.model.zip\",\r\n\t\"PG\": \"cxg_corpus_pg_final_v2.eng.1000k_words.model.zip\",\r\n\t\"PR\": \"cxg_corpus_reviews_final_v2.eng.1000k_words.model.zip\",\r\n\t\"OS\": \"cxg_corpus_subs_final_v2.eng.1000k_words.model.zip\",\r\n\t\"TW\": \"cxg_corpus_tw_final_v2.eng.1000k_words.model.zip\",\r\n\t\"WK\": \"cxg_corpus_wiki_final_v2.eng.1000k_words.model.zip\",\r\n\r\n</details>\r\n\r\n### Usage: Initialising\r\n---------------------\r\n\r\nTo use C2xG, it must first be initialised with the following command.\r\n\r\n_note:_ This process may take a few minutes depending on your machine. \r\n\r\n\tfrom c2xg import C2xG\r\n\tCxG = C2xG(model = \"BL\") \r\n\r\nInitialisation accepts the following parameters:\r\n\r\n\tmodel (str)\t\t\tPre-trained grammar file name in the out directory, or corresponding shortcut\r\n\t\t\t\t\t\tSee \"Download Models\" and \"Further Documentation\" for more\r\n\tdata_dir (str)\t\t\tWorking directory, creates 'data' in current directory if none given\r\n\tin_dir (str)\t\t\tInput directory name, creates 'IN' in 'data_dir' if none given\r\n\tout_dir (str)\t\t\tOutput directory name, creates 'OUT' in 'data_dir' if none given\r\n\tlanguage (str)\t\t\tLanguage for file names, default 'N/A'\r\n\tnickname (str) \t\t\tNickname for file names, default 'cxg'\r\n\tmax_sentence_length (int) \tCutoff length for loading a given sentence, 50 by default\r\n\tnormalization (bool)\t\tNormalize frequency by ngram type and frequency strata, True by default\r\n\tmax_words (bool) \t\tLimit the number of words when reading input data, False by default\r\n\tcbow_file (str)\t\t\tName of cbow file to load or create\r\n\tsg_file (str) \t\t\tName of skip-gram file to load or create\r\n\r\nFor example, to initialise an instance of C2xG with the English Wiki corpus in the folder \"CxG_data\", use:\r\n\r\n\tCxG_wiki = C2xG(model = \"WK\", data_dir = \"CxG_data\")\r\n\r\n### Usage: Grammar Parsing\r\n---------------------------\r\n\r\n***CxG.parse()***\r\n\r\nThe *parse()* function takes a text, file name, or list of file names and returns a sparse matrix with construction frequencies for each line in the text. \r\n\r\n\tCxG.parse(self, input, input_type = \"files\", mode = \"syn\", third_order = False)\r\n\r\nWhich accepts the following parameters: \r\n\r\n\tinput (str or list of str)\tA filename or list of filenames to be parsed, sourced from 'in' directory\r\n\tinput_type (str)\t\t\"files\" if input contains filenames or \"lines\" if input contains data\r\n\tmode (str, default \"syn\")\tType(s) of constructions to be parsed (\"lex\", \"syn\", \"full\", or \"all\")\r\n\tthird_order (bool)\t\tWhether third-order constructions are used, False by default\r\n\r\nFor example, to take a text file in the 'in' directory and parse it for lexical constructions, use:\r\n\r\n\tparse_lex = CxG.parse(input = \"my_sentences.txt\", input_type = \"files\", mode = \"lex\")\r\n\r\n***CxG.parse_types()***\r\n\r\nThe *parse_types()* function takes a text, file name, or list of file names and returns a sparse matrix with construction type frequencies over all inputs. \r\n\r\n\tCxG.parse_types(self, input, input_type = \"files\", mode = \"syn\", third_order = False)\r\n\r\nWhich takes the following parameters:\r\n\r\n\tinput (str or list of str)\tA filename or list of filenames to be parsed, sourced from 'in' directory\r\n\tinput_type (str)\t\t\"files\" if input contains filenames or \"lines\" if input contains data\r\n\tmode (str, default \"syn\")\tType(s) of representations to be parsed (\"lex\", \"syn\", \"full\", or \"all\")\r\n\tthird_order (bool)\t\tWhether third-order constructions are used, False by default\r\n\r\nFor example, to take a text file in the 'in' directory and parse it for all constructions types, use:\r\n\r\n\tparse_all_types = CxG.parse_types(input = \"my_sentences.txt\", input_type = \"files\", mode = \"all\")\r\n\r\n### Usage: Grammar Analysis\r\n----------------------------\r\n\r\n***CxG.get_type_token_ratio()***\r\n\r\nThe *get_type_token_ratio()* method takes a text, file name, or list of file names and returns the following: the type and token counts for all inputs, and the ratio thereof.\r\n\r\n\tget_type_token_ratio(self, input_data, input_type, mode = \"syn\", third_order = False)\r\n\r\nWhich takes the following parameters:\r\n\r\n\tinput (str or list of str)\tA filename or list of filenames to be parsed, sourced from 'in' directory\r\n\tinput_type (str)\t\t\"files\" if input contains filenames or \"lines\" if input contains data\r\n\tmode (str, default \"syn\")\tType(s) of representations to be parsed (\"lex\", \"syn\", \"full\", or \"all\")\r\n\tthird_order (bool)\t\tWhether third-order constructions are used, False by default\r\n\r\n For example, to take a text file in the 'in' directory and obtain the lexical construction type/token counts and type-token ratio, use:\r\n\r\n\tget_ratio = CxG.get_type_token_ratio(input = \"my_sentences.txt\", input_type = \"files\", mode = \"lex\")\r\n\r\n***CxG.get_association()***\r\n\r\nThe *get_association()* function returns a dataframe with assocation measures for word pairs in the input data. This dataframe includes the words in the pair, their Delta-P scores in both left and right directions, the difference in scores, the maximum score, and the frequency within the data. \r\n\r\n_note:_ For more on these measures, see https://arxiv.org/abs/2104.01297\r\n\r\n\tget_association(self, freq_threshold = 1, normalization = True, grammar_type = \"full\", lex_only = False, data = False)\r\n\r\nWhich takes the following parameters:\r\n\r\n\tfreq_threshold (int) \t\tOnly consider bigrams above this frequency threshold, 1 by default\r\n\tnormalization (bool) \t\tNormalize frequency by ngram type and frequency strata, True by default\r\n\tgrammar_type (str)\t\tSuffix for pickle file name for file containing discounts, default \"full\"\r\n\tlex_only (bool)\t\t\tLimit n-grams examined to lexical entries only, False by default\r\n\tdata (str or list of str)\tA filename or list of filenames to be parsed, sourced from 'in' directory\r\n\r\nFor example, to examples word pairs that occur at least ten times, use:\r\n\r\n\tdelta_association = CxG.get_association(input = \"my_sentences.txt\", freq_threshold = 10)\r\n\r\n### Usage: Grammar Exploration\r\n-------------------------------\r\n\r\n***CxG.print_constructions()***\r\n\r\nThe *print_constructions()* function prints, returns, and creates the file \"temp.txt\" in the 'out' directory containing a list of constructions of the selected type and their IDs from the initialised model.\r\n\r\n\tprint_constructions(self, mode=\"lex\")\r\n\r\nWhich takes the following parameters:\r\n\r\n\tmode (str, default \"lex\")\tType(s) of representations to be examined (\"lex\", \"syn\", \"full\", or \"all\")\r\n\r\nFor example, to print all constructions in the model loaded, use\r\n\r\n\tCxG_wiki = C2xG(model = \"WK\", data_dir = \"CxG_data\") # initialise model, as above\r\n \tall_wiki_constructions = CxG.print_constructions(mode = \"all)\r\n\r\n***CxG.print_examples()***\r\n\r\nThe *print_examples()* function creates the file \"temp.txt\" with a list of constructions in the 'out' directory containing a list of constructions of the selected type and their IDs from the initialised model (or grammar) with examples from the selected data.\r\n\r\n\tprint_examples(self, grammar, input_file, n = 50, output = False, send_back = False)\r\n\r\nWhich takes the following parameters:\r\n\r\n\tgrammar (str or CxG.grammar)\tType of grammar to examine (\"lex\", \"syn\", \"full\", \"all\")\r\n\t\t\t\t\t\tAlternatively, grammars can be specified with:\r\n\t\t\t\t\t\tC2xG.{type}_grammar.loc[:,\"Chunk\"].values'\r\n\tinput_file (str)\t\tA filename or list of filenames to be parsed, sourced from 'in' directory\r\n\tn (int)\t\t\t\tLimit examples per construction, 50 by default\r\n\toutput (bool)\t\t\tPrint examples in console, False by default\r\n\tsend_back (bool)\t\tReturn examples as variable, False by default\r\n\r\nFor example, to print and return 10 examples of each syntactic construction in the chosen data, use:\r\n\r\n\tsyn_examples = CxG.print_examples(grammar = \"syn\", input_file = \"my_sentences.txt\", n = 10,\r\n \t\t\t\t\toutput = True, send_back = true)\r\n\r\n### Usage: Learning\r\n----------------------------\r\n\r\n***CxG.learn()***\r\n\r\nThe *learn()* function creates and returns new grammar models like those obtained using the *download_model()* function, using a given data input. This function returns three separate dataframes for lexical, syntactic, and full grammars.\r\n\r\n_note:_ This function is likely to take some time, especially with more learning/forgetting rounds. \r\n\r\n\tlearn(self, input_data, npmi_threshold = 0.75, starting_index = 0, min_count = None, \r\n\t\tmax_vocab = None, cbow_range = False, sg_range = False, get_examples = True, \r\n\t\tincrements = 50000, learning_rounds = 20, forgetting_rounds = 40, cluster_only = False)\r\n\r\nWhich takes the following parameters:\r\n\r\n\tinput_data (str or list of str) \tA filename or list of filenames to be parsed, sourced from 'in' directory\r\n\tnpmi_threshold (int)\t\tNormalised pointwise mutual information threshold value, 0.75 by default. \r\n \t\t\t\t\t\tFor use with 'gensim.Phrases', for more information see:\r\n\t\t\t\t\t\thttps://radimrehurek.com/gensim/models/phrases.html\r\n\tstarting_index (int)\t\tIndex in input to begin learning, if not the beginning, 0 by default\r\n\tmin_count (int)\t\t\tMinimum ngram token count to maintain. If none, derived from 'max_words' during initialisation\r\n\tmax_vocab (int)\t\t\tMaximum vocabulary size, no maximum by default\r\n\tcbow_range (int)\t\tMaximum cbow clusters, 250 by default\r\n\tsg_range (int) \t\t\tMaximum skip-gram clusters, 2500 by default\r\n\tget_examples (bool)\t\tIf true, also run 'get_examples'. Use 'help(C2xG.get_examples)' for more.\r\n\tincrements (int)\t\tDefines both the number of words to discard and where to stop, 50000 by default\r\n\tlearning_rounds (int) \t\tNumber of learning rounds to build/refine vocabulary, 20 by default\r\n\tforgetting_rounds (int) \tNumber of forgetting rounds to prune vocabulary, 40 by default\r\n\tcluster_only (bool) \t\tOnly use clusters from embedding models, False by default\r\n\r\nEach learning fold consists of three tasks: (i) estimating association values from background data; this requires a large amount of data (e.g., 20 files); (ii) extracting candidate constructions; this requires a moderate amount of data (e.g., 5 files); (iii) evaluating potential grammars against a test set; this requires a small amount of data (e.g., 1 file or 10 mil words).\r\n\r\nThe freq_threshold is used to control the number of potential constructions to consider. It can be set at 20. The turn limit controls how far the search process can go. It can be set at 10.\r\n\r\nFor example, to create a simple model with only two rounds, eight rounds of forgetting, a cbow range of 50, and a skip-gram range of 500, while also getting a list of examles use:\r\n\r\n\tlex_gram, syn_gram, full_gram = CxG.learn(input_file = \"my_sentences.txt\", get_examples = True,\r\n \t\t\t\t\t\t\tlearning_rounds = 2, forgetting_rounds = 8, \r\n\t\t\t\t\t\t\tcbow_range = 50, sg_range = 500)\r\n\r\n***Example of using CxG.learn()***\r\n\r\nBelow is a simple example that shows the code necessary to learn a new grammar with an existing set of embeddings. In this example, the input corpus is the file \"corpus.blogs.gz\" and it is contained in DATA > IN. The embedding files are passed using the *cbow_file* and *sg_file* parameters. Their expected location is in DATA > OUT. These are the default locations; it is also possible to specify different locations when initializing the C2xG class.\r\n\r\n    cxg = c2xg.C2xG(data_dir = \"DATA\", \r\n                    cbow_file = \"training_corpus.01.cbow.bin\",\r\n                    sg_file = \"training_corpus.01.sg.bin\",\r\n                    )\r\n\r\n    cxg.learn(input_data = \"corpus.blogs.gz\", max_vocab = 10000)\r\n\r\nThis example will first cluster the sg and cbow embeddings to form the categories needed to formulate slot constraints. It will then proceed to learning the grammar itself.\r\n\r\n***CxG.learn_embeddings()***\r\n\r\nThe *learn_embeddings()* function creates new cbow and skip-gram embeddings using input data. The _learn()_ function will do this automatically by default, but this function generates them in isolation. \r\n\r\nFor quality assurance, it is best when possible to learn and examine the embeddings directly, for instance using Gensim.\r\n\r\n_note:_ Embeddings are stored in the class as 'self.cbow_model' and 'self.sg_model'.\r\n\r\n\tlearn_embeddings(self, input_data, name=\"embeddings\")\r\n\r\nWhich takes the following parameters:\r\n\r\n\tinput_data (str or list of str)\tA filename or list of filenames to be parsed, sourced from 'in' directory\r\n\tname (str) \t\t\tThe nickname to use when saving models, 'embeddings' by default.\r\n\r\nFor example, learn embeddings and save them with the nickname \"new_embeddings\":\r\n\r\n\tCxG.learn_embeddings(input_file = \"my_sentences.txt\", name = \"new_embeddings)\r\n",
    "bugtrack_url": null,
    "license": "LGPL 3.0",
    "summary": "Construction Grammars for Natural Language Processing and Computational Linguistics",
    "version": "2.2",
    "project_urls": {
        "Homepage": "http://www.c2xg.io"
    },
    "split_keywords": [
        "grammar induction",
        "syntax",
        "cxg",
        "unsupervised learning",
        "natural language processing",
        "computational linguistics",
        "construction grammar",
        "cognitive linguistics",
        "usage-based grammar"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bfeca6158c333afd8d2d0b02e3eb82d94776bbc56b6e2a2848caa05d8350d09a",
                "md5": "3c1245d7129421c6015356d9ac11289b",
                "sha256": "12cdb608516e24ae71351c4edbf940b82ab30b8b1ed278ff7265f4164bf3cfd7"
            },
            "downloads": -1,
            "filename": "c2xg-2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3c1245d7129421c6015356d9ac11289b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 58331,
            "upload_time": "2024-03-14T01:27:55",
            "upload_time_iso_8601": "2024-03-14T01:27:55.643383Z",
            "url": "https://files.pythonhosted.org/packages/bf/ec/a6158c333afd8d2d0b02e3eb82d94776bbc56b6e2a2848caa05d8350d09a/c2xg-2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-14 01:27:55",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "c2xg"
}

Jonathan Dunn