API reference

xtas.core

Core functionality. Contains the configuration and the (singleton) Celery “app” instance.

xtas.core.configure(config, import_error='raise', unknown_key='raise')[source]

Configure xtas.

Parameters:

config : dict

Dict with keys CELERY, ELASTICSEARCH and EXTRA_MODULES will be used to configure the xtas Celery app.

config.CELERY will be passed to Celery’s config_from_object with the flag force=True.

ELASTICSEARCH should be a list of dicts with at least the key ‘host’. These are passed to the Elasticsearch constructor (from the official client) unchanged.

EXTRA_MODULES should be a list of module names to load.

Failure to supply CELERY or ELASTICSEARCH causes the default configuration to be re-set. Extra modules will not be unloaded, though.

import_error : string

Action to take when one of the EXTRA_MODULES cannot be imported. Either “log”, “raise” or “ignore”.

unknown_key : string

Action to take when a member not matching the ones listed above is encountered (except when its name starts with an underscore). Either “log”, “raise” or “ignore”.

xtas.tasks.single

Single-document tasks.

These process one document per function call (in Python) or REST call (via the web server, /run or /run_es). Most single-document tasks take a document as their first argument. In the Python interface this may either be a string or the result from xtas.tasks.es.es_document, a reference to a document in an Elasticsearch store.

(task) xtas.tasks.single.alpino(doc, output='raw')

Wrapper around the Alpino (dependency) parser for Dutch.

Expects an environment variable ALPINO_HOME to point at the Alpino installation dir.

The script uses the ‘dependencies’ end_hook to generate lemmata and the dependency structure.

Parameters:

output : string

If ‘raw’, returns the raw output from Alpino itself. If ‘saf’, returns a SAF dictionary.

References

Alpino homepage.

(task) xtas.tasks.single.corenlp(doc, output='raw')

Wrapper around the Stanford CoreNLP parser.

Expects $CORENLP_HOME to point to the CoreNLP installation dir.

If run with all annotators, it requires around 3G of memory, and it will keep the process in memory indefinitely.

Tested with CoreNLP 2014-01-04 (see http://nlp.stanford.edu/software/corenlp.shtml).

Parameters:

output : string

If ‘raw’, returns the raw output lines from CoreNLP. If ‘saf’, returns a SAF dictionary.

(task) xtas.tasks.single.corenlp_lemmatize(doc, output='raw')

Wrapper around the Stanford CoreNLP lemmatizer.

Expects $CORENLP_HOME to point to the CoreNLP installation dir.

Tested with CoreNLP 2014-01-04.

Parameters:

output : string

If ‘raw’, returns the raw output lines from CoreNLP. If ‘saf’, returns a SAF dictionary.

(task) xtas.tasks.single.dbpedia_spotlight(doc, lang='en', conf=0.5, supp=0, api_url=None)

Run text through a DBpedia Spotlight instance.

Calls the DBpedia Spotlight instance to perform entity linking and returns the names/links it has found.

See http://spotlight.dbpedia.org/ for details. This task uses a Python client for DBp Spotlight: https://github.com/aolieman/pyspotlight

(task) xtas.tasks.single.frog(doc, output='raw')

Wrapper around the Frog lemmatizer/POS tagger/NER/dependency parser.

Expects Frog to be running in server mode, listening on localhost:${XTAS_FROG_PORT} or port 9987 if the environment variable XTAS_FROG_PORT is not set. It is not started for you.

Currently, the module is only tested with all frog modules active except for the NER and parser.

The following line starts Frog in the correct way:

frog -S ${XTAS_FROG_PORT:-9887}

Parameters:

output : string

If ‘raw’, returns the raw output lines from Frog itself. If ‘tokens’, returns dictionaries for the tokens. If ‘saf’, returns a SAF dictionary.

See also

nlner_conll
simple NER tagger for Dutch.

References

Frog homepage

(task) xtas.tasks.single.guess_language(doc, output='best')

Guess the language of a document.

This function applies a statistical method to determine the language of a document. Depending on the output argument, it may either return a single language code, or a ranking of languages that a document may be written in, sorted by probability.

Uses the langid library.

Parameters:

doc : document

output : string

Either “best” to get a pair (code, prob) giving the two-letter code of the most probable language and its probability, or “rank” for a list of such pairs for all languages in the model.

(task) xtas.tasks.single.morphy(doc)

Lemmatize tokens using morphy, WordNet’s lemmatizer.

Finds the morphological root of all words in doc, which is assumed to be written in English.

Returns:

lemmas : list

List of lemmas.

See also

stem_snowball
simpler approach to lemmatization (stemming).
(task) xtas.tasks.single.movie_review_polarity(doc)

Movie review polarity classifier.

Determines whether the film review doc is positive or negative. Might be applicable to other types of document as well, but uses a statistical model trained on a corpus of user reviews of movies, all in English.

Returns:

p : float

The probability that the movie review doc is positive.

See also

movie_review_emotions
per-sentence fine-grained sentiment tagger
sentiwords_tag
more generic sentiment expression tagger
(task) xtas.tasks.single.pos_tag(tokens, model='nltk')

Perform part-of-speech (POS) tagging for English.

Currently only does English using the default model in NLTK.

Expects a list of tokens.

(task) xtas.tasks.single.semafor(saf)

Wrapper around the Semafor semantic parser.

Expects semafor running in server mode listening to ${SEMAFOR_HOST}:${SEMAFOR_PORT} (defaults to localhost:9888). It also expects $CORENLP_HOME to point to the CoreNLP installation dir.

Input is expected to be a ‘SAF’ dictionary with trees and tokens. Output is a SAF dictionary with a frames attribute added.

References

(task) xtas.tasks.single.semanticize(doc, lang='en')

Run text through the UvA semanticizer.

Calls the UvA semanticizer webservice to perform entity linking and returns the names/links it has found.

See http://semanticize.uva.nl/doc/ for details.

References

M. Guerini, L. Gatti and M. Turchi (2013). “Sentiment analysis: How to derive prior polarities from SentiWordNet”. Proc. EMNLP, pp. 1259-1269.

(task) xtas.tasks.single.sentiwords_tag(doc, output='bag')

Tag doc with SentiWords polarity priors.

Performs left-to-right, longest-match annotation of token spans with polarities from SentiWords.

Uses no part-of-speech information; when a span has multiple possible taggings in SentiWords, the mean is returned.

Parameters:

doc : document or list of strings

output : string, optional

Output format. Either “bag” for a histogram (dict) of annotated token span frequencies, or “tokens” a mixed list of strings and (list of strings, polarity) pairs.

See also

movie_review_emotions
per-sentence fine-grained sentiment tagger
movie_review_polarity
figure out if a movie review is positive or negative
(task) xtas.tasks.single.stanford_ner_tag(doc, output='tokens')

Named entity recognizer using Stanford NER.

English-language name detection and classification.

Currently only supports the model ‘english.all.3class.distsim.crf.ser.gz’.

Parameters:

doc : document

Either a single string or a handle on a document in the ES store. Tokenization and sentence splitting will be done by Stanford NER using its own rules.

output : string, optional

Output format. “tokens” gives a list of (token, nerclass) triples, similar to the IO format but without the “I-”. “names” returns a list of (name, class pairs); since Stanford NER does not distinguish between start and continuation of name spans, the reconstruction of full names is heuristic.

Returns:

tagged : list of list of pair of string

For each sentence, a list of (word, tag) pairs.

See also

nlner_conll
NER tagger for Dutch.
(task) xtas.tasks.single.stem_snowball(doc, language)

Stem words in doc using the Snowball stemmer.

Set the parameter lang to a language code such as “de”, “en”, “nl”, or the special string “porter” to get Porter’s classic stemming algorithm for English.

See also

morphy
smarter approach to stemming (lemmatization), but only for English.
(task) xtas.tasks.single.tokenize(doc)

Tokenize text.

Uses the NLTK function word_tokenize.

(task) xtas.tasks.single.untokenize(tokens)

Undo tokenization.

Simply concatenates the given tokens with spaces in between. Useful after tokenization and filtering.

Returns:doc : string

xtas.tasks.cluster

Clustering and topic modelling tasks.

These tasks process batches of documents, denoted as lists of strings.

(task) xtas.tasks.cluster.big_kmeans(docs, k, batch_size=1000, n_features=1048576, single_pass=True)

k-means for very large sets of documents.

See kmeans for documentation. Differs from that function in that it does not computer tf-idf or LSA, and fetches the documents in a streaming fashion, so they don’t need to be held in memory. It does not do random restarts.

If the option single_pass is set to False, the documents are visited twice: once to fit a k-means model, once to determine their label in this model.

(task) xtas.tasks.cluster.kmeans(docs, k, lsa=None)

Run k-means clustering on a set of documents.

Uses scikit-learn to tokenize documents, compute tf-idf weights, perform (optional) LSA transformation, and cluster.

Parameters:

docs : list of strings

Untokenized documents.

k : integer

Number of clusters.

lsa : integer, optional

Whether to perform latent semantic analysis before k-means, and if so, with how many components/topics.

Returns:

clusters : sequence of sequence of documents

The input documents, grouped by cluster. The order of clusters and the order of documents within clusters is unspecified.

(task) xtas.tasks.cluster.lda(docs, k)

Latent Dirichlet allocation topic model.

Uses scikit-learn’s TfidfVectorizer and LatentDirichletAllocation.

Parameters:

k : integer

Number of topics.

(task) xtas.tasks.cluster.lsa(docs, k, random_state=None)

Latent semantic analysis.

Parameters:

docs : list of strings

Untokenized documents.

k : integer

Number of topics.

random_state : integer, optional

Random number seed, for reproducibility of results.

Returns:

model : list of list of (string, float)

The k components of the LSA model, represented as lists of (term, weight) pairs.

(task) xtas.tasks.cluster.parsimonious_wordcloud(docs, w=0.5, k=10)

Fit a parsimonious language model to terms in docs.