Core functionality. Contains the configuration and the (singleton) Celery “app” instance.
xtas.core.
configure
(config, import_error='raise', unknown_key='raise')[source]¶Configure xtas.
Parameters: | config : dict
import_error : string
unknown_key : string
|
---|
Single-document tasks.
These process one document per function call (in Python) or REST call (via
the web server, /run
or /run_es
). Most single-document tasks take a
document as their first argument. In the Python interface this may either be
a string or the result from xtas.tasks.es.es_document
, a reference to a
document in an Elasticsearch store.
xtas.tasks.single.
alpino
(doc, output='raw')¶Wrapper around the Alpino (dependency) parser for Dutch.
Expects an environment variable ALPINO_HOME to point at the Alpino installation dir.
The script uses the ‘dependencies’ end_hook to generate lemmata and the dependency structure.
Parameters: | output : string
|
---|
References
xtas.tasks.single.
corenlp
(doc, output='raw')¶Wrapper around the Stanford CoreNLP parser.
Expects $CORENLP_HOME
to point to the CoreNLP installation dir.
If run with all annotators, it requires around 3G of memory, and it will keep the process in memory indefinitely.
Tested with CoreNLP 2014-01-04 (see http://nlp.stanford.edu/software/corenlp.shtml).
Parameters: | output : string
|
---|
xtas.tasks.single.
corenlp_lemmatize
(doc, output='raw')¶Wrapper around the Stanford CoreNLP lemmatizer.
Expects $CORENLP_HOME
to point to the CoreNLP installation dir.
Tested with CoreNLP 2014-01-04.
Parameters: | output : string
|
---|
xtas.tasks.single.
dbpedia_spotlight
(doc, lang='en', conf=0.5, supp=0, api_url=None)¶Run text through a DBpedia Spotlight instance.
Calls the DBpedia Spotlight instance to perform entity linking and returns the names/links it has found.
See http://spotlight.dbpedia.org/ for details. This task uses a Python client for DBp Spotlight: https://github.com/aolieman/pyspotlight
xtas.tasks.single.
frog
(doc, output='raw')¶Wrapper around the Frog lemmatizer/POS tagger/NER/dependency parser.
Expects Frog to be running in server mode, listening on
localhost:${XTAS_FROG_PORT}
or port 9987 if the environment variable
XTAS_FROG_PORT
is not set. It is not started for you.
Currently, the module is only tested with all frog modules active except for the NER and parser.
The following line starts Frog in the correct way:
frog -S ${XTAS_FROG_PORT:-9887}
Parameters: | output : string
|
---|
See also
nlner_conll
References
xtas.tasks.single.
guess_language
(doc, output='best')¶Guess the language of a document.
This function applies a statistical method to determine the language of a
document. Depending on the output
argument, it may either return a
single language code, or a ranking of languages that a document may be
written in, sorted by probability.
Uses the langid library.
Parameters: | doc : document output : string
|
---|
xtas.tasks.single.
morphy
(doc)¶Lemmatize tokens using morphy, WordNet’s lemmatizer.
Finds the morphological root of all words in doc
, which is assumed to
be written in English.
Returns: | lemmas : list
|
---|
See also
stem_snowball
xtas.tasks.single.
movie_review_polarity
(doc)¶Movie review polarity classifier.
Determines whether the film review doc
is positive or negative. Might
be applicable to other types of document as well, but uses a statistical
model trained on a corpus of user reviews of movies, all in English.
Returns: | p : float
|
---|
See also
movie_review_emotions
sentiwords_tag
xtas.tasks.single.
pos_tag
(tokens, model='nltk')¶Perform part-of-speech (POS) tagging for English.
Currently only does English using the default model in NLTK.
Expects a list of tokens.
xtas.tasks.single.
semafor
(saf)¶Wrapper around the Semafor semantic parser.
Expects semafor running in server mode listening to
${SEMAFOR_HOST}:${SEMAFOR_PORT}
(defaults to localhost:9888).
It also expects $CORENLP_HOME
to point to the CoreNLP installation dir.
Input is expected to be a ‘SAF’ dictionary with trees and tokens. Output is a SAF dictionary with a frames attribute added.
References
xtas.tasks.single.
semanticize
(doc, lang='en')¶Run text through the UvA semanticizer.
Calls the UvA semanticizer webservice to perform entity linking and returns the names/links it has found.
See http://semanticize.uva.nl/doc/ for details.
References
M. Guerini, L. Gatti and M. Turchi (2013). “Sentiment analysis: How to derive prior polarities from SentiWordNet”. Proc. EMNLP, pp. 1259-1269.
xtas.tasks.single.
sentiwords_tag
(doc, output='bag')¶Tag doc with SentiWords polarity priors.
Performs left-to-right, longest-match annotation of token spans with polarities from SentiWords.
Uses no part-of-speech information; when a span has multiple possible taggings in SentiWords, the mean is returned.
Parameters: | doc : document or list of strings output : string, optional
|
---|
See also
movie_review_emotions
movie_review_polarity
xtas.tasks.single.
stanford_ner_tag
(doc, output='tokens')¶Named entity recognizer using Stanford NER.
English-language name detection and classification.
Currently only supports the model ‘english.all.3class.distsim.crf.ser.gz’.
Parameters: | doc : document
output : string, optional
|
---|---|
Returns: | tagged : list of list of pair of string
|
See also
nlner_conll
xtas.tasks.single.
stem_snowball
(doc, language)¶Stem words in doc using the Snowball stemmer.
Set the parameter lang
to a language code such as “de”, “en”, “nl”, or
the special string “porter” to get Porter’s classic stemming algorithm for
English.
See also
morphy
xtas.tasks.single.
tokenize
(doc)¶Tokenize text.
Uses the NLTK function word_tokenize.
xtas.tasks.single.
untokenize
(tokens)¶Undo tokenization.
Simply concatenates the given tokens with spaces in between. Useful after tokenization and filtering.
Returns: | doc : string |
---|
Clustering and topic modelling tasks.
These tasks process batches of documents, denoted as lists of strings.
xtas.tasks.cluster.
big_kmeans
(docs, k, batch_size=1000, n_features=1048576, single_pass=True)¶k-means for very large sets of documents.
See kmeans for documentation. Differs from that function in that it does not computer tf-idf or LSA, and fetches the documents in a streaming fashion, so they don’t need to be held in memory. It does not do random restarts.
If the option single_pass is set to False, the documents are visited twice: once to fit a k-means model, once to determine their label in this model.
xtas.tasks.cluster.
kmeans
(docs, k, lsa=None)¶Run k-means clustering on a set of documents.
Uses scikit-learn to tokenize documents, compute tf-idf weights, perform (optional) LSA transformation, and cluster.
Parameters: | docs : list of strings
k : integer
lsa : integer, optional
|
---|---|
Returns: | clusters : sequence of sequence of documents
|
xtas.tasks.cluster.
lda
(docs, k)¶Latent Dirichlet allocation topic model.
Uses scikit-learn’s TfidfVectorizer and LatentDirichletAllocation.
Parameters: | k : integer
|
---|
xtas.tasks.cluster.
lsa
(docs, k, random_state=None)¶Latent semantic analysis.
Parameters: | docs : list of strings
k : integer
random_state : integer, optional
|
---|---|
Returns: | model : list of list of (string, float)
|
xtas.tasks.cluster.
parsimonious_wordcloud
(docs, w=0.5, k=10)¶Fit a parsimonious language model to terms in docs.