Marc Kupietz | dbd431a | 2021-08-29 12:17:45 +0200 | [diff] [blame] | 1 | % Generated by roxygen2: do not edit by hand |
| 2 | % Please edit documentation in R/collocationAnalysis.R |
| 3 | \name{collocationAnalysis,KorAPConnection-method} |
| 4 | \alias{collocationAnalysis,KorAPConnection-method} |
| 5 | \alias{collocationAnalysis} |
| 6 | \title{Collocation analysis} |
| 7 | \usage{ |
| 8 | \S4method{collocationAnalysis}{KorAPConnection}( |
| 9 | kco, |
| 10 | node, |
| 11 | vc = "", |
| 12 | lemmatizeNodeQuery = FALSE, |
| 13 | minOccur = 5, |
| 14 | leftContextSize = 5, |
| 15 | rightContextSize = 5, |
| 16 | topCollocatesLimit = 200, |
| 17 | searchHitsSampleLimit = 20000, |
| 18 | ignoreCollocateCase = FALSE, |
| 19 | withinSpan = ifelse(exactFrequencies, "base/s=s", ""), |
| 20 | exactFrequencies = TRUE, |
| 21 | stopwords = RKorAPClient::synsemanticStopwords(), |
| 22 | seed = 7, |
| 23 | expand = length(vc) != length(node), |
Marc Kupietz | 5a336b6 | 2021-11-27 17:51:35 +0100 | [diff] [blame] | 24 | maxRecurse = 0, |
Marc Kupietz | dadfd91 | 2021-12-22 12:48:20 +0100 | [diff] [blame^] | 25 | addExamples = FALSE, |
Marc Kupietz | 419f21f | 2021-12-07 10:27:30 +0100 | [diff] [blame] | 26 | thresholdScore = "logDice", |
| 27 | threshold = 2, |
Marc Kupietz | 5a336b6 | 2021-11-27 17:51:35 +0100 | [diff] [blame] | 28 | localStopwords = c(), |
Marc Kupietz | 47d0d2b | 2021-12-19 16:38:52 +0100 | [diff] [blame] | 29 | collocateFilterRegex = "^[:alnum:]+-?[:alnum:]*$", |
Marc Kupietz | dbd431a | 2021-08-29 12:17:45 +0200 | [diff] [blame] | 30 | ... |
| 31 | ) |
| 32 | } |
| 33 | \arguments{ |
Marc Kupietz | 67edcb5 | 2021-09-20 21:54:24 +0200 | [diff] [blame] | 34 | \item{kco}{\code{\link[=KorAPConnection]{KorAPConnection()}} object (obtained e.g. from \code{new("KorAPConnection")}} |
Marc Kupietz | dbd431a | 2021-08-29 12:17:45 +0200 | [diff] [blame] | 35 | |
| 36 | \item{node}{target word} |
| 37 | |
| 38 | \item{vc}{string describing the virtual corpus in which the query should be performed. An empty string (default) means the whole corpus, as far as it is license-wise accessible.} |
| 39 | |
Marc Kupietz | 67edcb5 | 2021-09-20 21:54:24 +0200 | [diff] [blame] | 40 | \item{lemmatizeNodeQuery}{if TRUE, node query will be lemmatized, i.e. \verb{x -> [tt/l=x]}} |
Marc Kupietz | dbd431a | 2021-08-29 12:17:45 +0200 | [diff] [blame] | 41 | |
| 42 | \item{minOccur}{minimum absolute number of observed co-occurrences to consider a collocate candidate} |
| 43 | |
| 44 | \item{leftContextSize}{size of the left context window} |
| 45 | |
| 46 | \item{rightContextSize}{size of the right context window} |
| 47 | |
| 48 | \item{topCollocatesLimit}{limit analysis to the n most frequent collocates in the search hits sample} |
| 49 | |
| 50 | \item{searchHitsSampleLimit}{limit the size of the search hits sample} |
| 51 | |
| 52 | \item{ignoreCollocateCase}{logical, set to TRUE if collocate case should be ignored} |
| 53 | |
| 54 | \item{withinSpan}{KorAP span specification for collocations to be searched within} |
| 55 | |
| 56 | \item{exactFrequencies}{if FALSE, extrapolate observed co-occurrence frequencies from frequencies in search hits sample, otherwise retrieve exact co-occurrence frequencies} |
| 57 | |
| 58 | \item{stopwords}{vector of stopwords not to be considered as collocates} |
| 59 | |
| 60 | \item{seed}{seed for random page collecting order} |
| 61 | |
| 62 | \item{expand}{if TRUE, \code{node} and \code{vc} parameters are expanded to all of their combinations} |
| 63 | |
Marc Kupietz | 7d400e0 | 2021-12-19 16:39:36 +0100 | [diff] [blame] | 64 | \item{maxRecurse}{apply collocation analysis recursively \code{maxRecurse} times} |
| 65 | |
| 66 | \item{addExamples}{If TRUE, examples for instances of collocations will be added in a column \code{example}. This makes a difference in particular if \code{node} is given as a lemma query.} |
| 67 | |
| 68 | \item{thresholdScore}{association score function (see \code{\link{association-score-functions}}) to use for computing the threshold that is applied for recursive collocation analysis calls} |
| 69 | |
| 70 | \item{threshold}{minimum value of \code{thresholdScore} function call to apply collocation analysis recursively} |
| 71 | |
| 72 | \item{localStopwords}{vector of stopwords that will not be considered as collocates in the current function call, but that will not be passed to recursive calls} |
| 73 | |
Marc Kupietz | 47d0d2b | 2021-12-19 16:38:52 +0100 | [diff] [blame] | 74 | \item{collocateFilterRegex}{allow only collocates matching the regular expression} |
| 75 | |
Marc Kupietz | 67edcb5 | 2021-09-20 21:54:24 +0200 | [diff] [blame] | 76 | \item{...}{more arguments will be passed to \code{\link[=collocationScoreQuery]{collocationScoreQuery()}}} |
Marc Kupietz | dbd431a | 2021-08-29 12:17:45 +0200 | [diff] [blame] | 77 | } |
| 78 | \value{ |
| 79 | Tibble with top collocates, association scores, corresponding URLs for web user interface queries, etc. |
| 80 | } |
| 81 | \description{ |
Marc Kupietz | 67edcb5 | 2021-09-20 21:54:24 +0200 | [diff] [blame] | 82 | \ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#experimental}{\figure{lifecycle-experimental.svg}{options: alt='[Experimental]'}}}{\strong{[Experimental]}} |
Marc Kupietz | dbd431a | 2021-08-29 12:17:45 +0200 | [diff] [blame] | 83 | |
| 84 | Performs a collocation analysis for the given node (or query) |
| 85 | in the given virtual corpus. |
| 86 | } |
| 87 | \details{ |
| 88 | The collocation analysis is currently implemented on the client side, as some of the |
| 89 | functionality is not yet provided by the KorAP backend. Mainly for this reason |
| 90 | it is very slow (several minutes, up to hours), but on the other hand very flexible. |
| 91 | You can, for example, perform the analysis in arbitrary virtual corpora, use complex node queries, |
| 92 | and look for expression-internal collocates using the focus function (see examples and demo). |
| 93 | |
| 94 | To increase speed at the cost of accuracy and possible false negatives, |
| 95 | you can decrease searchHitsSampleLimit and/or topCollocatesLimit and/or set exactFrequencies to FALSE. |
| 96 | |
| 97 | Note that currently not the tokenization provided by the backend, i.e. the corpus itself, is used, but a tinkered one. |
| 98 | This can also lead to false negatives and to frequencies that differ from corresponding ones acquired via the web |
| 99 | user interface. |
| 100 | } |
| 101 | \examples{ |
Marc Kupietz | 6ae7605 | 2021-09-21 10:34:00 +0200 | [diff] [blame] | 102 | \dontrun{ |
| 103 | |
Marc Kupietz | dbd431a | 2021-08-29 12:17:45 +0200 | [diff] [blame] | 104 | # Find top collocates of "Packung" inside and outside the sports domain. |
| 105 | new("KorAPConnection", verbose = TRUE) \%>\% |
| 106 | collocationAnalysis("Packung", vc=c("textClass=sport", "textClass!=sport"), |
| 107 | leftContextSize=1, rightContextSize=1, topCollocatesLimit=20) \%>\% |
| 108 | dplyr::filter(logDice >= 5) |
| 109 | } |
| 110 | |
Marc Kupietz | 6ae7605 | 2021-09-21 10:34:00 +0200 | [diff] [blame] | 111 | \dontrun{ |
| 112 | |
Marc Kupietz | dbd431a | 2021-08-29 12:17:45 +0200 | [diff] [blame] | 113 | # Identify the most prominent light verb construction with "in ... setzen". |
| 114 | # Note that, currently, the use of focus function disallows exactFrequencies. |
| 115 | new("KorAPConnection", verbose = TRUE) \%>\% |
| 116 | collocationAnalysis("focus(in [tt/p=NN] {[tt/l=setzen]})", |
| 117 | leftContextSize=1, rightContextSize=0, exactFrequencies=FALSE, topCollocatesLimit=20) |
| 118 | } |
| 119 | |
| 120 | } |
| 121 | \seealso{ |
| 122 | Other collocation analysis functions: |
| 123 | \code{\link{association-score-functions}}, |
| 124 | \code{\link{collocationScoreQuery,KorAPConnection-method}}, |
| 125 | \code{\link{synsemanticStopwords}()} |
| 126 | } |
| 127 | \concept{collocation analysis functions} |