Update corpus mapping rules
Change-Id: I445798472f4494ab29db796fa19ad2b09339b1ac
diff --git a/.gitignore b/.gitignore
index 6268128..0027fa7 100644
--- a/.gitignore
+++ b/.gitignore
@@ -17,5 +17,6 @@
*.tar
!/cmd/termmapper/
!/cmd/koralmapper/
+/cmd/koralmapper/wiki2mapping
*-plan.md
overview.md
\ No newline at end of file
diff --git a/MAPPING.md b/MAPPING.md
new file mode 100644
index 0000000..3815431
--- /dev/null
+++ b/MAPPING.md
@@ -0,0 +1,194 @@
+# Mapping File Reference
+
+This document describes the syntax and guidelines for writing mapping files for Koral-Mapper. For general project information, installation, and API documentation, see [README.md](README.md).
+
+## Mapping File Format
+
+A mapping file defines a single mapping list with an ID, optional foundry/layer defaults, and a list of mapping rules:
+
+```yaml
+id: mapping-list-id
+foundryA: source-foundry
+layerA: source-layer
+foundryB: target-foundry
+layerB: target-layer
+mappings:
+ - "[pattern1] <> [replacement1]"
+ - "[pattern2] <> [replacement2]"
+```
+
+Mapping files can also be embedded inside a main configuration file under the `lists:` key (see [README.md](README.md) for configuration file format).
+
+Koral-Mapper supports two mapping types: **annotation** (the default) and **corpus**.
+
+## Annotation Mapping Rules (type: annotation)
+
+Annotation mapping rules rewrite `koral:token` / `koral:term` / `koral:termGroup` structures in query JSON and annotation spans in response snippets.
+
+Each rule consists of two patterns separated by `<>`. The patterns can be:
+- Simple terms: `[key]`, `[layer=key]`, `[foundry/*=key]`, `[foundry/layer=key]`, or `[foundry/layer=key:value]`
+- Complex terms with AND/OR relations: `[term1 & term2]`, `[term1 | term2]`, or `[term1 | (term2 & term3)]`
+
+Example mapping file:
+
+```yaml
+id: stts-upos
+desc: Mapping from STTS and Universal dependency Part-of-Speech
+foundryA: opennlp
+layerA: p
+foundryB: upos
+layerB: p
+mappings:
+ - "[ADJA] <> [ADJ]"
+ - "[ADJD] <> [ADJ & Variant=Short]"
+ - "[ART] <> [DET & PronType=Art]"
+ - "[PIAT] <> [DET & (PronType=Ind | PronType=Neg | PronType=Tot)]"
+```
+
+### Foundry and Layer Precedence
+
+Koral-Mapper follows a strict precedence hierarchy when determining which foundry and layer values to use during mapping transformations:
+
+1. **Mapping rule foundry/layer** (highest priority)
+ - Explicit foundry/layer specifications in mapping rule patterns
+ - Example: `[opennlp/p=DT]` has explicit foundry "opennlp" and layer "p"
+
+2. **Passed overwrite foundry/layer** (medium priority)
+ - Values provided via query parameters (`foundryA`, `foundryB`, `layerA`, `layerB`)
+ - Applied only when mapping rules don't have explicit foundry/layer
+
+3. **Mapping list foundry/layer** (lowest priority)
+ - Default values from the mapping list configuration
+ - Used as fallback when neither mapping rules nor query parameters specify values
+
+## Corpus Mapping Rules (type: corpus)
+
+Corpus mapping rules use `key=value <> key=value` syntax for rewriting `koral:doc` / `koral:docGroup` structures in the `corpus`/`collection` section of a KoralQuery request, and enriching `fields` arrays in responses.
+
+### Rule Syntax
+
+#### Simple fields
+
+```yaml
+mappings:
+ - "textClass=novel <> genre=fiction"
+```
+
+The left side is "side A" and the right side is "side B". With `dir=atob`, the query matcher rewrites A-side matches to B-side replacements. With `dir=btoa`, the direction is reversed.
+
+#### Match types and value types
+
+Rules can specify match operators and value types:
+
+```yaml
+mappings:
+ - "pubDate=2020:geq <> yearFrom=2020:geq" # match type (eq, ne, geq, leq, contains, excludes)
+ - "pubDate=2020-01#date <> year=2020#string" # value type (string, regex, date)
+ - "textClass=wissenschaft.*#regex <> genre=science" # regex matching
+```
+
+When a rule specifies a match type (e.g. `:geq`), it only matches nodes with that exact match type. When no match type is specified, the rule matches any match type and preserves the original.
+
+#### Group rules (AND / OR)
+
+Rules can use AND (`&`) and OR (`|`) groups on either side:
+
+```yaml
+mappings:
+ # Single field → AND group
+ - "textClass=novel <> (genre=fiction & type=book)"
+ # AND group → single field (matches AND docGroups via subset matching)
+ - "genre=fiction <> (textClass=kultur & textClass=musik)"
+ # OR group → single field (matches individual docs or OR docGroups)
+ - "(genre=fiction | genre=novel) <> textClass=belletristik"
+ # Complex: OR-of-AND on B-side
+ - "Entertainment <> ((kultur & musik) | (kultur & film))"
+```
+
+Key points for groups:
+- **AND patterns** match any AND group containing at least the pattern's operands (subset). Extra operands are preserved.
+- **OR patterns** match a single leaf if any operand matches, or an OR group structurally (exact operand count).
+- Groups on **both sides** are supported.
+
+#### Bare values with `fieldA` / `fieldB`
+
+When `fieldA` / `fieldB` are set in the mapping list header, values without a `key=` prefix are shorthand. The field name is filled in from the header:
+
+```yaml
+id: satek-wiki-dereko
+type: corpus
+fieldA: wikiCat
+fieldB: textClass
+mappings:
+ # Equivalent to "wikiCat=Entertainment <> textClass=kultur"
+ - "Entertainment <> kultur"
+ # Groups work too
+ - "Entertainment <> (kultur & musik)"
+```
+
+### Matching Semantics
+
+#### Query rewriting — iterative rule application
+
+Corpus rules are applied **iteratively**: each rule is applied to the **entire tree** in order, and subsequent rules see the **transformed result** of all previous rules. This means multiple rules can transform successive AST states, just like the annotation matcher.
+
+For each rule, the matcher tries matching at the current node first. If no match is found and the node is a `koral:docGroup` / `koral:fieldGroup`, the rule recurses into operands.
+
+#### OR pattern matching
+
+OR patterns like `(a | b)` match in two ways:
+
+- **Leaf nodes** (`koral:doc` / `koral:field`): An OR pattern matches if **any operand** matches the leaf. For example, the pattern `(Entertainment | Culture)` matches a single `koral:doc` with value `Entertainment`.
+- **Group nodes** (`koral:docGroup` / `koral:fieldGroup`): Structural matching — the node must be an OR group with **exactly** the same operands (commutative, exact count).
+
+#### AND pattern matching (subset)
+
+AND patterns like `(a & b)` use **subset matching**: the node must be an AND `koral:docGroup` / `koral:fieldGroup` containing **at least** all pattern operands. Extra operands beyond the pattern are preserved alongside the replacement.
+
+For example, if the rule is `genre=fiction <> (textClass=kultur & textClass=musik)` and the input is `AND(textClass=kultur, textClass=musik, pubDate=2020)`, the AND pattern matches (subset of 3 operands), and the result is `AND(genre=fiction, pubDate=2020)` — the replacement plus the preserved extra operand.
+
+If all operands match (no extras), the group is replaced entirely by the replacement node.
+
+#### Response enrichment
+
+For response field enrichment, the matching rules work as follows:
+
+- **Pattern matching**: Field patterns match directly. OR group patterns match a single response field if **any operand** matches. AND group patterns **cannot** match a single field and are skipped.
+- **Replacement collection**: AND group replacements are **flattened** — all operands become individual `koral:field` entries. OR group replacements are **skipped** because response fields are flat key/value entries and OR semantics (one-of) cannot be represented.
+
+Examples:
+- `(a | b) <> (c & d)` — when field `a` is in the response, both `c` and `d` are added.
+- `(a | b) <> (c | d)` — when field `a` is in the response, nothing is added (OR replacement skipped).
+- `a <> (c & d)` — when field `a` is in the response, both `c` and `d` are added.
+- `a <> c` — when field `a` is in the response, `c` is added.
+
+(Supported `@type` aliases: `koral:field` for `koral:doc`, `koral:fieldGroup` for `koral:docGroup`).
+
+### Rule Ordering Strategy
+
+Rules should be ordered from **most specific to most general** (by total leaf count across both sides, descending). Because rules are applied iteratively, more specific rules should appear first to transform the AST before more general rules get a chance to match. Generated mapping files typically contain complementary rule types such as:
+
+1. **Aggregated rules** with OR-of-AND groups — match exact complex structures
+2. **Individual group rules** with AND patterns — match individual `koral:docGroup` nodes (subset matching)
+
+### Iterative Application and Rule Chaining
+
+Because rules are applied iteratively (each rule sees the result of previous rules), you can chain transformations:
+
+```yaml
+mappings:
+ - "textClass=novel <> genre=fiction"
+ - "genre=fiction <> category=lit"
+```
+
+With this configuration and `dir=atob`, an input `textClass=novel` is first rewritten to `genre=fiction` by rule 1, then to `category=lit` by rule 2.
+
+This also means that for bidirectional mappings, you often need complementary rules that handle decomposed groups:
+
+```yaml
+mappings:
+ # Forward: source category → OR-of-AND target categories (for AtoB)
+ - "Entertainment <> ((kultur & musik) | (kultur & film))"
+ # Reverse AND: multiple source categories ← AND group (for BtoA with AND input)
+ - "(Entertainment | Culture) <> (kultur & film)"
+```
diff --git a/README.md b/README.md
index 7086fd0..7680a5c 100644
--- a/README.md
+++ b/README.md
@@ -98,44 +98,14 @@
These values are applied during configuration parsing. When using only individual mapping files (`-m` flags), default values are used unless overridden by command line arguments.
+### Mapping Rules
-### Corpus mapping rules (type: corpus)
+Koral-Mapper supports two types of mapping rules:
-Corpus mapping rules use `key=value <> key=value` syntax for rewriting `koral:doc` / `koral:docGroup` structures in the `corpus`/`collection` section of a KoralQuery request, and enriching `fields` arrays in responses.
+- **Annotation mappings** (default): Rewrite `koral:token` / `koral:term` structures in queries and annotation spans in responses
+- **Corpus mappings** (`type: corpus`): Rewrite `koral:doc` / `koral:docGroup` structures in corpus/collection queries and enrich response fields
-- Simple fields: `textClass=novel <> genre=fiction`
-- Match types: `pubDate=2020:geq <> yearFrom=2020:geq` (eq, ne, geq, leq, contains, excludes)
-- Value types: `pubDate=2020-01#date <> year=2020#string` (string, regex, date)
-- Regex matching: `textClass=wissenschaft.*#regex <> genre=science`
-- Groups (AND/OR): `(textClass=novel & pubDate=2020) <> genre=fiction`
-
-(Supported `@type` aliases: `koral:field` for `koral:doc`, `koral:fieldGroup` for `koral:docGroup`).
-
-### Annotation mapping rules (type: annotation)
-
-Each annotation mapping rule consists of two patterns separated by `<>`. The patterns can be:
-- Simple terms (e.g. `[key]`, `[layer=key]`, `[foundry/*=key]`, `[foundry/layer=key]` or `[foundry/layer=key:value]`)
-- Complex terms with AND/OR relations: `[term1 & term2]` or `[term1 | term2]` or `[term1 | (term2 & term3)]`
-
-### Foundry and Layer Precedence for term mapping
-
-Koral-Mapper follows a strict precedence hierarchy when determining which foundry and layer values to use during mapping transformations. This ensures predictable behavior when combining mapping rules with runtime overrides.
-
-#### Precedence Rules
-
-The precedence hierarchy is applied separately for foundry and layer values:
-
-1. **Mapping rule foundry/layer** (highest priority)
- - Explicit foundry/layer specifications in mapping rule patterns
- - Example: `[opennlp/p=DT]` has explicit foundry "opennlp" and layer "p"
-
-2. **Passed overwrite foundry/layer** (medium priority)
- - Values provided via query parameters (`foundryA`, `foundryB`, `layerA`, `layerB`)
- - Applied only when mapping rules don't have explicit foundry/layer
-
-3. **Mapping list foundry/layer** (lowest priority)
- - Default values from the mapping list configuration
- - Used as fallback when neither mapping rules nor query parameters specify values
+For detailed mapping rule syntax, examples, and guidelines on writing mapping files, see [MAPPING.md](MAPPING.md).
## API Endpoints
@@ -245,12 +215,15 @@
The SDK script and server data-attribute in the HTML are determined by the configuration file's `sdk` and `server` values, with fallback to default endpoints if not specified.
-## Supported mappings
+## Supported Mappings
### `mappings/stts-upos.yaml`
-Mapping between [STTS and UD part-of-spech tags](https://universaldependencies.org/tagset-conversion/de-stts-uposf.html).
+Mapping between [STTS and UD part-of-speech tags](https://universaldependencies.org/tagset-conversion/de-stts-uposf.html).
+### `mappings/wiki-dereko.yaml`
+
+Corpus mapping between wiki categories and DeReKo text classes.
## Progress
diff --git a/config/config.go b/config/config.go
index 73e5697..a4f9f77 100644
--- a/config/config.go
+++ b/config/config.go
@@ -32,6 +32,8 @@
LayerA string `yaml:"layerA,omitempty"`
FoundryB string `yaml:"foundryB,omitempty"`
LayerB string `yaml:"layerB,omitempty"`
+ FieldA string `yaml:"fieldA,omitempty"`
+ FieldB string `yaml:"fieldB,omitempty"`
Mappings []MappingRule `yaml:"mappings"`
}
@@ -41,8 +43,12 @@
}
// ParseCorpusMappings parses all mapping rules as corpus rules.
+// Bare values (without key=) are always allowed and receive the default
+// field name from the mapping list header (FieldA/FieldB) when set.
func (list *MappingList) ParseCorpusMappings() ([]*parser.CorpusMappingResult, error) {
corpusParser := parser.NewCorpusParser()
+ corpusParser.AllowBareValues = true
+
results := make([]*parser.CorpusMappingResult, len(list.Mappings))
for i, rule := range list.Mappings {
if rule == "" {
@@ -52,11 +58,33 @@
if err != nil {
return nil, fmt.Errorf("failed to parse corpus mapping rule %d in list '%s': %w", i, list.ID, err)
}
+
+ if list.FieldA != "" {
+ applyDefaultCorpusKey(result.Upper, list.FieldA)
+ }
+ if list.FieldB != "" {
+ applyDefaultCorpusKey(result.Lower, list.FieldB)
+ }
+
results[i] = result
}
return results, nil
}
+// applyDefaultCorpusKey recursively fills in empty keys on CorpusField nodes.
+func applyDefaultCorpusKey(node parser.CorpusNode, defaultKey string) {
+ switch n := node.(type) {
+ case *parser.CorpusField:
+ if n.Key == "" {
+ n.Key = defaultKey
+ }
+ case *parser.CorpusGroup:
+ for _, op := range n.Operands {
+ applyDefaultCorpusKey(op, defaultKey)
+ }
+ }
+}
+
// MappingConfig represents the root configuration containing multiple mapping lists
type MappingConfig struct {
SDK string `yaml:"sdk,omitempty"`
diff --git a/config/config_test.go b/config/config_test.go
index 3a7ad32..4ba8913 100644
--- a/config/config_test.go
+++ b/config/config_test.go
@@ -6,6 +6,7 @@
"testing"
"github.com/KorAP/Koral-Mapper/ast"
+ "github.com/KorAP/Koral-Mapper/parser"
"github.com/rs/zerolog/log"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
@@ -949,3 +950,33 @@
assert.Error(t, err)
assert.Contains(t, err.Error(), "failed to parse corpus mapping rule")
}
+
+func TestParseCorpusMappingsWithFieldAFieldB(t *testing.T) {
+ list := &MappingList{
+ ID: "test-keyed",
+ Type: "corpus",
+ FieldA: "wikiCat",
+ FieldB: "textClass",
+ Mappings: []MappingRule{
+ "Entertainment <> ((kultur & musik) | (kultur & film))",
+ },
+ }
+
+ results, err := list.ParseCorpusMappings()
+ require.NoError(t, err)
+ require.Len(t, results, 1)
+
+ upper := results[0].Upper.(*parser.CorpusField)
+ assert.Equal(t, "wikiCat", upper.Key)
+ assert.Equal(t, "Entertainment", upper.Value)
+
+ group := results[0].Lower.(*parser.CorpusGroup)
+ assert.Equal(t, "or", group.Operation)
+ require.Len(t, group.Operands, 2)
+
+ and1 := group.Operands[0].(*parser.CorpusGroup)
+ assert.Equal(t, "textClass", and1.Operands[0].(*parser.CorpusField).Key)
+ assert.Equal(t, "kultur", and1.Operands[0].(*parser.CorpusField).Value)
+ assert.Equal(t, "textClass", and1.Operands[1].(*parser.CorpusField).Key)
+ assert.Equal(t, "musik", and1.Operands[1].(*parser.CorpusField).Value)
+}
diff --git a/mapper/corpus.go b/mapper/corpus.go
index d45dc92..f5800ba 100644
--- a/mapper/corpus.go
+++ b/mapper/corpus.go
@@ -7,6 +7,8 @@
)
// applyCorpusQueryMappings processes corpus/collection section with corpus rules.
+// Rules are applied iteratively: each rule is applied to the entire tree,
+// and subsequent rules see the transformed result.
func (m *Mapper) applyCorpusQueryMappings(mappingID string, opts MappingOptions, jsonData any) (any, error) {
rules := m.parsedCorpusRules[mappingID]
@@ -15,7 +17,6 @@
return jsonData, nil
}
- // Find corpus or collection attribute
corpusKey := ""
if _, exists := jsonMap["corpus"]; exists {
corpusKey = "corpus"
@@ -33,61 +34,63 @@
}
result := shallowCopyMap(jsonMap)
- rewritten := m.rewriteCorpusNode(corpusData, rules, opts)
- result[corpusKey] = rewritten
+
+ var current any = corpusData
+ for _, rule := range rules {
+ current = m.applyCorpusRule(current, rule, opts)
+ }
+ result[corpusKey] = current
return result, nil
}
-// rewriteCorpusNode recursively walks a corpus tree and applies matching rules.
-func (m *Mapper) rewriteCorpusNode(node map[string]any, rules []*parser.CorpusMappingResult, opts MappingOptions) any {
- atType, _ := node["@type"].(string)
+// applyCorpusRule applies a single corpus mapping rule to a node tree.
+// It matches at the current level first, then recurses into operands
+// if no match is found.
+func (m *Mapper) applyCorpusRule(nodeAny any, rule *parser.CorpusMappingResult, opts MappingOptions) any {
+ node, ok := nodeAny.(map[string]any)
+ if !ok {
+ return nodeAny
+ }
- switch atType {
- case "koral:doc", "koral:field":
- return m.rewriteCorpusDoc(node, rules, opts)
- case "koral:docGroup", "koral:fieldGroup":
- return m.rewriteCorpusDocGroup(node, rules, opts)
- case "koral:docGroupRef":
- return node
- default:
+ atType, _ := node["@type"].(string)
+ if atType == "koral:docGroupRef" {
return node
}
-}
-// rewriteCorpusDoc attempts to match a koral:doc node against rules and replace it.
-func (m *Mapper) rewriteCorpusDoc(node map[string]any, rules []*parser.CorpusMappingResult, opts MappingOptions) any {
- for _, rule := range rules {
- var pattern, replacement parser.CorpusNode
- if opts.Direction == AtoB {
- pattern, replacement = rule.Upper, rule.Lower
- } else {
- pattern, replacement = rule.Lower, rule.Upper
- }
+ var pattern, replacement parser.CorpusNode
+ if opts.Direction == AtoB {
+ pattern, replacement = rule.Upper, rule.Lower
+ } else {
+ pattern, replacement = rule.Lower, rule.Upper
+ }
- patternField, ok := pattern.(*parser.CorpusField)
- if !ok {
- continue
- }
-
- if !matchCorpusField(patternField, node) {
- continue
+ if matchCorpusNode(pattern, node) {
+ // AND subset match: node has more operands than pattern
+ if pg, ok := pattern.(*parser.CorpusGroup); ok && pg.Operation == "and" {
+ operandsRaw, _ := node["operands"].([]any)
+ if operandsRaw != nil && len(operandsRaw) > len(pg.Operands) {
+ return m.buildSubsetANDReplacement(node, pg.Operands, replacement, opts)
+ }
}
replaced := buildReplacementFromNode(replacement, node)
-
if opts.AddRewrites {
addCorpusRewrite(replaced, node)
}
-
return replaced
}
+ // No match at this level; recurse into operands if it's a group
+ if atType == "koral:docGroup" || atType == "koral:fieldGroup" {
+ return m.applyCorpusRuleToOperands(node, rule, opts)
+ }
+
return node
}
-// rewriteCorpusDocGroup recursively rewrites operands of a koral:docGroup.
-func (m *Mapper) rewriteCorpusDocGroup(node map[string]any, rules []*parser.CorpusMappingResult, opts MappingOptions) any {
+// applyCorpusRuleToOperands recursively applies a single rule to operands of a docGroup.
+func (m *Mapper) applyCorpusRuleToOperands(node map[string]any, rule *parser.CorpusMappingResult, opts MappingOptions) any {
result := shallowCopyMap(node)
operandsRaw, ok := node["operands"].([]any)
@@ -97,18 +100,171 @@
newOperands := make([]any, len(operandsRaw))
for i, opRaw := range operandsRaw {
- opMap, ok := opRaw.(map[string]any)
- if !ok {
- newOperands[i] = opRaw
- continue
- }
- newOperands[i] = m.rewriteCorpusNode(opMap, rules, opts)
+ newOperands[i] = m.applyCorpusRule(opRaw, rule, opts)
}
result["operands"] = newOperands
return result
}
+// buildSubsetANDReplacement handles AND patterns that match a subset of a
+// group's operands. The matched operands are replaced and unmatched ones
+// are preserved alongside the replacement.
+func (m *Mapper) buildSubsetANDReplacement(node map[string]any, patternOps []parser.CorpusNode, replacement parser.CorpusNode, opts MappingOptions) any {
+ operandsRaw, _ := node["operands"].([]any)
+
+ used := make([]bool, len(operandsRaw))
+ for _, patOp := range patternOps {
+ for j, docOpRaw := range operandsRaw {
+ if used[j] {
+ continue
+ }
+ docOp, ok := docOpRaw.(map[string]any)
+ if !ok {
+ continue
+ }
+ if matchCorpusNode(patOp, docOp) {
+ used[j] = true
+ break
+ }
+ }
+ }
+
+ var remaining []any
+ for j, docOpRaw := range operandsRaw {
+ if !used[j] {
+ remaining = append(remaining, docOpRaw)
+ }
+ }
+
+ replacementNode := buildReplacementFromNode(replacement, node)
+ newOperands := append([]any{replacementNode}, remaining...)
+
+ if len(newOperands) == 1 {
+ result := newOperands[0]
+ if opts.AddRewrites {
+ if resultMap, ok := result.(map[string]any); ok {
+ addCorpusRewrite(resultMap, node)
+ }
+ }
+ return result
+ }
+
+ result := shallowCopyMap(node)
+ result["operands"] = newOperands
+
+ if opts.AddRewrites {
+ addCorpusRewrite(result, node)
+ }
+
+ return result
+}
+
+// matchCorpusNode checks if a JSON node matches a CorpusNode pattern.
+// For CorpusField patterns, the node must be a koral:doc/koral:field.
+// For CorpusGroup patterns, the node must be a koral:docGroup/koral:fieldGroup
+// with matching operation and exactly matching operands (commutative).
+func matchCorpusNode(pattern parser.CorpusNode, node map[string]any) bool {
+ switch p := pattern.(type) {
+ case *parser.CorpusField:
+ atType, _ := node["@type"].(string)
+ if atType != "koral:doc" && atType != "koral:field" {
+ return false
+ }
+ return matchCorpusField(p, node)
+ case *parser.CorpusGroup:
+ return matchCorpusGroupNode(p, node)
+ }
+ return false
+}
+
+// matchCorpusGroupNode checks if a JSON node matches a CorpusGroup pattern.
+//
+// OR patterns: for leaf nodes (doc/field), any operand matching suffices.
+// For group nodes, structural matching requires an OR docGroup/fieldGroup
+// with exactly matching operands (commutative, exact count).
+//
+// AND patterns: the node must be a docGroup/fieldGroup with AND operation
+// and all pattern operands must be found (subset matching — the node may
+// have additional operands beyond those in the pattern).
+func matchCorpusGroupNode(pattern *parser.CorpusGroup, node map[string]any) bool {
+ atType, _ := node["@type"].(string)
+
+ if pattern.Operation == "or" {
+ // Leaf nodes: any-operand matching
+ if atType == "koral:doc" || atType == "koral:field" {
+ for _, op := range pattern.Operands {
+ if matchCorpusNode(op, node) {
+ return true
+ }
+ }
+ return false
+ }
+ // Group nodes: structural matching (exact operand count)
+ if atType != "koral:docGroup" && atType != "koral:fieldGroup" {
+ return false
+ }
+ operation, _ := node["operation"].(string)
+ if operation != "operation:or" {
+ return false
+ }
+ return matchGroupOperands(pattern.Operands, node, true)
+ }
+
+ // AND patterns: subset matching
+ if atType != "koral:docGroup" && atType != "koral:fieldGroup" {
+ return false
+ }
+ operation, _ := node["operation"].(string)
+ if operation != "operation:and" {
+ return false
+ }
+ return matchGroupOperands(pattern.Operands, node, false)
+}
+
+// matchGroupOperands checks if a docGroup's operands match a pattern's
+// operands using commutative set matching. When exactCount is true, the
+// operand counts must be equal; otherwise subset matching is used (the
+// node may have more operands than the pattern).
+func matchGroupOperands(patternOps []parser.CorpusNode, node map[string]any, exactCount bool) bool {
+ operandsRaw, ok := node["operands"].([]any)
+ if !ok {
+ return false
+ }
+ if exactCount {
+ if len(operandsRaw) != len(patternOps) {
+ return false
+ }
+ } else {
+ if len(operandsRaw) < len(patternOps) {
+ return false
+ }
+ }
+
+ used := make([]bool, len(operandsRaw))
+ for _, patOp := range patternOps {
+ found := false
+ for j, docOpRaw := range operandsRaw {
+ if used[j] {
+ continue
+ }
+ docOp, ok := docOpRaw.(map[string]any)
+ if !ok {
+ continue
+ }
+ if matchCorpusNode(patOp, docOp) {
+ used[j] = true
+ found = true
+ break
+ }
+ }
+ if !found {
+ return false
+ }
+ }
+ return true
+}
+
// matchCorpusField checks if a koral:doc JSON node matches a CorpusField pattern.
func matchCorpusField(pattern *parser.CorpusField, doc map[string]any) bool {
docKey, _ := doc["key"].(string)
@@ -153,8 +309,14 @@
func buildReplacementFromNode(replacement parser.CorpusNode, originalDoc map[string]any) any {
switch r := replacement.(type) {
case *parser.CorpusField:
+ // Determine @type: use the original's type for doc/field, default to koral:doc
+ atType := "koral:doc"
+ if origType, _ := originalDoc["@type"].(string); origType == "koral:doc" || origType == "koral:field" {
+ atType = origType
+ }
+
result := map[string]any{
- "@type": originalDoc["@type"],
+ "@type": atType,
"key": r.Key,
"value": r.Value,
}
@@ -196,6 +358,15 @@
return
}
+ origAtType, _ := original["@type"].(string)
+
+ // If the original was a group, store the whole structure as the rewrite original
+ if origAtType == "koral:docGroup" || origAtType == "koral:fieldGroup" {
+ rewrite := newRewriteEntry("", original)
+ replacedMap["rewrites"] = []any{rewrite}
+ return
+ }
+
origKey, _ := original["key"].(string)
newKey, _ := replacedMap["key"].(string)
@@ -275,6 +446,8 @@
}
// matchSingleValue checks a single key+value pair against all rules and returns mapped field entries.
+// Supports field patterns (direct match) and OR group patterns (any operand match).
+// AND group patterns cannot match a single field and are skipped.
func (m *Mapper) matchSingleValue(key, value string, rules []*parser.CorpusMappingResult, opts MappingOptions) []any {
var results []any
@@ -291,12 +464,7 @@
pattern, replacement = rule.Lower, rule.Upper
}
- patternField, ok := pattern.(*parser.CorpusField)
- if !ok {
- continue
- }
-
- if !matchCorpusField(patternField, pseudoDoc) {
+ if !matchCorpusFieldPattern(pattern, pseudoDoc) {
continue
}
@@ -306,7 +474,29 @@
return results
}
-// collectReplacementFields flattens a replacement CorpusNode into individual mapped field entries.
+// matchCorpusFieldPattern checks if a single response field matches a pattern.
+// Field patterns match directly. OR group patterns match if any operand matches.
+// AND group patterns cannot match a single field.
+func matchCorpusFieldPattern(pattern parser.CorpusNode, doc map[string]any) bool {
+ switch p := pattern.(type) {
+ case *parser.CorpusField:
+ return matchCorpusField(p, doc)
+ case *parser.CorpusGroup:
+ if p.Operation == "or" {
+ for _, op := range p.Operands {
+ if matchCorpusFieldPattern(op, doc) {
+ return true
+ }
+ }
+ }
+ }
+ return false
+}
+
+// collectReplacementFields flattens a replacement CorpusNode into individual
+// mapped field entries. OR groups are skipped because response fields are flat
+// key/value entries and OR semantics (one-of) cannot be represented. AND groups
+// are flattened — all operands become individual fields.
func collectReplacementFields(node parser.CorpusNode) []any {
var results []any
@@ -326,6 +516,9 @@
results = append(results, entry)
case *parser.CorpusGroup:
+ if n.Operation == "or" {
+ return nil
+ }
for _, op := range n.Operands {
results = append(results, collectReplacementFields(op)...)
}
diff --git a/mapper/corpus_test.go b/mapper/corpus_test.go
index 4f607e7..df2ae85 100644
--- a/mapper/corpus_test.go
+++ b/mapper/corpus_test.go
@@ -684,3 +684,694 @@
assert.Error(t, err)
assert.Contains(t, err.Error(), "not found")
}
+
+// --- Group pattern matching tests ---
+
+func TestCorpusQueryANDGroupPatternMatchBtoA(t *testing.T) {
+ m := newCorpusMapper(t, "genre=fiction <> (textClass=kultur & textClass=musik)")
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{
+ "@type": "koral:doc",
+ "key": "textClass",
+ "value": "kultur",
+ "match": "match:eq",
+ },
+ map[string]any{
+ "@type": "koral:doc",
+ "key": "textClass",
+ "value": "musik",
+ "match": "match:eq",
+ },
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:doc", corpus["@type"])
+ assert.Equal(t, "genre", corpus["key"])
+ assert.Equal(t, "fiction", corpus["value"])
+}
+
+func TestCorpusQueryANDGroupPatternCommutative(t *testing.T) {
+ m := newCorpusMapper(t, "genre=fiction <> (textClass=kultur & textClass=musik)")
+
+ // Operands in reversed order — should still match
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{
+ "@type": "koral:doc",
+ "key": "textClass",
+ "value": "musik",
+ },
+ map[string]any{
+ "@type": "koral:doc",
+ "key": "textClass",
+ "value": "kultur",
+ },
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "genre", corpus["key"])
+ assert.Equal(t, "fiction", corpus["value"])
+}
+
+func TestCorpusQueryANDGroupPatternNoMatchWrongOp(t *testing.T) {
+ m := newCorpusMapper(t, "genre=fiction <> (textClass=kultur & textClass=musik)")
+
+ // OR operation doesn't match AND pattern
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:or",
+ "operands": []any{
+ map[string]any{
+ "@type": "koral:doc",
+ "key": "textClass",
+ "value": "kultur",
+ },
+ map[string]any{
+ "@type": "koral:doc",
+ "key": "textClass",
+ "value": "musik",
+ },
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:docGroup", corpus["@type"])
+}
+
+func TestCorpusQueryANDGroupPatternSubsetMatch(t *testing.T) {
+ m := newCorpusMapper(t, "genre=fiction <> (textClass=kultur & textClass=musik)")
+
+ // Three operands: pattern matches the two matching operands (subset),
+ // the unmatched operand is preserved alongside the replacement.
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "musik"},
+ map[string]any{"@type": "koral:doc", "key": "pubDate", "value": "2020"},
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:docGroup", corpus["@type"])
+ assert.Equal(t, "operation:and", corpus["operation"])
+
+ operands := corpus["operands"].([]any)
+ require.Len(t, operands, 2)
+
+ // First operand is the replacement
+ first := operands[0].(map[string]any)
+ assert.Equal(t, "genre", first["key"])
+ assert.Equal(t, "fiction", first["value"])
+
+ // Second operand is the preserved unmatched operand
+ second := operands[1].(map[string]any)
+ assert.Equal(t, "pubDate", second["key"])
+ assert.Equal(t, "2020", second["value"])
+}
+
+func TestCorpusQueryORGroupPatternExactMatch(t *testing.T) {
+ m := newCorpusMapper(t, "(genre=fiction | genre=novel) <> textClass=belletristik")
+
+ // Exact OR structure matches
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:or",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "genre", "value": "fiction"},
+ map[string]any{"@type": "koral:doc", "key": "genre", "value": "novel"},
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:doc", corpus["@type"])
+ assert.Equal(t, "textClass", corpus["key"])
+ assert.Equal(t, "belletristik", corpus["value"])
+}
+
+func TestCorpusQueryORGroupPatternSingleOperandMatch(t *testing.T) {
+ m := newCorpusMapper(t, "(genre=fiction | genre=novel) <> textClass=belletristik")
+
+ // Single doc matches OR group pattern when any operand matches.
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:doc",
+ "key": "genre",
+ "value": "fiction",
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "textClass", corpus["key"])
+ assert.Equal(t, "belletristik", corpus["value"])
+}
+
+func TestCorpusQueryORGroupPatternNoMatchWrongValue(t *testing.T) {
+ m := newCorpusMapper(t, "(genre=fiction | genre=novel) <> textClass=belletristik")
+
+ // Single doc with value not in OR pattern should not match.
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:doc",
+ "key": "genre",
+ "value": "science",
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "genre", corpus["key"])
+ assert.Equal(t, "science", corpus["value"])
+}
+
+func TestCorpusQueryComplexityOrdering(t *testing.T) {
+ // Rules ordered by complexity (most specific first).
+ // Group patterns use structural matching: AND matches AND groups,
+ // OR matches OR groups. The forward rule's OR B-side does NOT match
+ // individual AND groups in BtoA, so reverse rules handle those.
+ m := newCorpusMapper(t,
+ // Forward: Entertainment → OR-of-ANDs (complex B-side, for AtoB)
+ "genre=Entertainment <> ((textClass=kultur & textClass=musik) | (textClass=kultur & textClass=film))",
+ // Reverse aggregated: (Entertainment | Culture) → AND (for BtoA with (k&f))
+ "(genre=Entertainment | genre=Culture) <> (textClass=kultur & textClass=film)",
+ // Reverse individual: Entertainment → AND (for BtoA with (k&m))
+ "genre=Entertainment <> (textClass=kultur & textClass=musik)",
+ )
+
+ // AtoB: first rule matches (simple field pattern on A-side)
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:doc",
+ "key": "genre",
+ "value": "Entertainment",
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:docGroup", corpus["@type"])
+ assert.Equal(t, "operation:or", corpus["operation"])
+ operands := corpus["operands"].([]any)
+ require.Len(t, operands, 2)
+
+ // BtoA with AND group (kultur & film): forward rule's OR B-side doesn't
+ // match AND structurally, so the reverse aggregated rule matches
+ input2 := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "film"},
+ },
+ },
+ }
+ result2, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input2)
+ require.NoError(t, err)
+
+ corpus2 := result2.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:docGroup", corpus2["@type"])
+ assert.Equal(t, "operation:or", corpus2["operation"])
+ ops := corpus2["operands"].([]any)
+ require.Len(t, ops, 2)
+ assert.Equal(t, "Entertainment", ops[0].(map[string]any)["value"])
+ assert.Equal(t, "Culture", ops[1].(map[string]any)["value"])
+
+ // BtoA with AND group (kultur & musik): reverse individual rule matches
+ input3 := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "musik"},
+ },
+ },
+ }
+ result3, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input3)
+ require.NoError(t, err)
+
+ corpus3 := result3.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:doc", corpus3["@type"])
+ assert.Equal(t, "genre", corpus3["key"])
+ assert.Equal(t, "Entertainment", corpus3["value"])
+}
+
+func TestCorpusQueryGroupToFieldReplacementRewrite(t *testing.T) {
+ m := newCorpusMapper(t, "genre=fiction <> (textClass=kultur & textClass=musik)")
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "musik"},
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA, AddRewrites: true}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "genre", corpus["key"])
+ assert.Equal(t, "fiction", corpus["value"])
+
+ rewrites, ok := corpus["rewrites"].([]any)
+ require.True(t, ok)
+ require.Len(t, rewrites, 1)
+
+ rewrite := rewrites[0].(map[string]any)
+ assert.Equal(t, "koral:rewrite", rewrite["@type"])
+ // Original was a group, so the whole structure is stored
+ original, ok := rewrite["original"].(map[string]any)
+ require.True(t, ok)
+ assert.Equal(t, "koral:docGroup", original["@type"])
+}
+
+func TestCorpusQueryNestedGroupPatternMatch(t *testing.T) {
+ // Nested: OR of AND groups
+ m := newCorpusMapper(t, "genre=fiction <> ((textClass=kultur & textClass=musik) | (textClass=kultur & textClass=film))")
+
+ // BtoA: the OR-of-AND pattern matches an exact OR-of-AND docGroup
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:or",
+ "operands": []any{
+ map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "musik"},
+ },
+ },
+ map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "film"},
+ },
+ },
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:doc", corpus["@type"])
+ assert.Equal(t, "genre", corpus["key"])
+ assert.Equal(t, "fiction", corpus["value"])
+}
+
+func TestCorpusQueryGroupPatternRecursionFallthrough(t *testing.T) {
+ // Group pattern doesn't match the outer group, so we recurse into operands
+ m := newCorpusMapper(t, "genre=fiction <> (textClass=kultur & textClass=musik)")
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:or",
+ "operands": []any{
+ // This inner AND group matches the rule's B-side pattern
+ map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "musik"},
+ },
+ },
+ // This stays unchanged
+ map[string]any{
+ "@type": "koral:doc",
+ "key": "author",
+ "value": "Fontane",
+ },
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:docGroup", corpus["@type"])
+ assert.Equal(t, "operation:or", corpus["operation"])
+
+ operands := corpus["operands"].([]any)
+ require.Len(t, operands, 2)
+
+ // First operand was replaced
+ first := operands[0].(map[string]any)
+ assert.Equal(t, "genre", first["key"])
+ assert.Equal(t, "fiction", first["value"])
+
+ // Second operand unchanged
+ second := operands[1].(map[string]any)
+ assert.Equal(t, "author", second["key"])
+ assert.Equal(t, "Fontane", second["value"])
+}
+
+func TestCorpusQueryFieldGroupAliasWithGroupPattern(t *testing.T) {
+ m := newCorpusMapper(t, "genre=fiction <> (textClass=kultur & textClass=musik)")
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:fieldGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:field", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:field", "key": "textClass", "value": "musik"},
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "genre", corpus["key"])
+ assert.Equal(t, "fiction", corpus["value"])
+}
+
+func TestCorpusQueryComplexPatternAndComplexReplacementAtoB(t *testing.T) {
+ m := newCorpusMapper(t, "(genre=fiction & region=de) <> ((textClass=kultur & textClass=film) | textClass=kultur.film)")
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "genre", "value": "fiction"},
+ map[string]any{"@type": "koral:doc", "key": "region", "value": "de"},
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:docGroup", corpus["@type"])
+ assert.Equal(t, "operation:or", corpus["operation"])
+
+ orOps := corpus["operands"].([]any)
+ require.Len(t, orOps, 2)
+
+ andGroup := orOps[0].(map[string]any)
+ assert.Equal(t, "koral:docGroup", andGroup["@type"])
+ assert.Equal(t, "operation:and", andGroup["operation"])
+ andOps := andGroup["operands"].([]any)
+ require.Len(t, andOps, 2)
+ assert.Equal(t, "textClass", andOps[0].(map[string]any)["key"])
+ assert.Equal(t, "kultur", andOps[0].(map[string]any)["value"])
+ assert.Equal(t, "textClass", andOps[1].(map[string]any)["key"])
+ assert.Equal(t, "film", andOps[1].(map[string]any)["value"])
+
+ dotValue := orOps[1].(map[string]any)
+ assert.Equal(t, "koral:doc", dotValue["@type"])
+ assert.Equal(t, "textClass", dotValue["key"])
+ assert.Equal(t, "kultur.film", dotValue["value"])
+}
+
+func TestCorpusQueryComplexPatternAndComplexReplacementBtoA(t *testing.T) {
+ m := newCorpusMapper(t, "(genre=fiction & region=de) <> ((textClass=kultur & textClass=film) | textClass=kultur.film)")
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:or",
+ "operands": []any{
+ map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "film"},
+ },
+ },
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur.film"},
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:docGroup", corpus["@type"])
+ assert.Equal(t, "operation:and", corpus["operation"])
+ andOps := corpus["operands"].([]any)
+ require.Len(t, andOps, 2)
+
+ left := andOps[0].(map[string]any)
+ right := andOps[1].(map[string]any)
+ keys := []string{left["key"].(string), right["key"].(string)}
+ values := []string{left["value"].(string), right["value"].(string)}
+ assert.ElementsMatch(t, []string{"genre", "region"}, keys)
+ assert.ElementsMatch(t, []string{"fiction", "de"}, values)
+}
+
+// --- Iterative rule application tests ---
+
+func TestCorpusQueryIterativeRuleApplication(t *testing.T) {
+ // Two rules applied to the same tree — both should fire on different operands.
+ m := newCorpusMapper(t,
+ "textClass=novel <> genre=fiction",
+ "textClass=science <> genre=nonfiction",
+ )
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "novel"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "science"},
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ operands := corpus["operands"].([]any)
+ require.Len(t, operands, 2)
+ assert.Equal(t, "fiction", operands[0].(map[string]any)["value"])
+ assert.Equal(t, "nonfiction", operands[1].(map[string]any)["value"])
+}
+
+func TestCorpusQueryIterativeSuccessiveTransform(t *testing.T) {
+ // Rule 1 transforms a field, rule 2 transforms the result of rule 1.
+ m := newCorpusMapper(t,
+ "textClass=novel <> genre=fiction",
+ "genre=fiction <> category=lit",
+ )
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:doc",
+ "key": "textClass",
+ "value": "novel",
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "category", corpus["key"])
+ assert.Equal(t, "lit", corpus["value"])
+}
+
+// --- AND subset matching tests ---
+
+func TestCorpusQueryANDSubsetMatchGroupReplacement(t *testing.T) {
+ // AND pattern with group replacement on subset match
+ m := newCorpusMapper(t, "genre=fiction <> (textClass=kultur & textClass=musik)")
+
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:docGroup",
+ "operation": "operation:and",
+ "operands": []any{
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "kultur"},
+ map[string]any{"@type": "koral:doc", "key": "textClass", "value": "musik"},
+ map[string]any{"@type": "koral:doc", "key": "author", "value": "Goethe"},
+ },
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: BtoA}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "koral:docGroup", corpus["@type"])
+ assert.Equal(t, "operation:and", corpus["operation"])
+
+ operands := corpus["operands"].([]any)
+ require.Len(t, operands, 2)
+
+ assert.Equal(t, "genre", operands[0].(map[string]any)["key"])
+ assert.Equal(t, "fiction", operands[0].(map[string]any)["value"])
+ assert.Equal(t, "author", operands[1].(map[string]any)["key"])
+ assert.Equal(t, "Goethe", operands[1].(map[string]any)["value"])
+}
+
+// --- OR any-operand matching tests ---
+
+func TestCorpusQueryORPatternMatchesBothOperands(t *testing.T) {
+ m := newCorpusMapper(t, "(genre=fiction | genre=novel) <> textClass=belletristik")
+
+ // Both "fiction" and "novel" should match the OR pattern
+ for _, val := range []string{"fiction", "novel"} {
+ input := map[string]any{
+ "corpus": map[string]any{
+ "@type": "koral:doc",
+ "key": "genre",
+ "value": val,
+ },
+ }
+ result, err := m.ApplyQueryMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ corpus := result.(map[string]any)["corpus"].(map[string]any)
+ assert.Equal(t, "textClass", corpus["key"], "value %s should match", val)
+ assert.Equal(t, "belletristik", corpus["value"])
+ }
+}
+
+// --- Response-side OR pattern and replacement tests ---
+
+func TestCorpusResponseORPatternMatchesSingleField(t *testing.T) {
+ m := newCorpusMapper(t, "(textClass=novel | textClass=fiction) <> (genre=lit & type=book)")
+
+ input := map[string]any{
+ "fields": []any{
+ map[string]any{
+ "@type": "koral:field",
+ "key": "textClass",
+ "value": "novel",
+ "type": "type:string",
+ },
+ },
+ }
+ result, err := m.ApplyResponseMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ fields := result.(map[string]any)["fields"].([]any)
+ require.Len(t, fields, 3)
+
+ mapped1 := fields[1].(map[string]any)
+ assert.Equal(t, "genre", mapped1["key"])
+ assert.Equal(t, "lit", mapped1["value"])
+
+ mapped2 := fields[2].(map[string]any)
+ assert.Equal(t, "type", mapped2["key"])
+ assert.Equal(t, "book", mapped2["value"])
+}
+
+func TestCorpusResponseORReplacementSkipped(t *testing.T) {
+ m := newCorpusMapper(t, "textClass=novel <> (genre=fiction | genre=novel)")
+
+ input := map[string]any{
+ "fields": []any{
+ map[string]any{
+ "@type": "koral:field",
+ "key": "textClass",
+ "value": "novel",
+ "type": "type:string",
+ },
+ },
+ }
+ result, err := m.ApplyResponseMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ fields := result.(map[string]any)["fields"].([]any)
+ require.Len(t, fields, 1, "OR replacement should be skipped in response")
+}
+
+func TestCorpusResponseORPatternANDReplacementBothFields(t *testing.T) {
+ // Rule: (a | b) <> (c & d)
+ // When "a" is in response, both "c" and "d" should be added.
+ m := newCorpusMapper(t, "(textClass=a | textClass=b) <> (genre=c & genre=d)")
+
+ input := map[string]any{
+ "fields": []any{
+ map[string]any{
+ "@type": "koral:field",
+ "key": "textClass",
+ "value": "a",
+ "type": "type:string",
+ },
+ },
+ }
+ result, err := m.ApplyResponseMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ fields := result.(map[string]any)["fields"].([]any)
+ require.Len(t, fields, 3)
+
+ mapped1 := fields[1].(map[string]any)
+ assert.Equal(t, "genre", mapped1["key"])
+ assert.Equal(t, "c", mapped1["value"])
+ assert.Equal(t, true, mapped1["mapped"])
+
+ mapped2 := fields[2].(map[string]any)
+ assert.Equal(t, "genre", mapped2["key"])
+ assert.Equal(t, "d", mapped2["value"])
+ assert.Equal(t, true, mapped2["mapped"])
+}
+
+func TestCorpusResponseORPatternORReplacementSkipped(t *testing.T) {
+ // Rule: (a | b) <> (c | d)
+ // When "a" is in response, nothing should be added (OR replacement skipped).
+ m := newCorpusMapper(t, "(textClass=a | textClass=b) <> (genre=c | genre=d)")
+
+ input := map[string]any{
+ "fields": []any{
+ map[string]any{
+ "@type": "koral:field",
+ "key": "textClass",
+ "value": "a",
+ "type": "type:string",
+ },
+ },
+ }
+ result, err := m.ApplyResponseMappings("corpus-test", MappingOptions{Direction: AtoB}, input)
+ require.NoError(t, err)
+
+ fields := result.(map[string]any)["fields"].([]any)
+ require.Len(t, fields, 1, "OR replacement should be skipped")
+}
diff --git a/mappings/wiki-dereko.yaml b/mappings/wiki-dereko.yaml
new file mode 100644
index 0000000..83b4815
--- /dev/null
+++ b/mappings/wiki-dereko.yaml
@@ -0,0 +1,98 @@
+id: wiki-dereko
+type: corpus
+desc: Mapping between Wikipedia and DeReKo textClass categories
+fieldA: wikiCat
+fieldB: textClass
+mappings:
+ # Academic_disciplines (7171 files): wissenschaft.populaerwissenschaft=30.6%, kultur.literatur=24.2%
+ - "Academic_disciplines <> ((wissenschaft & populaerwissenschaft) | (kultur & literatur))"
+ # Communication (2568 files): technik-industrie.edv-elektronik=31.0%, wissenschaft.populaerwissenschaft=25.1%
+ - "Communication <> ((technik-industrie & edv-elektronik) | (wissenschaft & populaerwissenschaft))"
+ # Concepts (2646 files): freizeit-unterhaltung.reisen=23.9%, wissenschaft.populaerwissenschaft=21.7%
+ - "Concepts <> ((freizeit-unterhaltung & reisen) | (wissenschaft & populaerwissenschaft))"
+ # Entities (13372 files): wissenschaft.populaerwissenschaft=30.4%, technik-industrie.edv-elektronik=21.9%
+ - "Entities <> ((wissenschaft & populaerwissenschaft) | (technik-industrie & edv-elektronik))"
+ # Food_drink (4456 files): freizeit-unterhaltung.reisen=37.9%, wissenschaft.populaerwissenschaft=23.2%
+ - "Food_drink <> ((freizeit-unterhaltung & reisen) | (wissenschaft & populaerwissenschaft))"
+ # Health (1952 files): wissenschaft.populaerwissenschaft=30.0%, gesundheit-ernaehrung.gesundheit=28.7%
+ - "Health <> ((wissenschaft & populaerwissenschaft) | (gesundheit-ernaehrung & gesundheit))"
+ # History (59767 files): freizeit-unterhaltung.reisen=22.6%, kultur.literatur=20.6%
+ - "History <> ((freizeit-unterhaltung & reisen) | (kultur & literatur))"
+ # Mathematics (10035 files): wissenschaft.populaerwissenschaft=54.1%, kultur.literatur=24.1%
+ - "Mathematics <> ((wissenschaft & populaerwissenschaft) | (kultur & literatur))"
+ # Philosophy (10961 files): kultur.literatur=51.8%, wissenschaft.populaerwissenschaft=27.3%
+ - "Philosophy <> ((kultur & literatur) | (wissenschaft & populaerwissenschaft))"
+ # Religion (7875 files): staat-gesellschaft.kirche=37.1%, freizeit-unterhaltung.reisen=31.0%
+ - "Religion <> ((staat-gesellschaft & kirche) | (freizeit-unterhaltung & reisen))"
+ # Science (31185 files): wissenschaft.populaerwissenschaft=42.3%, kultur.literatur=21.8%
+ - "Science <> ((wissenschaft & populaerwissenschaft) | (kultur & literatur))"
+ # Sports (49599 files): sport.vermischtes=37.5%, sport.fussball=28.8%
+ - "Sports <> ((sport & vermischtes) | (sport & fussball))"
+ # Time (1023 files): wissenschaft.populaerwissenschaft=29.8%, kultur.film=27.2%
+ - "Time <> ((wissenschaft & populaerwissenschaft) | (kultur & film))"
+ # Culture (21448 files): freizeit-unterhaltung.reisen=37.1%
+ - "Culture <> (freizeit-unterhaltung & reisen)"
+ # Economy (13283 files): wissenschaft.populaerwissenschaft=20.0%
+ - "Economy <> (wissenschaft & populaerwissenschaft)"
+ # Education (7560 files): staat-gesellschaft.bildung=37.1%
+ - "Education <> (staat-gesellschaft & bildung)"
+ # Energy (1642 files): wissenschaft.populaerwissenschaft=41.8%
+ - "Energy <> (wissenschaft & populaerwissenschaft)"
+ # Engineering (10318 files): wissenschaft.populaerwissenschaft=33.6%
+ - "Engineering <> (wissenschaft & populaerwissenschaft)"
+ # Entertainment (1033 files): kultur.musik=34.2%
+ - "Entertainment <> (kultur & musik)"
+ # Geography (18082 files): freizeit-unterhaltung.reisen=68.4%
+ - "Geography <> (freizeit-unterhaltung & reisen)"
+ # Government (6594 files): politik.ausland=45.8%
+ - "Government <> (politik & ausland)"
+ # Human_behavior (10855 files): politik.ausland=35.7%
+ - "Human_behavior <> (politik & ausland)"
+ # Humanities (6897 files): kultur.literatur=31.8%
+ - "Humanities <> (kultur & literatur)"
+ # Information (1480 files): technik-industrie.edv-elektronik=55.0%
+ - "Information <> (technik-industrie & edv-elektronik)"
+ # Internet (1694 files): technik-industrie.edv-elektronik=67.7%
+ - "Internet <> (technik-industrie & edv-elektronik)"
+ # Knowledge (4832 files): wissenschaft.populaerwissenschaft=59.5%
+ - "Knowledge <> (wissenschaft & populaerwissenschaft)"
+ # Language (60359 files): kultur.literatur=64.5%
+ - "Language <> (kultur & literatur)"
+ # Law (26387 files): politik.ausland=42.5%
+ - "Law <> (politik & ausland)"
+ # Life (3117 files): politik.ausland=21.5%
+ - "Life <> (politik & ausland)"
+ # Lists (22019 files): freizeit-unterhaltung.reisen=21.0%
+ - "Lists <> (freizeit-unterhaltung & reisen)"
+ # Mass_media (21707 files): kultur.film=32.1%
+ - "Mass_media <> (kultur & film)"
+ # Military (27580 files): politik.ausland=32.0%
+ - "Military <> (politik & ausland)"
+ # Nature (5573 files): freizeit-unterhaltung.reisen=42.6%
+ - "Nature <> (freizeit-unterhaltung & reisen)"
+ # Politics (9887 files): politik.ausland=48.7%
+ - "Politics <> (politik & ausland)"
+ # Society (12187 files): wissenschaft.populaerwissenschaft=49.4%
+ - "Society <> (wissenschaft & populaerwissenschaft)"
+ # Technology (11385 files): wissenschaft.populaerwissenschaft=44.0%
+ - "Technology <> (wissenschaft & populaerwissenschaft)"
+ # Universe (1967 files): wissenschaft.populaerwissenschaft=41.1%
+ - "Universe <> (wissenschaft & populaerwissenschaft)"
+ # freizeit-unterhaltung.reisen → History, Geography, Culture (AND)
+ - "(History | Geography | Culture) <> (freizeit-unterhaltung & reisen)"
+ # technik-industrie.edv-elektronik → Entities, Internet (AND)
+ - "(Entities | Internet) <> (technik-industrie & edv-elektronik)"
+ # kultur.literatur → Language, History (AND)
+ - "(Language | History) <> (kultur & literatur)"
+ # politik.ausland → Law, Military (AND)
+ - "(Law | Military) <> (politik & ausland)"
+ # Health (1952 files): gesundheit-ernaehrung.gesundheit=28.7%
+ - "Health <> (gesundheit-ernaehrung & gesundheit)"
+ # Religion (7875 files): staat-gesellschaft.kirche=37.1%
+ - "Religion <> (staat-gesellschaft & kirche)"
+ # Science (31185 files): wissenschaft.populaerwissenschaft=42.3%
+ - "Science <> (wissenschaft & populaerwissenschaft)"
+ # Sports (49599 files): sport.fussball=28.8%
+ - "Sports <> (sport & fussball)"
+ # Sports (49599 files): sport.vermischtes=37.5%
+ - "Sports <> (sport & vermischtes)"
diff --git a/parser/corpus_parser.go b/parser/corpus_parser.go
index b4411ce..9debf3a 100644
--- a/parser/corpus_parser.go
+++ b/parser/corpus_parser.go
@@ -82,7 +82,12 @@
}
// CorpusParser parses corpus mapping rules.
-type CorpusParser struct{}
+type CorpusParser struct {
+ // AllowBareValues enables parsing values without a key= prefix.
+ // The resulting CorpusField will have an empty Key, to be filled
+ // from the mapping list header (KeyA/KeyB).
+ AllowBareValues bool
+}
func NewCorpusParser() *CorpusParser {
return &CorpusParser{}
@@ -174,12 +179,16 @@
}
// parseField parses a single field expression: key=value[:match][#type].
+// When AllowBareValues is true, also accepts bare values without key=.
func (p *CorpusParser) parseField(input string) (*CorpusField, error) {
input = strings.TrimSpace(input)
eqIdx := strings.Index(input, "=")
if eqIdx == -1 {
- return nil, fmt.Errorf("invalid field expression: missing '=' in %q", input)
+ if !p.AllowBareValues {
+ return nil, fmt.Errorf("invalid field expression: missing '=' in %q", input)
+ }
+ return p.parseBareValue(input)
}
key := strings.TrimSpace(input[:eqIdx])
@@ -217,6 +226,36 @@
return field, nil
}
+// parseBareValue parses a value without a key= prefix.
+// The Key is left empty and should be filled from the mapping list header.
+func (p *CorpusParser) parseBareValue(input string) (*CorpusField, error) {
+ if input == "" {
+ return nil, fmt.Errorf("invalid field expression: empty bare value")
+ }
+
+ field := &CorpusField{}
+
+ if hashIdx := strings.LastIndex(input, "#"); hashIdx != -1 {
+ field.Type = strings.TrimSpace(input[hashIdx+1:])
+ input = input[:hashIdx]
+ }
+
+ if colonIdx := strings.LastIndex(input, ":"); colonIdx != -1 {
+ candidate := strings.TrimSpace(input[colonIdx+1:])
+ if validMatchTypes[candidate] {
+ field.Match = candidate
+ input = input[:colonIdx]
+ }
+ }
+
+ field.Value = strings.TrimSpace(input)
+ if field.Value == "" {
+ return nil, fmt.Errorf("invalid field expression: empty bare value")
+ }
+
+ return field, nil
+}
+
// findMatchingParen finds the index of the closing parenthesis matching the
// opening parenthesis at position 0.
func findMatchingParen(input string) int {
diff --git a/parser/corpus_parser_test.go b/parser/corpus_parser_test.go
index a62d0fb..7c94345 100644
--- a/parser/corpus_parser_test.go
+++ b/parser/corpus_parser_test.go
@@ -283,3 +283,74 @@
c.Operands[0].(*CorpusField).Key = "changed"
assert.NotEqual(t, g.Operands[0].(*CorpusField).Key, "changed")
}
+
+func TestCorpusParserBareValue(t *testing.T) {
+ p := &CorpusParser{AllowBareValues: true}
+ result, err := p.ParseMapping("Entertainment <> (technik-industrie & edv-elektronik)")
+ require.NoError(t, err)
+
+ upper, ok := result.Upper.(*CorpusField)
+ require.True(t, ok)
+ assert.Equal(t, "", upper.Key)
+ assert.Equal(t, "Entertainment", upper.Value)
+
+ group, ok := result.Lower.(*CorpusGroup)
+ require.True(t, ok)
+ assert.Equal(t, "and", group.Operation)
+ require.Len(t, group.Operands, 2)
+
+ f1 := group.Operands[0].(*CorpusField)
+ assert.Equal(t, "", f1.Key)
+ assert.Equal(t, "technik-industrie", f1.Value)
+
+ f2 := group.Operands[1].(*CorpusField)
+ assert.Equal(t, "", f2.Key)
+ assert.Equal(t, "edv-elektronik", f2.Value)
+}
+
+func TestCorpusParserBareValueORGroup(t *testing.T) {
+ p := &CorpusParser{AllowBareValues: true}
+ result, err := p.ParseMapping("Entertainment <> ((kultur & musik) | (kultur & film))")
+ require.NoError(t, err)
+
+ upper, ok := result.Upper.(*CorpusField)
+ require.True(t, ok)
+ assert.Equal(t, "Entertainment", upper.Value)
+
+ group, ok := result.Lower.(*CorpusGroup)
+ require.True(t, ok)
+ assert.Equal(t, "or", group.Operation)
+ require.Len(t, group.Operands, 2)
+
+ and1, ok := group.Operands[0].(*CorpusGroup)
+ require.True(t, ok)
+ assert.Equal(t, "and", and1.Operation)
+ assert.Equal(t, "kultur", and1.Operands[0].(*CorpusField).Value)
+ assert.Equal(t, "musik", and1.Operands[1].(*CorpusField).Value)
+
+ and2, ok := group.Operands[1].(*CorpusGroup)
+ require.True(t, ok)
+ assert.Equal(t, "and", and2.Operation)
+ assert.Equal(t, "kultur", and2.Operands[0].(*CorpusField).Value)
+ assert.Equal(t, "film", and2.Operands[1].(*CorpusField).Value)
+}
+
+func TestCorpusParserBareValueDisabledByDefault(t *testing.T) {
+ p := NewCorpusParser()
+ _, err := p.ParseMapping("Entertainment <> genre=fiction")
+ assert.Error(t, err, "bare values should fail without AllowBareValues")
+}
+
+func TestCorpusParserBareValueMixedWithKeyed(t *testing.T) {
+ p := &CorpusParser{AllowBareValues: true}
+ result, err := p.ParseMapping("Entertainment <> genre=fiction")
+ require.NoError(t, err)
+
+ upper := result.Upper.(*CorpusField)
+ assert.Equal(t, "", upper.Key)
+ assert.Equal(t, "Entertainment", upper.Value)
+
+ lower := result.Lower.(*CorpusField)
+ assert.Equal(t, "genre", lower.Key)
+ assert.Equal(t, "fiction", lower.Value)
+}