Text Search in QLever
QLever allows the combination of SPARQL and full-text search in a collection of text records. The text records can be the literals from the RDF data, or additional text that is not part of the RDF data, or both.
The additional text can contain mentions of entities from the RDF data. A SPARQL+Text query can then specify co-occurrence of words from the text with entities from the RDF data. This is a powerful concept. See below for the input format and an example query.
Building a combined RDF and text index
The qlever script makes it very easy to build a text index and to start a
server using that text index. Just download the script using pip install qlever
or from https://github.com/ad-freiburg/qlever-control and follow the
instructions from qlever --help. As a quick summary, there are the following
commands and options related to SPARQL+Text:
qlever index --text-index [OPTIONS] ...
qlever add-text-index [OPTIONS]
qlever start --use-text-index [OPTIONS] ...
Note that qlever index builds both the RDF index and the text index in one
go, while qlever add-text-index adds a text index to an already existing RDF
index. Use qlever <command> --help to get detailed information about the
options for each of these commands.
Format of the text input files
This section describes the format of the input files for additional text that is not part of the RDF data. These text records may contain mentions of entities from the RDF data. This information can be passed to QLever via two files, a so-called "wordsfile" and a so-called "docsfile". The wordsfile specifies which words and entities occur in which text record. The docsfile is just the text record that is returned in query results.
The wordsfile is a tab-separated file with one line per word occurrence, in the following format:
Here is an example excerpt for the text record In 1928, Fleming discovered
penicillin, assuming that the id of the text record is 17and that the
scientist and the drug are annotated with the IRIs of the corresponding
entities in the RDF data. Note that the IRI can be syntactically completely
different from the words used to refer to that entity in the text.
In 0 17 1
1928 0 17 1
<Alexander_Fleming> 1 17 1
Fleming 0 17 1
discovered 0 17 1
penicillin 0 17 1
<Penicillin> 1 17 1
The docsfile is a tab-separated file with one line per text record, in the following format:
For example, for the sentence above:
Simple Text Search
QLever supports a simple text search using the special predicates ql:contains-entity and ql:contains-word. Additionally, a version with more configuration options is available: see Advanced Text Search.
Example SPARQL+Text queries
Here is an example query on Wikidata, which returns astronauts that are mentioned in a sentence from the English Wikipedia that also contains the word "moon" and the prefix "walk", ordered by the number of matching sentences.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX ql: <http://qlever.cs.uni-freiburg.de/builtin-functions/>
SELECT ?astronaut ?astronaut_name (COUNT(?text) AS ?count) (SAMPLE(?text) AS ?example_text) WHERE {
?astronaut wdt:P106 wd:Q11631 .
?astronaut rdfs:label ?astronaut_name .
FILTER (LANG(?astronaut_name) = "en")
?text ql:contains-entity ?astronaut .
?text ql:contains-word "moon walk*" .
}
GROUP BY ?astronaut ?astronaut_name
ORDER BY DESC(?count)
Advanced Text Search
QLever also provides a Text Search SERVICE which provides additional control over text search compared to ql:contains-word and ql:contains-entity. It allows binding the score and prefix match of the searches to self defined variables. This feature is accessed using the SERVICE keyword and the service IRI <https://qlever.cs.uni-freiburg.de/textSearch/>.
Full Syntax
The complete structure of a Text Search with every predicate used is as follows:
PREFIX textSearch: <https://qlever.cs.uni-freiburg.de/textSearch/>
SELECT * WHERE {
SERVICE textSearch: {
?t textSearch:contains [
textSearch:word "prefix*" ;
textSearch:prefix-match ?prefix_match ;
textSearch:score ?prefix_score
] .
?t textSearch:contains [
textSearch:entity ?entity ;
textSearch:score ?entity_score
] .
}
}
Parameters
textSearch:contains: Connects a text variable to a configuration for text search. One text variable can be connected to multiple configurations leading to multiple searches on the same text where the searches are later joined. One config shouldn't be linked to multiple text variables.
textSearch:word: Defines the word or prefix used in the text search. Doesn't support multiple words separated by spaces like ql:word does. It is instead intended to use multiple searches.
textSearch:entity: Defines the entity used in text search. Can be a literal or IRI for fixed entities or a variable to get matches for multiple entities. An entity search on a text variable needs at least one word search on the same variable.
textSearch:prefix-match (optional): Binds prefix completion to a variable that can be used outside of the SERVICE. If not specified the column will be omitted. Should only be used in a word search configuration where the word is a prefix.
textSearch:score (optional): Binds score to a variable that can be used outside of the SERVICE. If not specified the column will be omitted.
A configuration is either a word search configuration specified with textSearch:word or an entity search configuration with textSearch:entity.
Example queries
Example 1: Simple word search
A simple word search:
PREFIX textSearch: <https://qlever.cs.uni-freiburg.de/textSearch/>
SELECT * WHERE {
SERVICE textSearch: {
?t textSearch:contains [ textSearch:word "word" ] .
}
}
The output columns are: ?t
Example 2: Prefix search
A simple prefix search where the prefix variable is bound:
PREFIX textSearch: <https://qlever.cs.uni-freiburg.de/textSearch/>
SELECT * WHERE {
SERVICE textSearch: {
?t textSearch:contains [ textSearch:word "prefix*" ; textSearch:prefix-match ?prefix_match ] .
}
}
The output columns are: ?t, ?prefix_match
Example 3: Score word search with filter
A word search where the score variable is bound and later used to filter the result:
PREFIX textSearch: <https://qlever.cs.uni-freiburg.de/textSearch/>
SELECT * WHERE {
SERVICE textSearch: {
?t textSearch:contains [ textSearch:word "word" ; textSearch:score ?score ] .
}
FILTER(?score > 1)
}
The output columns are: ?t, ?score
Example 4: Simple entity search
A simple entity search, where an entity is contained in a text together with a word:
PREFIX textSearch: <https://qlever.cs.uni-freiburg.de/textSearch/>
SELECT * WHERE {
SERVICE textSearch: {
?t textSearch:contains [ textSearch:word "word" ] .
?t textSearch:contains [ textSearch:entity ?e ] .
}
}
The output columns are: ?t, ?e
Example 5: Multiple word and or prefix search
A text search with multiple word and prefix searches:
PREFIX textSearch: <https://qlever.cs.uni-freiburg.de/textSearch/>
SELECT * WHERE {
SERVICE textSearch: {
?t textSearch:contains [ textSearch:word "word" ] .
?t textSearch:contains [ textSearch:word "prefix*" ] .
?t textSearch:contains [ textSearch:word "term" ] .
}
}
The output columns are: ?t
Error Handling
The Text Search feature will throw errors in the following scenarios:
- No Basic Graph Pattern: If the inner syntax of the
SERVICEisn't only triples, an error will be raised. - Faulty Triples: If predicates aren't used with subjects and objects of correct type, an error will be raised. E.g.
textSearch:wordwith a variable as object. - Multiple Occurrences of Predicates: If predicates occur multiple times in one configuration, an error will be raised.
- Config linked to multiple Text Variables: If manually defining a configuration variable and linking that to different text variables, an error will be raised. It is good practice to use the [ ] syntax and thus not manually defining the configuration variables.
- Missing necessary Predicates: If predicate
textSearch:containsis missing inside theSERVICE, an error will be raised. If a configuration doesn't contain exactly one occurrence of eithertextSearch:wordortextSearch:entity, an error will be raised. - Word and Entity Search in one Configuration: If one configuration contains both predicates
textSearch:wordandtextSearch:entity, an error will be raised. - Entity Search on a Text without Word Search: If a configuration with
textSearch:entityis connected to a text variable that is not connected to a configuration withtextSearch:word, an error will be raised. - Empty Word Search: If the object of
textSearch:wordis an empty literal, an error will be raised.