Lately I've been using Virtuoso for running some SPARQL
.
Here is my quick setup.
I also provide a custom configuration file (for machines with larger memory), the setup for working with a RAM disk (for fast read-only data), a Github gist, and instruction for loading data.
Update
The new docker image is at openlink/virtuoso-opensource-7
.
The code snippets have been updated accordingly.
Setup Docker container for Virtuoso
The people at Openlink provided a docker image for the opensource version of their software.
So we will pull that, prepare a folder for our data (so that if we kill the container we do not lose the database) and a folder for the data to be imported.
I also provide a customized virtuoso.ini
file.
docker pull openlink/virtuoso-opensource-7:latest
mkdir -p database
cp virtuoso.ini.example database/virtuoso.ini
mkdir -p import
We run the container setting the database
and import
folders as volumes, here the container is named vos
. Note that I do not use --rm
so that I can restart the container if I want, you can add --rm
and then the container will be removed automatically when it dies.
docker run --name vos -d \
--volume `pwd`/database:/database \
-v `pwd`/import:/import \
-t -p 1111:1111 -p 8890:8890 -i openlink/virtuoso-opensource-7:latest
The commands above require a custom virtuoso.ini
file (provided here).
The main edits are based on my need to query a large dataset and I needed to process large resultsets.
More information on the parameters are found on the official documentation.
My edits below are for a machine with ~64GB
of RAM, and may not be optimal in general, so YMMV.
- Allow the
/import
folder where to put our files to be imported
DirsAllowed = ., /opt/virtuoso-opensource/vad, /import
- Change memory size thresholds: uncomment the following lines, and comment below the corresponding two (comment with
;
)
NumberOfBuffers = 4000000
MaxDirtyBuffers = 3000000
;NumberOfBuffers = 10000
;MaxDirtyBuffers = 6000
few lines earlier you may want to change also
MaxQueryMem = 4G ; memory allocated to query processor
VectorSize = 2000 ; initial parallel query vector (array of query operations) size
MaxVectorSize = 20000000 ; query vector size threshold.
-
Longer keep alive for large queries
KeepAliveTimeout = 30
-
Allow for larger resultsets
ResultSetMaxRows = 50000
MaxQueryCostEstimationTime = 0 ; in seconds
MaxQueryExecutionTime = 600 ; in seconds
A Gist
The contents of this reamde and of the ini file can be found on this Github gist.
Add a comment there if you have any feedback.
To use a RAM Disk (in the example of size 8GB)
This is in READ ONLY to have faster query performance. All edits will be lost.
sudo mkdir -p /media/ramdisk1
sudo mount -t tmpfs -o size=8192M tmpfs /media/ramdisk1
docker run --name vos-ram7 -d \
--volume /media/ramdisk1/database:/database \
-v `pwd`/import:/import \
-t -p 1111:1111 -p 8890:8890 -i openlink/virtuoso-opensource-7:latest
Run the CLI
docker exec -it vos isql 1111
Create graphs
SPARQL create GRAPH <http://www.purl.com/test/my_graph>;
Import data
delete from DB.DBA.load_list;
ld_dir ('/import', 'my_file.ttl', 'http://www.purl.com/test/my_graph');
rdf_loader_run ();
checkpoint;
Use the CLI to run a script
Assuming you have a script called import.isql
in the local import
folder, e.g., cotaning the ld_dir
and rdf_loader
commands above, you can run the following to execute that script.
docker exec -it vos isql 1111 exec="LOAD /import/import.isql"
Warning
The loader in new docker image at openlink/virtuoso-opensource-7
has disable checkpoints and your risk to not be able to access your data after you stop and restart the image (see this GitHub issue for details).
So remember to use checkpoint;
command after the loading as explained in point 6 of the official guide.
Check existing graphs
SPARQL
SELECT DISTINCT ?g
WHERE
{
GRAPH ?g {?s ?p ?t}
}