Lately I've been using Virtuoso for running some SPARQL. Here is my quick setup.

I also provide a custom configuration file (for machines with larger memory), the setup for working with a RAM disk (for fast read-only data), a Github gist, and instruction for loading data.

Update

The new docker image is at openlink/virtuoso-opensource-7. The code snippets have been updated accordingly.

Setup Docker container for Virtuoso

The people at Openlink provided a docker image for the opensource version of their software. So we will pull that, prepare a folder for our data (so that if we kill the container we do not lose the database) and a folder for the data to be imported. I also provide a customized virtuoso.ini file.

docker pull openlink/virtuoso-opensource-7:latest


mkdir -p database
cp  virtuoso.ini.example database/virtuoso.ini

mkdir -p import

We run the container setting the database and import folders as volumes, here the container is named vos. Note that I do not use --rm so that I can restart the container if I want, you can add --rm and then the container will be removed automatically when it dies.

docker run --name vos -d \
           --volume `pwd`/database:/database \
           -v `pwd`/import:/import \
           -t -p 1111:1111 -p 8890:8890 -i openlink/virtuoso-opensource-7:latest

The commands above require a custom virtuoso.ini file (provided here). The main edits are based on my need to query a large dataset and I needed to process large resultsets. More information on the parameters are found on the official documentation.

My edits below are for a machine with ~64GB of RAM, and may not be optimal in general, so YMMV.

  1. Allow the /import folder where to put our files to be imported

    DirsAllowed       = ., /opt/virtuoso-opensource/vad, /import
    
  2. Change memory size thresholds: uncomment the following lines, and comment below the corresponding two (comment with ;)

    NumberOfBuffers  = 4000000
    MaxDirtyBuffers  = 3000000
    
    ;NumberOfBuffers = 10000
    ;MaxDirtyBuffers = 6000
    

    few lines earlier you may want to change also

    MaxQueryMem    = 4G       ; memory allocated to query processor
    VectorSize     = 2000     ; initial parallel query vector (array of query operations) size
    MaxVectorSize  = 20000000 ; query vector size threshold.
    
  1. Longer keep alive for large queries

    KeepAliveTimeout    = 30
    
  2. Allow for larger resultsets

    ResultSetMaxRows            = 50000
    MaxQueryCostEstimationTime  = 0  ; in seconds
    MaxQueryExecutionTime       = 600   ; in seconds
    

A Gist

The contents of this reamde and of the ini file can be found on this Github gist.

Add a comment there if you have any feedback.

To use a RAM Disk (in the example of size 8GB)

This is in READ ONLY to have faster query performance. All edits will be lost.

sudo mkdir -p /media/ramdisk1
sudo mount -t tmpfs -o size=8192M tmpfs /media/ramdisk1

docker run --name vos-ram7 -d \
           --volume /media/ramdisk1/database:/database \
           -v `pwd`/import:/import  \
           -t -p 1111:1111 -p 8890:8890 -i openlink/virtuoso-opensource-7:latest

Run the CLI

docker exec -it vos isql 1111

Create graphs

SPARQL create GRAPH <http://www.purl.com/test/my_graph>;

Import data

delete from DB.DBA.load_list;
ld_dir ('/import', 'my_file.ttl', 'http://www.purl.com/test/my_graph');
rdf_loader_run ();
checkpoint;

Warning

The loader in new docker image at openlink/virtuoso-opensource-7 has disable checkpoints and your risk to not be able to access your data after you stop and restart the image (see this GitHub issue for details). So remember to use checkpoint; command after the loading as explained in point 6 of the official guide.

Check existing graphs

SPARQL
SELECT DISTINCT ?g
WHERE
{
  GRAPH ?g {?s ?p ?t}
}