This year I was lucky enough to participate to some of the major conferences in the area of databases, data management, semantic web, and information retrieval, namely (and in chronological order): ESWC'19, SIGMOD'19, SIGIR'19, and VLDB'19.
I've listened to great presentations and I've been exposed to exciting problems, research topics, and ideas.
There are brilliant trip-reports about them. I warmly invite you to read those as well (some are linked below) and dig in the respective conference proceedings. Here I provide my experience, pointers on some of the works that mostly resonated with me, and I will reference to the work that I was presenting at each venue.
I was there presenting the early results of a collaboration with the BONSAI organization in the effort to build «An Open Source Dataset and Ontology for Product Footprinting». A true interdisciplinary effort with the goal to allow the science of lifecycle assessment to perform in a more transparent and more reproducible way. Incidentally, this work was awarded with the best poster.
Interesting tidbits: the power of Knowledge Graphs...
It should not surprise anyone that the words “knowledge” and “graphs” appeared multiple times on the large screens of the Hotel Bernardin.
One of the presentations that put knowledge graphs under the spot-light was the keynote by Peter Haase. It was an unconventional keynote in a sense, since, instead of a deck of slides, the talk was a knowledge graph itself, explored live! It really showed how natural it is to organized and explore knowledge as entities connected by facts and how this allows to connect both multiple domains as well as heterogenous sources of information.
Once information is accessible in the form of a knowledge graph, a lot of interesting possibilities arise. You can enhance dialogue systems, provide more interesting and personalized recommendations, better understand vague user queries, or better search for information in an enterprise data lake.
Another very interesting topic relates to the unique possibility, offered by RDF and SPARQL, to allow query answering over information that is not materialized in your knowledge graphs, but derived by its structure and annotations.
Different techniques and tools have also been presented to retrieve textual evidence of facts contained in a knowledge graph, enrich a knolwedge graph through relation extraction (and you can do that with the help of a human employing an example based approach), build a fully decentralized p2p repository of semantic data, embed reasoning capabilities in mobile applications, and visualize and edit ontologies.
... and more
In general, I've found a community deeply involved in having an impact on real-world use-cases (very interesting the In-Use track), as well as in openly sharing results and resources (there is actually a Resources Track!).
Only a few weeks after ESWC, I visited Amsterdam for the International Conference on Management of Data (SIGMOD).
I was there with my
partners in crime colleagues Davide Mottin, Themis Palpanas, and Yannis Velegrakis to present an updated and much extended version of our tutorial on exploratory search, the contents of which are based on our recent book on «Data Exploration using Example-based Methods».
The venue, the organization, and the conference as a whole were outstanding! Not to mention the best conference badges I've ever seen.
The conference is a leading venue when it comes to databases and data management, and had a massive audience of more than 1000 people. I cannot hope my insight will do justice to the wide array of important topics covered. I would invite you once more to look at the conference proceedings and to the other trip report, again by Paul Groth. While below I will just highlight a few things.
Graphs, graphs, graphs...
There were 3 sessions about graphs, and graphs appeared also in other contexts as well. Some papers and demos I took note of are (in no particular order):
- PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs, by Wei et al.
- Experimental Analysis of Streaming Algorithms for Graph Partitioning, by Pacaci et Özsu
- Fractal: A General-Purpose Graph Pattern Mining System, by Dias et al.
- Interactive Graph Search, by Tao et al.
- Optimizing Declarative Graph Queries at Large Scale, by Zhang et al.
- Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together, by Han et al.
- CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching, by Bhattarai et al.
- Large Scale Graph Mining with G-Miner, by Chen et al.
- NeMeSys - A Showcase of Data Oriented Near Memory Graph Processing, by Krause et al.
- NAVIGATE: Explainable Visual Graph Exploration by Examples, by Namaki et al.
As a side note, I particularly enjoyed the style of the presentation of «Interactive Graph Search», it actually embodied the idea that a paper presentation should provide the important information about the research presented leaving you with the honest desire to read the paper for more details.
In general, graphs are really ubiquitous and they continue to be a prolific field of study (read below also about the VLDB keynote by Tamer Özsu). In general, systems to tackle and solve some of the important challenges related to graph data management are presented every year in all the relevant venues. I'm wondering though whether we - as a community - could do more to have these papers translate to solutions adopted in practice. This is a well known issue in many areas, but now that graph data management is in the spotlight, I feel we have a unique opportunity to have real impact.
On the general topic of graphs in the real world, co-located with SIGMOD, took place the 12th Technical User Community (TUC) meeting of the LDBC council. On the program you can see quite a packed schedule with members both from industry and academia discussing about graph database systems, graph benchmarking, and query languages. It was great and inspiring!
Among the various presentations, my personal highlight was the talk by Vasileios Trigonakis (Oracle) on experiences and limitations in evaluating their distributed graph query engine with LDBC. My personal takeaway is that we need to work on micro-benchmarks to complement existing benchmarks and for an in-depth understanding of graph databases systems performance. This topic will come back later in my notes about this year VLDB as well.
Information Discovery and what is Interesting
At one of the industry sessions, there were two papers about automatic discovery of intersting insights. The first, «Quick Insight: Interesting pattern discovery», by Justin Ding et al. at Microsoft, where they defined a collection of interesting patterns, they called insights. For example a rising trend, or an outlier among a set of otherwise similar data-points. Then they devised a systematic mining framework to discover such patterns efficiently and integrated that in one of their products. Among the other things, they also devise methods to skip patterns that are trivial (e.g., a linear correlation between two values among which it exists a functional dependency).
The second paper, by Flip Korn et al. presented by Cong Yu (Google), was about «Automatically Generating Interesting Facts from Wikipedia Tables». They use this method to augment entity knowledge panels, i.e., those small panels of information that are shown when we perform an entity centric search, with fun facts. For instance, if the user performs a search for the movie Frozen, the knowledge panel, which provides facts like the release date and the director, could also show that «Frozen is the highest-grossing animated film of all time». Such facts are extracted from superlative tables in Wikipedia, e.g., the table of the top 10 highest-grossing animated movies.
The general premise of both methods is that interestingness can be more easily defined as something unconventional or special (an outlier, dominance in some ranking, etc.). This is understandable, it caters to a universal definition of interestingness, which makes it a safer bet when it comes to fun facts.
Yet, this does not tap into users' information need and intention except for what is explicit in the query. When dealing with data exploration, identifying the user intention can help proposing exploratory directions to allow the user find more relevant information and better understand the information available. For instance, Frozen is based on a story by Hans Christian Andersen like at least other 32 others, this could be a non trivial dimension worth exploring (compared to suggesting movies with the same director) even if this doesn't represent any exceptionality.
We presented a special version of our tutorial, titled «Example-driven Search: a New Frontier for Exploratory Search» at the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval in Paris.
Our goal was to do our part in fostering collaborations on the topic of exploratory search in the intersection between information retrieval, data management, and data mining. We focused on exploratory analysis techniques based on examples that can be easily applied to improve or extend tools and systems for advanced information retrieval applications with both structured and non-structured data.
While in the database community the «Query by example» by Zloof (1975) is the seminal work, in the IR community querying by example documents has its root in the «Relevance Feedback in Information Retrieval» work by Rocchio (1971) and the general idea of query by content that had extensive success for image retrieval later on for example.
As a side note, in parallel to our tutorial, another one was taking place about Learning to Rank in Theory and Practice: From Gradient Boosting to Neural Networks and Unbiased Learning. I didn't managed to attend it, but I heard it was great and I'm planning to study these slides soon.
Also, the welcome reception inside the Museum d'Histoire Naturelle with the museum all for us and the effects from the storm simulator was really something else!
Interactive Search, Knowledge Graphs, and Explainable recommendations
The opening keynote of the conference was by Bruce Croft and was on the importance of interaction for information retrieval. My main takeaways from the keynote are:
- Interaction is necessary for effective information access.
- Traditional Search puts the burden on the user (in specifying what they are looking for).
- To overcome this limitation, we require a system that explicitly models interaction and user intent.
- That will allow to obtain personalized browsing and guided assistance, especially for exploratory search
Given that I was there to talk about exploratory search, much of the contents of the keynote resonated deeply with me. Bruce also described iterative search as a process where the system can show some examples, ask what looks relevant and what does not. Moreover, he highlighted exploratory search as a dynamic ongoing process that has to be modeled as a whole. In particular, he argued, the system should model explicitly the history of the search, and should be able to ask clarifying questions when not confident in the answer. Referencing the seminal work by Marchionini (2006), he stressed that, to support knowledge understanding, the focus should move beyond “one-shot retrieval” and move toward “intent-aware response retrieval”.
Knowledge Graphs also had significant representation at SIGIR. Some of the works that caught my attention were:
- Network Embedding and Change Modeling in Dynamic Heterogeneous Networks, by Bial et al.
- Embedding Edge-attributed Relational Hierarchies, by Chen and Quirk.
- M-HIN: Complex Embeddings for Heterogeneous Information Networks via Metagraphs, by Fang et al.
- A Scalable Virtual Document-Based Keyword Search System for RDF Datasets, by Dosso and Silvello.
- Personal Knowledge Base Construction from Text-based Lifelogs, by Yen et al.
- ENT Rank: Retrieving Entities for Topical Information Needs through Entity-Neighbor-Text Relations, by Dietz.
I'm really fascinated by the idea of personal knowledge bases (and personal knowledge graphs), and I think embedding methods for KGs are still in the early stages and missing quite a lot of the potential a KG has to offer. If you are interested in these topics, and have read so far, we should definitely talk!
On a completely different topic, I found the work by Balog and colleagues on «Transparent, Scrutable and Explainable User Models for Personalized Recommendation» extremely compelling. They demonstrate how a set-based recommendation technique, which is simple to understand, allows the user model (that is the reasoning behind the recommendation) to be explicitly presented to users in natural language. For example, their approach produces recommendation explanations like “You don't like movies that are tagged as adventure, unless they are tagged as thriller, such as Twister.”. This, in turn, allows explainable recommendations and enables the user to provide feedback to the system (to improve the model). Moreover, all this without any significant loss in the quality of the recommendations!
I think this approach to build models that are explainable by construction, instead of trying to concoct explanation post-hoc on why some involved neural network hallucinated some recommendation, a highly promising direction.
Last, but not least, I had also the opportunity to cross the ocean and join the other premier venue for database and data management, the 45th International Conference on Very Large Data Bases (VLDB).
I was there with my colleague Martin Brugnara to present the work we did together with Yannis Velegrakis on benchmarking graph databases. In this work, we present the first micro-benchmarking framework for the assessment of the functionalities of the existing graph databases systems. Moreover, we provide a comprehensive study of the existing systems in order to understand their capabilities and limitations.
We strongly believe this work to be particularly timely, as graph database systems are reaching maturity and widespread use, we can start discussing in details their architectures alongside the advantage or drawbacks of the various implementation alternatives. The presentation at the LDBC event at SIGMOD I discussed above and the keynote by Tamer Özsu at this VLDB strongly reinforced in me this belief.
Keynotes: Self-driving Databases, Graph Data Management Systems, and getting rid of data
Before moving to the opening keynote for the main conference, let me just have a pointer here to the keynote by Andy Pavlo at the “AI for Databases” workshop. Andy presented his experiences on the challenges of self-driving databases. As usual, thanks to its unconventional style, it was a quite enjoyable and informative talk. Not surprisingly, moving towards a data management system that adapts in almost-real-time to changes in workload and data at scale requires to overcome quite a few obstacles. I warmly invite you to watch the talk recording at the link above to let Andy explain the details better than what I could do.
The opening keynote at the main conference instead was on the open problems in graph processing (slides). Prof. Özsu opened with the multiple disciplines and research areas in which the research on graphs is fragmented. In particular, knowledge graphs and the semantic web, graph DBMS, and graph analytics systems. His keynote, as well as his research expertise, covered a great deal of topics, including RDF Engines, Graph DBMSs, Graph Analytics Systems, and Dynamic & Streaming Graphs. His recent work, presented in the past edition of VLDB, was exactly about the fact that graphs are everywhere, and not just social networks: products, web, financial, infrastructure, and knowledge graphs as well. This great deal of domains is matched with a corpus of methods and approaches of comparable size. He also addressed the eternal debate between scale-out and scale-up for graph management solutions and presented his argument about how the gigantic size of graph datasets today, along with the rich amount of non trivial information they store, can only be met by scale-out strategies. This argument, as well as the opposing view, have been expressed quite eloquently in a pair of issues in the article «Scale Up or Scale Out for Graph Processing?,» and the corresponding response article in the IEEE Internet Computing journal.
Speaking of RDF systems, Özsu's analysis suggests that a 1-to-1 mapping to the relational model (the single table approach for instance) is not ideal. The open problems in this area comprise how to scale out, what is the best storage architecture, full implementation of the SPARQL query language, the computational cost of entailment in the Semantic Web, ensuring data quality and efficient RDF data cleaning, handling streaming RDF data, and the new challenges of having RDF data management embedded in IoT devices. Moreover, he suggested that the DB community could get more involved with the Semantic Web community (and viceversa) for the study and development of performant RDF management systems (this reminds me of Ruben Verborgh's opinion on the overlooked 20% of engineering effort in the Semantic Web). I, for one, am highly interested in doing my part in this effort!
Speaking of graph data management systems (GDBMS), the main difference from the Semantic Web is the property graph model, this model where properties are directly associated with edge and nodes. They are highly optimized for online workloads (OLTP), and the workloads for GDBMS are usually more skewed towards traversal queries (e.g., paths and reachability) as well as subgraph search as in the RDF triplestores. The open issues he highlighted were on the current graph query processing techniques. In particular, if I understood correctly, his view is that the strategy to process structure and data separately is likely to provide sub-optimal performance. He also stressed how the poor locality in graph workloads renders traditional caching techniques less effective. Moreover, as for triplestores, it is unclear what storage system works best. In particular he highlighted how there is too much focus on homogeneous graphs (graphs with a single edge type and maybe without attributes). Finally, he called for more work on benchmarking, and in particular micro-benchmarking, in order to understand how each component of a GDBMS works and performs under different circumstances. It goes without saying that I was quite happy to point him to our work on the graph database micro-benchmark.
I am quite less acquainted with the literature on graph analytics and graph streaming. So this part of the talk was quite instructive. Especially the limitation of applying map-reduce directly, and the differences between Bulk Synchronous Parallel (BSP) and the Gather-Apply-Scatter (GAS) paradigms. He also pointed out to quite a gap in the exploration of the design space that goes beyond these two paradigms. As for open issues he pointed to OLAP style processing on graphs, integration in data science workflows, and the current need to support ML workloads over graphs. The topic of streaming graphs seems to have received more attention only recently, and he provided a compelling distinction between dynamic graphs and streaming graphs. In the first case we want to keep the whole picture up-to-date, in the second instead we keep a window of changes (insertion/deletion) and reason only within that window.
The third keynote that I want to highlight was by Tova Milo on Getting Rid of Data (slides). The idea is simple to present: we are producing too much data! Not only it is already infeasible to store all of it, but also, most of it is of no use, so better keep only the most useful parts (for some definition of useful).
The idea seems outrageous, but Prof. Milo introduced an extremely convincing picture. She argued that not all data is equally important, and distinguished between
- critical data, which cannot be lost at any time,
- important data that we need but we can live without (maybe can be recomputed at some cost), and
- potentially important data, which may be important, but not right now.
Given that, if we do not pick ourselves which data to forfeit, the circumstances will decide for us, she moved to the challenges we need to face. In particular the balance between size and importance, the question on which data is easy to summarize and how to summarize such data, and finally how to automatic generate a data disposal policy. Interesting was the idea of an automated exploration agent (deep reinforcement learning?) that would move around and try to discover data that can be important and data that can be disposed.
Other interesting tidbits: Data lakes, Timely Dataflow, and Data Exploration
The other topics that attracted my attention at this conference were:
- The tutorial on data lakes.
- A work by Lai et al that studied how to do distributed subgraph matching on Timely Dataflow.
- Optimization for Active Learning-based Interactive Database Exploration, by Huang et al.
- A vision paper on Exploring Change: A New Dimension of Data Analytics, by Bleifuß et al.
- Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity, by Fariha and Meliou.
- The poster of the VLDBJ survey on Summarizing Semantic Graphs.
- As well as a very interesting demo on A Modular Framework for Analytical Exploration of RDF Graphs.
To summarize this quite intense conference season, I'm happy to report on the ubiquity of Knowledge Graphs. They are adopted in multiple forms to enhance systems and algorithms in many tasks. In my view KGs are the perfect tool to build such intent aware data exploration systems that can help us find our route in this immense sea of data. At the same time, KGs, being graphs, require efficient graph data management systems. I have the feeling that such systems will arise from the intersection of triplestores and property graphs.
Participating to such important events was a privilege to me, but I cannot avoid thinking to the number of planes I took and to the impacts they had. Has highlighted by many sources among which the SIGPLAN initiative on Climate Change «Air travel is a significant source of greenhouse gas emissions, which in turn are a significant contributor to climate change».
I've already read or heard somewhere the idea that some of the major conferences should join forces and maybe organize to be in the same place almost at the same time, or even take place only once every other year. I'm not sure what would be the best solution, but for sure I'll start being more mindful of the issue.