We share here The Freebase ExQ Data Dump: a cleaned version of the triplets from the Freebase knowledge graph, where metadata, and also (most of) extraneous relationships have been removed. The dump is shared in a `machine` friendly format.

The extracted graph, after cleaning, is a directed unweighted multigraph, containining 72,407,365 nodes, and 306,733,220 edges with 4335 distinct edge labels.

The complete dump "Freebase Triples" can be found at developers.google.com/freebase and they are no longer updated after the shutdown of the project.

Freebase Data Dumps are provided free of charge for any purpose by Google. They are distributed, like Freebase itself, under the Creative Commons Attribution (aka CC-BY) and their use is subject to the Freebase Terms of Service.

The Freebase ExQ Data Dump (this repository) is distributed under the same license; see below for citing this work. This dataset had been used in the Exemplar Query project.

Reference this dataset

This dataset is called The Freebase ExQ Data Dump. If you use this dataset, generate a subsample, or test on this dataset, please use the following reference and link to https://people.cs.aau.dk/~matteo/exemplar.html.

Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, Themis Palpanas. "Exemplar Queries: A New Way of Searching." The VLDB Journal (2016) 25: 741--765.


@article{Mottin:2016:EQN:3016770.3016789,
 author = {Mottin, Davide and Lissandrini, Matteo and Velegrakis, Yannis and Palpanas, Themis},
 title = {Exemplar Queries: A New Way of Searching},
 journal = {The VLDB Journal},
 issue_date = {December  2016},
 volume = {25},
 number = {6},
 month = dec,
 year = {2016},
 issn = {1066-8888},
 pages = {741--765},
 numpages = {25},
 url = {https://doi.org/10.1007/s00778-016-0429-2},
 doi = {10.1007/s00778-016-0429-2},
 acmid = {3016789},
 publisher = {Springer-Verlag New York, Inc.},
 address = {Secaucus, NJ, USA},
 keywords = {Exemplar query, Knowledge base, Knowledge graph, Query answering},
}

Content

The dump consists of the following files:

  • freebase-sout.graph (2GB): edges triplets (ordered by the source id)

    • each line is a space-separated triplet source dest label, representing a single edge
    • edges are sorted by source, and thus a scan in order will give all the outgoing edges of a node
    • source and dest are long integers derived from the Freebase mid
  • freebase-labels.tsv: list of TAB separated 4-tuples, each of which contains:

    • Label ID (Long),
    • Number of edges with that label,
    • Freebase official edge label,
    • tentative human readable label
  • freebase-nodes-in-out-name.tsv (802MB): list of TAB separated 4-tuples, each of which contains:

    • Node ID (Long)
    • Node InDegree (could be approximate)
    • Node OutDegree (could be approximate)
    • tentative human readable label
  • freebase-topics.tsv: list of TAB separated values, each line contains:

    • topic name : defined as the first fragment of the edge label
    • topic frequency : number of edges belonging to this topic; note that >141 million edges belong to type instances (like isA relationships)
  • org-subsample (34MB): a subsample of Freebase for a selection of domains:

    • freebase-org-subsample-sout.graph contains a portion of 4.3M edges from freebase-sout.graph
    • selected_labels.tsv lists a portion of freebase-labels.tsv: only edges with labels in this list appear in the subsample
  • directory scripts: contains

    • mid2long converts Freebase mids to long values, e.g., from /m/0gwsd6y to 89546883877148
    • long2mid converts Freebase long ids to mids
    • extract_domain.py extracts subgraphs of Freebase given a topic name from freebase-topics.tsv. Requires python 2.7, and networkx if you wish to keep only the largest connectected component

Download the files or part of them, they are stored on Google Drive.

The org-subsample Subsample

The entire graph is cumbersome to process in many applications, expecially for testing purposes. We generate a relatively small subsample of the graph, containing only a portion of about 4.3 million edges from the entire graph, with a total of 424 edge labels.

We generated a subsample from the following topics:

  • business,
  • finance,
  • geography,
  • government,
  • military, and
  • organization.

Information about Node IDs

If you want to undersand what a node represents, then search in the file freebase-nodes-in-out-name.tsv for the corresponding node id. If the search doesn't satisfy you, then you can use grep on the official data dump to search for its mid value (removing the first slash, and replacing the second with a dot). So, if you care for node 89546883877148 and you want to search the official dump, convert it to the mid /m/0gwsd6y, replace the characters to obtain m.0gwsd6y and grep (zgrep on compressed file) the dump.

Mids have been converted into long numbers using the following code:


  /**
    * Convert a mid into a BigInteger since a mid is not more than "/m/"
    * followed by lower-case letters, digits and _, so it is a base-32 code
    * that can be easily converted to binary and then to bigint.
    *
    * ** NOTE ** Engineering version
    * @param mid The original Freebase mid
    * @return the converted number
    * @throws NullPointerException
    * @throws IndexOutOfBoundsException
    */
    long convertMidToLong(String mid)
       throws NullPointerException, IndexOutOfBoundsException {
       String id = mid.substring(mid.lastIndexOf('/') + 1).toUpperCase();
       long retval;
       String number = "";
       for (int i = 0; i < id.length(); i++) {
           number = (int)id.charAt(i) + number;
       }
       retval = Long.valueOf(number);
       return retval;
    }

Given a long value one can obtain the Freebase mid with the following code


  /**
   * Opposite of convertMidToBigInt
   * @param decimal
   * @return
   * @throws NullPointerException
   * @throws IndexOutOfBoundsException
   */
  String convertLongToMid(long decimal)
      throws NullPointerException, IndexOutOfBoundsException {

      String mid = "";
      String decimalString = decimal + "";
      for (int i = 0; i < decimalString.length(); i+= 2) {
          mid = (char)Integer.parseInt(decimalString.substring(i, i + 2)) + mid;
      }
      return "/m/" + mid.toLowerCase();
  }

Cleaning Criteria

Metadata relationships in Freebase; these relationships are omitted

  • DOMAIN /type/domain
  • TOPIC /type/type
  • ENTITY /common/topic
  • PROPERTY /type/property

Media and contextual information not interesting in the knowledge graph

For type relationships we keep only the isA, and not the reverse hasInstance


 /**
  * Patterns to skip
  * removes the line from the tsv dump matching the following patterns
  */
 String SKIP_PATTERNS = ".*\\t/user.*|"
     + ".*\\t/freebase/(?!domain_category).*|"
     + ".*/usergroup/.*|"
     + ".*/permission/.*|"
     + ".*\\t/community/.*\\t.*|"
     + ".*\\t/type/object/type\\t.*|"
     + ".*\\t/type/domain/.*\\t.*|"
     + ".*\\t/type/property/(?!expected_type|reverse_property)\\b.*|"
     + ".*\\t/type/(user|content|attribution|extension|link|namespace|permission|reflect|em|karen|cfs|media).*|"
     + ".*\\t/common/(?!document|topic)\\b.*|"
     + ".*\\t/common/document/(?!source_uri)\\b.*|"
     + ".*\\t/common/topic/(description|image|webpage|properties|weblink|notable_for|article).*|"
     + ".*\\t/type/type/(?!domain|instance)\\b.*|"
     + ".*\\t/dataworld/.*\\t.*|"
     + ".*\\t/base/.*\\t.*"
     ;

Other dumps

If you are looking for other dumps, you can see the Freebase Easy at freebase-easy.cs.uni-freiburg.de, which contains a snapshot of the dump of the Freebase data, which has been enriched with transitive closures, but also largely simplified (and pruned).

Feedback

If you have any feedback, suggestion, like edges to add/remove, or labels for nodes and edges, or suggested domains, please feel free to contact Matteo Lissandrini.