We share here The Freebase ExQ Data Dump: a cleaned version of the triplets from the Freebase knowledge graph, where metadata, and also (most of) extraneous relationships have been removed. The dump is shared in a `machine` friendly format.
The extracted graph, after cleaning, is a directed unweighted multigraph, containining 72,407,365
nodes, and 306,733,220
edges with 4335
distinct edge labels.
The complete dump "Freebase Triples" can be found at developers.google.com/freebase and they are no longer updated after the shutdown of the project.
Freebase Data Dumps are provided free of charge for any purpose by Google. They are distributed, like Freebase itself, under the Creative Commons Attribution (aka CC-BY) and their use is subject to the Freebase Terms of Service.
The Freebase ExQ Data Dump (this repository) is distributed under the same license; see below for citing this work. This dataset had been used in the Exemplar Query project.
Reference this dataset
This dataset is called The Freebase ExQ Data Dump. If you use this dataset, generate a subsample, or test on this dataset, please use the following reference and link to https://people.cs.aau.dk/~matteo/exemplar.html.
Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, Themis Palpanas. "Exemplar Queries: A New Way of Searching." The VLDB Journal (2016) 25: 741--765.
@article{Mottin:2016:EQN:3016770.3016789,
author = {Mottin, Davide and Lissandrini, Matteo and Velegrakis, Yannis and Palpanas, Themis},
title = {Exemplar Queries: A New Way of Searching},
journal = {The VLDB Journal},
issue_date = {December 2016},
volume = {25},
number = {6},
month = dec,
year = {2016},
issn = {1066-8888},
pages = {741--765},
numpages = {25},
url = {https://doi.org/10.1007/s00778-016-0429-2},
doi = {10.1007/s00778-016-0429-2},
acmid = {3016789},
publisher = {Springer-Verlag New York, Inc.},
address = {Secaucus, NJ, USA},
keywords = {Exemplar query, Knowledge base, Knowledge graph, Query answering},
}
Content
The dump consists of the following files:
-
freebase-sout.graph
(2GB): edges triplets (ordered by the source id)- each line is a space-separated triplet
source
dest
label
, representing a single edge - edges are sorted by source, and thus a scan in order will give all the outgoing edges of a node
source
anddest
are long integers derived from the Freebasemid
- each line is a space-separated triplet
-
freebase-labels.tsv
: list of TAB separated 4-tuples, each of which contains:- Label ID (Long),
- Number of edges with that label,
- Freebase official edge label,
- tentative human readable label
-
freebase-nodes-in-out-name.tsv
(802MB): list of TAB separated 4-tuples, each of which contains:- Node ID (Long)
- Node InDegree (could be approximate)
- Node OutDegree (could be approximate)
- tentative human readable label
-
freebase-topics.tsv
: list of TAB separated values, each line contains:- topic name : defined as the first fragment of the edge label
- topic frequency : number of edges belonging to this topic;
note that
>141
million edges belong to type instances (likeisA
relationships)
-
org-subsample
(34MB): a subsample of Freebase for a selection of domains:freebase-org-subsample-sout.graph
contains a portion of 4.3M edges fromfreebase-sout.graph
selected_labels.tsv
lists a portion offreebase-labels.tsv
: only edges with labels in this list appear in the subsample
-
directory
scripts
: containsmid2long
converts Freebase mids to long values, e.g., from/m/0gwsd6y
to89546883877148
long2mid
converts Freebase long ids to midsextract_domain.py
extracts subgraphs of Freebase given a topic name fromfreebase-topics.tsv
. Requirespython 2.7
, andnetworkx
if you wish to keep only the largest connectected component
Download the files or part of them, they are stored on Google Drive.
The org-subsample
Subsample
The entire graph is cumbersome to process in many applications, expecially for testing purposes. We generate a relatively small subsample of the graph, containing only a portion of about 4.3 million edges from the entire graph, with a total of 424 edge labels.
We generated a subsample from the following topics:
- business,
- finance,
- geography,
- government,
- military, and
- organization.
Information about Node IDs
If you want to undersand what a node represents, then search in the file freebase-nodes-in-out-name.tsv
for the corresponding node id.
If the search doesn't satisfy you, then you can use grep
on the official data dump to search for its mid
value (removing the first slash, and replacing the second with a dot).
So, if you care for node 89546883877148
and you want to search the official dump, convert it to the mid /m/0gwsd6y
, replace the characters to obtain m.0gwsd6y
and grep
(zgrep
on compressed file) the dump.
Mids have been converted into long numbers using the following code:
/**
* Convert a mid into a BigInteger since a mid is not more than "/m/"
* followed by lower-case letters, digits and _, so it is a base-32 code
* that can be easily converted to binary and then to bigint.
*
* ** NOTE ** Engineering version
* @param mid The original Freebase mid
* @return the converted number
* @throws NullPointerException
* @throws IndexOutOfBoundsException
*/
long convertMidToLong(String mid)
throws NullPointerException, IndexOutOfBoundsException {
String id = mid.substring(mid.lastIndexOf('/') + 1).toUpperCase();
long retval;
String number = "";
for (int i = 0; i < id.length(); i++) {
number = (int)id.charAt(i) + number;
}
retval = Long.valueOf(number);
return retval;
}
Given a long value one can obtain the Freebase mid with the following code
/**
* Opposite of convertMidToBigInt
* @param decimal
* @return
* @throws NullPointerException
* @throws IndexOutOfBoundsException
*/
String convertLongToMid(long decimal)
throws NullPointerException, IndexOutOfBoundsException {
String mid = "";
String decimalString = decimal + "";
for (int i = 0; i < decimalString.length(); i+= 2) {
mid = (char)Integer.parseInt(decimalString.substring(i, i + 2)) + mid;
}
return "/m/" + mid.toLowerCase();
}
Cleaning Criteria
Metadata relationships in Freebase; these relationships are omitted
DOMAIN
/type/domain
TOPIC
/type/type
ENTITY
/common/topic
PROPERTY
/type/property
Media and contextual information not interesting in the knowledge graph
For type relationships we keep only the isA
, and not the reverse hasInstance
/**
* Patterns to skip
* removes the line from the tsv dump matching the following patterns
*/
String SKIP_PATTERNS = ".*\\t/user.*|"
+ ".*\\t/freebase/(?!domain_category).*|"
+ ".*/usergroup/.*|"
+ ".*/permission/.*|"
+ ".*\\t/community/.*\\t.*|"
+ ".*\\t/type/object/type\\t.*|"
+ ".*\\t/type/domain/.*\\t.*|"
+ ".*\\t/type/property/(?!expected_type|reverse_property)\\b.*|"
+ ".*\\t/type/(user|content|attribution|extension|link|namespace|permission|reflect|em|karen|cfs|media).*|"
+ ".*\\t/common/(?!document|topic)\\b.*|"
+ ".*\\t/common/document/(?!source_uri)\\b.*|"
+ ".*\\t/common/topic/(description|image|webpage|properties|weblink|notable_for|article).*|"
+ ".*\\t/type/type/(?!domain|instance)\\b.*|"
+ ".*\\t/dataworld/.*\\t.*|"
+ ".*\\t/base/.*\\t.*"
;
Other dumps
If you are looking for other dumps, you can see the Freebase Easy at freebase-easy.cs.uni-freiburg.de, which contains a snapshot of the dump of the Freebase data, which has been enriched with transitive closures, but also largely simplified (and pruned).
Feedback
If you have any feedback, suggestion, like edges to add/remove, or labels for nodes and edges, or suggested domains, please feel free to contact Matteo Lissandrini.