How we processed Wikidata dumps over a Weekend using only shell scripting

Debanjan Chaudhuri
4 min readApr 24, 2021

Processing huge data can be daunting, especially if the data cannot be fit into memory. Many recent frameworks rely on the MapReduce algorithm where the data is distributed into different machines and processed separately. However, such techniques require adding more hardware components to your pipeline. Before the advent of big data technologies, we used to process such a huge amount of data using shell or Perl scripting. In this article, we will introduce simple shell scripting techniques we used to process wikidata dump which is ~ 500 GB uncompressed, over a weekend. The best part being, although the technique is disk-io intensive, you don’t need any additional hardware for this.

Before starting to talk about processing wikidata, one probably should explain what the heck is wikidata.

Wikidata, according to this definition is a collaborative knowledge graph, which is used as a common source of open-data. If you are reading this article, probably you already know what a knowledge graph is, but still for the sake of completeness, I would like to say a few lines about knowledge graphs as well. Without getting into the nitty-gritty of knowledge graphs, in short, knowledge graphs are graph-structured Datamodels that store information about different entities and the relation between them. For example, consider the below diagram where the different entities about the famous painting are connected in a graph-like structure.

Snippet from a Knowledge Graph

Wikidata, also contains entities and the different relations (or property in their term) they have with other entities. The entire wikidata dump is available here. Wikidata contains around 90 million entities and you can imagine how large the dump is. The compressed file is around 90 GB in size (tar.gz). We tried extracting the file at least 3–4 times, but every time we ran out of space while extracting it. We also had a machine with 256 GB RAM, imagine processing around 500 GB of data on that. You can always do batch disk-read and process, but that would mean extracting the data firstly, which we couldn’t since the machine had around 400 GB left for us.

Going forward with the challenges and to retain my coolness, I had to do something different to process the data. I thought, hey! why not use shell scripting? To understand the scripting parts, let’s have a look into how the data is organized firstly.

The data is organized as below:

  1. Each entity information is in a line of the dump as a json.
  2. The json structure is as: {“type”:”item”,”id”:”Q24",”labels”:{“en”:{“language”:”en”,”value”:”Jack Bauer”},

We needed to create an inverted index with the entity id which is Q24 in the previous case, entity label in English and also the description which was like this for the previous entity: {“en”:{“language”:”en”,”value”:”character from the television series 24"}

In order to extract the entity ids, we used the below grep pattern:

grep -oP '(?<=id":).*?(?="labels")' <filename>

In order to get the labels and description for english we used the grep pattern as:

grep -oP '(?<="en":{"language":"en","value":").*?(?="},")' <filename>

This will extract both the information in separate lines.

As mentioned previously, we didn’t want to extract the files because of the enormous space requirements. What is the best part about using shell scripting? You can process a file without extracting it. You can do it in other languages as well, but then you would also need to load it to memory at first, which we didn’t want.

How to grep from a compressed file (tar.gz) file? just use zgrep instead of grep. For instance, in order to get the labels and descriptions we used

zgrep -oP '(?<="en":{"language":"en","value":").*?(?="},")' <filename>

To do the post-processing while creating our index we needed to retain the line numbers. That would mean using another option to the grep to print the line number as well, the final command becomes

zgrep -oPn '(?<=id":).*?(?="labels")' <filename>  # extract ids
zgrep -oPn '(?<="en":{"language":"en","value":").*?(?="},")' <filename> # extract labels and descriptions

Finally, we used a simple python script here in order to extract the id, label and description in a separate file we used later for indexing.

The script could be highly optimized which we didn’t out of laziness and also we needed to run it once.

There are definitely issues with our approach, which are:

  1. Some labels were not extracted properly, which we ignored during post-processing
  2. Some entities doesn’t have labels, which came as blank
  3. The process is disk i-o intensive and hence slow, the label extraction process took around 1 day and the id a bit more.
  4. The process is not using multiple threads and hence slower. Although we ran the id and label extractions as 2 different processes.
  5. A probable way to make the whole process faster is to split the file and run the extraction for each split.

Thank you for reading! I hope it will be useful for people working with Wikidata knowledge graphs.

Take care!

--

--

Debanjan Chaudhuri

An eccentric researcher, working on Natural Language Processing and Computer Vision. Currently working in industry with academic collaborations.