Wikipedia Translate Crawler

A Wikipedia crawler that gives the worst translated page around an English starting page using hypertext links

Check out on GitHub

Wikipedia-Translate-Crawler

Lire en Français


Summary

What it is

This crawler will search for all pages related to a topic on Wikipedia in a certain language called source language, let’s say English, and for each related page it will check how good is the translated page in another language called target language, let’s say French.

For example, if you know a lot about Computer Science and you want to improve Wikipedia pages related to CS in French, you can use the script to know which pages related to the topic have bad translation and can be considered a priority.

Basically, this script is meant to be used when you want to contribute to Wikipedia by translating pages.

How to run

Considering the example above, you can run the following:

git clone https://github.com/Relex12/Wikipedia-Translate-Crawler.git
cd Wikipedia-Translate-Crawler
./crawler.sh Computer_Science fr

Script behavior

At first, the script is checking for Internet connection, the options and the existence of both source and translated pages (i.e the page you give as an argument and it’s translated version) and then create a workspace with the name of the source page.

The output is written in a sorted CSV file, where the first column is the score of page’s translation, then the name of the source page, the URL of the translated page, and additional information about quality tags of the translated page.

This CSV file is also written to stdout with fancy colors depending on the score and quality tags of the translated page.

The score is calculated according to this pseudo-code:

score = 0
for i in [<a>, <img>, <h2>, <h3>]
	score = score + N_src(i)/( N_trg(i)+1 )

where N_src and N_trg are respectively the number of occurrences of the current tag in the source page and the translated page.

CLI arguments

Usage: ./crawler PAGE [TARGET_LANGUAGE=fr] [DEPTH=2] [SOURCE_LANGUAGE=en]

Known issues and risky behaviors

License

The project is a small one. The code is given to the GitHub Community for free, only under the MIT License, that is not too restrictive.