Following my post on Character encodings and Unicode it is now time to talk about i18n with GNU gettext. We will look at i18n and l10n in general and then talk about how gettext
can make our live as programmers very easy.
i18n and l10n - Internationalization and Localization
Due to the length of Internationalization and Localization you can just write i18n and l10n which are “numeronyms”, number based words that are formed by taking the first and last character of the words and putting the amount of letters between these two characters in the middle, so for Internationalization it starts with an i
and ends with an n
and has 18 letters in-between resulting in i18n
.
For us software developers these terms mean adapting our code to be locale agnostic. If you create a UI and hard-code all strings then the users won’t be able to change the language. Aside from normal translations i18n and l10n also encompasses formatting rules for numbers, date and time, currency and things like text layout. Some languages read left to right, others right to left. Instead of reading horizontally there are also cultures where you read vertically.
All of this might seem overwhelming and in reality you will likely never have to deal with this. For Open-Source projects it’s often enough to just have everything in English and maybe provide a way to load translations.
GNU gettext and libintl
i18n and l10n is all nice and good, but how should we programmers design our software to support these concepts? This is where gettext
and libintl
come into play.
In 1995 the GNU projects released GNU gettext into the world. The package offers an integrated set of tools as well as the libintl
runtime library for dealing with translations. We will take a look at the tools xgettext
, msginit
, msgmerge
and msgfmt
, how to use the gettext
library in our code and what the process of creating translations is.
CMake Setup
In order for us to use libintl
in our code we need to get it from somewhere. In this example we will use CMake and vcpkg.
|
|
This is a very basic CMakeLists.txt
file containing one dependency which we will get with vcpkg using vcpkg install gettext[tools]
. The tools
feature is very important so we can get the programs required for our setup.
Code Setup
|
|
This is the most basic C++ program possible. In the code we hard-coded the string Hello World!
, and now we want to provide translations. The libintl
runtime library has exactly what we need:
|
|
The gettext
function from the libintl.h
header will now look for a translation of the string Hello World!
for the current locale at runtime. If it does not find a translation it will just use Hello World!
which is very nice since we only have to wrap all the strings in gettext()
. If gettext
is too long you can create a macro:
|
|
This macro is very commonly used in projects that use GNU gettext and even included in frameworks like GTK.
So where does gettext()
look for my translations? You will have to specify that with bindtextdomain
:
|
|
Now at runtime if the user has the de
locale set, gettext()
will look at locales/de/LC_MESSAGES/my-domain.mo
for a translation. Let’s break this path down:
locales
: the folder specified inbindtextdomain
de
: the current localeLC_MESSAGES
: the category name of the translationmy-domain.mo
: the binary message catalog containing all translations. The file name is what you specified inbindtextdomain
as well astextdomain
There are few things that need explaining. The library will first try and find an exact match of the current locale but if that is not possible it will look for similar locales. As an example if the user has de_DE
, but you only provided de
then the library will first look for de_DE
and when it doesn’t find it, it will look for an expanded locale like de
.
The LC_MESSAGES
part of the path is the category name of the translation we are looking for. LC
stands for locale category and there are various others like LC_CTYPE
, LC_NUMERIC
, LC_TIME
, LC_MONETARY
which all specify how to handle various things like numbers, dates and currency. For our purposes we only focus on messages, raw strings or texts we want to translate. gettext
will always use LC_MESSAGES
as its category. There are other functions that will let you specify which category you want to look for like dcgettext
but for this post we won’t look at those.
The .mo
file is the message catalog which has to be generated. So let’s look at how that works next.
Initial Project Setup
In our very complex example we have marked the Hello World!
string as a translatable string. We can now use xgettext
to extract these marked strings:
|
|
This .pot
file is a Portable Object Template file. It can look like this:
# SOME DESCRIPTIVE TITLE.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"POT-Creation-Date: 2022-05-05 14:47+0200\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"
#: main.cpp:10
msgid "Hello World!"
msgstr ""
At the bottom you can find our Hello World!
string which comes from main.cpp
at line 10
. This file can now be used to create .po
, Portable Object files using msginit
:
|
|
Every language you want to support gets its own .po
file. The .pot
is just a template used to create the .po
files with. The .po
file looks almost the same:
msgid ""
msgstr ""
"POT-Creation-Date: 2022-05-05 14:47+0200\n"
"PO-Revision-Date: 2022-05-04 18:41+0200\n"
"Last-Translator: <EMAIL@ADDRESS>\n"
"Language-Team: German\n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CP1252\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"X-Generator: Poedit 3.0.1\n"
#: main.cpp:10
msgid "Hello World!"
msgstr "Hallo Welt!"
I already went ahead and translated Hello World!
to Hallo Welt!
using Poedit, but there are other tools like KBabel, Gtranslator, PO Mode and more which you can use to edit the .po
files. The ecosystem is very mature and platforms like transifex also support it.
Now with your fully or partially translated .po
file in hand we will use msgfmt
to create our final output file:
|
|
You now have a .mo
file that can be loaded at runtime. But what happens when you change your code? How do you update your .pot
, .po
and .mo
files when there are new, removed or changed strings in your code?
Source changed, what to do?
What we looked at so far is the initial setup phase. This is what you do when you have no previous .pot
or .po
files and generate them for the first time. If you already have them and the code changed you need to run xgettext
again:
|
|
The .pot
file can be overwritten as you please since it only holds generated content. The .po
files are more important since you don’t want to re-do all translations. For this reason we use msgmerge
to update the .po
files with the new template:
|
|
The tool takes the current .po
file and the new .pot
file as input and spits out an updated .po
file that keeps your existing translations as long as they are still used.
Automating and integrating with CMake
If you found all of this extremely tedious to do by hand then you are not alone. Of course we can automate the generating and updating for all the required files as well as copying the .mo
files to our output at build time using a CMake script.
To get started, grab a copy this script or add the repository as a submodule. In your CMakeLists.txt
file we need to add some lines:
|
|
First we specify some variables and add GETTEXT_DOMAIN
and GETTEXT_OUTPUT_DIR
as a predefined macro so we can use it in our code. I suggest you change the domain name and the list of languages you want to use.
Next we want to add the previously downloaded script:
|
|
There is lots to configure so let’s go through it all:
DOMAIN
set the domain nameTARGET_NAME
: set the CMake target nameSOURCES
: set a list of all source files thatxgettext
should look through, I suggest creating some variable that holds all your sourcesPOTFILE_DESTINATION
: this is the directory where the.pot
file goesXGETTEXT_ARGS
: most of the extra arguments supplied are just for flavor like specifying the package name, version as well as the address where you can report bugs and the copyright holderLANGAUGES
: set the list of languages we want to supportBUILD_DESTINATION
: the top-level output folder for the generated.mo
filesALL
: adds the custom CMake target to the default build target so that it will be run every time
And that’s it. Doing a CMake configure will generate the .pot
and .po
files, and you now have some new targets that will generate the .mo
files.
Complete Code Example
|
|
Woah, what happened with our 12 lines of code? Things are sadly not as easy as you want them to be. The main function is still the same, we have our marked string Hello World!
but now there is the new function setup_i18n
.
This new function takes a std::string_view
as an argument and sets the current locale to something new. In this case I want to change the locale to de
so my German translation for Hello World!
can be loaded. In your actual code you’d want to have some language option the user can change which would call this function with the new locale.
The thing that makes this messy is of course the difference between systems. On a POSIX compliant system you can just call setlocale(LC_MESSAGES, "de")
, but this doesn’t work at all for Windows.
On Windows you need to use SetThreadLocale
which requires a locale ID that you can get from a name like de
using LocaleNameToLCID
. The _configthreadlocale(_DISABLE_PER_THREAD_LOCALE)
call is important because SetThreadLocale
only affects the current thread, who would have thought, so we want to disable this behavior and change the locale of this and all future threads.
The code after that uses the new predefined macros we added in our CMake file and I also added a call to the very useful bind_textdomain_codeset
function. I suggest reading my Character encodings and Unicode where I explain the mess that is Code Pages and Unicode. With this function call gettext
will always return a UTF-8 string. If you don’t want that or don’t need that you can remove this call but for frameworks like GTK it is required as they accept UTF-8 only.
Alternatives
GNU gettext has been around for over 30 years, it is battle tested, has great support and a matured ecosystem. However, if this is not for you or if you can’t use it there are a few alternatives available:
- DIY: always possible but highly discouraged
- Qt: of course the Qt-ecosystem has a different way of doing translations
- ICU: International Components for Unicode have ICU4C however it is not easy to use at all
- Boost.Locale they use ICU as a backend
- POSIX: catgets this was created back in 1987 before GNU gettext
- Win32: LoadString with this you can load string resources by ID