Following my post on Character encodings and Unicode it is now time to talk about i18n with GNU gettext. We will look at i18n and l10n in general and then talk about how
gettext can make our live as programmers very easy.
i18n and l10n - Internationalization and Localization
Due to the length of Internationalization and Localization you can just write i18n and l10n which are “numeronyms”, number based words that are formed by taking the first and last character of the words and putting the amount of letters between these two characters in the middle, so for Internationalization it starts with an
i and ends with an
n and has 18 letters in-between resulting in
For us software developers these terms mean adapting our code to be locale agnostic. If you create a UI and hard-code all strings then the users won’t be able to change the language. Aside from normal translations i18n and l10n also encompasses formatting rules for numbers, date and time, currency and things like text layout. Some languages read left to right, others right to left. Instead of reading horizontally there are also cultures where you read vertically.
All of this might seem overwhelming and in reality you will likely never have to deal with this. For Open-Source projects it’s often enough to just have everything in English and maybe provide a way to load translations.
GNU gettext and libintl
i18n and l10n is all nice and good, but how should we programmers design our software to support these concepts? This is where
libintl come into play.
In 1995 the GNU projects released GNU gettext into the world. The package offers an integrated set of tools as well as the
libintl runtime library for dealing with translations. We will take a look at the tools
msgfmt, how to use the
gettext library in our code and what the process of creating translations is.
This is a very basic
CMakeLists.txt file containing one dependency which we will get with vcpkg using
vcpkg install gettext[tools]. The
tools feature is very important so we can get the programs required for our setup.
This is the most basic C++ program possible. In the code we hard-coded the string
Hello World!, and now we want to provide translations. The
libintl runtime library has exactly what we need:
gettext function from the
libintl.h header will now look for a translation of the string
Hello World! for the current locale at runtime. If it does not find a translation it will just use
Hello World! which is very nice since we only have to wrap all the strings in
gettext is too long you can create a macro:
This macro is very commonly used in projects that use GNU gettext and even included in frameworks like GTK.
So where does
gettext() look for my translations? You will have to specify that with
Now at runtime if the user has the
de locale set,
gettext() will look at
locales/de/LC_MESSAGES/my-domain.mo for a translation. Let’s break this path down:
locales: the folder specified in
de: the current locale
LC_MESSAGES: the category name of the translation
my-domain.mo: the binary message catalog containing all translations. The file name is what you specified in
bindtextdomainas well as
There are few things that need explaining. The library will first try and find an exact match of the current locale but if that is not possible it will look for similar locales. As an example if the user has
de_DE, but you only provided
de then the library will first look for
de_DE and when it doesn’t find it, it will look for an expanded locale like
LC_MESSAGES part of the path is the category name of the translation we are looking for.
LC stands for locale category and there are various others like
LC_MONETARY which all specify how to handle various things like numbers, dates and currency. For our purposes we only focus on messages, raw strings or texts we want to translate.
gettext will always use
LC_MESSAGES as its category. There are other functions that will let you specify which category you want to look for like
dcgettext but for this post we won’t look at those.
.mo file is the message catalog which has to be generated. So let’s look at how that works next.
Initial Project Setup
In our very complex example we have marked the
Hello World! string as a translatable string. We can now use
xgettext to extract these marked strings:
.pot file is a Portable Object Template file. It can look like this:
# SOME DESCRIPTIVE TITLE. # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR. # #, fuzzy msgid "" msgstr "" "POT-Creation-Date: 2022-05-05 14:47+0200\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n" "Language-Team: LANGUAGE <LL@li.org>\n" "Language: \n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=CHARSET\n" "Content-Transfer-Encoding: 8bit\n" #: main.cpp:10 msgid "Hello World!" msgstr ""
At the bottom you can find our
Hello World! string which comes from
main.cpp at line
10. This file can now be used to create
.po, Portable Object files using
Every language you want to support gets its own
.po file. The
.pot is just a template used to create the
.po files with. The
.po file looks almost the same:
msgid "" msgstr "" "POT-Creation-Date: 2022-05-05 14:47+0200\n" "PO-Revision-Date: 2022-05-04 18:41+0200\n" "Last-Translator: <EMAIL@ADDRESS>\n" "Language-Team: German\n" "Language: de\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=CP1252\n" "Content-Transfer-Encoding: 8bit\n" "Plural-Forms: nplurals=2; plural=(n != 1);\n" "X-Generator: Poedit 3.0.1\n" #: main.cpp:10 msgid "Hello World!" msgstr "Hallo Welt!"
I already went ahead and translated
Hello World! to
Hallo Welt! using Poedit, but there are other tools like KBabel, Gtranslator, PO Mode and more which you can use to edit the
.po files. The ecosystem is very mature and platforms like transifex also support it.
Now with your fully or partially translated
.po file in hand we will use
msgfmt to create our final output file:
You now have a
.mo file that can be loaded at runtime. But what happens when you change your code? How do you update your
.mo files when there are new, removed or changed strings in your code?
Source changed, what to do?
What we looked at so far is the initial setup phase. This is what you do when you have no previous
.po files and generate them for the first time. If you already have them and the code changed you need to run
.pot file can be overwritten as you please since it only holds generated content. The
.po files are more important since you don’t want to re-do all translations. For this reason we use
msgmerge to update the
.po files with the new template:
The tool takes the current
.po file and the new
.pot file as input and spits out an updated
.po file that keeps your existing translations as long as they are still used.
Automating and integrating with CMake
If you found all of this extremely tedious to do by hand then you are not alone. Of course we can automate the generating and updating for all the required files as well as copying the
.mo files to our output at build time using a CMake script.
To get started, grab a copy this script or add the repository as a submodule. In your
CMakeLists.txt file we need to add some lines:
First we specify some variables and add
GETTEXT_OUTPUT_DIR as a predefined macro so we can use it in our code. I suggest you change the domain name and the list of languages you want to use.
Next we want to add the previously downloaded script:
There is lots to configure so let’s go through it all:
DOMAINset the domain name
TARGET_NAME: set the CMake target name
SOURCES: set a list of all source files that
xgettextshould look through, I suggest creating some variable that holds all your sources
POTFILE_DESTINATION: this is the directory where the
XGETTEXT_ARGS: most of the extra arguments supplied are just for flavor like specifying the package name, version as well as the address where you can report bugs and the copyright holder
LANGAUGES: set the list of languages we want to support
BUILD_DESTINATION: the top-level output folder for the generated
ALL: adds the custom CMake target to the default build target so that it will be run every time
And that’s it. Doing a CMake configure will generate the
.po files, and you now have some new targets that will generate the
Complete Code Example
Woah, what happened with our 12 lines of code? Things are sadly not as easy as you want them to be. The main function is still the same, we have our marked string
Hello World! but now there is the new function
This new function takes a
std::string_view as an argument and sets the current locale to something new. In this case I want to change the locale to
de so my German translation for
Hello World! can be loaded. In your actual code you’d want to have some language option the user can change which would call this function with the new locale.
The thing that makes this messy is of course the difference between systems. On a POSIX compliant system you can just call
setlocale(LC_MESSAGES, "de"), but this doesn’t work at all for Windows.
On Windows you need to use
SetThreadLocale which requires a locale ID that you can get from a name like
_configthreadlocale(_DISABLE_PER_THREAD_LOCALE) call is important because
SetThreadLocale only affects the current thread, who would have thought, so we want to disable this behavior and change the locale of this and all future threads.
The code after that uses the new predefined macros we added in our CMake file and I also added a call to the very useful
bind_textdomain_codeset function. I suggest reading my Character encodings and Unicode where I explain the mess that is Code Pages and Unicode. With this function call
gettext will always return a UTF-8 string. If you don’t want that or don’t need that you can remove this call but for frameworks like GTK it is required as they accept UTF-8 only.
GNU gettext has been around for over 30 years, it is battle tested, has great support and a matured ecosystem. However, if this is not for you or if you can’t use it there are a few alternatives available:
- DIY: always possible but highly discouraged
- Qt: of course the Qt-ecosystem has a different way of doing translations
- ICU: International Components for Unicode have ICU4C however it is not easy to use at all
- Boost.Locale they use ICU as a backend
- POSIX: catgets this was created back in 1987 before GNU gettext
- Win32: LoadString with this you can load string resources by ID