Following my post on Character encodings and Unicode it is now time to talk about i18n with GNU gettext. We will look at i18n and l10n in general and then talk about how gettext can make our live as programmers very easy.

i18n and l10n - Internationalization and Localization

Due to the length of Internationalization and Localization you can just write i18n and l10n which are “numeronyms”, number based words that are formed by taking the first and last character of the words and putting the amount of letters between these two characters in the middle, so for Internationalization it starts with an i and ends with an n and has 18 letters in-between resulting in i18n.

For us software developers these terms mean adapting our code to be locale agnostic. If you create a UI and hard-code all strings then the users won’t be able to change the language. Aside from normal translations i18n and l10n also encompasses formatting rules for numbers, date and time, currency and things like text layout. Some languages read left to right, others right to left. Instead of reading horizontally there are also cultures where you read vertically.

All of this might seem overwhelming and in reality you will likely never have to deal with this. For Open-Source projects it’s often enough to just have everything in English and maybe provide a way to load translations.

GNU gettext and libintl

i18n and l10n is all nice and good, but how should we programmers design our software to support these concepts? This is where gettext and libintl come into play.

In 1995 the GNU projects released GNU gettext into the world. The package offers an integrated set of tools as well as the libintl runtime library for dealing with translations. We will take a look at the tools xgettext, msginit, msgmerge and msgfmt, how to use the gettext library in our code and what the process of creating translations is.

CMake Setup

In order for us to use libintl in our code we need to get it from somewhere. In this example we will use CMake and vcpkg.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
cmake_minimum_required(VERSION 3.8)

project (
 "CppInternationalization"
 VERSION 1.0.0
 LANGUAGES CXX
)

# Using C++20

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

add_executable(${PROJECT_NAME} "main.cpp")

# add libintl

find_package(Intl REQUIRED)
target_link_libraries(${PROJECT_NAME} PUBLIC ${Intl_LIBRARY})
target_include_directories(${PROJECT_NAME} PUBLIC ${Intl_INCLUDE_DIRS})

This is a very basic CMakeLists.txt file containing one dependency which we will get with vcpkg using vcpkg install gettext[tools]. The tools feature is very important so we can get the programs required for our setup.

Code Setup

1
2
3
4
5
6
7
#include <iostream>
#include <cstdlib>

int main() {
  std::cout << "Hello World!" << std::endl;
  return EXIT_SUCCESS;
}

This is the most basic C++ program possible. In the code we hard-coded the string Hello World!, and now we want to provide translations. The libintl runtime library has exactly what we need:

1
2
3
4
5
6
7
8
#include <iostream>
#include <cstdlib>
#include <libintl.h>

int main() {
  std::cout << gettext("Hello World!") << std::endl;
  return EXIT_SUCCESS;
}

The gettext function from the libintl.h header will now look for a translation of the string Hello World! for the current locale at runtime. If it does not find a translation it will just use Hello World! which is very nice since we only have to wrap all the strings in gettext(). If gettext is too long you can create a macro:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#include <iostream>
#include <cstdlib>
#include <libintl.h>

#define _(String) gettext(String)

int main() {
  std::cout << _("Hello World!") << std::endl;
  return EXIT_SUCCESS;
}

This macro is very commonly used in projects that use GNU gettext and even included in frameworks like GTK.

So where does gettext() look for my translations? You will have to specify that with bindtextdomain:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#include <iostream>
#include <cstdlib>
#include <libintl.h>

#define _(String) gettext(String)

int main() {
  bindtextdomain("my-domain", "locales");
  textdomain("my-domain");

  std::cout << _("Hello World!") << std::endl;
  return EXIT_SUCCESS;
}

Now at runtime if the user has the de locale set, gettext() will look at locales/de/LC_MESSAGES/my-domain.mo for a translation. Let’s break this path down:

  • locales: the folder specified in bindtextdomain
  • de: the current locale
  • LC_MESSAGES: the category name of the translation
  • my-domain.mo: the binary message catalog containing all translations. The file name is what you specified in bindtextdomain as well as textdomain

There are few things that need explaining. The library will first try and find an exact match of the current locale but if that is not possible it will look for similar locales. As an example if the user has de_DE, but you only provided de then the library will first look for de_DE and when it doesn’t find it, it will look for an expanded locale like de.

The LC_MESSAGES part of the path is the category name of the translation we are looking for. LC stands for locale category and there are various others like LC_CTYPE, LC_NUMERIC, LC_TIME, LC_MONETARY which all specify how to handle various things like numbers, dates and currency. For our purposes we only focus on messages, raw strings or texts we want to translate. gettext will always use LC_MESSAGES as its category. There are other functions that will let you specify which category you want to look for like dcgettext but for this post we won’t look at those.

The .mo file is the message catalog which has to be generated. So let’s look at how that works next.

Initial Project Setup

In our very complex example we have marked the Hello World! string as a translatable string. We can now use xgettext to extract these marked strings:

1
xgettext main.cpp --keyword="_" --output="locales/my-domain.pot"

This .pot file is a Portable Object Template file. It can look like this:

# SOME DESCRIPTIVE TITLE.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"POT-Creation-Date: 2022-05-05 14:47+0200\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"

#: main.cpp:10
msgid "Hello World!"
msgstr ""

At the bottom you can find our Hello World! string which comes from main.cpp at line 10. This file can now be used to create .po, Portable Object files using msginit:

1
msginit --input="locales/my-domain.pot" --output-file="locales/de/my-domain.po" --locale="de"

Every language you want to support gets its own .po file. The .pot is just a template used to create the .po files with. The .po file looks almost the same:

msgid ""
msgstr ""
"POT-Creation-Date: 2022-05-05 14:47+0200\n"
"PO-Revision-Date: 2022-05-04 18:41+0200\n"
"Last-Translator: <EMAIL@ADDRESS>\n"
"Language-Team: German\n"
"Language: de\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CP1252\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"X-Generator: Poedit 3.0.1\n"

#: main.cpp:10
msgid "Hello World!"
msgstr "Hallo Welt!"

I already went ahead and translated Hello World! to Hallo Welt! using Poedit, but there are other tools like KBabel, Gtranslator, PO Mode and more which you can use to edit the .po files. The ecosystem is very mature and platforms like transifex also support it.

Now with your fully or partially translated .po file in hand we will use msgfmt to create our final output file:

1
msgfmt "locales/de/my-domain.po" --output-file="locales/de/my-domain.mo"

You now have a .mo file that can be loaded at runtime. But what happens when you change your code? How do you update your .pot, .po and .mo files when there are new, removed or changed strings in your code?

Source changed, what to do?

What we looked at so far is the initial setup phase. This is what you do when you have no previous .pot or .po files and generate them for the first time. If you already have them and the code changed you need to run xgettext again:

1
xgettext main.cpp --keyword="_" --output="locales/my-domain.pot"

The .pot file can be overwritten as you please since it only holds generated content. The .po files are more important since you don’t want to re-do all translations. For this reason we use msgmerge to update the .po files with the new template:

1
msgmerge "locales/de/my-domain.po" "locales/my-domain.pot" --output-file="locales/de/my-domain.po"

The tool takes the current .po file and the new .pot file as input and spits out an updated .po file that keeps your existing translations as long as they are still used.

Automating and integrating with CMake

If you found all of this extremely tedious to do by hand then you are not alone. Of course we can automate the generating and updating for all the required files as well as copying the .mo files to our output at build time using a CMake script.

To get started, grab a copy this script or add the repository as a submodule. In your CMakeLists.txt file we need to add some lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10

# setup gettext

set(GETTEXT_DOMAIN "my-domain")
set(GETTEXT_TARGET "gettext-target")
set(GETTEXT_OUTPUT_DIR "locales")
set(GETTEXT_LANGUAGES "en" "de")

target_compile_definitions(${PROJECT_NAME} PUBLIC "GETTEXT_DOMAIN=\"${GETTEXT_DOMAIN}\"")
target_compile_definitions(${PROJECT_NAME} PUBLIC "GETTEXT_OUTPUT_DIR=\"${GETTEXT_OUTPUT_DIR}\"")

First we specify some variables and add GETTEXT_DOMAIN and GETTEXT_OUTPUT_DIR as a predefined macro so we can use it in our code. I suggest you change the domain name and the list of languages you want to use.

Next we want to add the previously downloaded script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
include("extern/gettext-cmake/Gettext_helpers.cmake")
CONFIGURE_GETTEXT(
 DOMAIN ${GETTEXT_DOMAIN}
 TARGET_NAME ${GETTEXT_TARGET}
 SOURCES "main.cpp"
 POTFILE_DESTINATION ${GETTEXT_OUTPUT_DIR}
 XGETTEXT_ARGS
  "--keyword=_"
  "--add-comments=TRANSLATORS:"
  "--package-name=${PROJECT_NAME}"
  "--package-version=${PROJECT_VERSION}"
  "--msgid-bugs-address=<https://github.com/erri120/${PROJECT_NAME}/issues>"
  "--copyright-holder=erri120"
 LANGUAGES ${GETTEXT_LANGUAGES}
 BUILD_DESTINATION $<TARGET_FILE_DIR:${PROJECT_NAME}>/${GETTEXT_OUTPUT_DIR}
 ALL
)

There is lots to configure so let’s go through it all:

  • DOMAIN set the domain name
  • TARGET_NAME: set the CMake target name
  • SOURCES: set a list of all source files that xgettext should look through, I suggest creating some variable that holds all your sources
  • POTFILE_DESTINATION: this is the directory where the .pot file goes
  • XGETTEXT_ARGS: most of the extra arguments supplied are just for flavor like specifying the package name, version as well as the address where you can report bugs and the copyright holder
  • LANGAUGES: set the list of languages we want to support
  • BUILD_DESTINATION: the top-level output folder for the generated .mo files
  • ALL: adds the custom CMake target to the default build target so that it will be run every time

And that’s it. Doing a CMake configure will generate the .pot and .po files, and you now have some new targets that will generate the .mo files.

Complete Code Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <iostream>
#include <cstdlib>
#include <libintl.h>

#if WIN32
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#endif

#define _(STRING) gettext(STRING)

static void setup_i18n(const std::string_view locale) {
#if WIN32
  // LocaleNameToLCID requires a LPCWSTR so we need to convert from char to wchar_t
  const auto wStringSize = MultiByteToWideChar(CP_UTF8, 0, locale.data(), static_cast<int>(locale.length()), nullptr, 0);
  std::wstring localeName;
  localeName.reserve(wStringSize);
  MultiByteToWideChar(CP_UTF8, 0, locale.data(), static_cast<int>(locale.length()), localeName.data(), wStringSize);

  _configthreadlocale(_DISABLE_PER_THREAD_LOCALE);
  const auto localeId = LocaleNameToLCID(localeName.c_str(), LOCALE_ALLOW_NEUTRAL_NAMES);
  SetThreadLocale(localeId);
#else
  setlocale(LC_MESSAGES, locale.data());
#endif

  bindtextdomain(GETTEXT_DOMAIN, GETTEXT_OUTPUT_DIR);
  bind_textdomain_codeset(GETTEXT_DOMAIN, "UTF-8");
  textdomain(GETTEXT_DOMAIN);
}

int main() {
  setup_i18n("de");

  std::cout << _("Hello World!") << std::endl;

  return EXIT_SUCCESS;
}

Woah, what happened with our 12 lines of code? Things are sadly not as easy as you want them to be. The main function is still the same, we have our marked string Hello World! but now there is the new function setup_i18n.

This new function takes a std::string_view as an argument and sets the current locale to something new. In this case I want to change the locale to de so my German translation for Hello World! can be loaded. In your actual code you’d want to have some language option the user can change which would call this function with the new locale.

The thing that makes this messy is of course the difference between systems. On a POSIX compliant system you can just call setlocale(LC_MESSAGES, "de"), but this doesn’t work at all for Windows.

On Windows you need to use SetThreadLocale which requires a locale ID that you can get from a name like de using LocaleNameToLCID. The _configthreadlocale(_DISABLE_PER_THREAD_LOCALE) call is important because SetThreadLocale only affects the current thread, who would have thought, so we want to disable this behavior and change the locale of this and all future threads.

The code after that uses the new predefined macros we added in our CMake file and I also added a call to the very useful bind_textdomain_codeset function. I suggest reading my Character encodings and Unicode where I explain the mess that is Code Pages and Unicode. With this function call gettext will always return a UTF-8 string. If you don’t want that or don’t need that you can remove this call but for frameworks like GTK it is required as they accept UTF-8 only.

Alternatives

GNU gettext has been around for over 30 years, it is battle tested, has great support and a matured ecosystem. However, if this is not for you or if you can’t use it there are a few alternatives available:

  • DIY: always possible but highly discouraged
  • Qt: of course the Qt-ecosystem has a different way of doing translations
  • ICU: International Components for Unicode have ICU4C however it is not easy to use at all
  • Boost.Locale they use ICU as a backend
  • POSIX: catgets this was created back in 1987 before GNU gettext
  • Win32: LoadString with this you can load string resources by ID