Excellent quality Text-To-Speech with Linux (French / Français)

What device ?

Any Linux supported device that is able to run Android – x86, ARM, MIPS

Under which conditions ?

On Debian Linux Wheezy (or Android, ARM preferred)

What to expect ?

An excellent quality Text to Speech (later on TTS) for any textual input, in French (English would do too, but it’s more frequent) without Internet connection.

What were the pitfalls ?

Usual solution have a very poor quality whenever they work.

 

1) Analysis of existing solutions

For anyone interested in TTS solution running on Linux, a little bit of google time will return:

  1. E-speak
  2. MaryTTS
  3. Festival & Flite
  4. PicoTTS

E-speak is a synthesis solution that generate “speech” based on physical throat model (larynx, tongue, etc…). The speech is terrible, completely robotic, not understandable, not acceptable. However, E-speak provide a very useful text to phoneme conversion.

MaryTTS is the most promising solution (the speech quality is much better than E-speak, but still very far from a commercial solution), but its architecture server/client with Java will limit the casual hobbyist. Technically, this project is the one that gave the best speech from all open source engines I’ve found so far.

Festival or Flite does not worth any criticism, it’s so bad in all aspect that one should not even try it.

PicoTTS is a “open <crappy> source” solution when Google, in their infinite wisdom, decided to have an open source solution for their Text to speech technology. The quality is not that bad, however the source code is “take it as it is”, that is, you have a completely fixed source, that’s almost un-modifiable since the engine source code is just a “player” of the voice database which is not documented, not described.

Later Android revision changed the engine to something better (in fact, PicoTTS was formally written by SVOX, which in turns was bought by Nuance. Nuance is well known for aggressively eating their competitors to remove any competition from the market)

2) Other options

If you check the available TTS vendors, you’ll find very very good engines and voices from Cereproc, SVOX, Neospeech, Voxygen (my 4 best choice).

It’s very important to understand the amount of effort required to build a (perfect) voice from scratch, if you don’t, then you’ll not understand the next paragraph.

Making a voice requires a very constant reading of a text (no emotion, as pleasant as possible), for numerous hours (12h is a minimum).

Then you must segment each phoneme (you can have a software that “presplit” for you, but the result is usually bad), and let a computer make a model of your voice (typically two models will be build, one is called “unit selection”, and the other is called “hidden Markov model”. The former is used for casual words, the sound is perfect, the later is used for unknown words, the sound is a little bit worst.

I’ve contacted them, and asked for a quotation. They were all very friendly, and if I had to start a professional solution, I would really boot up from their work, the price they have asked for is not illogical compared to the work achieved.

Cereproc has a REST and SOAP based solution for TTS that’s very easy to use (2h of work to get it working). Their price is not that high, if you don’t intend to read a book. However, you need an internet connection to run it, as all other Cloud services.

However, I’m no professional, so I have to deal with my brain to solve my problems.

 

3) Chosen solution

Anyone using an Android phone will see the number of available voices (the engine is always free on Android, why ?) for almost free.

I wondered if it was possible to use these voices on a usual Linux system. You first need to purchase them (usually less than $4 per voice) to download them on your Android phone.

Then the hacking starts…

3.1) Backup of the voice engine and data

It’s usual for Android applications to have a locked state that prevents you, the content owner, to backup the application for analysis. It’s made to prevent copying, but as most DRM, it’s useless because any DRM must provide a way to read the content (else it’s just called “noise”). Typically, you’ll have to find a rooted phone/tablet (they are legions), navigate to the application folder and save it to a SD card for later analysis.

3.2) Disassembly of an application

They are numerous guides about Android APK disassembly. An APK is a zip file, and the jar file is DEX encoded so you need a DEX2JAR tool to get back a usual jar file.

Then you can use JD to decompile the code to Java’s source code (minus the variable name)

Voice synthesis engines usually build their engine in C and do a binding to Java via the JNI protocol. Their engine is delivered as a dynamic library (a .so file) and some Java glue code is used to link the Android’s API to their internal API.

This library is compiled for ARM platform (most of the time, I’ve yet to see a TTS working on a x86 platform on Android).

If the application requirement is 2.3 (Gingerbread), then the .so file will have to run on ARMv5 (else ARMv7 might be the minimum supported version).

3.3) Using the engine.so file

Typically, the .so file can not be used on a ARMv5 linux system directly because:

  1. Android dynamic linker is completely different than Glibc’s linker
  2. Android does not use Glibc but a minimal libc version (called Bionic)
  3. You don’t have the header for the library file

In order to solve 1 and 2, you might use a project called libhybris.

libhybris emulates the Android’s linker and specificities of the Bionic C library.

To solve 3, you’ll need to write a minimal header yourself based on the decompiled Java’s source code you got from step 3.2 (and a bit of “objdump -tTC engine.so”)

Typically, in C, you don’t need to know the type of the structures as long as it’s an opaque container and you only deal with pointers. You’ll write code like this:

struct MyStruct;
struct MyStruct * engine_create(); // If engine_create is an exported symbol from the library and the JNI wrapper seems to call it this way
int get_sample_rate(struct MyStruct *, int channel);  // etc...

3.4) Let’s put that together

So, you have written a minimal header. You’ll need to write a bridge from this header to the engine.so file you got from the Android package. You’ll need the Android NDK.

You might want to follow those steps.

If you do everything right, you’ll get something that compiles and links (quite hard by itself).

Then you run it, and … it fails…

You’ll then have to understand why, where, how.

The most likely reason is a bad header declaration, linking that does not find the expected files at the right place, etc… GDB is your friend, but understanding the concept here is not for the faint of heart, build libhybris with debug symbols, it’ll be more clear.

I’ve not succeed getting the engine.so to work with this solution (after having fixed all the dynamic linking errors and Bionic C library’s function fixing). It seems that the engine’s functions seems not to work as described in their JNI’s wrapper, or it calls something that’s specific to Android that’s not “fixed” by Libhybris yet.

3.5) Fallback solution

Since using a patcher for a library is tricky by itself and (in my case) does not work, I decided to run a direct call from the library on Android itself. I’ve a Beagle Bone Black at home, so installing and running Android is very easy, does not cost much more than the price for the BBB (electricity does not count, the BBB is always on on my system).

I’ve made a minimal wrapper code around the library (the same I was using with libhybris solution), built with the NDK with this makefile:

::::::::::::::
Makefile
::::::::::::::
AR = arm-linux-androideabi-ar
AS = arm-linux-androideabi-as
CC = arm-linux-androideabi-gcc
CXX = arm-linux-androideabi-c++
LD = arm-linux-androideabi-ld

NDK_KIT = /home/user/android-ndk-r9d/
PLATF_KIT = platforms/android-9/

ARM_INC = $(NDK_KIT)/$(PLATF_KIT)/arch-arm/usr/include
ARM_LIB = $(NDK_KIT)/$(PLATF_KIT)/arch-arm/usr/lib

OBJS = main.o
EXES = test

test: main.o
 $(LD) \
 --dynamic-linker /system/bin/linker -nostdlib \
 -rpath /system/lib -rpath $(ARM_LIB) -rpath-link $(ARM_LIB) \
 $(ARM_LIB)/crtend_android.o $(ARM_LIB)/crtbegin_dynamic.o \
 -L$(ARM_LIB) -lc -L. -lengine -l$(NDK_KIT)/toolchains/arm-linux-androideabi-4.6/prebuilt/linux-x86_64/lib/gcc/arm-linux-androideabi/4.6/libgcc.a -o $@ main.o

main.o: main.c
 $(CC) -I $(ARM_INC) -g -c main.c

clean:
 rm -f $(OBJS) $(EXES)

And then run the produced binary on the Android system… and it worked!

Then I’ve made a TCP server with it, and as far as I’ve tested it, I’ve a TTS engine with excellent voice quality in my langage that works on Linux.

 

 

 

Advertisements

18 thoughts on “Excellent quality Text-To-Speech with Linux (French / Français)

  1. Gaetan Lord says:

    HI

    I really love to make this work on my home automation system. I speak french and I have yet to find a better solution than cepstral on linux.

    Can I have access to an MP3/WAV file demonstrating the result.

    I haven’t really work with android (compiling, hacking, etc). Should I expect a lot of trouble implementing your solution on a beagle bone

    • Actually, it depends on how long you want to hack. If you are in a hurry, go to cereproc.com and contact them. They have a solution for a very good quality french voice (“Suzanne”) for 70€, IIRC (developer kit).
      It’s available for a x86/x64 linux and also for ARMv5/v7 (with a little fix to link with it).

      I’m running it on a BBB too which I use for my home automation system. The quality is EXACTLY as good as what you get on their website demo.
      The only downside with Cereproc is that their voice are huge so it takes time and memory to start (althrough they have a way to limit it w.r.t voice quality).

      BTW, I’m no more running android, I’m building a binary with the android RDK as static then run the binary on a debian linux on my BBB (so I don’t wait for Android to boot / pay for the android JVM memory cost on my system)

      Else, you can still use what this post refers to, that is, hack a android based TTS and build it either for android, or for a usual linux.

      You can contact me directly if you need to chez laposte.net, mail point pourri

  2. Ok_dac says:

    Dead sweethack,
    You did a performance hacking this Linux text to speech. I am very interesting to use it if i can employ a stt as wit.ai for instance with linux. Have you plan to use some speech to text to create new voices ?
    Tanks for your answer. Je

    • STT and TTS are very different process. Usually, to build a STT engine, you’ll need a *lot* of transcribed spoken content from *numerous* different speaker. That’s what voxforge is trying to gather (actually, they did it for English, but for French they are still far from it).
      For TTS, you need a bunch of spoken content from a *single* speaker. You can’t use the output of the later to feed the former (because a copy of a copy is neither on par with the original, and you’ll need a bunch of engines).

      But all is not lost. I’ve the Text output that works, so now, I’m focused on the input part, like you.
      I’ve tried to capture the Speech Recognition Engine that Google is using in Android for Offline speech recognition, but so far, the most important part is missing (not opensource, and reverse engineering Google’s Java code for figuring out how to load their “French” model in the opensource engine is not going to work for me).

      There’s also Kaldi that has numerous greetings right now, so until I’ve got some concrete results with any solution, I’m still completely open and testing.

  3. Ok_dac says:

    Dea’r’ sweethack,
    Thanks a lot for your answer.
    I’ve understood i could mail you using point pourri laposte
    If you agree, i Will try to explain my goal by this way.
    Good luck

  4. Great article !
    I’m currently working on what you did (i.e porting Android engine .so lib to linux), but I can’t see how you could “build a static binary” with a shared object library….this is not possible 😦
    Could you explain more and give me some help ?
    Thanks !

    • You can’t make a .a from a .so. I did however build the Android RDK as a static lib (not the TTS’s .so lib). I never succeeded getting the .so to work on linux (the best I did was being able not to crash, so I knew the structure layout I’ve defined were working). From there, I used the BBB to start android and run my app on it (without libhybris, it was not required anymore, since the system was android based). I needed to link with the Android RDK to get this to work. Yet, the BBB is too slow for the system to run “transparent” or “jitter” free, so you get some latency between the text entered and the speech generated.

      If you want a good advice (due to the amount of time I’ve spent on this and the various experiences/tests I’ve run), my actual system is the best I’ve run through. I’ve based on x86 hardware (a QNAP NAS, but you can use whatever “low power” server you have probably already running in your home), and Voxygen’s SAPI voices (using the sapi4linux code). This runs instantaneously when you need to generate speech, the quality is as good or better as the android’s stuff, and it’s native (you have a speech-dispatcher plugin, so any application can use the engine to generate speech).

      Cereproc is also a good solution if you want to use a single application you’re developping (since they give you a ELF lib and some headers for ~70€, IIRC).

      • Thank you for your answer !
        I ran into the same kind of troubles than you certainly did, by trying a LOT to compile in every way I could to make something work, using libhybris and so….
        I’m wel aware of all the Bionic linking problem,:(

        Right now my backup solution is using an old Android smartphone with a custom TCP server app waiting for text data to use the embedded TTS engine (with whatever voices I can install – voxygen, cereproc – as they are indeed cheaper on android than in Wndows !! ). Easy to code, to change voices and almost lagless

        Thanks for your advice for the X86 method, I read about it in your other article ! However I’m stuck with a “Raspberry” environnement, and my next try will on a Pi 3B to run linux and android side-by-side with a chroot, and make the same thing I do with my phone, but totally embedded in the Pi

        I had hope to make a “simple” debian/linux compatible app by reading your previous comment saying you managed to do it, but even when using a statically compiled Android RDK (and by the way what do you call RDK ? I know about the SDK for android studio, and NDK for native libraries, but RDK doesn’t ring any bell…) you couldn’t make it, is that right ?

      • I haven’t followed the improvement for libhybris since I worked on this, but clearly, I could not get something using GLib’s libc and ld dynamic linker to work (libhybris did its job for symbol’s mapping, but I guess it failed to replicate a 1:1 behaviour). It did not crash, but did not work either (it failed producing any synthetized speech). Since going further would have required reverse-engineering the TTS engine, I stopped here.

        I thought that spending days to succeed with this instead of paying 39€ for the SAPI voice wasn’t worth the trouble.

        That being said, Voxygen voices are also available on X86, and as such you can run them in a VM (this is probably the easiest solution if you really want to use the Android’s code).

        As a general guide, and I make the assumption you’re doing a kind of “home automation project with speech synthesis”, like I did, a TCP server with speech generation is too slow for a good feeling of latency (from my personnal experience).

        I made a mistake with the RDK, that’s NDK that I meant.

        I’m also working on a STT solution, but for now, I haven’t had any good success.

      • Gaetan Lord says:

        Hi

        I follow this thread from the beggining, and see that you you have something that work fine with Voxygen and sapi4linux.

        Try to get sapi4linux from the website polip.com but it seem down. Do you think this is temporary ??

        I’d like to gove a try one more time

  5. @Sweethack:
    That’s exactly my ‘project’ : trying to add some kind of (french !) STT / TTS usage on an already working custom Pi-based Home Automation device.
    Right know it’s just beta testing, an as you said because I’m not satisfied with any of the STT engine I’ve tested…

    Best result I had is using Julius (https://github.com/julius-speech/julius) with an custom french HMM accoustic model I compiled using various Voxforge (http://www.voxforge.org/fr) data and tutorials (and some more french audio Corpora found on the net)
    Pro :
    + Light on resources : very quick response time on an old 1st generation Pi (with a moderate dictionnary)
    + Tons of settings to play with to tune recognition accuracy
    + C/C++ API for an optimized/no latency integration
    + Very accurate recognition in a quiet environment
    Cons :
    – Just recognize word/sentense on a preset dictionnary/grammar
    – Even with tons of params to play with, I couldn’t make it work in an H24 noisy environment (TV, people speaking in a house -> too much false positive recognition to be stable)

    As for Speech transcription, I must say I’m pretty amazed by Google engine (“Ok Google” on Android) results. Too bad there is a bit of delay when using it

    @Gaetan Lord:
    sapi4linux website was up when I last checked on it a week ago, I hope this is just a temporary issue

    • Concerning STT, I was about to write an article about it. Right now, I’ve ruled out Julius for the same reasons as you’ve spotted. Simon does not work either (and it’s a pain in terms of dependencies). I’m currently trying kaldi, as it seems to be the best in town opensource engine. There’s also wavenet from google (which claims to be the best you could expect), but you’re on your own with it. The issue with the last 2 is that they require a CUDA compatible board to perform well (Kaldi is very good with its DNN engine, else it’s almost 1:1 with Julius with its HMM engine). In order to get CUDA to work, you need a NVidia GPU (either via Jetson TK1 or TX1, or better via a MXM module with a GTX1080 on a X86).

      I’ve also tried to decompile the offline SR engine on Google’s Android system, but the code for the models is obfuscated, and it’s very hard to understand what’s going on (no even speaking about the legal stuff of using it). As you said, the latency for the offline SR is not acceptable.

      Anyway, I’d be interested if you could post a link to a “tutorial” or “step by step guide” about how you built your acoustic model. I’ve followed the tutorials on the net, but I’m not 100% sure I did it correctly, and since the recognition accuracy is so dependent on the model, the more material we have to compare, the better it will be.

      • As for Android “offline” STT, I found it as bad as Microsoft “grammar based” DotNET STT API we can access with windows 7 or windows 10 IoT core…it seemed like a very bad Julius because of the lack of recognition parameters.

        I’ve yet to look more into Kaldi….I’ve just compiled it but DNN engine is way too much resources hungry for my “embedded” project right now.

        To compile my HMM accoustic model I mostly followed these guides :
        http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial
        http://www.voxforge.org/home/dev/acousticmodels/linux/adapt/htkjulius
        Voxforge didn’t compile French HMM model yet, because they are waiting for hundreds of hours of data, but we can download already uploaded data to make our own model.
        My biggest mistake was not to write down what I had to do to successfully compile 3.4.1 and 3.2.1 version of HTK tools, as I ran into a lot of trouble doing so (especially for 3.2.1)

        Here you’ll find what I made to compile these HMM models : https://mega.nz/#!5B5TWAhb!K3AI3mnGi_76SQFod0hZtlrRdwRRBDsR8RMNCkK5iXg
        It’s split into 3 folders, equivalent to the 3 steps needed :

        1)To compile a “global” accoustic model, I transformed thousands of .wav (downloaded from Voxforge, Mary TTS github and other location I don’t remember) into .MFC files and created prompts list and lexicon. You’ll find theses files in the “HTK_MasterFiles” folder. Just use the “Compile_AcousticModel.sh” script and you should find the output in the “HTK_Result” folder
        Note that I used 3.4.1 version of HTK tools, so make sure you do not miss any dependencies as I’m not re-compiling them.
        To add new sentences, just add them in the prompts file and add according .MFC files (and fill lexicon if needed)

        2) To compile a custom accoustic model, you have to record yourself and give a custom vocabulary/grammar as well as prompts and lexicon. You’ll find a quick example of mine in the “HTK_MasterFiles” folder. It is better to use words already present in the global acoustic model as you’re sure that you won’t get any “triphone does not exists” error.
        There again, just use the “Compile_AcousticModel.sh” script and you should find the output in the “HTK_Result” folder

        3) Finally, to adapt the global acoustic model with your custom, and make it waaay more efficient, just run the “Adapt_AcousticModel.sh” script (It seems that the more you recorded yourself in step 2, the better the results are), it will use necessary files from the last 2 folders and you should find the output in the “HTK_Result” folder, use these acoustic model files with Julius. Note that for this step you need to use 3.2.1 version of HTK tools.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s