Setting up a bot in french, confused by accent

Hi,

First of all, I’m quite new here. So I fear the following problem is really trivial.

I’m trying to set up an offline bot for french language. I chose v12.1.1, installed duckling, installed two language sets (fr + en) for 100 dimensions. I run everything in Kubernetes (I plan to share related YAML files, if people are interested).

I created some intents, mostly with accent (hard to not use accent in french :slight_smile: ). I set a flow to collect all non-identified intents. When I ask a question, exactly as written in the intent, with accents, the bot fallback. When I replace accented letters with non-accented one, the bot identify the intent and select the right flow.

For exemple:

  • “mon compte est désactivé” failed :-1:
  • “mon compte est desactive” succeed :+1:

During experiment, I saw strange behavior. For a given sentence, in the debugger, the word “créer” was identified as two token “cr” and “er”.

Note: I created an pattern entity like “[a-z0-9]+” for login.

As I understand internationalization is heavily used, I’m quite sure I misconfigured something. But I do not understand what.

Hi,

I can assure you it is possible to write working bots in french. I just did, last week. Here’s what you can do to make your bot work.

  1. Use the 300 dimensions language models. The 25 and 100 ones are really not as good.

  2. It really seems to me that your bot’s language is actually english. I often myself observe similar behaviors when tokenizing a french input with an english model tokenizer:

To change your bot language, go to your bot configs (the small gear wheel) in the admin panel:

Then select the correct language for your bot:

To build bots that support multiple languages, you’ll have to enable botpress pro

Hope this information help!

François

3 Likes

Thanks.

I limit my language server to French. But I’m still using 100 dimension only. Currently, the tokenizer seems to be confused.

One point I’m not sure: which component is responsible of tokenization? Duckling server or Language server?

Oups… I’ve just realized that the tokenization is due to one of my entities which try to grab login.

Ignore me. :blush:

1 Like

Please do :slight_smile: !

–> Language server handles tokenization and word embeddings from that tokenization.
–> Duckling is used to extract some of the system entities.

Note : Pretty sure tokenization is done properly, what you see in the debugger aren’t tokens but as you realized, your extracted entities.