Botpress dropping messages

Hi,

We’re having an issue whereby BotPress seems to not be responding to messages that are received to our server.
The server is receiving roughly 1500 events in an hour and seems to fairly often not respond to the events, whereby it will send some of the messages in a flow, but not always all of them.
When the message rate is within the range of 100 - 300 an hour, there are no issues.


Seemingly, according to data in this answer, BotPress should be able to handle many thousand messages per second, but that does not seem to be what we are seeing.
We can tell that the webhooks are received by the server from Facbeook Messenger, however we can also tell that the BotPress server is not attempting to send responses.

How might be find what the issue here is, and how might we solve it.

This issue also presents itself when we attempt to batch send a proactive message to 100+ users at the same, whereby the message is only sent to some small portion of the users.

Any help would be appreciated

What version of botpress are you using ?

Is it possible that there are some users are ending in some flow error ?

This issue is very impotant indeed, we really eager to know if anyone has a reply for this case ?

@EFF
We’re using 12.07. It is certainly possible that users have ended in errors, but the errors do not always match with the users who have had dropped messages. If user A experiences an error, would that impact user B>

It certainly was a performance issue, that seemed to be related to an issue in our Postgres setup causing delays reading the state/etc. from the DB. That the messages were dropped, and dropped for this reason, are not show in any way in the logs.

@EFF
This issue doesn’t seem as resolved as I thought.
I have multiple times attempted to send a batch of proactive events, and the number that are actually processed is often as low as 50% when sending batches over 100 in size. We have a user base of roughly 100,000 and growing, and there are often times when we need to proactively reach those users, so this is obviously an untenable situation.

Can you please share how you’re actually sending a batch of messages with as much details as possible (i.e piece of codes, details on where the code is running, list of sdks/apis you are using, etc). I’ll try to help you as soon as I get more details.

In a custom module I have:

api.ts

import * as sdk from 'botpress/sdk'
const _ = require('lodash');

export default async (bp: typeof sdk) => {
  router.post('/flow', async (req, res) => {
    // Since the bot ID is required to access your module,
    const botId = req.params.botId

    let { flowName, startNodeName, users } = req.body;
    if (!flowName || !startNodeName) return res.status(500).send({ error: 'Bad Request', message: 'The flow provided is invalid', statusCode: 500 })
    let invalidUsers = _.some(users, user => (!user || !user.userId))
    if (invalidUsers) return res.status(500).send({ error: 'Bad Request', message: 'The users provided are invalid', statusCode: 500 })

    const config = await bp.config.getModuleConfigForBot('quiz', botId)

    let events = _.map(users, user => {
      const senderId = user.userId;

      let event = bp.IO.Event({
        botId,
        type: 'manual-flow-trigger',
        channel: 'messenger',
        direction: 'incoming',
        target: senderId,
        payload: { user, flowName, startNodeName }
      })
      
      return event;
    })

    sendBatchEvents(bp, events, 1, 200);

    res.sendStatus(200)
  })
}

const sendBatchEvents = (bp, events, batchSize, batchWait) => {
  console.log('events:', events.length)
  let batches = _.chunk(events, batchSize);
  _.each(batches, (batch, i) => {
    console.log('Sending batch', i)
    setTimeout(() => {
      _.each(batch, (event, j) => {
        console.log('Sending event to', event.target, 'event', j, 'of batch', i);
        bp.events.sendEvent(event);
      })
    }, i * batchWait)
  })
}

And then in before_incoming_middleware

const _ = require('lodash');

const jumpToFlow = async () => {
  try {
    switch (event.type) {
      case 'manual-flow-trigger':
        await manualFlow()
        break;
    }
  } catch (e) {
    console.error('Error jumpToFlow quiz', e)
  }
}

const manualFlow = async () => {
  const sessionId = bp.dialog.createId(event)
  event.state.temp = {
    tag: 'GAME_EVENT',
    ...event.state.temp,
    ...event.payload.user.data
  }

  await bp.dialog.jumpTo(sessionId, event, `${event.payload.flowName}.flow.json`, event.payload.startNodeName)
}

return jumpToFlow();

I have omitted irrelevant code from both these files.

The nodes in the flows can call our own backends in some actions to retrieve certain data regarding the user, which is then stored in temp.

The bot is run entirely through the FB Messenger channel, and we also have the HITL module enabled.

Thanks for sharing your code, it seems fine.

How did you end up with 50% when sending batches over 100 in size in other words, how do you track the messages ?

We run a custom version of channel-messenger so that we can tag messages so they can be sent outside of the 24 hour window on messenger.

As part of this, I edited sendTextMessage in messenger.ts

async sendTextMessage(senderId: string, message: string, tag: string) {
    const body: any = {
      recipient: {
        id: senderId
      },
      message
    };

    if (tag) {
      body.messaging_type = 'MESSAGE_TAG';
      body.tag = tag;
    }

    const { quizAPIEndpoint } = await this.bp.config.getModuleConfigForBot('quiz', this.botId);

    try {
      axios.post(quizAPIEndpoint + 'tracker/event/create', { userId: senderId, eventName: 'messageSendAttempt', data: { body } });
    } catch (e) {
      console.log('Failed to track messageSendAttempt', e);
    }

    try {
      let response = await this._callEndpoint('/messages', body)
      axios.post(quizAPIEndpoint + 'tracker/event/create', { userId: senderId, eventName: 'messageSent', data: { body } });
      await this.markUnblocked(senderId);
    } catch (e) {      
      let err = (e.response && e.response.data && e.response.data.error) ? e.response && e.response.data && e.response.data.error : e.message
      axios.post(quizAPIEndpoint + 'tracker/event/create', { userId: senderId, eventName: 'messageFailed', data: { err } });
    }

  }

We track to our own backend an event for every message that is attempted to be sent. I can see that the number of messageSendAttempts in the backend is lower than the size of the batch. We also don’t see any Failed to track messageSendAttempt messages logged.

Thank you for the details. I never experienced dropped messages in botpress so far (delays yes but no drops). So I suspect there’s something wrong with the tracking method.

I see that you send an http request to your backend to track every message sent without awaiting your request (I believe that’s on purpose as you probably don’t want to slow down BP). You say that Failed to track messageSendAttempt is never logged but since you don’t await, it’s possible that if an error or a timeout happens on your backend and the execution of this function is already done, it’s not going to log.

I suggest you set the BP_DEBUG_IO environment variable and inspect the logs, you’ll see how many events are being processed (in and out). That’s pretty much a local integer that’s being incremented limiting the overhead like the http call in your case.

You could also implement the same thing in your module, declare a counter and simply increment and log it to every messages.

I’m very curious to see what’s happening, keep me posted.

Fair point about not awaiting messageSendAttempt
However, the tracking was added as check for the fact that we were losing messages. We would launch proactive flows for users and might end up not receiving the messages from the flows (when we would expect to) or only received some messages from the flows (and not always the first ones).

I’ll test with BP_DEBUG_IO in the next week.

I know it’s only for tracking purposes, what I’m saying is I would be very much surprised that botpress is dropping as much as 50% of your events.

As for your flow, I’m curious as well, would you be willing to share your bot archive in private ? I could try to help on my spare time.

Hi Jesal !

I’m sharing this here, I think it could benefit the community. When you send an event with the botpress sdk bp.events.sendEvent(event) please try to await the call. That should not impact but there was in issue in the botpress sdk definition, prior to botpress 12.1.5(not released yet) the sendEvent was not marked as async but it is.

Let me know if this fixes your issue + please try to run with BP_DEBUG_IO

cheers :v:

We’ll give this a try, however I’m unsure how this will work/help with sendBatchEvents as that uses a timeout to batch out the events