In this article we’ll see how to use available datasets or your own in order to create a Botpress bot from scratch without having to come up with every single training phrase.
HumanFirst Studio was built in order to manage and continuously improve the training data of large conversational assistants, identifying valuable training data from existing sources that are often available but hard to tap into without proper tooling.
In this article we’ll see how to use available datasets or your own in order to create a Botpress bot from scratch without having to come up with every single training phrase. We’ll also see how to use our command line tool, hf, in order to seamlessly integrate Botpress with studio in a git-oriented workflow.
Note: What you’ll learn in this article can also be applied for continuous improvement of deployed Botpress projects
You will need a HumanFirst Studio account in order to go through this tutorial, you can create a free account here to get started.
Install the HumanFirst CLI tool
Download one of our precompiled binaries at: https://github.com/zia-ai/humanfirst/releases/tag/cli-0.0.4
Choose the binary for your operating system.
For linux do:
You can then login to your studio account from the command line:
You should then see something like this, indicating you have logged in properly.
Download the latest archive from their downloads area
Start the server using the provided binary
You can now navigate to http://localhost:3000/ and create the admin user to begin using Botpress.
Starting a new Botpress project
Once you have logged in, click the Create Bot button on the top right of the screen, then select New Bot. Let's pick an example that's already within their templates. Call it smalltalk and select Small Talk in the bot templates dropdown, a bit further down. Click Create Bot to confirm.
This bot is fairly simple, and most of the logic lied within the Q&A section.
Importing your new Botpress project into Studio
Now that you have a bot, create a workspace in which you’ll import your data. (this is essentially our labeled container that will contain your intents and let you manage and improve them.)
Note: Since the commands are ran from the botpress’ root folder, we have to specify the bot id that you selected in the Create bot dialog. If you didn't name your bot smalltalk - you'll have to edit the command accordingly.
Note: We use --clear in order to erase the workspace's contents so it reflects exactly what you have in your repository. It's not necessary for the first time, but it's a good way to bring in changes that someone else committed to the repository.
http://studio.humanfirst.ai/ will now show your newly created workspace along with the intents imported from the Botpress project.
Adding more data
We’ll add some phrases to the existing intents. We can use publicly available datasets in order to search for training phrases that fit. Since the intents added in Botpress init are pretty generic, there are good chances we'll find relevant matches.
In Studio, click on the Data sources menu item on the left, then click the Use one of our data sets button to add existing conversations to your project. There are many choices available, but for this tutorial pick the STAR dataset, which contain goal oriented conversations for different tasks. If you have existing data, either from existing human-human conversation or a list of unclassified utterances, this is where you would import it into your workspace.
Augmenting existing intents
Now that we have some unlabled data to work with we can expand the currently defined intents.
In the Labeled data section, you'll find the list of imported intents. Activating one will bring up the list of its associated training examples. Click the Get Suggestions button and some suggestions will be provided from the dataset you added in the previous step. You can then accept training examples that make sense. The None of these look good button rejects the remaining elements.
Note: Recommendations work by looking at all the workspace’s training data and returns examples from your data sources. When you reject, we maintain a list of phrases that are internally tagged as “not part of that intent”. This list is used to improve suggestions, you can see it as an ephemeral binary classifier helping to narrow down your search until you get enough relevant examples.
Discovering new intents
Next, let’s take a look at the Unlabeled data section. This is where all utterances that haven't been assigned to an intent are located.
You’ll see a list of unlabeled utterances that is sourced from your data sources. Since you’ve already added some demo data, there should be a lot of data. The search bar on top is a full-text search feature allowing you to find things the old fashioned way. Try it first by searching for hotel - there are a few intents that can be created relating to these
One of the initial matches is Hi, I am looking for the rating of a hotel.. Go ahead and select it, you'll notice that a new option is available right under the selection: Show similar suggestions. This button will use semantic search to look for similar phrases in the corpus. It's a good idea to mix these two techniques because full text search gives you keyword-based results, and semantic search expands on the meaning of the utterance and returns more relevant matches.
Select a few examples where the user clearly asks for a hotel with a specific rating. Notice that the button is clickable again, doing so will look for results similar to all selected items.
Tip: You can shift+click to select a range without clicking on each of them separately.
Once you have enough elements, click the Label selected data button on the left, and click + Create here to create a new intent. Let's name it hotel_request_rating and click the Create and edit button.
Here are a few intents you may want to create:
- Book an appointment
- Reserve a hotel
- Reserve a hotel with a specific rating (see if you can make this one a child intent of the reserve a hotel one)
While working on your project, you may decide that some intents should be merged together or even broken down into more specific intents. In the Labeled data section, where you can view the list of training phrases for an intent, you'll notice a checkbox next to each phrase, clicking it with automatically sort the rest of the list by similarity to the selected phrases. You can click the similar phrases and move them using the left column, as we did with unlabeled utterances in the previous step.
Back to Botpress
We can export our changes using the command line
You can now go back to Botpress and see your updated data.