ibm watson corpus

If you are working with IBM WATSON, the supercomputer that defeated human contestants in the US TV show Jeopardy in 2011, these tips may help you to create a reliable corpus in the now cloud-based Watson Analytics Environment. Either you are using Watson to ask natural language questions, visualize data patterns or find insights, your Corpus is the cornerstone of your application.

1. Define the kind of interactions your users require.

Are your users looking for facts, procedures or opinions? The better you understand your user information needs; the better WATSON will respond to them. This knowledge may also help you chose the most suitable documents to create your corpus, thus saving training time and improving precision.

2. Check the documents structure and language.

Once you have identified the most suitable sources to match your type of interactions, check the document structure. A document can have amazing data but if it is not properly written or its internal structure is not neatly organized (no headings or sections, convoluted paragraphs, slang), it will be hard to read not only for WATSON, but for the user as well.

3. Cleaning and reformatting.

Yes, it is a lot of work, but it is totally worth it. Think about this step the same way you would think about having a neatly organized pantry or closet: you will find everything you need when you need it. It will save you time and effort. Eliminate noise, and add headers, tags, punctuation or metadata to improve the document structure. Remember: titles, paragraphs and punctuation have real use! If you don’t do this for all the corpus, do it at least for a ground-group of critical data.

4. Train WATSON at least in three model-questions.

When training WATSON use three basic question formats (short, medium, long). It can greatly improve your precision, and reduce training too. E.g.How to make a cherry pieHow do I make a cherry pie?
How can I make a delicious fast cherry pie?

5. Perform a baseline testing.

Before jumping into a frantic training, perform a baseline testing -Test retrieval precision without any training. Check questions and answers and learn from your mistakes. Once you identify a pattern you will be able to promptly fix it, and more importantly, avoid it in further ingestions. If your corpus is nicely curated, chances are you will get high recall and precision with little initial training.