Unstructured Text Data Associations of Online Content in Public Domain
by Promod George, on May 20, 2019 4:56:50 PM
Estimated reading time: 3 mins
Internet is rife with innumerable reviews, articles, blogs, news, or in general Big Data related to a product or person. This content is useful from market research or intelligence analytics perspective. However, the question remains if the content can be grouped together logically for further analysis. A solution powered by artificial intelligence (AI) algorithms can be used to solve such business issues. This solution is able to associate the user to the right data set based on the user’s interest.
How does it work?
The solution is built using AI algorithms and Python based Gensim word2vec technique.
- It commences with periodic streaming of raw articles, entities, or content using Cron jobs or Spark streaming into high end text indexing NoSQL databases, such as SolR, MongoDB, or elastic search.
- Next, the solution performs basic cleansing/staging of the data using semi-manual intervention, such that information is not lost in the process. This step involves first stage data, which includes ‘tagged words’ or ‘important extracted words’ derived with the help of Subject Matter experts or tools such as Spacy or RAKE algorithms.
- The solution then creates an ensemble of models by randomizing the word2vec parameters. It delivers consistent values for an input document with ‘n’ keywords. Next the normalized sum of these vectors is put into a tuple.
- Subsequently, the tuples are processed using a ‘similarity method’ to find a ‘similarity score’. The solution then uses a combination of Cosine, Euclidean, and Manhattan distance to provide a choice to the user that best suits a particular user requirement.
- Depending on the best value of this ‘score’, the plotting of the target article, email, entity, or content and the other closely associated data sets is achieved. Throughout, a distance matrix is derived and maintained as a reference to relate to other incoming content for a faster processing.
- Unsupervised clustering gives a general landscape of the complete content data set.
- Distance matrix is used to find other product or person related content which has context-specific and content-specific similarity.
- A user-interface facilitates labelling 20-30% of existing data. When a new piece of review or content arrives, its neighbours are found out in the virtual space within a threshold distance. Subsequently, the solution determines the new content’s label by looking at the labels of the neighbours.
- By using associative patterns and comparing distance matrix of few neighbours in repeated transactions, a flag is marked depending on the use-case in hand.
- ‘Price value’ or ‘importance determination’ is done depending on the distribution concentration or sparsity in a determined range of the target data point.
- The distance matrix helps to perform a type of recommendation that takes into account a more detailed ranking.
After ascertaining labels, the framework is used for many use cases as mentioned above.
The typical turnaround time, for say one lakh emails as corpus, takes about 2 - 3 seconds for the response and about 3 - 4 seconds for each model creation. The time increases correspondingly for every 30% increase in the content corpus. The solution reduces turnaround time by up to 50% as compared to contemporary methods. A more intuitive language model is obtained by leveraging the word2vec model and the language parser of SolR. However, it has to be used under human supervision for the initial period until stable F1 scores are achieved.
Download white paper on "Unstructured Text Data Associations of Online Content in the Public Domain"
Today, AI is increasingly used in the business world, especially market research and intelligence analytics. However, in scenarios comprising image and text analytics, human intervention is irreplaceable.
The approach and design brought forth by this solution helps take steps towards an AI approach in solving many problems related to ‘unstructured text data associations’ in the public domain.
How far does it replace the human intervention is a question that still lingers in many minds. With a strong balance of AI technology and human participation, the AI-powered solution helps solve many mysteries in the very dynamic world wide web.