Learning Text Analytics using Google Trends!

Tanaya Badve
11 min readFeb 18, 2021

Text Analytics also known as ‘Cultural Analytics’ is a branch of machine learning that deals with textual data and processing of text data to get meaningful insights.Dealing with text data has its own lot of algorithms i.e. Natural Language Processing techniques applied to data for various reasons.

Application of text analytics can be found everywhere and in everyday life and the most promising example of this could be data generated out of social media platforms like Facebook, Twitter etc.

For explaining text analytics, the route it follows, and how to handle the haphazard data, I have taken the case-study of Google’s Google Trends which gives the best results about what the world is searching.

What exactly is Google Trends?

  • Google Trends is an analysis tool that provides frequencies of searched words or contents in the form of graphical waves interpreting how the term has been used on a weekly, monthly or yearly basis. So it basically plots normalized scores between 0 to 100.
  • It analyzes top search queries which were googled based on region, language.
  • It has multiple features which could let us modify our searched term based on regions like countries, time-period like hours, weeks, months or even years, different subject to which the searched term belongs to,to make or find out the term easily, and also from where and what kind of term searched is like is it an image or web search or maybe a song searched on Youtube.
  • It has another feature that lets users compare and analyze two terms.
  • It plots the chart on the basis of ‘Interest over time’ where the numbers in the scale from 0–100 represent search interest relative to the highest point on the chart for the given region & time. The value 50 means that the term searched is half popular and 0 means there is not enough data to show its queries.

This explains the overall functionality of what Google Trends does but exactly what text analytics pipeline identifying each and every stage that leads to the better search query results:

Data Cleaning Pipelining:

For performing Natural Language processing techniques and to get optimized search charts, it is essential for the data to be of utmost clean as analyzing textual data is a bit trickier than quantitative data.

The overall data cleaning pipelining that trends would be using like:

  • Raw Data (Strings)
  • Sentence segmentation
  • Tokenization
  • Parts-of-speech tagging
  • Entity detection
  • Semantic processing to extract relations.

Data Preprocessing:

  • Taking-out data
  • Putting-in data

Let’s see what it exactly means…

  1. Taking-out data means converting uppercase to lowercase, removing punctuations, stopwords, stemming the data into its basic form.
  2. Putting data means inputting POS-tags to get the clearer meaning to the sentence, syntactic parsing, entity tags,lexical chains

Tokenizing:

  • The text in order to select the token that we intend to work on.Problem in tokenization is language as it might not handle words from other languages like german,french,chinese etc.
  • Cannot handle hyphenation, acronyms,numbers etc.

Normalization:

  • Normalization also handles ‘accent differences’ like Bon’amour
  • Handles lowercase uppercase letters.
  • Acronyms which tokenizing fails like searching ‘UK’ or ‘’U.K
  1. Checking Spellings: Slightest spelling mistakes must also give the resulting output which it intends to give if the spelling would have been correct.
  2. Like when I searched ‘spices’ it did not throw any result on Google Trends but when I wrote spices it popped up the results.

Stemming & lemmatization: Bringing words to their root form for better data.Like bringing the data into analysis with all its attachments.

But the problem with stemming is it does not handle the syntactic difference in the words like suicide as a noun and as a verb but both intend to mean differently.

PorterStemmer: handles suffixes and plurals.

Lemmatizer like WordNetLemmatizer handles cons of stemming like plurals,gets better roots than stemmer but it still cannot handle syntactic parsing on its own.

POS-tagging: In depth grammatical analysis of each and every token in order to make meaningful syntactic results which is already combined with lemmatizer.

Parsing Syntax & extracting entities

With syntactic structure, one could actually get a clearer understanding of the data that is then taken for analysis as compared to taking every bits and piece.

NLTK library helps us define context-free grammar in order to find conceptual entities behind the words.

Entity extraction forms the semantic aspect of processing language i.e you track the originality in its various forms like ‘U.S.A > USA’.

Trends backend data is cleaned but simultaneously handling a; forms which the user might put as an input

Frequency-base Encoding:

How does frequency help Trends get its data:

  • It creates results from the data that is based on the opinions of a large group of people via integrated surveys.
  • As its search is based on region wise analysis,it analyzes cultural trends in different parts of the world.
  • Survey-based data can help us extract the agreement & disagreement amongst the population about the particular trend happening like opinion related to COVID-19 being the hype.
  • Trends uses frequency based charts to represent the corpus.
  • The frequency-based charts that Trends pops up after the term is searched is based on how frequently that term is searched in the given time span in a particular region with this particular language etc.

Frequency-base Encoding:

Overview of frequency representation:

  • Wordclouds

Simple frequency representation of chunks of words belonging to the same corpus.

  • Using Frequencies from Corpora’s:
  1. Search-term Corpora : Google Trends collects its data from the historical search data from Google’s search engine or other browsers or websites.
  2. Chart-based Corpora which is used in Trends.
  3. N-grams of corpus is important for finding the frequencies of the terms providing explicit information related like:

Unigram : UCD

Bigrams : University College

Trigrams : University College Dublin

(N-grams combined with POS-tags is more effective)

  1. Frequency of n-gram in a given time period divided by total number of words in the corpus.

Supposedly, COVID as a unigram has appeared 8673675635 times in 2020 in different posts, websites, surveys etc.

Different ways for calculating the frequency of terms in a corpus is using TF-IDF vectorization, calculating document-term frequency.

Frequencies in brief:

  • Weighted Tf-IDF(Term Frequency-Inverse Term Frequency): Trying to weigh the term in a chunk about a particular topic.

Like when I searched term suicide it gave resulted queries based on Ronnie McNutt.Trends must be using data about Ronnie McNutt which has topic about suicide discussed more or many people searched about Ronnie McNutt’s suicide on Google or other browsers and this is trending because it happened earlier this month.

  • The intelligence of this frequency technique calculates the frequency of words based on the rarity in other documents i.e. frequent in one, but rare in other.
  • Repetitiveness is taken into consideration but not important like a term might be repetitive but not so important and vice versa.

So in order to check its relatedness towards the targeted goal, we calculate the log-likelihood of the terms. This is what Trends does, though many people have committed suicide the log-likelihood of Ronnie Mcnutt is more though not very important but still it is been given as the result.

Pointwise Mutual Information(PMI)

Association between terms depending on the occurrences of another term forming collocation.

Combined search of the term suicide depends on the terms which are closely associated with the word suicide like apart of Ronnie Mcnut is shows results about ‘COVID suicide rates’.

Entropy

Normalised entropy which is divided by its document length is useful..

This is how ‘filtering’ works like classifying the searched term based on the topics filter available on trends use this principle.

Similarity

Featural & structural similarity are the two approaches Trends is based on:

Featural: Jaccard coefficient, Cosine Similarity

Structural: Analogy

Jaccard Coefficient measures similarity whereas Jaccard Distance measures dissimilarity between corpora.

Locality Sensitive Hashing: It uses Jaccard similarity scores from a sample set and locality sensitive hashing uses a hash function to group similar terms together while streaming the data.

Using TF-IDF for vectorization of terms in the form of a matrix and calculating the similarity of terms using Euclidean distance or Cosine similarity in order to create Vector Similarity Matrix (VSM) which is predominantly used for ranking terms according to their possible relevance computing similarity between terms and its queries like in Trends

Trends is based on Transformational Similarities like ‘Aobama’ and ‘Obama ’ searched give almost similar queries though ‘Aobama is wrongly spelled but it transforms the term and produces the results.

Clustering

K-means clustering : Text Classification

  • Trends collects data about particular topic or search based on the time-period ,region etc on the basis of rising or falling numbers the term is searched on the basis of a period.Like for a particular span of time a term would be highlighted or would be at its peak but then see a downfall.This is what is demonstrated in Trends.
  • It computes similarity group of a term to relatable other groups.
  • Try to extract feature,discover trends in the sentence of the data and then label this trend.
  • Align topic of the term to that trend so that it belongs to that group.
  • Calculate weight for each aligned topic and term within it.
  • Generate prediction model which assigns terms to the related topic and it pops queries when searched and discard the ones who weigh less.

Temporal Trends: Text Mining

  • Discovering topics by extracting keywords whose temporal profile matches a certain pattern for discovering trends of a known topic.
  • It applies the concept of time-series analysis to discover trends over a time span.
  • The same is performed by Trends in frequency distributed time analysis charts.

Problems associated with stages of text analytics

Though for any nominal human, data preprocessing means cleaning raw data into useable,goal-achieving data that would give optimized output, but that might not be the case always. It actually depends on what one wanted to harness out of that data which means utmost cleaning might lead to not-so-good results. Example of such scenario or issues are as follows in terms of Google Trends:

  1. Stopwords Removal :

The research-related data or large survey data that Trends uses for its results consists of STOPWORDS like: ‘the’,’ and’,’are’ etc. Essentially removal of stopwords cleans your data but might sometimes change the actual meaning the sentence with which it is framed leading to spurious results.

2. Does Lemmatiziation work?

As discussed above , stemming & lemmatization being one of the important steps in the data cleaning pipeline,Trends must be getting a lot of data that belongs to the real-world and not in the dictionary.

  • It does remove affixes from the searched term but only if the word is in the dictionary. Words that people use in everyday language like so-called ‘chatting language’ is not taken into consideration.
  • Lemmatizer is not separating 2 seem-to-be-attached words like ‘what’sup’ and also does not take into consideration the punctuations.
  • Also,obviously lemmatization is better than stemming as it gets roots of the words than stemmer.
  • But both stemmer & lemmatizer fail because they cannot recognise difference in Parts-of-speech tagging i.e. suicide as verb & suicide as noun.

3. POS (Parts-of-speech) tagging :

Trends definitely wants to use the most useful and informative data to produce convincing and truthful outcome.So for classifying useful data from the one which is not, supervised Machine Learning algorithms make use of labeling the trained classifiers,so classifying the labeled data into certain topics so as to provide desired results when term from certain list is searched.

Basically unigrams and bigrams of POS-tag effect the performance of algorithms used like the Naive-Bayes classifier. Therefore in the data-cleaning pipeline, POS-tagging can improve the performance of classifiers learned from wordcloud forming terms or bag-of-words which is not always.

Also, POS-tag bigrams give more accurate results as compared to POS-tag unigrams because bigrams show better learning performance.

Like suppose the data like :

Sentence: ‘Sushant Singh Rajput committed suicide in Mumbai,India’

POS-tag: Noun | Noun | Noun | Verb | Verb | Noun | Noun | Noun

4. Excluding terms or words that are hardly searched :

Google provides explanations to the terms searched like how and what are they related to according to the specific time period.So if certain terms which are hardly been searched , then Google excludes such terms from its term-explanation dictionary or provides hardly any wanted or relatable information related to term.

Example : When I searched term ‘Graphite’ it is showing results like ‘iphone graphite max pro’ which is not what my intentions or expectations of output are.

5. Excluding terms which are repeatedly requested within a specific time span by an user :

If the user had a tendency to search a particular term multiple times for a given time frame but after that there is no trace of the same user searching for the same term after a certain period then the results of searched term vary completely or the term itself is excluded

Example : if I search ‘Obama’ now then the related topic appearing are : swine flu,ebola etc whereas if it would have been 5 years back it would have been President of United states etc.

6. Spell-check and (same meaning different terms) is important :

  • When I search ‘Obama’ the following are the related results shown ‘Obama as city in Japan’, ‘Michelle Obama’,’ Barack Obama ’ etc.
  • Also though I search ‘obma’ which is misspelled still,I could find related query as ‘Michelle Obama’,’Donald Trump’.
  • Because spell check is important but if the algorithm would have just searched for the correct spelling of ‘Obama’,it wouldn’t have given me related queries when I misspelled the term.

7. Punctuations may or may not vary the outcome :

  • Just like the misspelled term or multiple people uses different writing styles their ways of spelling terms or using punctuations differ which at some point might differ the results to a certain extent.
  • Like when I searched ‘Aobama’ it stated that there isn’t enough data to show at this searched item when I actually meant ‘Obama’.

Ethical issues

The ethical issues in terms of text analytics related to Google Trends could be identified on the following grounds:

Data gathered

Data Source:

The sources from which the data is gathered by Google mainly could be responses which can be web-crawled by users behaviour, conducting surveys to gather data are some common ways. Other ways from where data is collected is through routine management information systems or surveys or research activities. Data gathered from non-trusted sources can lead to obtaining results out of fake data surveys giving wrong information.

Example: The source from where Trends brings in its data for analysis and gives the output must actually be a trustworthy source because the majority of the world’s population looks up to Google for their answers and a single

misleading data like ‘Death of Indian actor-Sushant Singh Rajput’ shows trends about suicide and if the results are guided by misleading information

from media then people would be influenced wrongly about the actor’s life and could hurt the sentiments of his closed ones.

Data confidentiality & security:

Usage of secondary data potentially leads to harming individual subjects like sentiments,beliefs of the person or property or issue of return.

  1. Misleading data results:

As it is discussed the search volume of any search engine work in an ad-hoc manner. Potentially after evaluating only, we would get to know that, there might be possible about the valid Google trend’s voluminous data at the backend depended on suicide and actual suicide cases or related data for rating the suicides happening in a particular region of the world. Misleading data could actually lead to spurious associations about search volumes and suicide rate and such harsh topics could actually harm the psychology of any user. Because search-related results might tend to suicide-related behaviour or ideation rather than actual suicide.

Happy Analysing peeps!

--

--