Data Collection for Presenting Social Media Evidence for Litigation

7829 words (31 pages) Dissertation

13th Dec 2019 Dissertation Reference this

Tags: LawSocial Media

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Dissertation Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NursingAnswers.net.

Voluble: Insights for Litigtion

Contents

Overview

Data

Product Reviews

Forums

Promoted Tweets

Public Information

Sampling

Exporting

Location Methodology

Supplemental Sources

Webhose

Google

BoardReader

Source Comparison

Logo recognition

Reach and Impressions

Machine Learning

BrightView Algorithm

Testing and Validity

Sentiment Analysis

Emotion Analysis

Case Studies

Yeti vs. Boss and Ozark

Armstrong vs. US Postal Service

Other Uses

Purchase intent

Identifying Influencers

Comparing Brands

Other Uses

Quality Assurance

Improvements

Generic Terms

Location

Boolean

Retweets and Reposts

Filters

Cleaning the Data

Opportunities for Propreitary Tools

Future Research

Machine learning

Overview

Crimson Hexagon is a data provider and social media listening platform that analyzes online textual content. The platform allows access to data which is proactively indexed from online social networks. Through visual tools such as word clouds and topic wheels, prominent conservation topics can be analyzed to decipher how consumers are talking about brands. These conversations can be analyzed over time to see how conversations change and are shaped by external events. Social media can be used to show consumer opinions, awareness, and confusion, as well as, brand reach and marketing efforts which can be useful evidence in commercial litigation. In regards to presenting social media evidence for litigation, it is important to understand the entirety of the data being presented and the methodology that was used to collect it.

Data

Crimson Hexagon software inspects publicly available social media platforms, as well as blogs, news, and consumer reviews. Crimson Hexagon indexes and stores all their data, allowing access to a Historical to present day library of content consisting of over 1 trillion posts to date. The social media indexed data goes back to mid-May of 2008.[1] Boolean search logic is required to identify the appropriate data to be analyzed from the data universe.

  • Facebook: The Facebook search option will search all currently archived content from the Facebook library. Specific Facebook pages can be targeted and include a year of historical data and will be added to the content library for future searches. Everyday Crimson Hexagon adds over 20 million Facebook posts to their library
  • Twitter: The Twitter Firehose allows access to data from all public tweets starting July 2010. Additionally, content is available from Gardenhose beginning July 2009. Historical data from targeted accounts extends to December 2013.
  • Tumblr: Data from Tumblr includes all public info beginning in January of 2015. Crimson Hexagon adds over 90 million new posts per day.
  • Instagram: Instagram content begins in January 2014. For requested content, Crimson Hexagon can only provide historical data going back 21 days from the request.
  • Google Plus: Google+ data has been pooled and collected into the content library starting April 2013. Their historical data for targeted accounts extends back one year from the request date.
  • Blogs: Content is pulled from public blogs starting June 2008, which have been categorized as observations and opinion pieces featured generally on smaller, independently owned sites. Comments can be included or excluded in searches. Approximately 1.1 million blog posts are added each day.
  • Forums: Sites that consist of thread-based discussion are indexed and included in the forum library based on the subject and replies. This content dates back to October 2008 with over 11 million new posts being added to the library every day.
  • Reviews: Product-based review sites are indexed and included as well as comment for review posts.
  • News: Online articles from formal news organizations and their comments are archived with content beginning in June 2008. Approximately 250,000 articles and comments are added each day.
  • YouTube: The YouTube library consists of video description content and video comments with 1,100,000 new posts every day. The titles of videos can be searched by using a title operator.

Product Reviews
Product reviews are collected using Boolean search terms that are found within the review text. Unless a review was to specifically state the product name or one of the included search terms, it will not be captured in the data pull. This problem can be mitigated by using the title operator in the Boolean search. All reviews falling under the selected title will then be included in the data.

When running a simple Boolean search for Yeti Tumblers. Approximately 2,900 product reviews were captured from Amazon. While this is a substantial amount of product reviews for a single product, the top-rated Yeti Tumbler listed on Amazon had 2,300 reviews alone[2]. By updating the Boolean search terms to include the title operator, the number of Amazon product reviews increased to over 3,100.

Forums
Like product reviews with the title operator, forums can be specified using a site operator.  The site specifier will target specific forums based on their URL. Additionally, forums can also be added via whitelists or excluded via the blacklists. While Crimson Hexagon indexes many forums, forums and other sites that are not available can be requested to be added to their collection of content. Forums, however, can take up to four weeks to be added to the library.

Promoted Tweets

Promoted Tweets are paid advertisements on Twitter[3]. Crimson collects all public Twitter data, which includes promoted tweets. Within the platform there is no distinction between a regular tweet and a promoted tweet. Retweets of promoted tweets are collected in social account monitors as well as Buzz and Opinion Monitors. While not all tweets from the brand are paid advertisements, one way to detect these paid advertisements is through the author specifier.

Public Information

When it comes to litigation, evidence must be publicly available. It is important to note that when acquiring social media posts, that they are only from public content sources. These data providers do not have access to sources that are marked as private in their robots.txt or html tags or have been marked as private profiles by users. Additionally, “dark posts” that have been hidden or are unpublished to some are not included in Crimson’s content as they are not considered publicly available to all users.[4] Since Crimson Hexagon stores all of their data, posts that have since been removed from user sites will still appear within their platform. Upon exporting the data, any deleted posts will not be compiled in the export.

Sampling

Sampling is necessary for the overall functionality of the social platform. If all data points were analyzed, run-times would be excessive for real-time results. Since certain conversation topics result in large amount of data posts, sampling is often necessary to measure the conservation. Random samples of the conversation are used for most analysis tools within the software package. Random sampling is an accepted and accurate method to measure large quantities of data. This technique is often used amongst market researchers[5]. Volume is the only tool that is never based on the sample. Sampling is performed when a monitor has posts exceeding 10,000 within a day. In this case a random sample is collected that is proportionate to each data source. Individual posts that are not randomly selected will not be directly analyzed. The analyzed data will be extrapolated to be representative of the entire conversation. Due to this sampling technique, however, only the posts sampled will be available to export.

Exporting

With the ability to run an unlimited number of queries, monitors can be refined and rerun in order to isolate the most accurate conversations prior to exporting the data. Crimson Hexagon allows for a bulk export of the posts collected within a search. These posts consist of metadata including: time posted, URL link, author, location and the post content. Bulk exporting allows for additional scripts to be run and analytical tools to be employed outside of the Crimson Hexagon system that better fit the needs of litigation.

However, there are some restrictions when exporting the data. The exported posts are a random sample of the total number of posts within the date range. At most, one can export 10,000 random posts[6]. When dealing with data over large periods of time, there is often more than 10,000 by volume. To download more data, the timeframe can be broken into pieces and each piece can be downloaded separately and combined to form the aggregate data. In the case where there are fewer than 10,000 posts, a sample of the data will not be performed and the data will be exported in full.

Due to Twitter’s data guidelines, only 50,000 tweets can be exported per user within a 24-hour period. Once the tweet threshold has been reached, exported posts will include the post URL, but not the actual content. The content can be obtained by accessing each link online or programmatically through Twitter’s API. Another way to mitigate this restriction is with additional user profiles. Since each user has their own 50,000 tweet limit, another user could export the additional data.

Location Methodology

Locations are assigned to social media posts based on the metadata that is available. 1% of post are geotagged and consist of geographic coordinates of where they are posted when using a mobile phone. The other 99% of posts which are not geotagged use contextual information to estimate their locations.

The location field in a user’s profile, is a predictor of nongeotagged locations. Attributes such a time zones and languages are also used to determine a user’s country, region and city to locate where a user is posting from. Crimsons Algorithm can use these attributes to estimate and assign a location, but if algorithm cannot find other similar posts, then the algorithm does not assign a location and it is labeled as “location unknown.” For nongeotagged posts, the algorithm can locate 90% of posts to a country of origin, 70% of posts to a state or province and 50% of posts to a city or county within the state[7]. Location data is not available for Tumblr, YouTube, Google+ and reviews.

Supplemental Sources

When it comes to social media, blogs, and forums, there are multiple search engines and options for gathering data such as BoardReader, Google, and Webhose. These search engine sites provide search results specifically targeting blogs and forums. Each search engine has their own positive and negative aspects, but when it comes to providing evidence for litigation, it is important to understand the entirety of the consumer conversation. It is therefore important to supplement Crimson Hexagon with additional sources if need be.

Webhose
The Webhose search engine indexes a wide variety of news media, blogs, and online discussion via message boards and comments. The platform uses a standard Boolean search system that pulls historical data up to 30 days. If data is to be acquired past that additional fees apply per retroactive month.[8] Additionally, Webhose provides metadata besides each post including the language, country, content source, and performance score. The performance score measures a posts social impact by considering the number of shares compared to an appropriate benchmark based on the search.

Google
Similar to how Google provides filters for news or videos, Google also used to provide a filter for forums. Unfortunately, the forum filter has been removed, but one can still search the system by adding the following title operator before the search terms: inurl:forum. While the operator allows for a quick solution, Google only indexes public forums that have not indicate privacy restrictions within the robots.txt or html tags.[9]

BoardReader
BoardReader is a popular forum and message board search engine. It is often used as a resource for community research on consumer conservations and beliefs. With BoardReader, one can search for content within a forum or based on the forum’s topic. However, the platform does not allow for many filter options and specifically does not allow results to be filtered by location. One of the positive aspects of BoardReader is the compatible API, which is easily integrated into various platforms such as IBM BigInsights.[10]

Source Comparison
When the supplemental search engines were compared to the blog and forum results for Crimson Hexagon, Crimson Hexagon either met or exceeded the number of returned relevant results as seen in the table below. A simple query was run on all search engines for “Reebok” in the last 30 days in English and then a more complex query was run for “Reebok AND RBX.”

Comparison query “reebok” “Reebok AND RBX”
BoardReader 1,692 1
Webhose 3,712 (2,904 in US) 2
Google 4,370 (2,130 in US) 0
Crimson Hexagon 4,545 (2,972 in US) 2

Table 1: Search Engine Comparisons using keywords “Reebok” and “Reebok AND RBX”

Each search engine presents its results differently, but total volume is easily comparable. As mentioned previously, BoardReader does not allow for filtering by location, so the number of posts from the US cannot be extrapolated. Crimson Hexagon produced the most posts for “Reebok”, as well as the most posts from the United States. While Google produced the second most posts, Webhose produced the second most posts within the United States, which more closely emulated those from the Crimson Hexagon search. With the more specific search of “Reebok AND RBX,” Crimson Hexagon produced two posts which consisted of the one post located by Google, as well as both the posts located by Webhose. Upon further research, it was discovered that Crimson Hexagon uses Webhose as one of its own tools to acquire data.

Logo recognition

Logo recognition was introduced to the Crimson platform in 2016. With over 1.8 billion photos being shared on social media daily and only 15% of reference brands within them, the logo recognition feature helps capture unspoken data[11]. This tool helps measure a true volume of brand mentions, both textual and visual. Select logo data  on Twitter can be accessed starting in May 2015, with more consistent data starting later that year in December. Each monitor has the ability to track one logo. For comparison data, multiple monitors will need to be set up. The platform currently supports 500 logos. These logos are recognized with 98% precision. Accuracy is affected by how often a specific logo appears in social media posts, the uniqueness of the logo icon, and uniqueness of the logo font.

Reach and Impressions

As part of Crimson Hexagons analytics, impressions and reach is calculated for posts. Impressions are measured by how many times a specific post has been displayed, but not whether the post has been clicked. Essentially, an impression is made for every opportunity the post could have been read and it is possible for one person to accrue multiple impressions on the same post. Reach is considered to be the total number of people who have received impressions. If one person reads a post multiple times, the impressions will increase, but the reach will not.

Total potential impressions in measured by the amount of times that a tweet could have been read if every follower of every author read every tweet about the topic. When a user tweets, the tweet generates a potential impression count that is equal to their number of followers. If someone were to retweet the original tweet, this creates additional potential impressions that are equal to the number of their followers. This process continues for every user who retweeted the original post.

Machine Learning

Through the use of natural language processing, computer algorithms can be used to help sort and categorize the data. Natural language processing is a form of artificial intelligence used to understand how people speak and communicate. Through automated solutions and natural language processing applications, social media data can be categorized into sentiments and emotions. For more complex text analytics, machine learning approaches can be used to analyze multiple variables at once. Machine learning requires training posts in order to identify patterns in the conversation to categorize posts according to topics.

BrightView Algorithm

The BrightView Algorithm is Crimson Hexagon’s machine learning tool that powers Opinion Monitors. The algorithm is based on aggregate analysis to allow flexibility and accuracy. Each individual post does not always fit perfectly into one category, but often encompasses several categories to an extent. Aggregate analysis allows BrightView to take this into account. Unlike sentiment analysis, each post will be assigned proportions of sentiment[12]. While the algorithm can look at larger conversation patterns, the post list focuses on classify induvial social posts. In this case, each post will only appear in one category.

Crimson Hexagon monitors examine the entirety of each discussion, which is collectively known as the conversation. The algorithm divides the all relevant text into subsections. Due to the creation of subsections, individual posts or tweets are not categorized, but rather the assertion is the unit of measure. If a portion of a post fits into one category, while the rest fits better into a separate category, the algorithm will divide the post accordingly. Due to this form of categorization, the categories are reported as the percent of assertions out of the entire conversation resulting from the Boolean search terms. [13]

The BrightView Algorithm must be trained to analyze the conversation based on relevant topics. Each category requires at least 10 posts, but more training data will lead to more accurate results. When the monitor is run, the algorithm examines the word patterns derived from the previous monitor training. The patterns are applied to every social media post, blog comment, product review etc. that was captured from the search terms within the date guidelines. While other systems tend to count keywords, Crimson Hexagon’s algorithm artificially learns what the user is looking for through trained posts and is able to produce a relevant output for conversation categories.

Testing and Validity

The machine-learning algorithm can be tested by measuring the results compared to that of a human coder. An accurate algorithm will produce data that tells the same store as the hand-coded data. Crimson Hexagon’s BrightView algorithm achieves this and is highly, positively correlated at 92%.[14]

Sentiment Analysis

Sentiment analysis is based upon pre-defined sentiment categories. By comparing posts to the hand -labelled training posts, the sentiment analysis algorithm can detect positive, negative or neutral posts. Based on hand-labelling over 500,000 posts, the training posts are used to measure the frequency distribution of each word and emoticon across the sentiment categories[15]. Based on the frequency distributions, a model is constructed to analyze each new post. Each post will be assigned a single sentiment.

Emotion Analysis

The emotion analysis within Crimson Hexagon provides context to the data. The analysis utilizes Paul Ekman’s six basic human emotions consisting of anger, fear, disgust, joy, surprise, and sadness. This allows the data to be analyzed beyond basic positive, negative or neutral sentiment. By training over 2 million tweets, Crimson was able to train the algorithm to detect emotion based on linguistic structure, word patterns, and phrases[16]. Additionally, opinion monitors can be trained to recognize sarcasm through the BrightView algorithm.

Case Studies

Yeti vs. Boss and Ozark

One of the many uses of social media listening is to analyze conversation volumes over time to spot trends and compare conversations. In a case of Yeti Tumblers vs. Boss and a separate case of Yeti Tumblers vs. Ozark by Walmart, Yeti sued for trade dress infringement of their product configuration and design. For this case, the goal was to use Crimson Hexagon to perform a volume analysis, find illustrative comments, and provide analysis of the overall tumbler industry. The volume analysis consisted of determining which brands generated the most social conversation and trends within those conversations. The second objective was to find illustrative comments showing consumer confusion and secondary meaning. By locating examples of consumers calling an Ozark Tumbler a Yeti or commenting of Boss Tumblers’ Yeti design, Yeti could provide evidence of organic consumer opinions. Lastly, the objective was to determine how Yeti has shaped the tumbler industry by analyzing conversations prior to their launch in 2014 to the conversation after the launch in order to provide evidence of widespread consumer awareness.

Armstrong vs. US Postal Service

The Armstrong vs. the United State Postal Service took a more qualitative evidence approach as opposed to a quantitative volume analysis. The goal was to compile social media posts that would make compelling trial exhibits on behalf of the Department of Justice. Additionally, data was collected on the size of the conversation surrounding Armstrong’s scandal and the length of the conversation, compared to other athletes such as Aaron Rodriguez and Maria Sharapova who have also had drug scandals. Through this analysis, Voluble was able to show locate quality posts to show that people are still talking about Armstrong’s drug use and in association with the United States Postal Service. The analysis included sentiment to distinguish between users who were posting negatively about his drug use versus those who were posting positively about his race wins and charity. By comparing the Armstrong conversation to that of other athletes, Voluble was able to show that his conversation was much greater, but is also lasting much longer. Additionally, Armstrong has become the poster boy of sport drug scandals and the conversation involving Armstrong peaks when other athlete’s scandals break.

Other Uses

Purchase intent

Conversations on social platforms can provide a window into consumer opinion to view consumer behavior. By studying the volume of tweets surrounding a product, insight can be gained on purchasing behavior. The consumer decision journey starts with interest and desire, moving on to evaluation of choices and options. Intent declares the future purchase, conversion is reached at the point of purchase and advocacy begins after the purchase when the consumer is now promoting it to others[17].

Consumer Decision Steps Boolean String
Interest (“wish I owned” OR “wish I had” OR wishlist OR ((look OR looks) AND (great OR amazing)))
Evaluation (“should I buy” OR “has any bought” OR “am thinking about getting” OR “choose between”)
Intent (ordering OR “picking up” OR “going to buy” OR “can’t wait to get” OR “gonna buy”)
Conversion (“just bought” OR “pre-ordered” OR preordered OR purchased OR “got a new” OR ordered)
Advocacy (“highly recommend” OR “happy customer” OR “glad I bought” OR “love my new” OR “I recommend”)

Table 2 Boolean Strings for Consumer Decision Steps

Through Crimson Hexagon’s Brightview Algorithm and refined Boolean strings, posts can be categorized into the different consumer decision steps. This allows Voluble to gauge where client’s consumers are within their decision and gain a better understanding of the consumer conversation. This data can help support claims of lost sales.

Identifying Influencers

Crimson Hexagon can be used to identify influencers and examining the influencer behind a specific post. The authors tab provides influential Twitter authors based on Klout score. Authors such as The New York Times, The Boston Globe and other trust worthy new sources tend to have high Klout scores and therefore will show up at the top of the list. These sources though are often not the authors spreading the organic conversation. By adding an author influence sub-filter to eliminate those with the highest Klout scores, one can identify authors that often talk about the brand naturally. Regional influencers can be identified as well by setting a location sub-filter. Influencer data can be exported up to 50 authors at a time which includes additional meta-data of number of posts, following, and followers.

In the Twitter tab, the top retweets, mentions, and hashtags are listed regardless of Klout score and influence. The top attributes is given to identify authors, posts and hashtags that have gone viral. Influence can be measured outside of Twitter as well. By creating a monitor with non-Twitter sources, blogs, forums, reviews, and comments, influencer data can be identified as well. Top sites identifies will identify the sites that post the most on a brand. This data shows where people are talking about the brand the most, but does not identify a quantitative Klout score.

Comparing Brands

Custom segments can be set up with Crimson Hexagon to analyze the followers of each brand. The segment produces the number of people who follow each brand. Hashtags and mentions can be analyzed for their followers going back 30 days and analyze what they are talking about. Demographic data for each can be analyzed in the custom segment and identify the gender proportion of their followers, the geographical location of the audience, and common interests of their target market. Likewise, customer segments can be created to measure how many people are talking about the brand by mentioning their handle.

The brand monitor can be created with all mentions of the brand names, hashtags and using the author specifier to remove posts from the brand to capture only organic conversation. By excluding link shares and removing http and https, spam and e-commerce posts can be removed to display the remaining organic conversation. The volume of the total posts about the brand can then be measured against the other brand in to compare consumer awareness.

Other Uses

Crimson Hexagon’s social data can also be used for a variety of other litigation uses. Social media examples can be located to support claims of tarnishment, and passing off. Claims of genericide can also be shown by producing consumer conversations and how consumers use specific terms in organic conversations. Additionally, this data can be used to provide analysis for a brand or an event to measure consumer and public reaction. By analyzing the conversation surrounding a product category or industry, markets can be defined, as well as the trends within them and the social demographic of the market. This analysis helps with Antitrust cases and class certifications, making sure that right consumer is being represented in the case.

Quality Assurance

To ensure that the Crimson Hexagon monitors are capturing the relevant conversation universe, it is important to place quality control measures into place.  Before monitor queries are created, background research is essential. It is important to understand how consumers are talking about a brand, the products that consumers are using, the social platforms they are communicating on, as well as how and where the brands are advertising themselves and how they interact with social platforms. Additionally, it is important to be aware of other companies or brands that may have similar branding or company name in order to ensure that they are not captured within the conversation. Having a comprehensive understanding of the online conversation allows for a more comprehensive, yet targeted query that excludes false positives.

Boolean strings should contain all possible spellings of keywords including various tenses, misspellings and abbreviations. By checking the Word Cloud, Word Cluster, and Topic Waves, one can ensure that the most popular terms in the conversation are relevant and that no known topics are missing. Within the post list, spam, excessive duplicates and irrelevant posts can be found. By locating key phrases, these posts can be removed by adjusting the Boolean code. By examining the authors tab, shows the most prolific people. Twitter handles that have an unrealistic number of posts are often bots. By selecting the authors name, their posts can be analyzed to determine if they should be added to the blacklist and excluded from all future searches or temporarily excluding them using an author exclusion in the Boolean terms.

Improvements

While the Crimson Hexagon platform was chosen over other providers for their more comprehensive data and analytic tools, it still has its limitations. Crimson Hexagon’s content library has over a trillion posts, but certain content sources are not indexed and coverage of current platforms is not comprehensive. Further adding content sources takes a few weeks to implement and historical data is limited. Creating monitors is a time-consuming process, especially when Boolean codes are being continually refined to narrow down the conversation. As refinements are made, records of previous codes and deleted queries are not recorded.  While Excel can be used to further clean data, two databases are then needed to be managed.

Generic Terms

In the case of Yeti vs. Boss, the issue of generic terms proved relevant. While the word “yeti” has multiple uses including Yeti Coolers, Yeti Cycles, Yeti Airlines, and yeti as a mythical creature, these uses were able to be isolated and removed from the conversation through keyword exclusions. Boss Tumblers, however, proved to be more difficult. While the word “boss” can be used in multiple contexts, it was used for multiple contexts within the tumbler conversation. The tumbler conversation consisted of consumers talking about tumblers inscribed with slogans such as “Big Boss”, “Boss Lady”, “#1 Boss” etc. Additionally, Yeti tumblers proved to be popular gifts to and from bosses. These conversations required a more targeted approach to remove these conversations, without removing relevant posts. Since the exported conversation for Boss Tumblers resulted in approximately 1,800 posts. This allowed for a manual review of each post to ensure relevancy. A manual review of all conversations is not always practical and therefore requires sampling and additional content filtering modalities.

Location

The location filter is important when measuring and analyzing the conversation within a particular area. In terms of litigation, the location is extremely important when presenting evidence. Providing evidence outside from outside of the United States, is not particularly useful for cases being tried within the United States legal system. However, not every post location is able to be geotagged or inferred a location. These posts are then excluded and will not appear regardless of the location filter applied. It would be useful to be able to include these unlocated posts when filtering via a “blank” or “location unknown” option.

Boolean

By simply improving the Boolean functionality and keeping the code organized, the pooled data can be greatly improved. While traditional Boolean operators can be improved by integrating conditional statements, but Crimson Hexagon does not support them. Crimson Hexagon supports Boolean operators such as “AND”, “OR”, “NOT”, proximity, wildcards and specifiers such as “author:”, “title:”, “itemreview:”, “topleveldomain”, and “URL:” amongst others. Additionally, the specifiers are not always accurate in only including or excluding specific countries.

Detailed queries offer more conservative results. While these complicated queries eliminate false positives, they are also complicated to present to the court. These intricacies within the strings of code are hard to explain to an expert and difficult to defend. Shorter and simpler queries are easier to understand and provide more results, but also more false positives.

Retweets and Reposts

Crimson Hexagon does not differentiate between original posts and reposts or retweets. While in certain cases, the inclusion of duplicated posts are relevant to the conversation volume, in others only the original posts should be included within the analysis. Certain retweets and reposts can be removed with a keyword exclusion of “RT”, but this does not exclude all the duplicated posts and may eliminate relevant posts as well.

Filters

Filters can be used to further narrow down monitor conversations. The Crimson Hexagon filters are much faster at refining queries than resetting the entire monitor with a refined Boolean code. However, Crimson Hexagon does not support multiple levels of filters and sub-filters. Searches can be made within filters, but these searches can not be saved. Additionally, each filter has a limit of 500 characters. This imposed limits forces the user to update the entire monitor in order to continue filtering.

Cleaning the Data

Due to the limitations of Crimson Hexagon’s tools, the cleaning of the data will be completed in Excel. Within Excel the main goals are to remove non-US and non-English posts as they are irrelevant for US litigation cases, to remove spam, bot and giveaway posts as they do not represent organic consumer conversations, and to remove false positives as they are irrelevant to the conversation. Filters are used to easily eliminate posts that originated outside of the United States and non-English comments. Bots and spam were eliminated based on a variety of characteristics. Authors who tweeted above a given threshold are removed, as well as authors of highly duplicated tweets as these are characteristics of bots. Patterns and use of certain phrases, hashtags and characters can also be detect spam within the conversation and can be filtered out as well.

While we will never claim 100% accuracy of the data, through statistical sampling we can be confident in the data with a specified error rate. Claiming perfectly cleaned data would require a manual review of each individual post which is not always practical or an efficient use of time. Additionally, a claim of perfect data provides an easy mean for rebuttal from the opposing side if one post is found to be incorrect.

As data is cleaned, posts can be triaged through automation or sampling. Posts can be categorized into categories including advertisements, e-commerce, and consumer conversations. Relevant consumer conversations can further be identified based on known secondary meaning, likelihood of confusion, and dilution keywords and phrases. The categorization of these posts can be completed through manual review and coding. Depending on the complexity of the tasks, a Mechanical Turk model can be employed to categorize numerous posts in a short time frame. Induvial posts would be categorized by multiple reviewers and any discrepancies in categorization would be compiled for an expert to review and make the final categorization.

Opportunities for Propreitary Tools

Supplemental tools can be used to further automate the cleaning of the data, but tools cannot replace the social and cultural research on the brands that the queries are based on. Human review will be at every step of the way as the tools will not be able to determine keywords, hashtags, slogans, product names, and colloquialisms associated with the brand conversations. The use of tools can be used to import and merge data, exclude international posts, and generate a table of authors with the most posts and duplicated tweets. While these tasks can easily be done manually, the automated scripts can complete the tasks in a fraction of the time, while maintaining a detail record of the methodology used to clean the data.

Additional tools that are under development are focused on triage, content analysis, and implementing breadcrumb trails for replication. By generating rules and conditional statements to filter out spam and bots, tools can be implemented to better sift through the data. These scripts will allow for basic cleanup, logging of any filters, deletions, or searches, and a consistent approach between datasets. When providing data for litigation it is important to be able to replicate the results, as well as ensure that the data is not biased to favor the plaintiff or the defendant. The tools will allow for the methodology to be replicated and a standardize our approach when working with large datasets.

By augmenting the Crimson Hexagon database, new content sources can be added through the APIs, as well as additional context and metadata giving the current data more meaning. Additional platform data can be acquired as well through each platforms API. This data can be used to acquire twitter followers for a brand and cross-reference their followers with the followers of their competitors. This comparison determines if the two brands are targeting and talking to the same consumers or marketing to completely separate industry segments.

The final stage of the prospective tools is to create human defined machine learning tools auto generate rules based on defined features and reviewed by humans. It is important that while machine learning is being implemented, experts in litigation are guiding the entire process.

Future Research

Machine learning

There are two types of machine learning; unsupervised and supervised. Unsupervised learning requires the computer to cluster or group similar posts together. Supervised learning is the classification of posts based on posts that have been labeled as examples. Voluble is looking into supervised machine learning that can classify posts into evidence that support specific litigation claims. This can be done through the use of Crimson Hexagon’s API to either upload data into the Crimson Hexagon platform or download data from Crimson Hexagon and feeding the data into external software.

Through Crimson Hexagon’s API, examples of social media posts can be uploaded into their data sets.  This allows Voluble to take advantage of their natural language processor within the BrightView algorithm by training the model with social media posts used in litigation. The ability to manipulate external data within the Crimson platform, allows Voluble to better tailor the algorithm to their litigation needs.

While Crimson Hexagon’s BrightView algorithm allows for the classification of posts into categories, Voluble is researching how to build their own machine learning model. While Voluble can create their own model, the model is only as good as the data used to train it. Therefore, Voluble is focused on synthesizing a collection of social media posts that have withstood the scrutiny of litigation for specific claims and using these posts to train the model. The model can be strengthened over time, as additional training data is made available by the courts.


[1] Crimson Hexagon (n.d.). Social data you can work with. Retrieved April 25, 2017, from http://www.crimsonhexagon.com/

[2] Amazon. (n.d.). YETI Rambler 20 oz Stainless Steel Vacuum Insulated Tumbler with Lid (Stainless Steel): Amazon Launchpad. Retrieved April, 2017, from https://www.amazon.com/dp/B00JP9AJC6

[3] Twitter. (n.d.). What are Promoted Tweets? Retrieved April, 2017, from https://business.twitter.com/en/help/overview/what-are-promoted-tweets.html

[4] Crimson Hexagon. (2017, January 27). Retrieved April, 2017, from https://help.crimsonhexagon.com/hc/en-us/articles/213262743-Posts-Does-Crimson-Hexagon-Index-Dark-Posts-

[5] Crimson Hexagon. (2014, December 5). Sampling Overview. Retrieved April, 2017, from https://help.crimsonhexagon.com/hc/en-us/articles/203523995-Sampling-Overview

[6] Crimson Hexagon. (2017, March 2). Exports Regular and Bulk. Retrieved April, 2017, from https://help.crimsonhexagon.com/hc/en-us/articles/202774129-Exports-Regular-Bulk

[7] Crimson Hexagon. (2017, January 25). Location Methodology. Retrieved April, 2017, from https://help.crimsonhexagon.com/hc/en-us/articles/203952525-Location-Methodology

[8] Webhose. (n.d.). Tap Into Data Feeds. Retrieved April, 2017, from https://webhose.io/

[9] Google Developers. (n.d.). Robots meta tag and X-Robots. Retrieved April, 2017, from https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

[10] IBM. (n.d.). BoardReader Sample Apps. Retrieved April, 2017, from https://www.ibm.com/support/knowledgecenter/en/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.dev.doc/doc/c_sample_apps_boardreader.html

[11] Crimson Hexagon (n.d.). Social data you can work with. Retrieved April 25, 2017, from http://www.crimsonhexagon.com/

[12] Crimson Hexagon. (2017, March 15). Aggregate Analysis. Retrieved April, 2017, from https://help.crimsonhexagon.com/hc/en-us/articles/205004735-Aggregate-Analysis-Overview

[13] Hitlin, P. (2015, April 01). Methodology: How Crimson Hexagon Works. Retrieved April, 2017, from http://www.journalism.org/2015/04/01/methodology-crimson-hexagon/

[14] Crimson Hexagon. (2017, February 13). Introduction to BrightView™ Algorithm and Validation Methodology. Retrieved March, 2017, from http://pages.crimsonhexagon.com/WC2015-04-21-VID-IntroductiontoBrightViewAlgorithmandValidationMethodology_VideoPage.html

[15] Crimson Hexagon. (2017, February 14). Sentiment Analysis Overview. Retrieved April, 2017, from https://help.crimsonhexagon.com/hc/en-us/articles/203523885-Sentiment-Analysis-Overview

[16] Crimson Hexagon. (2017, March 30). Emotion Analysis Overview. Retrieved April, 2017, from https://help.crimsonhexagon.com/hc/en-us/articles/211129163-Emotion-Analysis-Overview

[17] Crimson Hexagon. (2016, September 15). Purchase Intent Workshop Video. Retrieved April, 2017, from https://help.crimsonhexagon.com/hc/en-us/articles/212775766-Workshop-Purchase-Intent-Video-

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the UKDiss.com website then please: