SHARI – An Integration of Tools to Visualize the Story of the Day
Abstract
Tools such as Google News and Flipboard exist to convey daily news, but what about the past? In this paper, we describe how to combine several existing tools with web archive holdings to perform news analysis and visualization of the “biggest story” for a given date. StoryGraph clusters news articles together to identify a common news story. Hypercane leverages ArchiveNow to store URLs produced by StoryGraph in web archives. Hypercane analyzes these URLs to identify the most common terms, entities, and highest quality images for social media storytelling. Raintale then takes the output of these tools to produce a visualization of the news story for a given day. We name this process SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration).
1 Introduction

URL:http://storygraph.cs.odu.edu/graphs/polar-media-consensus-graph/#cursor=0&hist=1440&t=2020-03-23T00:09:10

URL:https://oduwsdl.github.io/dsa-puddles/stories/shari/2020/03/23/storygraph_biggest_story_2020-03-23/


Tools such as Google News and Flipboard exist to convey daily news, but what about the news of the past? We have combined StoryGraph111http://storygraph.cs.odu.edu with tools from the Dark and Stormy Archives Toolkit222https://oduwsdl.github.io/dsa/software.html to produce the StoryGraph Hypercane ArchiveNow Raintale Integration (SHARI) process. These tools represent disparate research efforts in news analysis, corpus summarization, web archiving, and visualization. The integration produces a summary of the “biggest story” for a given date. SHARI combines the following components from Old Dominion University’s Web Science and Digital Libraries Research Group333https://ws-dl.cs.odu.edu:
-
•
StoryGraph: a platform that downloads RSS feeds and analyzes the linked articles to cluster news stories nwala_365_dots_2020 – http://storygraph.cs.odu.edu/
-
•
Hypercane: a framework for intelligently sampling and analyzing documents from web archive collections jones_hypercane_2020 – https://oduwsdl.github.io/hypercane
-
•
ArchiveNow: a library developed by Aturban et al. aturban_archivenow:_2018 that submits live web URI-Rs to web archives to create URI-Ms – https://github.com/oduwsdl/archivenow
-
•
Raintale: a MementoEmbed jones_preview_2018 client that creates stories from a sample of mementos – https://oduwsdl.github.io/raintale
Nwala et al. nwala_bootstrapping_2018 ; nwala_scraping_2018 have focused on finding seeds within search engine result pages (SERPs), social media stories, and news feeds. As part of this research, Nwala et al. also developed StoryGraph nwala_365_dots_2020 , a service that saves RSS feeds from 17 news sources (Table 1 in Appendix A) every ten minutes. With these RSS feeds, StoryGraph analyzes the lexical connections between articles across feeds to generate JSON output, which drives a graph visualization. Figure 1 displays some of this JSON output for March 23, 2020. StoryGraph then visualizes this output, as shown in Figure 2.
Collections on specific topics exist at various web archives jones_many_2018 . AlNoamany et al. alnoamany_generating_2017 introduced how to use social media storytelling to summarize web archive collections. Klein et al. 10.1145/3201064.3201085 have built collections from web archives by conducting focused crawls. Jones developed Hypercane jones_hypercane_2020 to intelligently sample mementos from larger collections. Jones also developed Raintale jones_raintale_2019 for generating social media stories to summarize groups of mementos, providing visualizations that employ familiar techniques, like cards, that require no training for most users to understand.
The JSON data structure from Figure 1 provides all information gathered but is difficult for humans to understand at a glance. The graph shown in Figure 2 provides an overview of the JSON through favicons and edges, but a user requires some training to fully comprehend what it represents. Figure 3 displays the largest connected component from this graph visualized via the SHARI process. Through images, text snippets, titles, cards, domain names, favicons, and other content, the SHARI output allows the viewer to intuitively understand that the biggest news story for this date consists of different reactions to the growing COVID-19 pandemic.
2 The SHARI process
The StoryGraph Hypercane ArchiveNow Raintale Integration (SHARI) jones_shari_2020 process automatically creates stories summarizing news for a day. Figure 4 details what each tool contributes to the story. Figure 5 shows the steps of the SHARI process.








-
1.
With the StoryGraph Toolkit, we query the StoryGraph service for the URI-Rs belonging to the biggest story of the day.
-
2.
Hypercane converts these URI-Rs to URI-Ms by first attempting to find a corresponding URI-M by querying the LANL Memento Aggregator444https://timetravel.mementoweb.org via the Memento Protocol van_de_sompel_rfc_2013 . For each URI-M that does not have a memento, Hypercane creates a memento by calling ArchiveNow aturban_archivenow:_2018 (Figure 6).
-
3.
Hypercane runs the mementos through spaCy555https://spacy.io/ to generate a list of named entities, sorted by frequency (Figure 7).
-
4.
Hypercane runs the mementos through sumgram nwala_sumgram_2019 and generates a list of sumgrams, sorted by frequency (Figure 8).
-
5.
Hypercane scores all of the mementos’ embedded images. Images that article authors reference in HTML META tags are favored first, followed by MementoEmbed jones_preview_2018 score, then pixel size, color count, the ratio of width to height, and finally position on the page (Figure 9).
-
6.
Hypercane runs the mementos through newspaper3k666https://newspaper.readthedocs.io/en/latest/ to extract each article’s publication date and orders the URI-Ms by that date (Figure 10) .
-
7.
Hypercane consolidates the entities, terms, image scores, and ordered URI-Ms into a JSON file containing the structured data for the summary. During this step, Hypercane uses the highest scoring image as the striking image for the summary (Figure 11). In Figure 4, the highest-ranking image is the UK Prime Minister addressing his country about the COVID-19 pandemic.
-
8.
Raintale renders the output as Jekyll HTML based on the contents of this JSON file, a template file, and information on each memento provided by MementoEmbed (Figure 11).
-
9.
The SHARI script publishes the summary story to GitHub Pages for distribution. Figure 13 shows the output of our dsa_tweeter bot which announces the story after publication through the @StormyArchives Twitter account.
3 Discussion
StoryGraph is a valuable resource with additional unrealized potential. We are not only able to create stories for today or yesterday but any date back to August 8, 2017, when Nwala launched StoryGraph. As seen in Figures 14, 15, and 16 we can see how the world has evolved each year on StoryGraph’s launch date. In Figure 14, the biggest news story was that of North Korea threatening other nations with nuclear weapons. One year later, in Figure 15, we see that the biggest news story is the results of several United States Congressional and gubernatorial primaries. Two years after StoryGraph’s launch, Figure 16 shows that the biggest news story is the aftermath of the 2019 shootings in El Paso and Dayton.

URL: https://oduwsdl.github.io/dsa-puddles/stories/shari/2017/08/08/storygraph_biggest_story_2017-08-08/

URL: https://oduwsdl.github.io/dsa-puddles/stories/shari/2018/08/08/storygraph_biggest_story_2018-08-08/

URL: https://oduwsdl.github.io/dsa-puddles/stories/shari/2019/08/08/storygraph_biggest_story_2019-08-08/
4 Summary and Future Work
SHARI produces a familiar yet novel method of viewing news for a given day. SHARI can create stories for today, yesterday, and back to StoryGraph’s creation on August 8, 2017. It is different from other storytelling services like Wakelet777https://wakelet.com/ because SHARI is entirely automated. The stories produced by SHARI are different from services like Google News888https://news.google.com/ or Flipboard999https://flipboard.com/ because those tools focus on current events and personalized topics. Because StoryGraph samples content from multiple sides of the political spectrum, the SHARI process can provide a visualization of articles not tied to one interest area or even a single side’s terminology. This process works because each component is loosely coupled, has high cohesion, has explicit interfaces, and engages in information hiding. Each command passes data in the expected format to the next.
We are also exploring how to improve striking image selection for stories. One could use this to consider how the same story is told in different venues. For instance, one could ask StoryGraph only to include left-leaning sources and produce a SHARI story. One could then do the same for only the right-leaning sources. With both stories, one could compare the striking images and sumgrams that SHARI produces. We are investigating how to produce and render other news stories for a given day and any given period of time. Finally, we are examining how to best visualize significant events that span substantial periods of time, like the entire COVID-19 news story. Though StoryGraph is an existing service that gathers current news, we also want to apply its algorithm directly to mementos and tell the news stories of past events like the Hurricane Katrina disaster.
5 Acknowledgements
This work supported in part by the Institute of Museum and Library Services (LG-71-15-0077-15).
References
- (1) AlNoamany, Y., Weigle, M. C., and Nelson, M. L. Generating Stories From Archived Collections. In WebSci 2017 (Troy, New York, USA, 2017), pp. 309–318. http://doi.org/10.1145/3091478.3091508.
- (2) Aturban, M., Kelly, M., Alam, S., Berlin, J. A., Nelson, M. L., and Weigle, M. C. ArchiveNow: Simplified, Extensible, Multi-Archive Preservation. In JCDL 2018 (Fort Worth, Texas, USA, 2018), pp. 321–322. https://doi.org/10.1145/3197026.3203880.
- (3) Jones, S. M. A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages. https://ws-dl.blogspot.com/2018/08/2018-08-01-preview-of-mementoembed.html, 2018.
- (4) Jones, S. M. Raintale – A Storytelling Tool For Web Archives. https://ws-dl.blogspot.com/2019/07/2019-07-11-raintale-storytelling-tool.html, 2019.
- (5) Jones, S. M. Hypercane Part 1: Intelligent Sampling of Web Archive Collections. https://ws-dl.blogspot.com/2020/06/2020-06-03-hypercane-part-1-intelligent.html, 2020.
- (6) Jones, S. M. SHARI: StoryGraph Hypercane ArchiveNow Raintale Integration – Combining WS-DL Tools For Current Events Storytelling. https://ws-dl.blogspot.com/2020/04/2020-04-01-shari-storygraph-hypercane.html, 2020.
- (7) Jones, S. M., Nwala, A., Weigle, M. C., and Nelson, M. L. The Many Shapes of Archive-It. In iPres 2018 (Boston, Massachusetts, USA, 2018), pp. 1–10. https://doi.org/10.17605/OSF.IO/EV42P.
- (8) Klein, M., Balakireva, L., and Van de Sompel, H. Focused crawl of web archives to build event collections. In WebSci 2018 (Amsterdam, Netherlands, 2018), p. 333–342. https://doi.org/10.1145/3201064.3201085.
- (9) Nwala, A. C. Introducing sumgram, a tool for generating the most frequent conjoined ngrams. https://ws-dl.blogspot.com/2019/09/2019-09-09-introducing-sumgram-tool-for.html, 2019.
- (10) Nwala, A. C., Weigle, M. C., and Nelson, M. L. Bootstrapping Web Archive Collections from Social Media. In Hypertext 2018 (Baltimore, Maryland, USA, 2018), pp. 64–72. http://doi.org/10.1145/3209542.3209560.
- (11) Nwala, A. C., Weigle, M. C., and Nelson, M. L. Scraping SERPs for Archival Seeds: It Matters When You Start. In JCDL 2018 (Fort Worth, Texas, USA, 2018), pp. 263–272. http://doi.org/10.1145/3197026.3197056.
- (12) Nwala, A. C., Weigle, M. C., and Nelson, M. L. 365 Dots in 2019: Quantifying Attention of News Sources. Tech. Rep. arXiv:2003.09989, 2020. https://arxiv.org/abs/2003.09989.
- (13) Van de Sompel, H., Nelson, M., and Sanderson, R. RFC 7089 - HTTP Framework for Time-Based Access to Resource States – Memento. https://tools.ietf.org/html/rfc7089, Dec. 2013.
6 Appendix A: StoryGraph News Sources
News Source | Feed URL | US Political Polarity |
---|---|---|
Politicus USA | http://www.politicususa.com/feed | Left |
Vox | https://www.vox.com/rss/index.xml | Left |
Huffington Post | http://www.huffingtonpost.com/section/front-page/feed | Left |
MSNBC | http://www.msnbc.com/feeds/latest | Left |
New York Times | http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml | Left |
Washington Post | http://feeds.washingtonpost.com/rss/politics | Left |
CNN | http://rss.cnn.com/rss/cnn_topstories.rss | Center |
Politico | http://www.politico.com/rss/politics.xml | Center |
ABC News | http://abcnews.go.com/abcnews/topstories | Center |
The Hill | http://thehill.com/rss/syndicator/19109 | Center |
Real Clear Politics | http://feeds.feedburner.com/realclearpolitics/qlMj | Center |
Washington Examiner | http://www.washingtonexaminer.com/rss/news | Right |
Fox News | http://feeds.foxnews.com/foxnews/latest | Right |
Daily Caller | http://feeds.feedburner.com/dailycaller | Right |
Conservative Tribune | http://conservativetribune.com/feed/ | Right |
Breitbart | http://feeds.feedburner.com/breitbart | Right |
The Gateway Pundit | http://www.thegatewaypundit.com/feed/ | Right |