Tracking YouTube videos across the web - First Draft

This website is hosted in perpetuity by the Internet Archive.

Tracking YouTube videos across the web

This recipe is the second in our Digital Investigations Recipe Series, a collaboration between the Public Data LabDigital Methods InitiativeOpen Intelligence Lab and First Draft. It is designed to lift the lid on advanced social media analysis. 

Introduction

This recipe details how you can track content banned from YouTube to other spaces online. Throughout Covid-19, conspiratorial documentaries, such as “Plandemic,” or videos like “America’s Frontline Doctors” gathered millions of views before being taken down. The attention given to such material, and not least to a plethora of other misinformation, makes it necessary to trace and identify misinformation outside the curated spaces of popular platforms. 

Even if these videos or links no longer exist, the online spaces you identify using this recipe are locations that likely share other kinds of misinformation. They may also have repackaged versions of banned content in different video mediums. Better understanding the web ecology of deplatformed videos will not only ramp up the monitoring of this content, but it will also enrich your online investigations as you can track the spread of URLs hosting misinformation across the web. 

In this recipe, we will surface websites, pages and platforms where “Plandemic” and other Judy Mikovits-related videos have appeared. It’s important to note that while we will be focusing on content that has largely been banned across the major platforms, you can use this same recipe to track the spread of live URLs across the web. This is very useful for journalists trying to judge whether a specific video or article has reached the tipping point and where it has spread online.

Examples

1. Rogers, R. (2020) “Deplatforming: Following extreme Internet celebrities to Telegram and alternative social media,” European Journal of Communication. SAGE Publications Ltd, p. 0267323120922066. doi: 10.1177/0267323120922066. Video format.

Ingredients

1. 4CAT’s YouTube URL metadata. 4CAT is a dashboard-like tool that scrapes data from Reddit, 4chan, 8chan, 8kun, Breitbart, Instagram, Telegram and Tumblr. It hosts a suite of natural language processing and other statistical tools designed to facilitate the study of social media content and misinformation. The tool is free and there are detailed instructions on how you can install 4CAT on your own computer or server.

2. DMI’s Google Search Engine Scraper.

3. A spreadsheet editor (e.g., Numbers, Excel, LibreOffice Calc or Google Sheets). 

Recipe

1. On 4CAT, choose a dataset and query. Specify a date range if needed. 

2. Open your dataset (4CAT will notify you of its completion on the top-right corner). Go on YouTube URL metadata and choose subsequent parameters, if needed. If not, leave them as they are. 

3. In your resulting CSV, you may find that some of the videos you obtained were deleted or “deplatformed” from YouTube. You can find these videos in the column deleted_or_failed. Filter your spreadsheet so only videos with TRUE appear in the column. These are all the videos that no longer exist on YouTube. 

4. Create a new column (you can give it whatever name you want) copying the links from the column referenced_urls and adding a comma to the end of each url. 

5. Copy the video URLs and paste these in the Google Search Engine Scraper. The Google Search Engine Scraper only works in Firefox, so make sure you have Firefox installed as well as the DMI Firefox ToolBar extension

6. After running the Google Search Engine Scraper, click on Output at the top of the page and select Text. This will open a new tab with your data. Click on Save Page As and give it a name. 

7. Open a new Google sheet and click on Import and then select the Text file you just saved. It should come with .txt attached to the end of the name. Google will show you an Import file popup. Click on Import data

8. The text data has now been transmuted into a csv readable format which shows the top domains in which these videos were either referenced or accessible and the dates in which they appeared in these links’ domains. You may, for example, find links to BitChute and other fringe media or even more mainstream platforms, such as Facebook and Instagram, where these videos, or versions of these videos, may still be hosted. 

9. In a new spreadsheet, create two columns: (A) The first domain in which your videos have appeared (YouTube in this case); (B) The second domain in which said videos have appeared. This will be the column article url. To make sure the domains are easily readable, create a new column and extract the root domain from the longer url using the following Google Sheets formula:

=trim(REGEXEXTRACT(REGEXREPLACE(REGEXREPLACE(C2,”https?://”,””),”^(w{3}\.)?”,””)&”/”,”([^/?]+)”))

C2 in this case refers to the first row of the article url column.

10. Paste your results on Raw Graphs, select alluvial diagram and place your source and root domain columns in the Steps entry.

 

You can start to see not only the top platforms through which these deplatformed videos were shared and circulated, such as YouTube, Facebook and Twitter, but also some of the more fringy platforms and websites such as BitChute, qanon.news, lockdownsceptics.org, the donald.win, or republicanbriefs.org where these URLs found a home and are likely home to other similar content. 

Remember, these recipes are not static. There are many ways to use parts of this recipe for different purposes, whether they are for monitoring or research and investigations. For example, instead of pasting deplatformed YouTube URLs into the Google Search Engine Scraper in step 5, you can add in BitChute or Banned.video urls that you have identified to track the spread of particularly misleading or problematic videos across the web. You may also add in YouTube URLs that have not been banned yet. Doing this may lead you to particular Facebook pages or websites that are driving the dissemination of this content.

Emillie de Keulenaar is a PhD researcher at University of Amsterdam’s Open Intelligence Lab and Simon Fraser University’s Digital Democracies group. She has previously researched with the UN’s Innovation Cell, the Dutch digital humanities cluster CLARIAH, the European Time Machine consortium and the Clingendael Institute. Her interests lie in the role of deep disagreements in producing misinformation, as well as in the history of moderating online hate speech and other problematic information.