Introduction
It’s no secret that for better or worse, much of the news for which we rely on gets congregated and aggregated through social media networks. Most of all Twitter, at least in the world of cyber security. Both large and small companies use it to release press releases, security teams use it to publish their latest findings, and news of new disclosed bugs or CVEs get discovered often first on Twitter. CISA (US’s Cybersecurity and Infrastructure Security Agency) even posts news of new public advisories through Twitter via their official page.
However, Twitter can be a bit unreliable and ephemeral as platforms go. Important posts from minor accounts can easily go unnoticed, much to the dismay of anyone wanting to get a complete picture of the security landscape. Don’t get me wrong, it’s far better than visiting 30+ different web pages every time you want to check the latest InfoSec releases, but Twitter isn’t without its fair share of problems. For starters, retaining data from posts have been deleted is difficult, unlike the easily Wayback-Machine-able news website. Additionally, it’s advantageous to look beyond the surface level data of any given post (content of post, poster, et cetera), and compare with the overall trend (E.G. instead of just one or two posts are saying, what the overall community is saying about [X topic]). Twitter doesn’t natively provide this capability, and it really isn’t supposed to. It’s a social media site, not a data analysis tool. However, what if we combined the ubiquity of Twitter and other news feeds with the useful data analysis tools we all know and love. That’s the purpose of this post, and it’s associated code repository: https://github.com/benjaminhays/news2data/tree/main.
/home/bhays/news2data$ tree .
.
|-- LICENSE
|-- README.md
|-- docker-compose.yml
|-- feeds
| |-- news.py
| `-- twitter.py
|-- ingestion
| `-- elasticsearch.py
`-- main.py
2 directories, 7 files
Methodology
I choose Elasticsearch/Kibana as the the data analysis tool for this project, mostly just because of personal preference and experience using the tool. The code is very easily modified to support any database or tool, and it was designed this way very much purposely. I also explicitly chose Python as the language to implement the tool, primarily because performance isn’t a major concern and I was trying to avoid writing too much code that’d overcomplicate the whole ordeal. The program itself though could be fairly ported to any language or library of choice. Lastly, I chose to avoid the official Twitter API itself, due to it’s monetary cost in favor of using RSS feed data from third-party instances (see: Nitter). This way I avoid both the dubiously legal and unethical practice of scraping and avoid the stringent API fees associated with the paid Twitter API. I specifically chose RSS both for it’s reliability, but also because it doesn’t put any undue load on Nitter’s (or Twitter’s) servers, as my program functions identically to a RSS feed reader, just with an alternative payload that uploads the “news” data to Elasticsearch.
Implementation
During the programs run time, the code first parses the associated .env
file containing all the relevant configurations. Here’s the default .env
file distributed with the repo:
ES_PORT=9200 # Change this
ES_HOST=127.0.0.1 # Change this
# Data feeds
TWITTER_SOURCES=CISACyber,NISTcyber,SonicWallAlerts,Unit42_Intel
RSS_SOURCES=https://www.bleepingcomputer.com/feed/,https://www.theregister.com/security/headlines.atom,https://thehackernews.com/feeds/posts/default,https://www.cisa.gov/cybersecurity-advisories/all.xml
Then, it downloads all of the feeds (from Twitter or else where), then parses and normalizes them into one unified JSON-serialized file. This is done using the python-requests library, among others. Some minor adjustments had to be made to avoid bot-detection, but once a proper User-Agent
and browser session were established, no other complications arose.
This JSON file is then transformed to meet the requirements of the final destination (Elasticsearch, by default). This includes removing duplicates, dealing with destination-side AAA, and adding metadata fields like _id
and date
. This way, events that came from say: https://www.bleepingcomputer.com/ or CISA’s Twitter look exactly identical to the data analyzer (other than a minor source
field, of course).
Finally, the transformed data is sent over the wire to the target destination, and all is well. At this point, the script self-terminates as all goals have completed.
As an additional side-note, I’ve written and included a docker-compose.yml file in the code repository to assist with development/testing. A simple docker-compose up -d
should do the trick, and the default .env
variables reflect this instance’s configuration. It however lacks secure authentication and reasonable security defaults for the sake of developmental ease, so don’t even think about running it in production or exposing the ports beyond loopback. That said, this code was really never designed to be ran in production anyway, just a fun side-project.
Conclusion
A lot was learned over the course of this project, most specifically regarding Python’s scriptable interface with Elasticsearch. That was by-far the most intriguing part of it, although the data collected was also quite fun to explore. I think that a similar tool, with far more polish and execution would do some service to those with a vested interest in the analysis of social media content. Cyber crime investigators, OSINT specialists, and large corporations whose digital reputation can mean millions of dollars worth of stock price losses would benefit from more attention to social media data. Additionally, future effort could focus on the analysis of such collected data for a specific purpose or use-case. Sentiment analysis, AI-content detection, and other forms of NLP seem like interesting avenues of research on the data collected by this tool. That said, it has certainly be done before and likely much more professionally than my quickly attempted implementations.