Embracing the USA PATRIOT Act & FinCEN
November 15, 2023 (re-published on July 17, 2024)
Generative AI has shown great potential in the cyber security industry, but systems such as OpenAI’s Chat GPT do require us to send potentially sensitive information to a 3rd party's system. I have been searching for a way to prevent this, and have come across a tool you may be interested in. It is called GPT4All and is a locally installed GPT client. In this discussion, I will delve into my experiences with this tool, highlighting its capabilities and the challenges I've encountered.
GPT4All is a product from nomic.ai1 and is one of the fastest growing projects in GitHub history2. It is a large language model (LLM) AI which runs locally on commodity hardware, for example, in my case, a MacBook Pro M1 with 16GB of memory. GPT4All allows us to ask questions, summarize documents, request content creation, and write code; however, we are doing all the work on our local system and not submitting content to 3rd parties’ servers and systems. As computer security professionals, we can use our imagination why we would want to keep some information more private while leveraging the power of Gen AI.
Let’s look at a concrete example. I will show how you can download the vast majority of threat intelligence reporting (open source) and we will then use GPT4All to run some queries across that data locally.
To begin, the initial step is to obtain the reporting, which is accessible through the APT Notes GitHub repository3. From there we are pointed to a Python script4 that can pull the reporting for us. I’ll leave running and tweaking the Python as an exercise for the reader.
Once we have the Python script downloaded, we need to store it. I stored mine into a local directory, where the script sorted the reporting by year as show below (Figure 1 & Figure 2):
Figure 1
Figure 2
Now we have a nice corpus of threat reporting available to us, locally, and we can initiate queries and not tip our hand in any way to an adversary. This tactic could also be used by people who need to travel to secure, or sensitive locations, or for use where there is limited network availability. Based on early October 2023 we would have access to approximately 670 reports (Figure 3).
Figure 3
Let’s see how we can then use this data. First, we must point GPT4All to our local repository. We will create a link inside GPT4All to the repository directory called “ThreatIntel”. (Figure 4)
Figure 4
Once we have set up our LocalDocs plugin, we can now use the engine to scan our collected threat intelligence first. Since we have the “Show reference” box checked; our answers will refer to the source file.
One last item we should examine is what language model we use.
I chose to use Falcon as it’s licensed for commercial use and seems to be a reasonable compromise on speed and resource usage. (Figure 5)
Now that we have set up the system, let’s try out some tests. First, we need to make sure to select the LocalDocs shown below (Figure 6). It is important to check this collection so the LLM will search our local documents first.
As we query, the results will look like what we see below, but for readability and conciseness, I have copied out the transcript and pasted that below (Figure 7). Note the references listed at the bottom of the response. This is how we can see what local documents the response was pulled from.
Let’s see how the GPT4All does in some queries on the data we have. Note that the prompt/response is copied out of the UI and pasted here for ease of reading.
Prompt: Tell me about Iran based threat actors.
Response: Iranian based threat actors are known for their cyber espionage activities, which include stealing confidential data and conducting surveillance on political figures and organizations. They also use social engineering techniques to trick individuals into revealing sensitive information. Some of the most notorious Iranian threat actors include APT34 (also known as OilRig), APT35 (also known as Charming Kitten), and APT36 (also known as Barium). These groups have been linked to cyber-attacks on various organizations, including political parties, media outlets, and human rights activists.
Prompt: Tell me about NSO.
Response: NSO (Nucleon Software) is a cyber-security company based in Herzliya, Israel. It specializes in developing mobile surveillance software for law enforcement agencies and government organizations. NSO's products include Pegasus, which allows clients to remotely access the data stored on an individual's phone, including text messages, emails, photos, and location data. The company has faced criticism for its products being used by authoritarian regimes to target political dissidents and human rights activists. In 2019, it was reported that NSO's products were being used by Saudi Arabia to monitor the communications of activists and journalists in the country.
Prompt: What tactics does the group OilRig use and how can I defend against it?
Response: The group OilRig uses a variety of tactics to gain access to systems and steal sensitive data. Some of the most common techniques include spear-phishing emails, which are crafted to appear legitimate but contain malicious attachments or links that download malware onto the victim's machine. Another tactic used by OilRig is exploiting vulnerabilities in software applications, such as Adobe Flash Player or Microsoft Office, to gain access to systems.
To defend against these tactics, organizations should implement a multi-layered defense strategy that includes regular security updates for all software and hardware, implementing firewalls and intrusion detection/prevention systems, using strong passwords and encryption, and training employees on how to recognize and report suspicious emails or attachments. It is also recommended to stay up-to-date with the latest threat intelligence and actively monitor systems for signs of compromise.
Prompt: Wwhat malware does oilrig use after it spear-phishes a target?
Response: The OilRig group uses a variety of malware, including Smoaler, which is a downloader used to install other malicious software on the victim's machine. Smoaler can be delivered through spear-phishing emails or by exploiting vulnerabilities in software applications. Once installed, Smoaler can download and execute additional malware on the victim's machine.
It is worth noting that Smoaler is not malware but rather an intermediate infector used to deliver other malicious software. The OilRig group has been known to use a variety of techniques to evade detection and analysis, including obfuscation and anti-analysis techniques.
Prompt: Describe the Smoaler malware's capabilities.
Response: Smoaler is a downloader used by the OilRig group to install other malicious software on the victim's machine. It can be delivered through spear-phishing emails or by exploiting vulnerabilities in software applications. Once installed, Smoaler can download and execute additional malware on the victim's machine.
Smoaler has several capabilities that make it a powerful tool for cybercrime groups:
1. Ability to evade detection: Smoaler is designed to evade detection by security products such as firewalls and intrusion detection/prevention systems. It can also hide its malicious activity from the victim's machine, making it difficult to detect and remove.
2. Ability to download additional malware: Once Smoaler has been installed on a victim's machine, it can download and execute other malicious software without requiring any further action from the user. This makes it an ideal tool for cybercrime groups looking to install more advanced forms of malware on their victims' machines.
3. Ability to steal sensitive data: Smoaler can be used to steal sensitive data from a victim's machine, including login credentials, financial information, and other confidential data. This makes it an attractive target for cybercriminals who want to use the stolen data for financial gain or other malicious purposes.
Overall, Smoaler is a powerful tool that cybercrime groups can use to evade detection, download additional malware, and steal sensitive data from their victims' machines.
Prompt: What are the victims of the oilrig threat group?
Response: The OilRig threat group targets individuals and organizations in various industries, including finance, government, military, and media. The group's primary goal is to gather sensitive information from these targets for financial gain or other malicious purposes.
Victims of the OilRig threat group include individuals who are involved in political activism, human rights activists, journalists, and opposition figures in various countries around the world. The group has also been known to target military personnel and government officials in some countries.
The OilRig threat group uses a variety of techniques to gather information from its targets, including spear-phishing emails, social engineering tactics, and malware infections. Once it gains access to a victim's machine, the group can download additional malware or steal sensitive data for financial gain or other malicious purposes.
The results seem to be a reasonable summary of some of the data inside the files and we could assume that analyzing and summarizing large quantities of information may be done with the tool. The length of responses and the number of files queried can be set inside the options, but they strongly suggest using the defaults, which is what I have done for this analysis.
There were however some challenges in running this analysis. I suspect this may be related to macOS based security related to access of my Documents folder. The issue is that I have inconsistent success with the ability to have GPT4All search my local directories first. Sometimes it takes a shutdown of the chat and restart to finally have success searching the local directories first. Additionally, you can tell the system is running on a lower powered system than ChatGPT, as some of the responses take a bit longer to respond and have a limited length. There are settings to adjust the performance, which I have only sparingly used at this point. (Figure 8)
The ability to use a local AI to search across local data without any data spillage to a 3rd party is an attractive feature of the GPT4All tool. While it does suffer from a few bugs related to consistently hooking our local data directory, it does prove to be useful already to search and generate responses across a moderately large corpus of information. For specialty applications, this could be a nice budget entry to an isolated LLM for small and medium business with security concerns on their data. I will continue to test this tool for a cyber threat intelligence use case. I would love to hear in the comments if others find this useful also.