VISaR: A Code Scanning Solution for Data Platform Engineers
VISaR (Vulnerability Identification, Scanning and Reporting) is an open-source software application designed for data platform engineers. Keep your data platforms safe and protect your data.
Read time: 7 minutes
Summary:
What is VISaR?
VISaR (Vulnerability Identification, Scanning and Reporting) is an open-source tool designed to automatically scan code repositories for vulnerabilities. VISaR generates detailed CSV reports summarizing potential issues so engineers and developers can make informed decisions regarding code security.Who is VISaR for?
VISaR is designed for data engineers and software developers. It allows the user to scan code repositories and generate vulnerability reports using only Python.How to try VISaR?
Check out VISaR today on our GitHub.
1. Introduction
The recent Open Source Security and Risk Analysis (OSSRA) report published by Black Duck found that 86% of codebases had open-source software vulnerabilities and an alarming 81% had high- or critical-risk vulnerabilities. A major contributor to these statistics was the number of codebases that had components that were out of date by more than four years, which increases security risks. In the current landscape of increasingly complex data platforms, production environments and open-source tech stacks, it is more critical than ever to evaluate software vulnerabilities before integration. We need to be periodically assessing the security posture as packages we rely on become outdated, or versions change due to compatibility.
To address this problem, we have developed a Python-based solution, VISaR (Vulnerability Identification, Scanning and Reporting), to scan code repositories for vulnerabilities and generate detailed reports summarizing potential issues. This solution leverages best-in-class open-source components; the OSSF Scorecard for vulnerability identification and the OSV Database to obtain vulnerability information which are compiled into the output CSV file. This CSV includes key details such as vulnerability ID codes, description and severity levels. This output empowers both software developers and data engineers to make informed decisions about the software they integrate with their systems.
Whether you’re securing production code, managing an open-source data platform, or evaluating AI-generated code; VISaR is designed to be accessible, modular and practical. In this article we will explore the motivation behind this project, the technical details of how it works, and describe how you can start using VISaR today to evaluate your own data platforms and systems.
2. Use Cases
In today’s fast-paced work environments, data teams are often trying to do more with less. There is more risk than ever before for security issues to make it into production systems. We still need to ensure the solutions we deliver are secure. Whether you’re deploying code to production or experimenting with open-source software projects on your own PC, you (likely) want to ensure you’re not exposing yourself to unnecessary risk. Here are three main users for VISaR:
Data Platform Engineers:
Quickly evaluate software before integrating it with your data platform.
As a senior data engineer, I was responsible for maintaining our internal data platform and proactively safeguarding sensitive data. Before approving an open-source tool for use, I would scan the code and review vulnerabilities so I could make an informed decision about approval for the target system.Software Engineers:
Assess your own code for vulnerabilities before it reaches production systems.
As a lead developer, I was responsible for maintaining the integrity and security of our production environments (both on-prem and in Microsoft Azure). By scanning code contributed from different teams using this tool, I could catch vulnerabilities before they made it to production or approved for operational use.Independent Developers:
Verify code generated by AI assistants or community contributions.|
With the astronomic rise of generative-AI tools and coding assistants, more people are producing code than ever before. This pipeline enables anyone (whether a hobbyist or independent developer) to assess their own code for vulnerabilities. Whether you’re evaluating AI generated scripts or your own code before pushing to an open-source project, this tool helps you ensure your code is risk-free.
If any of these use-cases sound relevant, keep reading to find out how it all works.
3. Technical Details
VISaR is designed to be accessible, requiring only basic Python knowledge to run it. We have chosen open-source components and made sure it runs on standard hardware (no GPU required) to minimize barriers to entry.
3.1 Technical Overview
We’re not going to go into all of the details here but in this section we will give a high-level over of how VISaR works.
The user provides the URL for a GitHub code repository to VISaR which then automatically performs a code scan to identify vulnerabilities, sends HTTP requests to obtain additional vulnerability information and then write key information to a CSV file.
When VISaR is ran, a six stage process happens to go from input URL to output CSV file:
OSSF Scorecard scans the repository and generates a summary file.
A second OSSF Scorecard scan generates a file of known vulnerabilities (saved temporarily).
A list of vulnerability IDs are harvested from the temporary data file.
Vulnerability IDs are sent to the OSV API with a request for the vulnerability severity and description.
Key vulnerability information is extracted from the JSON payload.
The vulnerability IDs, severity, and plain-text summary are compiled into a structured CSV file.
The numbered steps correspond to the numbers on the architecture diagram shown in Figure 1. This modular design ensures that the pipeline can easily integrate alternative tools or output file formats, making it versatile and future-proof.
An example output is shown in Figure 2. This vulnerability report would be analyzed by your security offer and technical staff (software and data engineers) to decide if the vulnerabilities fall within the risk tolerance, or if additional measures need to be taken
3.2 Design Choices
This pipeline has been intentionally designed with accessibility and modularity in mind. Here we touch on our design choices.
Language and Dependencies:
The entire codebase is written in Python 3.12 and all functionality relies on widely used open-source tools. Running VISaR requires Python, Docker Desktop, a GitHub account (to create an authentication token), and internet connectivity.
Core Tools:
The OSSF Scorecard (developed by the Open Source Security Foundation) scans repositories to identify vulnerabilities. The Open Source Vulnerability Database (OSV, developed by Google) provides detailed information for each vulnerability. These are best-in-class components for assessing software security.
Output:
Results from compiled into a CSV file, easily viewable in tools like Excel or Google Sheets, or ingested into tools like Python for further analysis.
Modularity:
The pipeline’s design allows users to swap out components (e.g. alternative scanning tools or vulnerability databases) with minimal effort. This ensures compatibility even if the preferred tools within the software security community change.
Coding Standards:
The code is written with PEP8 style guide adherence, and we aim to have close to 100% test coverage (with code tests implemented using unittest).
Project Structure:
VISaR uses a standard src/
structure with separate directory for source code (a helpers/
package is setup with a collection of modules each called from the entry point, main.py
), code tests, log files, data files,
Logging:
Run details are captured in a .log
file found within the logs/
directory. If a run fails, this is where you start your troubleshooting.
3.3 Using the code
We have tried to make VISaR easy to set-up and use. The steps to get up-and-running are given here:
Prerequisites. To use VISaR ensure you have the following:
Python 3.8 or higher
Docker Desktop and the most recent OSSF Scorecard Docker image (instructions available here)
A classic Github auth token (settings > Developer Settings > Personal access tokens > Tokens (classic)) and set the scope to public_repo.
Clone the Repository. Obtain a copy of VISaR from this repo.
Create a
.env
file in the root directory and populate with your GitHub tokenGITHUB_AUTH_TOKEN = “<github-auth-token>”
Run the setup PowerShell script from the root directory. This creates a virtual environment, installs dependencies, and activates the environment.
./scripts/setup.ps1
From the root directory run the test suite.
python -m unittest discover -s tests
Run VISaR. From the root directory, move into the
src
folder and run the applicationcd src/ python main.py <full-github-repo>
After the initial setup, running VISaR only requires activating the virtual environment and then running it (step 5). If you make changes to the code, we suggest re-running the test suite (step 4) to validate that functionality hasn’t been compromised.
If you do find any bugs please report them on the VISaR Issues page so we can address them and/or include notes for frequently encountered issues.
4. Summary
Proactively mitigating risks in our software is not optional as they can jeopardize production environments and data platforms. This is especially true in the era of generative AI as machine produced code is on the rise and must be reviewed. VISaR offers an automated, accessible solution to generate vulnerability assessments of code repositories by using open-source components. We hope VISaR empoweres engineers and developers to make informed decisions when it comes to choosing their tech stack and whether or not to integrate code with their own systems.
The real value of this tool lies in its practicality, the generated output CSV file is easy for technical teams and independent developers to generate and provide evidence for security offers and other subject matter experts to take action on identified vulnerabilities.
4. Roadmap
This article documents the first iteration of VISaR (v1.0.0). We are working on the following features for the next release:
We are developing a low-code user interface (UI) to make it even more user-friendly. This will let the user run code scans from the UI and have a dashboard view to check previous runs.
We are adding batch mode so users can supply a list of code repositories to scan. One major drawback as of now is that it only takes one URL as an input.
We are integrating Google Gemini into the UI so users can chat with an LLM to help understand vulnerabilities which may fall out of the users own knowledge.
If you try the tool, I would love to hear from you! Share your feedback, report bugs, or contribute to the GitHub repository.
Thanks for reading! Feel free to follow us on LinkedIn here and here. See you next time!
To receive new posts and support our work, become a free or paid subscriber today.
If you enjoyed this newsletter, comment below and share this post with your thoughts.