
Security News
The Changelog Podcast: Practical Steps to Stay Safe on npm
Learn the essential steps every developer should take to stay secure on npm and reduce exposure to supply chain attacks.
com.societegenerale:github-crawler-parent
Advanced tools
With the current move to microservices, it's not rare that a team who previously had a couple of repositories, now has several dozens. Keeping a minimum of consistency between the repositories becomes a challenge which may cause risks :
These are all simple questions that sometimes take hours to answer, with always the risk of missing one repository in the analysis, making the answer inaccurate.
Github crawler aims at automating the information gathering, by crawling an organization's repositories through GitHub API. Even if your organization has hundreds of repositories, Github crawler will be able to report very useful information in few seconds !
Github crawler is a Spring Boot command line application. It is written in Java and Kotlin, the target being to move as much as possible to Kotlin.
Following a simple configuration, it will use Github API starting from a given organization level, then for each public repository, will look for patterns in specified files.
You can easily exclude repositories from the analysis, configure the files and patterns you're interested in. If you have several types of repositories (front-end, back-end, config repositories for instance), you can have separate configuration files so that the information retrieved is relevant to each scope of analysis.
Several output types are available in this package :
# the base GitHub URL for your Github enterprise instance to crawl
gitHub.url: https://my.githubEnterprise
# or if it's github.com...
# gitHub.url: https://api.github.com
# the name of the GitHub organization to crawl. To fetch the repositories, the crawler will hit
# https://${gitHub.url}/api/v3/orgs/${organizationName}/repos
organizationName: MyOrganization
#repositories matching one of the configured regexp will be excluded
repositoriesToExclude:
# exclude the ones that start with "financing-platform-" and end with "-run"
- "^financing-platform-.*-run$"
# exclude the ones that DON'T start with "financing-platform-"
- "^(?!financing-platform-.*$).*"
# do you want the excluded repositories to be written in output ? (default is false)
# even if they won't have any indicators attached, it can be useful to output excluded repositories,
# especially at beginning, to make sure you're not missing any
publishExcludedRepositories: true
# by default, we'll crawl only the repositories' default branch. But in some cases, you may want to crawl all branches
crawlAllBranches: true
# default output is console - it will be configured automatically if no output is defined
# the crawler takes a list of output, so you can configure several
output:
file:
# we'll output one repository branch per line, in a file named ${filenamePrefix}_yyyyMMdd_hhmmss.txt
filenamePrefix: "orgaCheckupOutput"
http:
# we'll POST one repository branch individually to ${targetUrl}
targetUrl: "http://someElasticSearchServer:9201/technologymap/MyOrganization"
ciDroidReadyFile:
# this should be an indicator defined in indicatorsToFetchByFile section. can be a comma separated list if several to output
indicatorsToOutput: "dockerFilePath"
# list the files to crawl for, and the patterns to look for in each file
indicatorsToFetchByFile:
# filename - crawler will do a GET https://${gitHub.url}/raw/${organizationName}/${repositoryName}/${branchName}/${filename}
# use syntax with "[....]" to escape the dot in the file name (configuration can't be parsed otherwise, as "." is a meaningful character in yaml files)
"[pom.xml]":
# name of the indicator that will be reported for that repository in the output
- name: spring_boot_starter_parent_version
# name of the method to find the value in the file, pointing to one of the implementation classes of FileContentParser
method: findDependencyVersionInXml
# the parameters to the method, specific to each method type
params:
# findDependencyVersionInXml needs an artifactId as a parameter : it will find the version for that Maven artifact by doing a SAX parsing, even if the version is a ${variable} defined in <properties> section
artifactId: spring-boot-starter-parent
- name: spring_boot_dependencies_version
method: findDependencyVersionInXml
params:
artifactId: spring-boot-dependencies
#another file to parse..
Jenkinsfile:
- name: build_helper_package
method: findFirstValueWithRegexpCapture
params:
# findFirstValueWithRegexpCapture needs a pattern as a parameter. The pattern needs to contain a group capture (see https://regexone.com/lesson/capturing_groups)
# the first match will be returned as the value for this indicator
pattern: ".*com\\.a\\.given\\.package\\.([a-z]*)\\.BuildHelpers.*"
## Configuration on repository side
While the global configuration is defined along with github crawler, we have the possibility to override it at the repository level.
Repository level config is stored in a **.githubCrawler** file, at the root of the repository in the default branch
- **Exclusion**
if a repository should be excluded, we can define it in the repository itself. if .githubCrawler contains :
```yaml
excluded: true
Then the crawler will consider the repository as excluded, even if it doesn't match any of the exclusion pattern in the crawler config
Sometimes, the file we're interested in parsing is not in a standard location like the root of the repository - this is typically the case for Dockerfile.
What we can do in this case is define the file in the crawler config, and override the path in the repository config, with the redirectTo attribute :
filesToParse:
-
name: Dockerfile
redirectTo: routing/Dockerfile
With above config, when the crawler tries to fetch Dockerfile at the root of the repository, it will actually try to parse routing/Dockerfile
It may happen that you want to "tag" some repos, to be able to filter easily on them when browsing the results. This is made possible by adding below config in the .githubCrawler file :
tags:
- "someTag"
- "someOtherTag"
in output, this information will be attached to all the repositories for which it has been configured.
when running the crawler with above config and using HTTP output to push indicators in ElasticSearch, this is the kind of data you'll get


(when there's no value, it means the file was not found. when the value is "not found", it means the file exists, but the value was not found in it)

Once you have this data, you can quickly do any dashboard you want, like here, with the split of spring-boot-starter-parent version across our services :

At build time, we produce 2 jars :
We leverage on Spring Boot profiles to manage several configurations. Since we consider that each profile will represent a logical grouping of repositories, the Spring profile(s) will be copied on a "groups" attribute for each repository in output.
Assuming you have a property file as defined above

Github crawler is available in Maven Central, so all you have to do is to fetch it and execute it with the property file(s) that you need.
Have a look at below very simple script :
--> it should work and ouput will be available according to your configuration
#!/usr/bin/env bash
crawlerVersion="1.0.0"
wget -P github-crawler-exec.jar http://repo1.maven.org/maven2/com/societegenerale/github-crawler/${crawlerVersion}/github-crawler-${crawlerVersion}-exec.jar --no-check-certificate
$JAVA_HOME/bin/java -jar github-crawler-exec.jar --spring.config.location=./ --spring.profiles.active=myOwn
Above script assumes that you have property file(s) in same directory as the script itself (--spring.config.location=./) and that one of them is declaring a myOwn Spring Boot profile
A starter project is available, allowing you to create your own GitHub crawler application, leveraging on everything that exists in the package. This is the perfect way to test your own output or parser class on your side.. before maybe contributing it back to the project ? ;-)
A simple example is available here : https://github.com/vincent-fuchs/my-custom-github-crawler/
@SpringBootApplication
public class PersonalGitHubCrawlerApplication implements CommandLineRunner {
@Autowired
private GitHubCrawler crawler;
public static void main(String[] args) {
SpringApplication.run(PersonalGitHubCrawlerApplication.class, args);
}
@Override
public void run(String... strings) throws Exception {
crawler.crawl();
}
}
We follow a strict test driven strategy for the implementation. Contributions are welcome, but you'll need to submit decent tests along with your changes for them to be accepted. Browse the tests to get an idea of what level of test is expected.
FAQs
Unknown package
We found that com.societegenerale:github-crawler-parent demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Learn the essential steps every developer should take to stay secure on npm and reduce exposure to supply chain attacks.

Security News
Experts push back on new claims about AI-driven ransomware, warning that hype and sponsored research are distorting how the threat is understood.

Security News
Ruby's creator Matz assumes control of RubyGems and Bundler repositories while former maintainers agree to step back and transfer all rights to end the dispute.