GitHub crawler
Why can it be useful ?
With the current move to microservices, it's not rare that a team who previously had a couple of repositories, now has several dozens.
Keeping a minimum of consistency between the repositories becomes a challenge which may cause risks :
- have we updated all our repositories so that they use the latest Docker image ?
- have we set up the proper security config in all our repositories ?
- which versions of the library X are we using across ?
- are we using a library that we are not supposed to use anymore ?
- do we use hardcoded string in unexpected places in our code ?
- which team is owner of a repository ?
These are all simple questions that sometimes take hours to answer, with always the risk of missing one repository in the analysis, making the answer inaccurate.
Github crawler aims at automating the information gathering, by crawling an organization's repositories through GitHub API. Even if your organization has hundreds of repositories,
Github crawler will be able to report very useful information in few seconds !
Getting started
If you want to provide your own configuration without any code customisation, then you can simply :
- download the latest github-crawler-starter-exec jar from Maven
- place your config file (say application.yml) next to the jar - see below
- run from command line :
java -jar github-crawler-exec.jar --spring.config.location=./
(more examples are available in sections below, ie how to run from IDE and how to extend github crawler, and in this repository)
How does it work ?
Github crawler is a Spring Boot command line application, written in Kotlin.
Following a simple configuration, it will use Github API starting from a given organization level, then for each repository, will look for patterns in specified files or perform other actions.
You can easily exclude repositories from the analysis, configure the files and patterns you're interested in. If you have several types of repositories (front-end, back-end, config repositories for instance), you can have separate configuration files so that the information retrieved is relevant to each scope of analysis.
Several output types are available in this package :
- console is the default and will be used if no output is configured
- a simple "raw" file output
- HTTP output, which enables you to POST the results to an endpoint like ElasticSearch, for easy analysis in Kibana
- some specific "CI-droid oriented" outputs, to easily "pipe" the crawler output to CI-droid
Configuration on crawler side
Below configuration shows how outputs, indicators and actions are configured under the github-crawler
prefix.
crawler:
source-control:
type: "GITHUB"
url: https://my.githubEnterprise/api/v3
apiToken: "YOUR_TOKEN"
organizationName: MyOrganization
crawlUsersRepoInsteadOfOrgasRepos: false
repositoriesToExclude:
- "^financing-platform-.*-run$"
- "^(?!financing-platform-.*$).*"
repositoriesToInclude:
- "^financing-platform-.*-service$"
publishExcludedRepositories: true
crawlAllBranches: true
crawl-in-parallel: true
outputs:
file:
filenamePrefix: "orgaCheckupOutput"
http:
targetUrl: "http://someElasticSearchServer:9201/technologymap/MyOrganization"
indicatorsToFetchByFile:
"[pom.xml]":
- name: spring_boot_starter_parent_version
type: findDependencyVersionInXml
params:
artifactId: spring-boot-starter-parent
- name: spring_boot_dependencies_version
type: findDependencyVersionInXml
params:
artifactId: spring-boot-dependencies
Dockerfile:
- name: docker_image_used
type: findFirstValueWithRegexpCapture
params:
pattern: ".*\\/(.*)\\s?"
"[src/main/resources/application.yml]":
- name: spring_application_name
type: findPropertyValueInYamlFile
params:
propertyName: "spring.application.name"
misc-repository-tasks:
- name: "nbOfMetricsInPomXml"
type: "countHitsOnRepoSearch"
params:
queryString: "q=metrics+extension:xml"
- name: "pathsWhere_ConsulCatalogWatch_IsFound"
type: "pathsForHitsOnRepoSearch"
params:
queryString: "q=ConsulCatalogWatch"
Configuration on repository side
While the global configuration is defined along with github crawler, we have the possibility to override it at the repository level.
Repository level config is stored in a .githubCrawler file, at the root of the repository in the default branch
if a repository should be excluded, we can define it in the repository itself. if .githubCrawler
contains :
excluded: true
Then the crawler will consider the repository as excluded, even if it doesn't match any of the exclusion pattern in the crawler config
- Redirecting to a specific file to parse
Sometimes, the file we're interested in parsing is not in a standard location like the root of the repository.
What we can do in this case is define the file in the crawler config, and override the path in the repository config, with the redirectTo attribute, here for a DockerFile :
filesToParse:
-
name: Dockerfile
redirectTo: routing/Dockerfile
With above config, when the crawler tries to fetch Dockerfile at the root of the repository, it will actually try to parse routing/Dockerfile
You may want to "tag" some repos, to be able to filter easily on them when browsing the results.
GitHub provides "topics" that are very easy to edit, which are actually similar to "tags".
GithubCrawler crawls through repository and attaches tags information with all the repositories for which topics have been configured.
Gitlab support
Basic support for gitLab is available ! It all boils down to implementing a GitLab specific version of RemoteSourceControl
interface.
Your config would look like :
crawler:
source-control:
type: "GITLAB"
url: https://gitlab.com/api/v4/
# your Gitlab personal access token
apiToken: "5yL4_Y9hyC_YX9urZN_G"
# your Gitlab "group"
organizationName: myJavaProjects
Not all methods defined in RemoteSourceControl interface may have been implemented for Gitlab : NotImplementedError
would be thrown in that case. If you need them, you can implement them in RemoteGitLabImpl
(and contribute them back through a pull request ?).
Similarly, we may have added methods in the interface for some of our Gitlab specific use-cases : in that case, these methods may not have been implemented in the Github version of the interface
overriding config at repository level for Gitlab
the same rules apply that for GitHub, but in a file named .gitlabCrawler
Azure Devops support
Just like for GitLab, there's basic support for Azure Devops !
crawler:
source-control:
type: "AZURE_DEVOPS"
apiToken: "abcedfr6rwqwzslqhvfmdpuo5amfyv25a"
# no need to define the URL, since it can only be a hosted service
# in Azure devops, repositories are in a project, within an organization. We mention both of them, separated by a '#' :
# the crawler will pick the repositories from this project
organization-name: "myOrg#myProject"
BitBucket support
since v2.2.0, there's also support for BitBucket !
crawler:
source-control:
type: "BITBUCKET"
url: YOUR_BITBUCKET_URL
organizationName: myProject
apiToken: "abcedfr6rwqwzslqhvfmdpuo5amfyv25a"
File content parsers
Some parsers are provided here. As of v1.1.1, available parser types out of the box are :
see javadoc in each class for details
Miscellaneous tasks to perform
We sometimes need to get information on repositories, that is not found in the files it contains : we need to perform a "task" on each repository. As of v1.1.0, these are the task types available out of the box :
see javadoc in each class for details
Outputs
Available default outputs are available in this package.
Each of them can be enabled at startup time through configuration. Have a look at GitHubCrawlerOutputConfig to see which property activates which output : we use Spring @ConditionalOnProperty
to decide which output to instantiate, depending on what we've configured under github-crawler.outputs
As of v1.1.0, there are 2 "general purpose" outputs available :
there are 3 "specific purpose" outputs available (see javadoc for more infos):
default output is ConsoleOutput
example using HTTP output, pointing to ElasticSearch with Kibana on top
when running the crawler with HTTP output to push indicators in ElasticSearch, this is the kind of data you'll get
- different values for the same indicator, fetched with
findFirstValueWithRegexpCapture
parser:
- different values for the same indicator, fetched with
findDependencyVersionInXml
parser :
(when there's no value, it means the file was not found. when the value is "not found", it means the file exists, but the value was not found in it)
- when using crawlAllBranches: true property , branch name is shown :
Once you have this data, you can quickly do any dashboard you want, like here, with the split of spring-boot-starter-parent
version across our services :
Packaging
At build time, we produce several jars :
- a starter-exec jar, bigger because self-contained. If you don't need to extend it, just take this jar and run it from command line with your config
- much smaller regular jars (following Spring Boot recommendations, that contains just the compiled code : this is the jar you need to declare as a dependency if you want to extend Github crawler on your side.
Running the crawler from your IDE
We leverage on Spring Boot profiles to manage several configurations. Since we consider that each profile will represent a logical grouping of repositories, the Spring profile(s) will be copied on a "groups" attribute for each repository in output.
Assuming you have a property file as defined above, all you need to do in your IDE is :
- check out this repository
- create your own property file in src/main/resources, and name it application-myOwn.yml : myOwn is the Spring Boot profile you'll use
- run GitHubCrawlerApplication, passing myOwn as profile
Extending the crawler (and contributing to it ?)
A starter project is available, allowing you to create your own GitHub crawler application, leveraging on everything that exists in the library.
This is the perfect way to test your own output or parser class on your side.. before maybe contributing it back to the project ? ;-)
A simple example is available here : https://github.com/vincent-fuchs/my-custom-github-crawler/
- import the gitHubCrawler starter as a dependency in your project
- create a Spring Boot starter class, and inject the GitHubCrawler instantiated by the starter's autoconfig :
@SpringBootApplication
public class PersonalGitHubCrawlerApplication implements CommandLineRunner {
@Autowired
private GitHubCrawler crawler;
public static void main(String[] args) {
SpringApplication.run(PersonalGitHubCrawlerApplication.class, args);
}
@Override
public void run(String... strings) throws Exception {
crawler.crawl();
}
}
-
add your own config or classes, the Spring Boot way : if you add your own, implementing the recognized interfaces for output or parsing, then Spring Boot will use them !
see here or here for examples
-
see the javadoc in FileContentParser , RepoTaskToPerform, GitHubCrawlerOutput which are the main extension points.