hyperscan-java
hyperscan is a high-performance multiple regex matching library.
It uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions and for the matching of regular expressions across streams of data.
This project is a third-party developed wrapper for the hyperscan project to enable developers to integrate hyperscan in their java (JVM) based projects.
Because the latest hyperscan release is now under a proprietary license and ARM-support has never been integrated, this project utilizes the vectorscan fork.
Add it to your project
This project is available on maven central.
The version number consists of two parts (i.e. 5.4.11-3.0.0).
The first part specifies the vectorscan version (5.4.11), the second part the version of this library utilizing semantic versioning
(3.0.0).
Maven
<dependency>
<groupId>com.gliwka.hyperscan</groupId>
<artifactId>hyperscan</artifactId>
<version>5.4.11-3.0.0</version>
</dependency>
Gradle
compile group: 'com.gliwka.hyperscan', name: 'hyperscan', version: '5.4.11-3.0.0'
sbt
libraryDependencies += "com.gliwka.hyperscan" %% "hyperscan" % "5.4.11-3.0.0"
Usage
If you want to utilize the whole power of the Java Regex API / full PCRE syntax
and are fine with sacrificing some performance, use thePatternFilter
.
It takes a large lists of java.util.regex.Pattern
and uses hyperscan
to filter it down to a few Patterns with a high probability that they will match.
You can then use the regular Java API to confirm those matches. This is similar to
chimera, only using the standard Java API instead of libpcre.
If you need the highest performance, you should use the hyperscan API directly.
Be aware, that only a smaller subset of the PCRE syntax is supported.
Missing features are for example backreferences, capture groups and backtracking verbs.
The matching behaviour is also a litte bit different, see the semantics chapter of the hyperscan docs.
Examples
Use of the PatternFilter
List<Pattern> patterns = asList(
Pattern.compile("The number is ([0-9]+)", Pattern.CASE_INSENSITIVE),
Pattern.compile("The color is (blue|red|orange)")
);
PatternFilter filter = new PatternFilter(patterns);
List<Matcher> matchers = filter.filter("The number is 7 the NUMber is 27");
for(Matcher matcher : matchers) {
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
Direct use of hyperscan
import com.gliwka.hyperscan.wrapper;
...
LinkedList<Expression> expressions = new LinkedList<Expression>();
expressions.add(new Expression("[0-9]{5}", EnumSet.of(ExpressionFlag.SOM_LEFTMOST)));
expressions.add(new Expression("Test", ExpressionFlag.CASELESS));
try(Database db = Database.compile(expressions)) {
try(Scanner scanner = new Scanner())
{
scanner.allocScratch(db);
List<Match> matches = scanner.scan(db, "12345 test string");
}
try(OutputStream out = new FileOutputStream("db")) {
db.save(out);
}
try (InputStream in = new FileInputStream("db");
Database loadedDb = Database.load(in)) {
}
}
catch (CompileErrorException ce) {
Expression failedExpression = ce.getFailedExpression();
}
catch(IOException ie) {
}
Native libraries
This library ships with pre-compiled vectorscan binaries for linux (glibc >=2.17) and macOS for x86_64 and arm64 CPUs.
Windows is no longer supported (last supported version is 5.4.0-2.0.0
) due to vectorscan dropping windows support.
You can find the repository with the native libraries here
Documentation
The developer reference explains vectorscan.
The javadoc is located here.
Changelog
See here.
Contributing
Feel free to raise issues or submit a pull request.
Credits
Shoutout to @eliaslevy, @krzysztofzienkiewicz, @swapnilnawale, @mmimica and @Jiar for all the great contributions.
Thanks to Intel for opensourcing hyperscan and @VectorCamp for actively maintaining the fork!
License
BSD 3-Clause License