
Security News
ECMAScript 2025 Finalized with Iterator Helpers, Set Methods, RegExp.escape, and More
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
ocr-click-plugin
Advanced tools
An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens with AI-powered screen analysis
An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens. This plugin leverages Tesseract.js for text recognition, Sharp for image enhancement, and Google Cloud Vertex AI for intelligent screen analysis.
# Clone the repository
git clone <your-repo-url>
cd ocr-click-plugin
# Install dependencies
npm install
# Build the plugin
npm run build
# Install plugin to Appium
npm run install-plugin
export GOOGLE_PROJECT_ID="your-project-id"
export GOOGLE_LOCATION="us-central1" # or your preferred location
export GOOGLE_MODEL="gemini-1.5-flash" # or gemini-1.5-pro
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# Run development server (uninstall, build, install, and start server)
npm run dev
# Or run individual commands
npm run build
npm run reinstall-plugin
npm run run-server
Find and click text elements using OCR.
POST /session/{sessionId}/appium/plugin/textclick
Parameters:
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
text | string | Yes | - | Text to search for and click |
index | number | No | 0 | Index of match to click (if multiple matches found) |
Response:
{
"success": true,
"message": "Clicked on text 'Login' at index 0",
"totalMatches": 2,
"confidence": 87.5,
"imageEnhanced": true
}
Check if text is present on screen without clicking.
POST /session/{sessionId}/appium/plugin/checktext
Parameters:
Parameter | Type | Required | Description |
---|---|---|---|
text | string | Yes | Text to search for |
Response:
{
"success": true,
"isPresent": true,
"totalMatches": 1,
"searchText": "Submit",
"matches": [
{
"text": "Submit",
"confidence": 92.3,
"coordinates": { "x": 200, "y": 400 },
"bbox": { "x0": 150, "y0": 380, "x1": 250, "y1": 420 }
}
],
"imageEnhanced": true,
"message": "Text 'Submit' found with 1 match(es)"
}
Analyze screen content using Google Cloud Vertex AI.
POST /session/{sessionId}/appium/plugin/askllm
Parameters:
Parameter | Type | Required | Description |
---|---|---|---|
instruction | string | Yes | Natural language instruction for AI analysis |
Response:
{
"success": true,
"instruction": "What buttons are visible on this screen?",
"response": {
"candidates": [
{
"content": {
"parts": [
{
"text": "I can see several buttons on this screen: 'Login', 'Sign Up', 'Forgot Password', and 'Help'. The Login button appears to be the primary action button."
}
]
}
}
]
},
"message": "AI analysis completed successfully"
}
// JavaScript/TypeScript
const driver = await remote(capabilities);
// Click text using mobile command
await driver.execute('mobile: textclick', { text: 'Login', index: 0 });
// Check if text exists
const result = await driver.execute('mobile: checktext', { text: 'Welcome' });
console.log(result.isPresent); // true/false
// AI screen analysis
const aiResult = await driver.execute('mobile: askllm', {
instruction: 'What are the main actions a user can take on this screen?'
});
console.log(aiResult.response.candidates[0].content.parts[0].text);
import io.appium.java_client.android.AndroidDriver;
import org.openqa.selenium.remote.DesiredCapabilities;
import java.util.HashMap;
import java.util.Map;
public class OCRClickExample {
public static void main(String[] args) {
AndroidDriver driver = new AndroidDriver(serverUrl, capabilities);
// Click text
Map<String, Object> clickParams = new HashMap<>();
clickParams.put("text", "Submit");
clickParams.put("index", 0);
Object result = driver.executeScript("mobile: textclick", clickParams);
// Check text presence
Map<String, Object> checkParams = new HashMap<>();
checkParams.put("text", "Error");
Object checkResult = driver.executeScript("mobile: checktext", checkParams);
// AI analysis
Map<String, Object> aiParams = new HashMap<>();
aiParams.put("instruction", "Describe the layout and main elements of this screen");
Object aiResult = driver.executeScript("mobile: askllm", aiParams);
System.out.println("AI Response: " + aiResult);
}
}
from appium import webdriver
driver = webdriver.Remote('http://localhost:4723/wd/hub', capabilities)
# Click text
result = driver.execute_script('mobile: textclick', {'text': 'Login'})
print(f"Click result: {result}")
# Check text
check_result = driver.execute_script('mobile: checktext', {'text': 'Welcome'})
print(f"Text present: {check_result['isPresent']}")
# AI analysis
ai_result = driver.execute_script('mobile: askllm', {
'instruction': 'What form fields are visible and what information do they require?'
})
print(f"AI Analysis: {ai_result['response']['candidates'][0]['content']['parts'][0]['text']}")
# Text click
curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/textclick \
-H "Content-Type: application/json" \
-d '{"text": "Sign Up", "index": 0}'
# Text check
curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/checktext \
-H "Content-Type: application/json" \
-d '{"text": "Error Message"}'
# AI analysis
curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/askllm \
-H "Content-Type: application/json" \
-d '{"instruction": "What are the key UI elements and their purposes on this screen?"}'
The askllm
API enables powerful screen analysis capabilities:
await driver.execute('mobile: askllm', {
instruction: 'Describe the main purpose of this screen and its key components'
});
await driver.execute('mobile: askllm', {
instruction: 'List all clickable buttons and their likely functions'
});
await driver.execute('mobile: askllm', {
instruction: 'What form fields are present and what type of information do they expect?'
});
await driver.execute('mobile: askllm', {
instruction: 'Are there any error messages or warnings visible on this screen?'
});
await driver.execute('mobile: askllm', {
instruction: 'How would a user navigate to the settings page from this screen?'
});
# Google Cloud Configuration
GOOGLE_PROJECT_ID=your-gcp-project-id
GOOGLE_LOCATION=us-central1
GOOGLE_MODEL=gemini-1.5-flash
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
# Alternative: Use gcloud CLI authentication
# gcloud auth application-default login
# OCR Configuration
OCR_CONFIDENCE_THRESHOLD=60
OCR_LANGUAGE=eng
# Image Processing
ENABLE_IMAGE_ENHANCEMENT=true
SHARP_IGNORE_GLOBAL_LIBVIPS=1
The plugin uses optimized Tesseract configuration:
const TESSERACT_CONFIG = {
lang: 'eng',
tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?-_@#$%^&*()',
tessedit_pageseg_mode: '6', // Uniform text block
preserve_interword_spaces: '1',
// ... other optimizations
};
Default minimum confidence threshold is 60%. Words below this confidence are filtered out:
const MIN_CONFIDENCE_THRESHOLD = 60;
The plugin applies several image processing steps:
Authentication Error:
# Set up application default credentials
gcloud auth application-default login
# Or use service account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
API Not Enabled:
# Enable Vertex AI API
gcloud services enable aiplatform.googleapis.com
Model Not Available: Try different model names:
gemini-1.5-flash
(faster, cheaper)gemini-1.5-pro
(more capable)gemini-1.0-pro-vision
(legacy)If you encounter Sharp compilation errors during installation, especially with Node.js v24+:
# Method 1: Use environment variable
SHARP_IGNORE_GLOBAL_LIBVIPS=1 npm install ocr-click-plugin
# Method 2: Install Sharp separately first
SHARP_IGNORE_GLOBAL_LIBVIPS=1 npm install --include=optional sharp
npm install ocr-click-plugin
# Method 3: For Appium plugin installation
SHARP_IGNORE_GLOBAL_LIBVIPS=1 appium plugin install ocr-click-plugin
MIN_CONFIDENCE_THRESHOLD
if text is not being detectedocr-click-plugin/
āāā src/
ā āāā index.ts # Main plugin implementation
āāā dist/ # Compiled JavaScript
āāā package.json # Dependencies and scripts
āāā tsconfig.json # TypeScript configuration
āāā README.md # This file
npm run build
npm test
npm run dev # Full development workflow
npm run build # Compile TypeScript
npm run install-plugin # Install to Appium
npm run reinstall-plugin # Uninstall and reinstall
npm run run-server # Start Appium server
npm run uninstall # Remove from Appium
git checkout -b feature/amazing-feature
)git commit -m 'Add some amazing feature'
)git push origin feature/amazing-feature
)This project is licensed under the ISC License - see the LICENSE file for details.
FAQs
An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens with AI-powered screen analysis
The npm package ocr-click-plugin receives a total of 475 weekly downloads. As such, ocr-click-plugin popularity was classified as not popular.
We found that ocr-click-plugin demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Ā It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
Security News
A new Node.js homepage button linking to paid support for EOL versions has sparked a heated discussion among contributors and the wider community.
Research
North Korean threat actors linked to the Contagious Interview campaign return with 35 new malicious npm packages using a stealthy multi-stage malware loader.