Product
Introducing License Enforcement in Socket
Ensure open-source compliance with Socket’s License Enforcement Beta. Set up your License Policy and secure your software!
cloud-tpu-diagnostics
Advanced tools
This is a comprehensive library to monitor, debug and profile the jobs running on Cloud TPU. To learn about Cloud TPU, refer to the full documentation.
This module will dump the python traces when a fault such as Segmentation fault, Floating-point exception, Illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help debug when a program running on Cloud TPU is stuck or hung somewhere.
To install the package, run the following command on TPU VM:
pip install cloud-tpu-diagnostics
To use this package, first import the module:
from cloud_tpu_diagnostics import diagnostic
from cloud_tpu_diagnostics.configuration import debug_configuration
from cloud_tpu_diagnostics.configuration import diagnostic_configuration
from cloud_tpu_diagnostics.configuration import stack_trace_configuration
Then, create configuration object for stack traces. The module will only collect stack traces when collect_stack_trace
parameter is set to True
. There are following scenarios supported currently:
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=False)
This configuration will prevent you from collecting stack traces in the event of a fault or process hang.
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=True,
stack_trace_to_cloud=False)
If there is a fault or process hang, this configuration will show the stack traces on the console (stderr).
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=True,
stack_trace_to_cloud=True)
This configuration will temporary collect stack traces inside /tmp/debugging
directory on TPU host if there is a fault or process hang. Additionally, the traces collected in TPU host memory will be uploaded to Google Cloud Logging, which will make it easier to troubleshoot and fix the problems. You can view the traces in Logs Explorer using the following query:
logName="projects/<project_name>/logs/tpu.googleapis.com%2Fruntime_monitor"
jsonPayload.verb="stacktraceanalyzer"
By default, stack traces will be collected every 10 minutes. In order to change the duration between two stack trace collection events, add the following configuration:
stack_trace_config = stack_trace_configuration.StackTraceConfig(
collect_stack_trace=True,
stack_trace_to_cloud=True,
stack_trace_interval_seconds=300)
This configuration will collect the stack traces on cloud after every 5 minutes.
Then, create configuration object for debug.
debug_config = debug_configuration.DebugConfig(
stack_trace_config=stack_trace_config)
Then, create configuration object for diagnostic.
diagnostic_config = diagnostic_configuration.DiagnosticConfig(
debug_config=debug_config)
Finally, call the diagnose()
method using with
and wrap the statements inside the context manager for which you want to collect the stack traces.
with diagnostic.diagnose(diagnostic_config):
run_job(...)
FAQs
Monitor, debug and profile the jobs running on Cloud TPU.
We found that cloud-tpu-diagnostics demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Ensure open-source compliance with Socket’s License Enforcement Beta. Set up your License Policy and secure your software!
Product
We're launching a new set of license analysis and compliance features for analyzing, managing, and complying with licenses across a range of supported languages and ecosystems.
Product
We're excited to introduce Socket Optimize, a powerful CLI command to secure open source dependencies with tested, optimized package overrides.