{"id":8962,"date":"2024-08-24T11:49:27","date_gmt":"2024-08-24T11:49:27","guid":{"rendered":"https:\/\/www.infinitivehost.com\/knowledge-base\/?p=8962"},"modified":"2024-08-29T07:09:04","modified_gmt":"2024-08-29T07:09:04","slug":"how-to-identify-faulty-gpu-slot-in-server-with-ubuntu-command","status":"publish","type":"post","link":"https:\/\/www.infinitivehost.com\/knowledge-base\/how-to-identify-faulty-gpu-slot-in-server-with-ubuntu-command\/","title":{"rendered":"How to Identify Faulty GPU Slot in Server with Ubuntu Command"},"content":{"rendered":"<div class='epvc-post-count'><span class='epvc-eye'><\/span>  <span class=\"epvc-count\"> 5,270<\/span><span class='epvc-label'> Views<\/span><\/div>\n<p>To identify a faulty GPU slot in a server running Ubuntu, you can use a combination of system commands and utilities. Here&#8217;s a step-by-step guide to help you through the process:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. <strong>Check GPU Information with <code>lspci<\/code><\/strong><\/h3>\n\n\n\n<p>The <code>lspci<\/code> command lists all PCI devices, including GPUs. To identify the GPU devices, you can use:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>lspci | grep -i vga<\/code><\/pre>\n\n\n\n<p>This will list all VGA-compatible devices (including GPUs). Note the device IDs and the names of the GPUs listed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2. <strong>Check GPU Status with <code>nvidia-smi<\/code> (for NVIDIA GPUs)<\/strong><\/h3>\n\n\n\n<p>If you have NVIDIA GPUs, the <code>nvidia-smi<\/code> command provides detailed information about NVIDIA GPUs and their statuses:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>nvidia-smi<\/code><\/pre>\n\n\n\n<p>This will show you information like GPU utilization, memory usage, and any potential errors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3. <strong>Examine System Logs<\/strong><\/h3>\n\n\n\n<p>System logs can provide information about hardware errors. You can check system logs for GPU-related messages:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>dmesg | grep -i gpu\ndmesg | grep -i error<\/code><\/pre>\n\n\n\n<p>These commands will search for GPU-related or error-related messages in the kernel ring buffer.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. <strong>Use <code>lshw<\/code> to List Hardware Information<\/strong><\/h3>\n\n\n\n<p>The <code>lshw<\/code> command provides detailed information about the hardware in your system. To get details about the GPU, use:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo lshw -C display<\/code><\/pre>\n\n\n\n<p>This will give you information about the display adapters (GPUs) in your system, including any errors or warnings.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5. <strong>Check GPU Utilization and Performance<\/strong><\/h3>\n\n\n\n<p>To monitor GPU performance, especially if you suspect the GPU is faulty due to performance issues, you can use tools like <code>watch<\/code> with <code>nvidia-smi<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>watch -n 1 nvidia-smi<\/code><\/pre>\n\n\n\n<p>This command will refresh the <code>nvidia-smi<\/code> output every second, allowing you to monitor real-time GPU performance and errors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6. <strong>Check for Hardware Issues<\/strong><\/h3>\n\n\n\n<p>You can also use the <code>smartctl<\/code> utility to check the health of the GPU if it supports SMART monitoring (usually for SSDs and HDDs, but some GPU tools might support similar checks).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7. <strong>Review GPU-Specific Tools<\/strong><\/h3>\n\n\n\n<p>For AMD GPUs, use tools like <code>radeontop<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo apt install radeontop\nradeontop<\/code><\/pre>\n\n\n\n<p>For Intel GPUs, you might use <code>intel_gpu_top<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo apt install intel-gpu-tools\nsudo intel_gpu_top<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">8. <strong>Test GPU Functionality<\/strong><\/h3>\n\n\n\n<p>Sometimes, using stress tests or benchmark tools can help identify faulty GPUs. Tools like <code>CUDA<\/code> (for NVIDIA) or <code>OpenCL<\/code> benchmarks can be useful.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Summary<\/h3>\n\n\n\n<p>To diagnose a faulty GPU in a server running Ubuntu:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Use <code>lspci<\/code> to list PCI devices and check for GPU.<\/li>\n\n\n\n<li>Use <code>nvidia-smi<\/code> for NVIDIA GPUs or <code>lshw -C display<\/code> for general GPU information.<\/li>\n\n\n\n<li>Check system logs with <code>dmesg<\/code>.<\/li>\n\n\n\n<li>Monitor GPU performance with <code>watch<\/code> and <code>nvidia-smi<\/code>.<\/li>\n\n\n\n<li>Use GPU-specific tools and stress tests if needed.<\/li>\n<\/ol>\n\n\n\n<p>By combining these commands and tools, you should be able to identify and diagnose issues with your GPU.<\/p>\n\n\n\n<p><strong>Conclusion<\/strong><\/p>\n\n\n\n<p>It is very easy to find a faulty GPU slot in the <a href=\"https:\/\/www.infinitivehost.com\/gpu-dedicated-server\"><mark style=\"background-color:#8ed1fc\" class=\"has-inline-color has-black-color\"><strong>best GPU dedicated server<\/strong><\/mark><\/a> running Ubuntu; you can easily utilize a mixture of various system utilities as well as commands. So, there is an above-mentioned complete guide to successfully overcoming this problem.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>5,270 Views To identify a faulty GPU slot in a server running Ubuntu, you can use a combination of system commands and utilities. Here&#8217;s a step-by-step guide to help you through the process: 1. Check GPU Information with lspci The lspci command lists all PCI devices, including GPUs. To identify the GPU devices, you can [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[202],"tags":[],"class_list":["post-8962","post","type-post","status-publish","format-standard","hentry","category-gpu-server"],"_links":{"self":[{"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/posts\/8962","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/comments?post=8962"}],"version-history":[{"count":2,"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/posts\/8962\/revisions"}],"predecessor-version":[{"id":8997,"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/posts\/8962\/revisions\/8997"}],"wp:attachment":[{"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/media?parent=8962"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/categories?post=8962"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.infinitivehost.com\/knowledge-base\/wp-json\/wp\/v2\/tags?post=8962"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}