{"id":94,"date":"2023-04-14T19:49:27","date_gmt":"2023-04-15T03:49:27","guid":{"rendered":"https:\/\/www.dumpsterfirecomputing.com\/?p=94"},"modified":"2023-04-15T07:39:53","modified_gmt":"2023-04-15T15:39:53","slug":"troubleshooting-asr-and-vss","status":"publish","type":"post","link":"https:\/\/www.dumpsterfirecomputing.com\/?p=94","title":{"rendered":"Troubleshooting &#8211; ASR and VSS"},"content":{"rendered":"\n<p>I was once asked in an interview years ago the following question: &#8220;User calls and says their printer isn&#8217;t working.  How would you troubleshoot this?&#8221;<\/p>\n\n\n\n<p>It&#8217;s a silly question at face value, but I believe it was intended to be vague in order to have a conversation around the troubleshooting thought process.  Off the top of my head it&#8217;s easy to come up with questions intended to collect data that can help narrow down the issue.  Local printer or network printer? One user or all users? When was the last time you could print to it? Does it have paper? Is it on? Etc. etc.<\/p>\n\n\n\n<p>One of the reasons I wanted to start this blog was to document my own troubleshooting processes and perhaps give others tools necessary to start asking questions and digging deeper into issues.  Especially if I spent any time researching a topic to find a solution.  Some people go one layer deep when trying to find a solution and then get lost.  But just because it&#8217;s dark, there&#8217;s always another path to take &#8211; you just need to find it.<\/p>\n\n\n\n<p>I&#8217;ll use a specific example and go a few layers deep to talk through how it was solved.  I worked with a colleague to really flush out some of the solutions.<\/p>\n\n\n\n<p>The example we&#8217;ll use relates to Azure Site Recovery being deployed to a couple hundred servers.  Out of that batch, there are a large number of warnings &#8211; this is an example:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image.png\" alt=\"\" class=\"wp-image-95\" width=\"526\" height=\"405\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image.png 1013w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-300x231.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-768x592.png 768w\" sizes=\"auto, (max-width: 526px) 100vw, 526px\" \/><\/figure>\n\n\n\n<p>Azure let&#8217;s you drill down a little bit further and get more information on that specific error:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1019\" height=\"653\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-12.png\" alt=\"\" class=\"wp-image-117\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-12.png 1019w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-12-300x192.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-12-768x492.png 768w\" sizes=\"auto, (max-width: 1019px) 100vw, 1019px\" \/><\/figure>\n\n\n\n<p>In this example, we have a server where replication is running successfully but there is no app-consistent recovery point.  Microsoft gives us some possible causes and possible recommendations.  So&#8230;.how would you troubleshoot this?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Tier 1 &#8211; The Basics &#8211; &#8220;Is it plugged in?&#8221;<\/h2>\n\n\n\n<p>Troubleshooting is part art, part science. It&#8217;s a repeatable process &#8211; gather data, create hypothesis, act, check.  In nearly all cases, there&#8217;s no obvious right or wrong answer for beginning the troubleshooting process.  Some places may make more sense than others.  For instance, I wouldn&#8217;t jump to restarting the server first.  For issues with servers, restarting it is the one thing you want to avoid at all costs.<\/p>\n\n\n\n<p>Who hasn&#8217;t seen this meme in their career?<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"540\" height=\"294\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-1.png\" alt=\"\" class=\"wp-image-97\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-1.png 540w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-1-300x163.png 300w\" sizes=\"auto, (max-width: 540px) 100vw, 540px\" \/><\/figure>\n\n\n\n<p>Don&#8217;t.  Reboot.  Servers.<\/p>\n\n\n\n<p>In this case, we did start with Azure Site Recovery.  We disabled replication, re-enabled replication and waited to see what happened.  There was no harm in that, didn&#8217;t require a reboot, was a fairly quick thing to do.  We wanted to rule out the possibility that it was the ASR agent that was somehow not cooperating.  But alas, same result.<\/p>\n\n\n\n<p>This is the first layer down the rabbit hole where I&#8217;ll sometimes see people stop.  The old joke &#8220;I&#8217;ve tried nothing and I&#8217;m out of ideas&#8221; comes to mind.  But in troubleshooting, we need more data in order to make an informed hypothesis.  For this situation we need to get onto the server and start looking around.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Tier 2 &#8211; &#8220;Data, data, data.  I cannot make bricks without clay!&#8221;<\/h2>\n\n\n\n<p>The most important thing when troubleshooting a problem is data.  The more data you have around the symptoms, the environment, the scope, the scale, etc. the better.  When troubleshooting a problem on a Windows server, the first place you should always look is the Event Log.  In this instance, we&#8217;re looking for either Azure Site Recovery or VSS errors.  And here we&#8217;ve found one:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"833\" height=\"484\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-4.png\" alt=\"\" class=\"wp-image-101\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-4.png 833w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-4-300x174.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-4-768x446.png 768w\" sizes=\"auto, (max-width: 833px) 100vw, 833px\" \/><\/figure>\n\n\n\n<p>I&#8217;m going to pause briefly here and talk about the Volume Shadow Copy Service (VSS).  Third party applications use it to make snapshots for replication purposes or, more commonly, backup.  Microsoft has some fantastic, and in depth, documentation that you should be aware of if you want to get a deeper understanding of how it works.<\/p>\n\n\n\n<p><a href=\"https:\/\/learn.microsoft.com\/en-us\/windows\/win32\/vss\/overview-of-processing-a-backup-under-vss\">Overview of Processing a Backup Under VSS &#8211; Win32 apps | Microsoft Learn<\/a><\/p>\n\n\n\n<p>Working left to right you have a Requestor that talks to Backup Components.  The Backup Components talk to Writers, and the Writers talk to back-end Providers.  In the error above we see that it speaks very specifically about a provider not having some components registered.  This is where we roll our sleeves up.<\/p>\n\n\n\n<p>If you&#8217;re not familiar, Windows has a built-in tool called <a rel=\"noreferrer noopener\" href=\"https:\/\/learn.microsoft.com\/en-us\/windows-server\/administration\/windows-commands\/vssadmin\" target=\"_blank\">vssadmin <\/a>which lets you collect some potentially useful information.  By running<code> vssadmin list providers<\/code> on the server we get three items:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"859\" height=\"350\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-5.png\" alt=\"\" class=\"wp-image-102\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-5.png 859w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-5-300x122.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-5-768x313.png 768w\" sizes=\"auto, (max-width: 859px) 100vw, 859px\" \/><\/figure>\n\n\n\n<p>From doing some digging online, we find that the second and third providers are built-in, native.  Microsoft File Share Shadow Copy provider looks to have been introduced as part of an enhancement to Server 2012 R2 (according to this older <a rel=\"noreferrer noopener\" href=\"https:\/\/techcommunity.microsoft.com\/t5\/storage-at-microsoft\/vss-for-smb-file-shares\/ba-p\/425726#:~:text=The%20File%20Share%20Shadow%20Copy%20Provider%20is%20invoked,copy%20request%20to%20File%20Share%20Shadow%20Copy%20Agents.\" target=\"_blank\">Microsoft blog post<\/a>).<\/p>\n\n\n\n<p>In this case, I&#8217;m left with the <code>Hyper-V IC Software Shadow Copy Provider<\/code>.  As of the writing of this post, I&#8217;ve no idea where this provider comes from.  I&#8217;ve seen this provider on multiple servers with issues, and have verified that no other backup product has been installed.  But the provider ID in the output above matches the provider ID in the Event Log error and we know isn&#8217;t used.  The next step in this was to remove that provider and test.  To do that, we need to venture into the registry.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"559\" height=\"536\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-6.png\" alt=\"\" class=\"wp-image-103\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-6.png 559w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-6-300x288.png 300w\" sizes=\"auto, (max-width: 559px) 100vw, 559px\" \/><\/figure>\n\n\n\n<p>The registry can be a scary place.  It&#8217;s very easy to cause significant damage to your system if you don&#8217;t know what you&#8217;re looking at.  For VSS providers, luckily it&#8217;s pretty straightforward.  You can see the location above under the HKEY_LOCAL_MACHINE hive.  What I did in this case was right click the provider with the matching GUID and export it (saving it to the desktop).  Then removed the key.  Next, run <code>net stop vss &amp;&amp; net start vss<\/code> in order to restart the VSS Service and test.<\/p>\n\n\n\n<p>In a few cases, this worked.  In others, it didn&#8217;t.  In those that didn&#8217;t, we need to keep collecting data.<\/p>\n\n\n\n<p>In one oddball case, we found VSS errors relating to a specific SQL Writer.  If you run <code>vssadmin list writers<\/code> you&#8217;ll get a list of all the registered VSS Writers on the system:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"691\" height=\"317\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-7.png\" alt=\"\" class=\"wp-image-104\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-7.png 691w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-7-300x138.png 300w\" sizes=\"auto, (max-width: 691px) 100vw, 691px\" \/><\/figure>\n\n\n\n<p>This shows three writers (there are usually about 10 or so in a default installation).  The middle writer is clearly a problem here, as the State shows as &#8220;Failed&#8221; and Last Error shows as &#8220;Non-retryable error&#8221;.  As it turns out, this has an application installed called &#8220;VSS Writer for Server 2016&#8221; installed &#8211; so this is almost certainly interfering with the operation of VSS.<\/p>\n\n\n\n<p>But let&#8217;s continue down another tunnel in the rabbit hole&#8230;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Tier 3 &#8211; &#8220;Knee deep in the dead&#8221;<\/h2>\n\n\n\n<p>When poking around, you may come across something that just doesn&#8217;t make sense.  In a number of other servers we investigated, we found an interesting registry key in the VSS providers which was completely empty:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"495\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-8-1024x495.png\" alt=\"\" class=\"wp-image-106\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-8-1024x495.png 1024w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-8-300x145.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-8-768x371.png 768w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-8.png 1031w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>This key, <code>3eb85257-e0b0-4788-a080-d39aa6a515af<\/code>, was seen on a number of machines that included vague VSS errors in the event logs.  The only thing we could guess was that an older version of the Azure Site Recovery agent had at some point been removed and some artifacts were left.  But this was speculation.  On one test server, we found that removing that empty key and restarting the VSS service cleared up some VSS errors, and actually caused ASR app-consistent backups to be taken properly.<\/p>\n\n\n\n<p>To see precisely how many servers were affected, I wrote the following script.  The output was a CSV file that I could then filter and get the exact list of servers.<\/p>\n\n\n\n<p><code>$allServers = Import-Csv $pathToServerList<br>$array = @()<br>foreach ($server in $allServers) {<br>  try {<br>    $result = Invoke-Command $server.Name {Get-ChildItem -Path 'HKLM:\\SYSTEM\\CurrentControlSet\\Services\\VSS\\Providers'}<br>   } catch {<br>   }<br>   $array += $result<br>}<br>foreach ($item in $array) {<br>   \"$($item.Name),$($item.property),$($item.PSComputerName)\" | Out-File $pathToOutputFile -Append<br>}<\/code><\/p>\n\n\n\n<p>You will want to have an input and output file defined, but this gave me a list of all servers and all the providers.  I could then sort \/ filter on the empty one we identified above, and then with the following code snippet was able to remove the key and restart the VSS service cleanly.<\/p>\n\n\n\n<p><code>$allServers = Import-Csv $pathToServerList<br>foreach ($server in $allServers) {<br>   Try {<br>      Invoke-Command $server.Name {<br>         Remove-Item -Path 'HKLM:\\SYSTEM\\CurrentControlSet\\Services\\VSS\\Providers\\{3eb85257-e0b0-4788-a080-d39aa6a515af}' -Confirm:$false<br>      } -ErrorAction SilentlyContinue<br>   } catch {<br>   }<br>}<\/code><\/p>\n\n\n\n<p>Don&#8217;t judge my code, remember, I&#8217;m <a rel=\"noreferrer noopener\" href=\"https:\/\/ironscripter.us\/factions\/\" target=\"_blank\">#BattleFaction<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Tier 4 &#8211; &#8220;Where we&#8217;re going, we don&#8217;t need roads&#8230;&#8221;<\/h2>\n\n\n\n<p>There&#8217;s some seriously deep holes, and one of the things about troubleshooting in general is that you want to be careful not to follow any one hole too deeply &#8211; you&#8217;ll just end up chasing nothing that truly matters in solving the issue at hand.  Case in point&#8230;<\/p>\n\n\n\n<p>In troubleshooting the above, I found that there&#8217;s a log used for Azure Site Recovery located at <code>C:\\Program Files (x86)\\Microsoft Azure Site Recovery\\agent\\Application Data\\ApplicationPolicyLogs\\vacp.log<\/code> where I found the following error.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"78\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-11-1024x78.png\" alt=\"\" class=\"wp-image-112\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-11-1024x78.png 1024w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-11-300x23.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-11-768x59.png 768w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-11-1536x117.png 1536w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/04\/image-11.png 1585w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>In the log you see a couple of highlighted errors.  In doing online searches for this error, I landed on protocol definition for MS-FSRVP (File Server Remote VSS Protocol):<\/p>\n\n\n\n<p><a href=\"https:\/\/learn.microsoft.com\/en-us\/openspecs\/windows_protocols\/ms-fsrvp\/dae107ec-8198-4778-a950-faa7edad125b\">[MS-FSRVP]: File Server Remote VSS Protocol | Microsoft Learn<\/a><\/p>\n\n\n\n<p>This was an interesting reference document, but not exactly helpful in the troubleshooting pursuits.  It can be easy to follow the trail and get lost, so at some point in the research efforts it&#8217;s helpful to realize whether you&#8217;ve gone too far.  You may learn something (which is always a plus), but it may not be useful in solving the immediate problem at hand.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">End of the trail<\/h2>\n\n\n\n<p>If you&#8217;ve made it this far, kudos to you.  Troubleshooting is a skill that I honestly see being in short supply.  Error messages will pop up and people will have no clue what to do next.  It&#8217;s a skill that everyone should have.  Be curious.  Figure out how to solve a problem, and endeavor to learn new things along the way.  It will only make you a more valuable resource!<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was once asked in an interview years ago the following question: &#8220;User calls and says their printer isn&#8217;t working. How would you troubleshoot this?&#8221; It&#8217;s a silly question at [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24,25],"tags":[27,28],"class_list":["post-94","post","type-post","status-publish","format-standard","hentry","category-troubleshooting","category-vss","tag-troubleshooting","tag-vss"],"_links":{"self":[{"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/posts\/94","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=94"}],"version-history":[{"count":11,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/posts\/94\/revisions"}],"predecessor-version":[{"id":118,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/posts\/94\/revisions\/118"}],"wp:attachment":[{"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=94"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=94"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=94"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}