VidiFlow Troubleshooting [GL OG]

Name Issue with the Manual Deletion and Recreation of a Service in ServiceFabric

A service in Service Fabric can be restarted by deleting and recreating an instance in Service Fabric Explorer. However, it is important to note that not all service templates names are identical to the running services in VidiFlow.

Solution: Please double-check the name when recreating the service. Ideally, copy the service name from the deletion confirmation dialog so that it can be pasted when being recreated.

Deleting a Service from Service Fabric Takes Longer Than Expected

Deleting a service from inside Service Fabric will result in that service no longer being displayed in Service Fabric. However, in rare circumstances the service may still be running. Recreating the service in Service Fabric Explorer during this period could mean that the existing service is reused.

Solution: Ensure the service is actually deleted by checking the Task Manager on the host machine where the service was running before.

Workflows Cancelled in VidiFlow Monitoring Remain in Camunda Queue

Solution: Use this script in PowerShell to delete workflows that are canceled from Camunda. Replace the workflow ID where required:

PowerShell Script

clear

$workflowinstances = Invoke-RestMethod -Uri http://KUBERNETESENDPOINT:31080/engine-rest/process-instance/?processDefinitionId=WF_FileIngest:29:262dac29-1366-11e9-8326-fa5a5ad76dd7

"Startcount: " + $workflowinstances.Count

foreach($instance in $workflowinstances)

{

"Deleting instance " + $instance.id

$result1 = Invoke-RestMethod -Uri ("http://KUBERNETESENDPOINT:31080/engine-rest/process-instance/" + $instance.id + "/modification") -Method Post -ContentType "application/json" -Body '{"skipCustomListeners": true,"skipIoMappings": true,"instructions": [{"type": "cancel","activityId": "Task_0a7mr03"}]}'

$result1

$result = Invoke-RestMethod -Uri ("http://KUBERNETESENDPOINT:31080/engine-rest/process-instance/" + $instance.id) -Method Delete

$result

}

$workflowinstances = Invoke-RestMethod -Uri http://KUBERNETESENDPOINT:31080/engine-rest/process-instance/?processDefinitionId=WF_FileIngest:29:262dac29-1366-11e9-8326-fa5a5ad76dd7

"Endcount " + $workflowinstances.Count

ConfigPortal API is Slow and Services Fail to Start / Retrieve their Configuration

There are two things to check when services fail to start, or the ConfigPortal WebAPI Swagger are extremely slow to react:

Garbage Collection in the Git Repository: to improve performance, a regular cron job is executed in k8s to perform garbage collection on the configuration git repo (executed command is found below). In some circumstances, the job may fail to finish execution or run into issues. In this case, try to manually run the following script from the machine where the GitRepo is located (git must be installed):

Git Bash

git -C \\PATH\TO\GITREPO\ gc

If this fails, try running the garbage collection with additional options “--aggressive --prune-now”, which may sometimes help to resolve the issue.

2. Empty the CachedData table in ConfigPortal’s Cache DB: truncate table CachedData

3. Restart ConfigPortal API and Notification service afterwards

CreateItem Vidispine Requires a Long Time

The default timeout until Vidispine registers a file as closed / not growing is set to 5 minutes. This parameter is called closeLimit in Vidispine. The parameter can be modified by sending a PUT to {{k8surl}}:{{vidispinePort}}/API/storage/STORAGE-VX-{{ID}}/metadata/closeLimit with a plain text body of 1 (for 1 minute). Please ensure that the right storage ID be applied to in the URL.

Changing Service Fabric Configuration on a Running Local Cluster

In the event that one may want to edit configuration details for an existing cluster, such as the number of ports available to clients of the cluster, proceed by taking the following steps:

Duplicate the cluster configuration json, and give it a distinct name.
Modify the cluster json, and give it a new version number at the top of the json file. Example: change 2.0.0 to 2.1.0
Execute TestConfiguration.ps1 with two parameters:

ClusterConfigFilePath "path to edited json"
OldClusterConfigFilePath "path to original json"

4. Execute the following command in PS: Connect-ServiceFabricCluster

5. This should return "True" and a few Connection and Gateway details

6. Then execute Start-ServiceFabricClusterConfigurationUpgrade - ClusterConfigPath "path to edited json".

ConfigPortal WebAPI has Issues Writing to GitRepository

If ConfigPortal web API fails when writing .git file, this may be due to the fact that some file systems (for example Isilon) do not allow creation of hidden files.

Solution: Check with the system administrator for different user permissions. Alternatively, use a standard SMB share for GitRepository.

VidiFlow Installation Fails with Setup Error MSI 1603

ERROR

ERROR: Execution of deployment step with name: 'Deploy ______ Service' has failed.

ERROR: [________ deployment failed with MSI error code 1603].

Solution: Should the errors occur, check the log file of the services’s installer. By default, the log file is located in VidiFlow installation folder/drop \Deployment\Scripts\logs.

Every Service in Service Fabric Displayed as Red

Solution: The most common point of failure for widespread errors is the Authentication Service which relies on the database. Check if the database is running and reachable with the VidiFlow user. Alternatively, check the Authentication log file for more specific errors.

Vidiflow Web Frontends Remain Empty / Show Only White

Solution: Ensure that the exact fully-qualified domain name, specified when setting up the system, is used. If that is the case, check if the service is operational.

Kubernetes Nodes Stuck in "Registering" State

Solution: Ensure that the necessary ports are open in the Linux firewall, necessary for Kubernetes communication to take place.

Staging in ConfigPortal Fails with "System name is more than 1"

For staging, both the system name and its respective GUID must be identical. If the names are set manually in multiple environments, without using staging, this may result in the names having the same string name, but not the same GUID.

Solution: Ideally, the system name should be set automatically through staging. If a system name is already set, retrieve the definition via CP API (Swagger) in the first environment to get the exact system name and GUID, and update it in the new environment via CP API as well. Alternatively delete it in the target environment and retrigger staging to set it again automatically.

MediaPortal Connector Does not Sync Item to MediaPortal

The MediaPortal Connector will at times filter out items, preventing these from synchronizing to MediaPortal. There are multiple reasons why this may occur:

No proxy shape tag on a video item
V3_hidden flag set to “true”
MediaType/ V3_ExpectedMediaType not “video”
Vidispine title medata is empty (this is only valid for release prior to MediaPortal 19.1).

Camunda Engine Reports "<var> was found more than once in the input"

When sending HTTP POST to:

EXAMPLe

'http://KUBERNETESENDPOINT:31080/engine-rest/external-task/ffbbe30f-af92-11e9-94d9-8ab94683cec2/failure'. Content: '{"workerId":"platformWorkerId","errorMessage":"Agent taskEvent ffbbe30f-af92-11e9-94d9-8ab94683cec2 failed: The data contract type 'Platform.Vidispine.TranscodeFile.Contracts.TranscodeProxyInternalTask' cannot be deserialized because the data member 'ItemId' was found more than once in the input.","errorDetail":"Agent taskEvent ffbbe30f-af92-11e9-94d9-8ab94683cec2 failed: The data contract type 'Platform.Vidispine.TranscodeFile.Contracts.TranscodeProxyInternalTask' cannot be deserialized because the data member 'ItemId' was found more than once in the input.","retries":0,"retryTimeout":0}'

Solution: This may occur if the Camunda workflow engine parses the workflow incorrectly. Try to identify the offending action and remove it in the workflow, add it again, and then test the new workflow version.

RabbitMQ Error While Waiting for Mnesia Tables on Startup

An issue exists in which RabbitMQ logs error while waiting for Mnesia tables. Retries are then performed.

EXAMPLE

"Waiting for Mnesia tables for 30000 ms, x retries left".

Solution: The reason for this issue lies in RabbitMQ having problems synchronizing with other running instances, or previously ran with more instances than now.

The simplest approach to solving this issue is to throw away RabbitMQ queue data and allow it to perform a fresh resync.

First, scale down RabbitMQ deployment to 0 instances.
Then, on each node in which the local rabbitmq-data folder resides, switch to the directory “/vpmsdata/rabbitmq-data/data/”.
Then, delete the data inside that folder with "rm -rf *".
Finally, scale up again and check to see if the error is no longer present in the logs.

RabbitMQ or Other Stateful Services with Local Data do not Recognize the Mount

In some cases when modifying the local mount directory, RabbitMQ or other stateful services with local mountpoints will not start.

Example error message:

"mount failed: exit status 32 Mounting command: mount Mounting arguments: -o bind /vpmsdata/rabbitmq-data/data /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~local-volume/pv-7ttjp Output: mount: mount(2) failed: No such file or directory)

Solution: This mostly happens on clusters set up with Rancher. It may help to restart the kubelet pods on the affected machines to recognize the correct local mount point. For Rancher / RKE installations: make sure that the bind mounts are set for kubelet. Follow the setup instructions for details.

Kubernetes Reports Error "PLEG is not healthy"

PLEG is a very basic health check which runs docker ps in the background on every node and updates a timestamp on completion. If this timestamp is not updated in a set interval (by default 3 mins), then the health check is failing and Kubernetes may not be behaving as expected.

Solution: This is most often caused by performance issues in the container environment, or in rare cases, the VM itself. Try removing unused containers with the following command on every node:

docker container rm $(docker container ls -aq)

The node where the PLEG has failed will likely be stuck in the process of removing containers. If that's the case, a reboot may help, then run the command again afterwards.

Long Directory Filenames Cannot be Deleted in Windows

Even with long filename support enabled in Windows, it is possible that Service Fabric applications create files and folder structures that are too long for Windows to handle.

Solution: A possible solution involving the use of robocopy is described in the SuperUser thread referenced below and works for deleting obscure Service Fabric Logstash folders:

https://superuser.com/a/1048242/403461

VidiFlow Installation Reports Error “settings.xml already exists”

VidiFlow install into ServiceFabric fails with the error "settings.xml for CamundaBroker (or other agent) already exists on the host".

Solution: In PowerShell:

Connect-ServiceFabric

Get-ServiceFabricImageStoreContent => Show ALL Packages in Image Store of SF

Remove-ServiceFabricApplicationPackage –[Package Name] => Remove all Versions of the Package

Then rerun the deployment.

High resource usage on DB "VPMS_ProcessEngine"

The Prometheus plugin in Camunda can cause a high usage of resources. This is due to it continuously performing writes and reads from the RU_METER_LOG table in VPMS_Platform_ProcessEngine database. This may also cause storage usage to grow quickly.

Solution: Check whether old entries are properly cleaned up from the table.

You can disable the reporting (not recommended): Look in configmap of Camunda-Config for the Prometheus plugin and comment it out:

XML

<class>io.digitalstate.camunda.prometheus.PrometheusProcessEnginePlugin</class>
  <properties>
    <property name="port">9199</property>
    <property name="camundaReportingIntervalInSeconds">15</property>
    <property name="collectorYmlFilePath">/camunda-prometheus/prometheus-metrics.yml</property>
  </properties>
</plugin>

Large Number of Thumbnails: Thumbnail Hierarchy

By default, thumbnails in VS / VidiFlow are stored as folders, one for each item, in a flat directory. However, once a few million assets have been ingested, and depending on your storage solution, this may result in issues.

Solution: Vidispine offers a system property called "thumbnailHierarchy", which when set automatically splits folder IDs into subfolders:

https://apidoc.vidispine.com/latest/system/thumbnails.html#using-a-tree-structure-for-thumbnails

This can be changed to a hierarchy level of 10000. The following script allows modifying this in a running system, by moving existing keyframe folders into the same storage structure.

Call for Thumbnail Hierarchy Value

PUT /API/configuration/properties
Body:
<ConfigurationPropertyDocument xmlns="http://xml.vidispine.com/schema/vidispine">
<key>thumbnailHierarchy</key>
<value>10000</value>
</ConfigurationPropertyDocument>

Python script for moving the existing files retroactively (execute in main thumbnail folder):

PY

#!/usr/bin/python
import os
import shutil
dirs = [os.path.join(".", o) for o in os.listdir(".") if os.path.isdir(os.path.join(".",o))]
for dir in dirs:
    parts = dir.split("-")
    if len(parts) != 2:
        continue
    new_dir = "temp/"
    if len(parts[1]) >= 5:
        new_dir += parts[0] + "-" + parts[1][:-4] + "/" + parts[1][-4:]
    elif len(parts[1]) == 4:
        new_dir += parts[0] + "-0/" + parts[1]
    elif len(parts[1]) == 3:
        new_dir += parts[0] + "-0/0" + parts[1]
    elif len(parts[1]) == 2:
        new_dir += parts[0] + "-0/00" + parts[1]
    elif len(parts[1]) == 1:
        new_dir += parts[0] + "-0/000" + parts[1]
    shutil.move(dir, new_dir)
dirs = [os.path.join("./temp", o) for o in os.listdir("./temp") if os.path.isdir(os.path.join("./temp",o))]
for dir in dirs:
    shutil.move(dir, ".")
shutil.rmtree("./temp")

DNS resolution broken in k8s cluster (may happen after Kubespray upgrade)

Double-check that the kubelet config has the right IP for the DNS server.

First, query the IP of coredns with this command:

kubectl get pods --namespace=kube-system -l k8s-app=kube-dns

Then check on each of the nodes where kubelet is running in /etc/kubernetes/kubelet-config.yaml that the DNS IP is the same under clusterDNS, if not, change it to the correct IP from above, and restart kubelet. (repeat this on each kubelet node).

Then restart all running pods in the cluster.

For additional information, this doc from the Kubernetes organisation is very helpful: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/