In the blog, I record several problems that I met with during coding. I hope the content is helpful.
Install Nvidia Driver CUDA and CUDNN on Ubuntu
Official installation tutorial link.
-
Use Command Line: Tutorial
-
Search for PPA which can be used to install packages by
apt-get
: linkFollowing the steps:tutorial
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
# add ppa and find suitable nvidia-driver sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt-get update apt-cache search nvidia-driver # sudo apt-get install nvidia-driver-version sudo apt-get install nvidia-440 # another way to install nvidia driver ubuntu-drivers devices sudo ubuntu-drivers autoinstall # add keys for your specific ubuntu version, be careful about ubuntu1x04 sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub # if error message: # using the command: # wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub # sudo apt-key add 7fa2af80.pub sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda.list' sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda_learn.list' sudo apt-get update sudo apt install cuda-10-1 sudo apt install libcudnn7 # add the following codes to ~/.bashrc # set PATH for cuda installation if [ -d "/usr/local/cuda/bin/" ]; then export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} fi reboot -i # check settings nvcc --version /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep libcudnn nvidia-smi
-
Reboot system
Then the NVIDIA driver will be installed smoothly. After that, we should install CUDA.
Check CUDA version: cat /usr/local/cuda/version.txt
.
If we want to download package by ourselves, we should check the website first to find the right version for our system. The greatest way to install it is following the official guide.
There is also a supportive tutorial for installation cuda and cudnn for ubuntu.
Debian CUDA install
Recently, I need to install cuda11-1 and nvidia driver on debian 11. Here is something that I achieve based on the experience.
The key point is that the cuda package can be used smoothly on debian 11 from my testing.
The faster method is using sudo apt install nvidia-cuda-toolkit
. However, it can just install the latest version. In my case, it will install cuda11-2 while torch cannot support such a new version.
Then I tried several methods on the internet. But the solution is just based on the previous method.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt install nvidia-driver
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/7fa2af80.pub
sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
sudo apt-get update
sudo apt install cuda-11-1
if [ -d "/usr/local/cuda/bin/" ]; then
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
fi
reboot -i
Then just enjoy the new world.
Conda Usage
1. search available python version
1
conda search "^python$"
2. Create or Remove conda environment
1
2
conda remove --name myenv --all
conda create -n env python==3.6
Tensorflow Usage Problem Recording
1. Tensorflow does not use GPU
In the case, I actually installed cuda and Nvidia driver at first. So it is because I did not add cuda/bin
and related library to the .bashrc
.
By adding the following code to .bashrc
file will solve the problem.
1
2
export PATH=/usr/local/cuda-10.2/bin/:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Besides, I think it is a useful tutorial for use Tensorflow on GPU.
Tensorflow2.1 is not useful for CUDA10.2 due to the lack of some libraries.
So I reinstall CUDA10.1 for tensorflow2.1-gpu.
Test tensorflow for GPU
1
2
import tensorflow as tf
tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
2. Install TensorRT for Tensorflow2-GPU version
Official tutorial: TensorRT is not a must for tensorflow-gpu. It is just useful for speed up training process.
3. Change CUDA Version
1
2
3
ll /usr/local/cuda
# make a link for cuda. usage: ln -s source target
ln -s /usr/local/cuda-7.5 /usr/local/cuda
4. Cannot visit tensorflow.org
official website
In the case, the following step is useful for Mac. Linux should edit its specific hosts
file.
- edit /private/etc/hosts
- add
64.233.188.121 www.tensorflow.org
5. Tensorflow GPU Allocation Problem
By default, tensroflow will use all gpu memories to update efficiency. The way to use specific memory is by following codes in two ways. More details are available on the link
tf.config.experimental.set_memory_growth
:
1
2
3
4
5
6
7
8
9
10
11
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
tf.config.experimental.set_virtual_device_configuration
:
1
2
3
4
5
6
7
8
9
10
11
12
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
6. Pytorch Jupyter set GPU
1
2
3
print(torch.cuda.device_count()) # list visible GPU
device = torch.device("cuda:6" if torch.cuda.is_available() else "cpu") # set GPU for device
model_ft = torch.nn.DataParallel(model_ft, device_ids=[0]) # set GPU for models
7. Data Loader Problem
1
2
raw_train_X = next(iter(train_dataloader))[0].numpy() # (100000, 3, 64, 64)
raw_train_Y = next(iter(train_dataloader))[1].numpy() # (100000, )
The aforementioned code is not right because next(iter(dataloader))
will re-output so the X and Y are not mapping.
8. Check label counts
1
2
unique, counts = np.unique(sy_train, return_counts=True)
print(counts)
9. Allocation problem
Warning: ensorflow/core/framework/allocator.cc:101] Allocation of X exceeds 10% of system memory
Solution: The main problem is the batch_size
is too big. Sometimes the problem is on related settings like shuffle
function intf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle()
.
10. How to use @tf.function
[Need Digging into]
When I want to change the gradients when training by function train_step
on tensorflow2, I found that if I did not use @tf.function
before the train_step
function, the accuracy grow very slow. But if I add @tf.function
, the process becomes normal without considering optimizers and data type. The related code is here.
1
2
3
4
5
6
7
8
9
10
11
@tf.function
def fix_train_step(model, images, labels, all_mask):
with tf.GradientTape() as tape:
predictions = model(images)
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
for i in range(len(all_mask)):
gradients[i] = gradients[i] * all_mask[i]
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss(loss)
train_accuracy(labels, predictions)
11. Problem after suspending the Machine
After I suspend the ubuntu, I met with such problemfailed call to cuInit: CUDA_ERROR_UNKNOWN
and I cannot use GPU. Rebooting it probably can solve the problem. As to my case, I reinstall nvidia-smi 440 solve the problem.
Related link
12. Save Model Problem in Tensorflow2
When I want to save a Keras subclass model, it will meet with the no bounded node error when I want to use tf.keras.models.Model to get middle layers’ output. So in tensorflow 2, the suitable way to save and load model is listed as follows.
1
2
3
4
5
6
7
# save model weights
model.save_weights(MODEL_FILEPATH + 'weight.h5')
model.load_weights(MODEL_FILEPATH + "weight.h5")
# save whole model
tf.keras.models.save_model(model, MODEL_FILEPATH)
model = model.load_weights(MODEL_FILEPATH)
13. Check GPU is available or not on tensorflow:
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.list_physical_devices('GPU')
The version compatible version of Tensorflow details.
Other Problems
1. Install Cudnn on Debian
The installation steps of installing cudnn on debian is different from that of ubuntu. And the path of cuda related package is not the same of that of ubuntu.
Step 1: Download Cudnn on Nvidia official website.
Step 2: Add the related lib to the path ./usr/lib/x86_64-linux-gnu/
. A good way to find the path is by find . -name libcublas.so.10
.
Step 3: Check the result of installation of cudnn.
2. Use matplotlib to draw 3D gradient descent pictures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import *
from mpl_toolkits import mplot3d #用于绘制3D图形
#梯度函数的导数
def gradJ1(theta):
return 4*theta
def gradJ2(theta):
return 2*theta
#梯度函数
def f(x, y):
return 2*x**2 +y**2
def ff(x,y):
return 2*np.power(x,2)+np.power(y,2)
def train(lr,epoch,theta1,theta2,up,dirc):
t1 = [theta1]
t2 = [theta2]
for i in range(epoch):
gradient = gradJ1(theta1)
theta1 = theta1 - lr*gradient
t1.append(theta1)
gradient = gradJ2(theta2)
theta2 = theta2 - lr*gradient
t2.append(theta2)
plt.figure(figsize=(20,10)) #设置画布大小
x = np.linspace(-3,3,30)
y = np.linspace(-3,3,30)
X, Y = np.meshgrid(x, y)
Z = f(X,Y)
ax = plt.axes(projection='3d')
print(t1, t2, ff(t1,t2))
# ax.scatter(t1, t2, ff(t1,t2), c='black',marker = '*', linewidth=1)
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='viridis', edgecolor='none', alpha=0.9) #曲面图
#ax.plot_wireframe(X, Y, Z, color='c') #线框图
# ax.contour3D(X, Y, Z, 50, cmap='binary')#等高线图
# ax.scatter3D(t1, t2, ff(t1,t2), c='black',marker = 'o')
ax.plot(t1, t2, ff(t1,t2), c='black',marker = 'o', markersize=5, zorder=5)
# ax.plot_wireframe(t1, t2, ff(t1,t2))
# ax.plot3D(t1, t2, ff(t1,t2),'red')
#调整观察角度和方位角。这里将俯仰角设为60度,把方位角调整为35度
ax.view_init(up, dirc)
plt.savefig("./temp.png")
#可以随时调节,查看效果 (最小值,最大值,步长)
@interact(lr=(0, 2, 0.0002),epoch=(1,100,1),init_theta1=(-3,3,0.1),init_theta2=(-3,3,0.1),up=(-180,180,1),dirc=(-180,180,1),continuous_update=False)
#lr为学习率(步长) epoch为迭代次数 init_theta为初始参数的设置 up调整图片上下视角 dirc调整左右视角
def visualize_gradient_descent(lr=0.05,epoch=10,init_theta1=-2,init_theta2=-3,up=60,dirc=60):
train(lr,epoch,init_theta1,init_theta2,up,dirc)
3. git add new repo
4. Out of memory on Pytorch
When training, memory usage of GPU will increase by calculating loss, output. So delete them when they are useless is a suitable way to decrease memory usage.
1
2
del cost, out
print("\nall", torch.cuda.memory_allocated())
5. Mac New Application Damaged Problem
-
sudo spctl --master-disable
- Enable it again:
sudo spctl --master-enable
- Enable it again:
-
xattr -r -d com.apple.quarantine <path>
- ` xattr -r -d com.apple.quarantine /Applications/PDF\ Expert.app`
6. Display the pictures from the remote ubuntu on the local server
- Install
xquartz
on mac from the link. - Use
ssh -X <remote server>
to enable x11 forwarding - Run python script with
matplotlib
to build the connection.
Enable OpenCL Problems:
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
set parameters:
export LIBGL_ALWAYS_INDIRECT=1
export LIBGL_DEBUG=verbose
export LIBGL_ALWAYS_SOFTWARE=1
- Open3d Problem
[Open3D WARNING] GLFW Error: GLX: Forward compatibility requested but GLX_ARB_create_context_profile is unavailable
[Open3D WARNING] Failed to create window
[Open3D WARNING] [DrawGeometries] Failed creating OpenGL window.
clinfo glxinfo: OpenGL version string: 1.4 (2.1 INTEL-16.1.7) Need to be change to nvidia driver command:
1
2
nvidia-settings
sudo prime-select nvidia
opengl version with GPU https://opengl.gpuinfo.org/displayreport.php?id=5738
7. How to install a editable python package
Basically, setup.cfg
and setup.py
are configured for the editable package.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# setup.cfg
[metadata]
name = local_structure
version = 0.1.0
[options]
packages = structure
# setup.py
import setuptools
setuptools.setup()
And then use the following command to install the package locally under the package folder: python -m pip install -e .
8. The Information about the file system and the cooresponding operating system
Linux:
-
Best Recommands: Ext4
-
Do not support: Exfat, Fat
Mac:
-
Do not support: Ext4
-
Need kernel extension: NTFS
如果没有 Homebrew 的话,需要先安装 Homebrew:
1
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
安装 e2fsprogs
1
brew install e2fsprogs
把 U 盘插到 Mac 上,执行:
1
diskutil list
找到自己 U 盘的盘符,比如我这里是:/dev/disk2s1,
1
2
3
4
/dev/disk2 (external, physical):
#: TYPE NAME SIZE IDENTIFIER
0: FDisk_partition_scheme *31.0 GB disk2
1: DOS_FAT_32 KINGSTON 31.0 GB disk2s1
然后执行格式化:
1
2
diskutil unmountdisk /dev/disk2s1
sudo $(brew --prefix e2fsprogs)/sbin/mkfs.ext4 /dev/disk2s1
执行命令后会要求输入用户密码,然后输入 y 确认,等待一会儿就可以了。
9. The Related things about external disk on Ubuntu
When we mount a NTFS external disk on Ubuntu, all the files are owned by root and the priviledge is not able to be changed. That will leads to some priviledge problems when using softwares in it.
So the most suitable way is using an Ext4 external disk for Ubuntu.
10. latex one table covers two columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
\usepackage{stfloats}
\begin{table*}[tp]
\centering
\begin{tabular}{c|ccc|ccc|ccc|ccc}
\hline
\multirow{2}{*}{Case} & \multicolumn{3}{c}{PointNet++(Random)} & \multicolumn{3}{c}{PointNet++(\Mname)} & \multicolumn{3}{c}{ResGCN-28(Random)} & \multicolumn{3}{c}{ResGCN-28(\Mname)} \\
& $L_2$ & Acc & mIoU & $L_2$ & Acc & mIoU & $L_2$ & Acc & mIoU & $L_2$ & Acc & mIoU \\
\hline
\hline
Best & \\
Average & \\
Worst & \\
\hline
\end{tabular}
\caption{The results of the non-targeted attack.}
\label{tab:nt-performance}
\end{table*}
11. Oh-my-zsh completions commands with repeated words
12. Download Bilibili video automatically
you-get -l https://www.bilibili.com/video/BV1U7411a7xG\?p\=20 --debug
Use you-get command. The syntax is listed as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
usage: you-get [OPTION]... URL...
A tiny downloader that scrapes the web
optional arguments:
-V, --version Print version and exit
-h, --help Print this help message and exit
Dry-run options:
(no actual downloading)
-i, --info Print extracted information
-u, --url Print extracted information with URLs
--json Print extracted URLs in JSON format
Download options:
-n, --no-merge Do not merge video parts
--no-caption Do not download captions (subtitles, lyrics, danmaku, ...)
-f, --force Force overwriting existing files
--skip-existing-file-size-check
Skip existing file without checking file size
-F STREAM_ID, --format STREAM_ID
Set video format to STREAM_ID
-O FILE, --output-filename FILE
Set output filename
-o DIR, --output-dir DIR
Set output directory
-p PLAYER, --player PLAYER
Stream extracted URL to a PLAYER
-c COOKIES_FILE, --cookies COOKIES_FILE
Load cookies.txt or cookies.sqlite
-t SECONDS, --timeout SECONDS
Set socket timeout
-d, --debug Show traceback and other debug info
-I FILE, --input-file FILE
Read non-playlist URLs from FILE
-P PASSWORD, --password PASSWORD
Set video visit password to PASSWORD
-l, --playlist Prefer to download a playlist
-a, --auto-rename Auto rename same name different files
-k, --insecure ignore ssl errors
Proxy options:
-x HOST:PORT, --http-proxy HOST:PORT
Use an HTTP proxy for downloading
-y HOST:PORT, --extractor-proxy HOST:PORT
Use an HTTP proxy for extracting only
--no-proxy Never use a proxy
-s HOST:PORT, --socks-proxy HOST:PORT
Use an SOCKS5 proxy for downloading
13. Recovery the deleted files on Ubuntu
14. How to set the local Mac to proxy the Ubuntu server’s packages
Step 1: Open Mac’s SSH service and Farword configure.
Step 2: Set Ubuntu proxy config
1
2
3
4
5
6
7
8
9
10
11
12
# add the following two lines on the ~/.bashrc
export https_proxy=127.0.0.1:1234
export http_proxy=127.0.0.1:1234
# set the ssh channel, 7890 is the VPN proxy port
ssh -N -f -L localhost:1234:localhost:7890 jason@10.177.74.47<local machine ssh service>
# check the port-using process
lsof -ti:1234
# check the vpn service
curl -I https://google.com
15. A fast way to transfer files between remote servers with progress bar
1
rsync -r --info=progress2 <files path> <username@remote server>
16. Jupyter cannot use the specific environment of conda
A helpful link.
Basically, the main problem is that the system does not use the jupyter
command in the conda environment. Instead, it uses the system default version.
We can use the sys.path
to check whether we use the correct command or not.
If the result of sys.path
is like the following, the environment is correct.
1
2
3
4
5
6
7
8
['/home/jxu/random-fourier-features/examples',
'/home/jxu/miniconda3/envs/rf/lib/python37.zip',
'/home/jxu/miniconda3/envs/rf/lib/python3.7',
'/home/jxu/miniconda3/envs/rf/lib/python3.7/lib-dynload',
'',
'/home/jxu/miniconda3/envs/rf/lib/python3.7/site-packages',
'/home/jxu/miniconda3/envs/rf/lib/python3.7/site-packages/IPython/extensions',
'/home/jxu/.ipython']
Basically, I did not exploit the detail of the problem this time. But I will list the solution here.
I install the jupyterhub by the command: conda install -c conda-forge jupyterhub
from the link.
I reinstall it by the commnads from the link.
1
2
3
conda install -c conda-forge jupyterlab
conda install -c conda-forge nb_conda_kernels
# conda install -c conda-forge jupyter_contrib_nbextensions
Some userful jupyter extensions can be found here.
17. Use slack to receive signals or messages from the commands
A helpful link.
Command for the direct message to the user:
curl -X POST --data-urlencode "payload={\"channel\": \"@memberid\", \"username\": \"webhookbot\", \"text\": \"The machine with GTX1080 has been rebooted:)\", \"icon_emoji\": \":ghost:\"}" <link>
Command for the channel message:
curl -X POST --data-urlencode "payload={\"channel\": \"#general\", \"username\": \"webhookbot\", \"text\": \"The machine with GTX1080 has been rebooted:)\", \"icon_emoji\": \":ghost:\"}" <link>
For example, we can send a message to the user if the machine reboot.
We can write the command to the file \etc\rc.local
.
18. Ubuntu set Default Desktop
1
2
sudo update-alternatives --config x-session-manager
sudo dpkg-reconfigure gdm3 # set the default desktop
Currently, I test it on Ubuntu 16.04 and figure out gdm3 cannot be run while lightdm works well. I am not sure the reason.