Using mcelog to detect hardware issues in Linux

mcelog is a tool that helps in finding out hardware errors especially memory, IO, and CPU hardware errors and supports both the x86 and x86_64 platforms since Kernel 2.6 .

From the mcelog.org ( the official website for mcelog ), The mcelog daemon accounts memory and some other errors errors in various ways. mcelog –client can be used to query a running daemon. The daemon can also execute triggers when configurable error thresholds are exceeded. This is used to implement a range of automatic predictive failure analysis algorithms: including bad page offlining and automatic cache error handling. User defined actions can be also configured.

All the errors are logged to /var/log/mcelog or syslog or the journal. Earlier, mcelog was run as a cronjob but in modern versions or systems, it is run as a service. mcelog is configured through the /etc/mcelog.conf.

Installating mcelog

mcelog can be installed using yum ( yum install mcelog ) but installing it in from the git repositories is recommended.

git clone git://git.kernel.org/pub/scm/utils/cpu/mce/mcelog.git

and compile it.

On systems based on init and upstart , follow the steps below to run it on boot

cp mcelog.init /etc/init.d/mcelog
chkconfig mcelog on

For systems based on systemd,

cp mcelog.service /usr/lib/systemd/system
systemctl enable mcelog.service

You can verify whether the mcelog daemon is running completely by running

mcelog –client

This should query the information in the running daemon or the log file. If it returns nothing, there is no errors logged yet.

Please click here to view the man page for mcelog

An example mcelog’s log structure

[[email protected] ~]# /usr/sbin/mcelog –ignorenodev –filter
Hardware event. This is not a software error.
MCE 0
CPU 2 THERMAL EVENT TSC 16411ab616
TIME 1477257072 Sun Oct 23 17:11:12 2016
Processor 2 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 880003c3 MCGSTATUS 0
MCGCAP c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.

Leave a Reply

Your email address will not be published. Required fields are marked *