...making Linux just a little more fun!
=?ISO-8859-2?Q?Petr_Vav=F8inec?= [pvavrinec at snop.cz]
Allmighty TAG!
first, sorry for a longish post. I'm almost at my wit's end.
I have an industrial PC, equipped with a touch screen and "VIA Nehemiah" CPU (that behaves like i386). The PC has no fan, no keyboard, no harddisk, nor flash drive - only ethernet card. It boots via PXE from my database server. The /proc/meminfo on the PC says, that the PC has MemTotal: 452060 kB. I'm booting the "thinstation" (https://thinstation.sourceforge.net) with kernel 2.6.21.1. The PC uses ramdisk. I boot Xorg with twm window manager. Then I start client of the "Opera" browser on my database server and "-display" it on the X server on the PC. This setup has worked flawlessly for a couple of months.
Then my end-users came with complains, that the PC doesn't boot anymore after flipping the mains switch. For the moment, I have found following:
1. The PC sometimes doesn't boot at all. The boot process stops with the message:
Uncompressing linux... OK, booting the kernel.
...and that's all. Usually, when I switch the PC again off and on, it boots OK, I mean I don't change anything, just flip that big button.
2. When it really does boot all the way into X, the opera browser isn't able to start properly. I tried to investigate further the matter. I modified the setup, now I'm booting only into X+twm. This works.
Now I tried to run following test on my database server:
xlogo -display <ip_address_of_the_pc>:0
This works, the X logo is displayed on the screen of the PC.
Now I tried this test on the database server:
xterm -display <ip_address_of_the_pc>:0
Result is this error message on the client side (i.e. on the database server):
xterm: fatal IO error 104 (Connection reset by peer) or KillClient on X server "192.168.100.171:0.0"
...and the X server on the PC is really killed (I can't find him anymore in the process list on the PC).
This is, what I have found in the /var/log/boot.log on the PC:
------------ /var/log/boot.log starts here ---------------------------- X connection to :0.0 broken (explicit kill or server shutdown). /etc/init.d/twm: /etc/init.d/twm: 184: xwChoice: not found twm_startup twm: unable to open display ":0.0" ------------ /var/log/boot.log ends here ------------------------------
...and this is from /var/log/messages on the PC:
------------ /var/log/messages starts here ---------------------------- Nov 12 12:18:25 msweld03 daemon.info udevd[721]: udev_event_run: seq 790 forked, pid [2167], 'remove' 'vc', 0 seconds old Nov 12 12:18:25 msweld03 daemon.info udevd[721]: udev_event_run: seq 791 forked, pid [2168], 'remove' 'vc', 0 seconds old Nov 12 12:18:25 msweld03 daemon.info udevd-event[2167]: udev_db_get_device: found a symlink as db file Nov 12 12:18:25 msweld03 daemon.info udevd-event[2167]: name_index: removing index: '/dev/.udev/names/vcs3/%2fclass%2fvc%2fvcs3' Nov 12 12:18:25 msweld03 daemon.info udevd-event[2167]: udev_node_remove: removing device node '/dev/vcs3' Nov 12 12:18:25 msweld03 daemon.info udevd-event[2167]: udev_event_run: seq 790 finished Nov 12 12:18:25 msweld03 daemon.info udevd[721]: udev_done: seq 790, pid [2167] exit with 0, 0 seconds old Nov 12 12:18:25 msweld03 daemon.info udevd-event[2168]: udev_db_get_device: found a symlink as db file Nov 12 12:18:25 msweld03 daemon.info udevd-event[2168]: name_index: removing index: '/dev/.udev/names/vcsa3/%2fclass%2fvc%2fvcsa3' Nov 12 12:18:25 msweld03 daemon.info udevd-event[2168]: udev_node_remove: removing device node '/dev/vcsa3' Nov 12 12:18:25 msweld03 daemon.info udevd-event[2168]: udev_event_run: seq 791 finished Nov 12 12:18:25 msweld03 daemon.info udevd[721]: udev_done: seq 791, pid [2168] exit with 0, 0 seconds old Nov 12 11:18:25 msweld03 user.debug kernel: unhashed dentry being revalidated: CMD2095 ------------ /var/log/messages ends here ------------------------------
I tried to fiddle with the BIOS parameters, i.e. reverted back to "factory setup" and re-entered the PXE-booting stuff. It didn't help.
I tried to limit the memory used by the ramdisk. My current parameters are:
append ramdisk_blocksize=1024 initrd=initrd.C0A864AB root=/dev/ram0 ramdisk_size=131072 console=ttyS3 acpi=off noapic nolapic
...and this doesn't help, either.
Has anyone of the honorable TAG staff any clues that could help me? Thanks in advance for any help.
Petr
Paul Sephton [paul at inet.co.za]
On Thu, 2009-11-12 at 11:35 +0100, Petr Vavřinec wrote:
> 1. The PC sometimes doesn't boot at all. The boot process stops with > the message: > > Uncompressing linux... OK, booting the kernel. > > ...and that's all. Usually, when I switch the PC again off and on, it > boots OK, I mean I don't change anything, just flip that big button.
Just looking at this aspect of the problem, and for the moment disregarding the rest of the symptoms, it would appear that your problem is hardware related. I would suggest you start by checking the integrity of your RAM modules, or if you have ready access to replacement RAM, simply swap them out and see whether the problem has gone away.
My reasoning might be flawed, but based on the fact that the machine was working before, and the fact that you are using a RAM disk, together with the strange behaviour on bootup, I suspect that your RAM might not be refreshing properly. This failure to refresh might be caused by a faulty module, a loosely seated module (possible through vibration of the chassis over time), or a CPU memory bus problem.
Whether or not this is the case, it is easy enough to check.
Regards, Paul
René Pfeiffer [lynx at luchs.at]
On Nov 12, 2009 at 2151 +0200, Paul Sephton appeared and said:
> On Thu, 2009-11-12 at 11:35 +0100, Petr Vavřinec wrote: > 1. The PC sometimes doesn't boot at all. The boot process stops with > > 1. The PC sometimes doesn't boot at all. The boot process stops with > > the message: > > > > Uncompressing linux... OK, booting the kernel. > > > > ...and that's all. Usually, when I switch the PC again off and on, it > > boots OK, I mean I don't change anything, just flip that big button. > > Just looking at this aspect of the problem, and for the moment > disregarding the rest of the symptoms, it would appear that your problem > is hardware related. I would suggest you start by checking the > integrity of your RAM modules, or if you have ready access to > replacement RAM, simply swap them out and see whether the problem has > gone away.
Running memtest86 for a couple of hours of even days is a good way of finding out if the memory works or not. You can find it here: https://www.memtest86.com/
Some live-CDs aloow booting into memtest86, too.
Best, René.
Neil Youngman [ny at youngman.org.uk]
On Thursday 12 November 2009 19:55:17 René Pfeiffer wrote:
> > Running memtest86 for a couple of hours of even days is a good way of > finding out if the memory works or not. You can find it here: > https://www.memtest86.com/
The last time I had flaky memory the system seized up about twice a day. Running memtest86 overnight didn't show any problems, but swapping in new memory solved the problem.
I would suggest that if memtest86 can find a problem you can rely on that, but a negative from memtest86 doesn't guarantee that your memory's OK.
Neil
=?windows-1250?Q?Petr_Vav=F8inec?= [pvavrinec at snop.cz]
René Pfeiffer napsal(a):
[...snip...]
> > Running memtest86 for a couple of hours of even days is a good way of > finding out if the memory works or not. You can find it here: > https://www.memtest86.com/ > > Some live-CDs aloow booting into memtest86, too. > > Best, > René. >
Thank you guys,
I ran the memtest86, when it reached 5000 errors, I switched it off. Fortunatedly I was able to find suitable replacement for my memory. I swapped it, and now I'm running again without problems...
Thanks again, have a nice weekend,
Petr
Rick Moen [rick at linuxmafia.com]
Quoting Neil Youngman (ny@youngman.org.uk):
> The last time I had flaky memory the system seized up about twice a day. > Running memtest86 overnight didn't show any problems, but swapping in new > memory solved the problem.
Running memtest86 overnight will almost always pinpoint bad RAM. Running it for only a few hours and finding no errors doesn't mean much.
On one memorable occasion, bad RAM on my candidate replacement server got smoked out only through resorting to iterative kernel compiles using "make -j NN" tweaked to make sure I used all memory. (After a few tests, NN ended up being 256.)
In the referenced case, the situation really was sort of my own fault, because, trying to save money, I'd deployed some sticks of RAM that had a dubious history. The entire saga of how I tracked down the bad sticks is here, and I personally think it makes pretty good reading:
https://linuxmafia.com/pipermail/conspire/2006-December/002662.html
https://linuxmafia.com/pipermail/conspire/2006-December/002668.html
https://linuxmafia.com/pipermail/conspire/2007-January/002743.html
Also worth considering is the Cerberus Test Control System (CTCS), the hardware burn-in suite that we used to quality hardware at VA Linux Systems.
Information links on CTCS:
https://linuxmafia.com/faq/Hardware/cerberus.html
https://va-ctcs.cvs.sourceforge.net/va-ctcs/ctcs/FAQ?view=markup
-- Rick Moen "If accuracy / Is what you crave / rick@linuxmafia.com Then you should call it / Myanmar Shave." -- FakeAPStylebook
Ben Okopnik [ben at linuxgazette.net]
On Fri, Nov 13, 2009 at 03:04:30PM -0800, Rick Moen wrote:
> Quoting Neil Youngman (ny@youngman.org.uk): > > > The last time I had flaky memory the system seized up about twice a day. > > Running memtest86 overnight didn't show any problems, but swapping in new > > memory solved the problem. > > Running memtest86 overnight will almost always pinpoint bad RAM. > Running it for only a few hours and finding no errors doesn't mean much.
True in my experience as well. Not specifically for memtest86, but when I was building systems, the only time I considered the memory to have been properly "burned in" is after I ran it through "BURNIN" (this was back in the days of DOS) for 24 hours. It was annoying, but since the alternative was to have systems that would come back and that I'd have to fix on my own dime - not to speak of the associated loss of reputation for my then very-young business - it was a requirement for any machines I sold. I've also had systems with memory problems that required either a spritz of Freon or a blast from a hair dryer set on 'high' to confess their faults.
I'm quite happy that a) this kind of problems are quite rare anymore, and b) that I'm doing almost nothing but software these days.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Raj Shekhar [rajlist2 at rajshekhar.net]
In infinite wisdom Ben Okopnik said the following On 11/13/09 3:24 PM:
> I've also had systems with memory problems that required either a spritz > of Freon or a blast from a hair dryer set on 'high' to confess their > faults.
I have never heard of this method for diagnosing RAM problems. What does it do?
Thomas Adam [thomas.adam22 at gmail.com]
2009/11/16 Raj Shekhar <rajlist2@rajshekhar.net>:
> In infinite wisdom Ben Okopnik said the following On 11/13/09 3:24 PM: > >> I've also had systems with memory problems that required either a spritz >> of Freon or a blast from a hair dryer set on 'high' to confess their >> faults. > > I have never heard of this method for diagnosing RAM problems. What does it > do?
man 7 salonandpermset
-- Thomas Adam
Ben Okopnik [ben at linuxgazette.net]
On Mon, Nov 16, 2009 at 10:44:56PM +0000, Thomas Adam wrote:
> 2009/11/16 Raj Shekhar <rajlist2@rajshekhar.net>: > > In infinite wisdom Ben Okopnik said the following On 11/13/09 3:24 PM: > > > >> I've also had systems with memory problems that required either a spritz > >> of Freon or a blast from a hair dryer set on 'high' to confess their > >> faults. > > > > I have never heard of this method for diagnosing RAM problems. What does it > > do? > > man 7 salonandpermset
<Ahem> "man 8 salonandpermset", please. There's nothing "miscellaneous" about serious stuff like this.
Raj: In the past, some memory chips would become thermally-sensitive as they degraded, and could be made to fail by heating them up or (less often) by cooling them during a memory test. You had to use this with some discretion - obviously, you could crack the chip packages if you flipped the temperature too rapidly - but it was a really good test that would usually suss out otherwise difficult-to-diagnose, intermittent memory.
Memory quality has increased greatly since those days, so I don't know how applicable this would be today - but it was a standard part of a techie's toolkit back then.
-- [ Okopnik Consulting | Putting computing solutions within easy reach ] [ Expert-led Training | Dynamic, vital websites | Custom programming ] [____________________________ https://okopnik.com ________________________]
Blaine Clark [thelight9 at comcast.net]
Just a note about temperature effects; Sometimes with failed hard drives, they can be put in your fridge or for a short time, in your freezer. Let it cool, pull it out, connect up and have your recovery process ready to roll NOW. You probably have only one chance to recover as much of your data as can be recovered. This doesn't always work of course, and it doesn't always let you retrieve all that you want or need, but hey, it's a shot when you're desperate.