red assassin wrote: kohlrak wrote:
Starting with 64bit, the processor can switch between 32bit and 64bit modes almost seemlessly, just like ARM can switch to different instruction sets (hence why they recommend mixing Thumb and regular ARM code, since Thumb code is smaller, but ARM code can be faster for certain tasks that require certain operations that have to be done a roundabout way with Thumb code [like DIV]).
Furthermore, those CPU enhancements that 64bit CPUs have are still available to programs running on 64bit machines using 32bit code (SSE3 and up, for example). The advantage comes from the fact that 64bit instructions are 125% or higher in size of their 32bit equivalents. I've actually seen coders (i haven't checked GCC on a 64bit machine lately [and gcc has improved alot the past couple years with optimization] since my linux boxes are 32bit [actually, the one is 64bit, but the android version is still 32bit, and I don't have much of a choice on that: Samsung Galaxy Tab E]) write "64bit code" where the only thing it bothered with 64bit for was pointers. The next kicker, 64bit address space is actually only 40 bits, so most programs would survive without that, even.
The reason for the slowdown is that when you compile code for 32bit, it assumes Pentium 4. I don't know if it's Wirth's law, or they just haven't gotten around to "32bit on 64bit compiling" yet. Naturally, you'll want to keep libraries using 64bit mode so they can handle programs without a "proxy address" or something. Intel also planned out this so that you could optimize processes by using 32bit mode. Unlike 16bit code running on a 32bit processor, 32bit code running on a 64bit processor is "sign extended" or "zero extended" so that you can easily switch back and forth (although linux doesn't feel the need to allow this, since code should be open source for it, anyway). What this translates to is, for a simple optimization example, instead of turning "return 0" into:
which, iirc, is 6bytes, you can do:
which is 2 bytes. The general rule of optimization for the past few years (thanks to increases in execution time, and such, that the best optimization you can do is making your code fit into cache pages, rather than picking faster operations (the fastest method is and eax, 0, but the code size is an extra byte). So, the trick is, unless you're calculating "long int" in all your calculations, you gain a major speed boost in using 32bit code. If you have a copy of GCC, try checking for me to see if it doesn't already use the E prefix registers for regular int calculations, instead of down promoting after using the R prefix registers.
I think you have a misunderstanding about the difference between register size and processor mode going on here. In long mode (i.e. 64 bit), you can of course still touch eax etc (just as you can touch ax, al and ah from both 64 and 32 bit modes, for that matter) to work with 32 bit values, and it's good practice to use the size of variable you actually need. This is very much a separate thing from a mode transition to 32 bit protected mode, which allows you to run code written for 32 bit x86 - you can't just point a processor in long mode at 32 bit code and expect it to work. Reasons for this include: a) some of the opcodes have changed (several blocks of 32 bit instructions were moved to make room for the new 64 bit instructions), b) page table layout is necessarily different, and c) some architectural features have changed, e.g. segmentation is gone in long mode. If you have a process running in 32 bit compatibility mode and you want to call other code that's 64 bit (which includes syscalls to the kernel, since that will be 64 bit in this case), something needs to mode transition. This is handled on Windows by WOW64 and on Linux by the kernel when it receives a syscall, and it isn't a free operation as it's a context switch.
Which ones were moved? Last time i checked for x86, they used the "reserved" instructions and turned them into prefixes, which is a staple of x86's ability to have variable length instructions. The context switching ends up being part of the general task switching, anyway, so it's not as much as you think. IIRC, it's a matter of pointing the segment registers in the right direction, since the format is basically the same. TBH, i don't have experience in this matter, just reading. I've stayed pretty close to 32bit transitions, which, agreed, are a pain. Intel, however, understood this would be a problem (since BIOSes stayed 16bit, many 32bit OSes constantly had to switch back and forth and it was a major pain, but even that switching isn't as bad as you think, and i do
have experience with that), so they tried to simplify the mode switching when it came to switching between pmode and long mode. And, frankly, what these compilers do just to call a function is more complex than the switching between 32bit and 16bit code (only marginally, if it's not doing the useless mov operations that i've seen GCC do on 32bit x86 code).
Distros tend to build 32 bit versions to maintain some degree of backwards compatibility, so yes, some of the extensions are disabled, which explains some of the cases where 64 bit code is significantly faster (generally stuff which is particularly suited to vectorisation). It certainly doesn't make 64 bit code any slower than 32. If you're compiling yourself it's easy enough to turn all the extensions back on (it's -march=native for gcc) and test. I ran a quick test with xz (since it's easy to compile and very CPU-intensive) - compressing a 1GB random file (reading from a ramdisk and writing to /dev/null to avoid disk bandwidth being an issue), with identical compile options other than the architecture (i.e. optimisation and use of all available extensions are enabled), 32 bit takes ~7m40s and 64 bit takes ~7m10s (across a few repeated runs, variance is ~5s, on my Core i5-3350P). It isn't super significant, but it's also definitely faster on 64 bit!
But your example still says nothing, since the vectorization still isn't enabled. If it is, then GCC has a hard time with 32bit x86 code, still (and i remember it having a hard time). The 64bit instructions are longer. If it's using the 32bit equivalents, as it should, the 64bit code will be marginally slower due to occasionally having to handle pointers, even if they're predictable in lower memory. There's nothing inherent about 32bit that makes it slower than 64bit, while the opposite is true.
Optimisation is also quite a bit more complicated than "just make the code as small as possible", or -Os and -O2 would be the same thing (or /Ot vs /Os on MSVC).
It is more complex, but it's a general rule. One Intel picked out and creamed AMD with a few years back when AMD was spending all sorts of money on making individual instructions execute faster while intel spent very little money on making the caches larger. Intel figured out that the biggest bottleneck on CPUs today is cache misses. Given that hello world programs are compiling so large should be enough to point out the source of the problem: Wirth's Law.
kohlrak wrote:Right, but those are superficial mechanisms. Even chroot gives a warning that you shouldn't use it for security purposes. I was actually going to use it for that purpose with a php app i was making, only to get that warning and change my mind. Frankly, the OS shouldn't be providing methods for programs to scan for other programs and then modify them. It's not necessary at all, and just serves as an extra hole. Use the things that our processors gave us for debugging, instead of some external solution which often has hard time "connecting to the process." Heck, even within the programming languages there are often constructs for debuggers (like try, throw, and catch in C++) being built into the code. You can use defines to enable and disable this debugging code to optimize.
"Superficial"? You go ahead and write me a Chrome sandbox escape (that is, given native code execution in a Chrome renderer process, gain execution elsewhere on the system), for example, and tell me just how superficial those mechanisms are. I'll wait.
Just because it'd take a lot of work to pull off, doesn't mean it's not superficial. It's more like a sandbag bunker as opposed to a steel one. Those are fairly superficial, yet have fun getting through one without an explosive, which is alot of money and work.
Actually, on second thoughts, if you write a Chrome sandbox escape, you probably just want to sell it, as it's worth at least tens of thousands of dollars.
So would a decent cryptor, but those things are open source, even. A fool and his money are soon parted.
As I said, if you're the same privilege level as another process there is no way of preventing data access. Modify the executable on disk to write the information you want out; just read the same files and do the same calculations as it does; modify one of the libraries it loads to do that; write your own process loader that injects code to read the information you want and use that to start the process; etc etc... You don't need to be able to attach a debugger/ReadProcessMemory/etc the obvious ways to get there.
No, you don't, but saying that just because you can break the window of a store to rob the place as an excuse for putting the key under the front door matt is inexcuseable.
kohlrak wrote:What piece of equipment is too critical to need rebooting the whole system, but can have a driver that can crash? As long as you have CPU and RAM, your system can restart the hardware (since all drivers can be stored on the HD aside from the HDD driver which should always be in RAM). As far as i can tell, this *IS* what Linux does. My video card has crashed already, for example, and linux just restarted the GPU itself instead of going flashing the capslock light with a blank screen like it normally does with a kernel panic.
Your HDD driver crashes and corrupts itself in memory. Any driver crashes and sends junk data to whatever was reading from it when it went wrong, throwing random spanners into the works of the rest of your system. And so forth. The Linux kernel makes an attempt to decide whether it thinks a driver crash is recoverable, in which case it will try (it refers to this as an "oops"), or non-recoverable, in which case it will panic. However, it's definitely not right all the time - I've had my system do some very bizarre things after oopses that weren't as recoverable as it thought they were. (Writing kernel code is an exciting experience, believe me.) Windows chooses not to take that risk for a variety of reasons, including security.
Yeah, but that doesn't mean all drivers should have access to the crash function. The majority of windows crashes have been GPU crashes, so if MS wants to limit the number of BSoDs, the first thing it should do is take it away from them. But, that comes back to, you have RAM and CPU, with innate drivers, and the HDD driver. The HDD driver should not get corrupted in memory, but should it, it should cause a kernel panic. However, that same driver should be part of the kernel. The rest should not need to panic, as anything else can be restarted (or, at least, the device makers should have the option to cut the power and re-establish power, instead of demanding everything go down).
At any rate, this isn't the only issue with microkernels - all of the extra ring transitions to talk to your drivers introduce extra latency, extra complexity (which means, ironically, more risk of things going wrong, among other disadvantages), and architectural issues introduced by inability to share state (which means more copying of things).
Have you ever written transition code? It's really alot less than you think. Just read a tutorial on getting to protected mode from real mode, or, if you'd like, i'd just give you the code here, as i've actually written some. Long mode is a different story, but i've read it is easier by design, due to how many issues came from the 16bit to 32bit transition. They realized that they didn't want to make the same mistake again.
Anyway, kernel design philosophy isn't really my area - macrokernels won, I just have to deal with them. If you want to know more about why, read some of Torvalds' thoughts about them.
I developed a microkernel. The short answer is, it's easier to convince companies to write drivers for macro-kernels, because they want as many options as possible, even if you're giving away your stability to them.
kohlrak wrote:Oh, i understand, but there's no reason we should assume developers can't write secure code, either. They should own up to their mistakes when they make them, instead of pointing their fingers and blaming everyone else. But that's why we have patching. We can't blame a certain few "elites" like intel for the mistakes of a dumb programmer. We could also work on improving coding education, as well. Alot of people today leave colleges with less knowledge and experience than me, and i'm just a cowboy coder with no degree, yet I can code my own toy OS (entirely in assembly, without stealing code from Linux like grub or some othe boot loader) and programming language, which baffles most people with degrees for some unknown reason. People these days seem to get programming degrees like blackbelts out of McDojo. It's not like good education is too hard for people to understand, but we've been simplifying that to the degree that half the people don't even know half the language that they're even coding in. I had arguments with teachers over whether or not to teach & and |, when students were confused that their code compiled with & and |, but didn't produce the results they were expecting (since they wanted to use && and ||). This is what creates these kinds of coders: they assume that since most errors produce a compiler issue, that if it compiles but has bugs, it's probably an off-by-one error or something like that, rather than a mistyped ==, &&, || or even pulling off a unary * (ended up with a pointer dereference instead of a multiply).
"There's no reason we should assume developers can't write secure code"? How about, like, all code ever written? Coding is hard to start with, but security is incredibly
hard. Understanding machine behaviour on a deep enough level to get exploits is hard; keeping track of all the classes of exploit is hard; not making any mistakes even when you know what you're doing is hard. Writing completely safe buffer handling code with no mistakes in is hard enough even when you're an expert who knows how all of the exploits for it work, and other classes of bug are harder to understand and harder to reason about. And besides, people come up with new exploitation approaches that nobody has had to deal with before pretty often, which means a program developed perfectly to the state of the art today might be laughably insecure tomorrow. If your code is complicated enough to do anything useful or to have security boundaries, I can pretty much guarantee it has security flaws.
KISS makes it easy, actually. Generally, the idea of stack busting is taking advantage of injecting a callback or using a stack overflow. If you assume incoming data could be malicious, it's much, much easier to avoid. If you don't mind loss of efficiency, but are determined to use a stack, you make a simple stack jail. And, yes, there are often instructions specifically to pull this off (intel has a bounds instruction). And is it easy to pull off? Absolutely.
I don't disagree for a moment that the state of programming education is terrible, or that the average developer is terrible, but I don't think any of this is a fixable problem - what exactly are you going to do about it? You can't stop people programming. And even if it *was* fixable, as I say, the best developer in the world is still going to produce insecure code. But you can improve the tools.
And this is precisely where we made the mistake, because by assuming the tools should nanny the programmer, when the tools don't nanny you, you make a false assumption that they will nanny you. Your failure to use my library securely is not my fault. I can choose to nanny you if i want, but if i'm not making any claims of nannying you, you can't call it my bug. It's your bug.
kohlrak wrote:If you can't shoot yourself in the foot, you often can't get anything done. I remember having this one really annoying issue where i tried to make some code and store an IPv4 in a uint32 so I could encode it, but the only way I could do it was with a plethora of really inefficient shifts and such because type-casting was an error instead of a warning. Took me days to finally write it, then i wrote it in 32bit x86 assembly in a few hours. People say you're not supposed to be able to do things faster in assembly, yet I did, because these protections were getting in the way. I was constantly fighting castes in that code.
The compiler probably optimises away anything particularly weird you do. Either way, sounds like you were either using something storing the address in a particularly odd way or just doing it wrong, because the only way I've ever seen IPv4 addresses represented is a string or something that's just a typedef for a uint32_t.
The irony here is that a more modern, safer language would likely have a type for addresses that handles all of this for you and saves trying to blindly cast things at all.
Which would actually make the problem worse, because instead of it being a pain in the rear to get around the type casting, it simply wouldn't let me try. That's a great idea.
kohlrak wrote:That's even worse. Speculative reads (aka speculative execution) is branch prediction. It's a staple of x86's pipeline optimization. I really miss ARM's original response to it (conditional instructions, but they switched to branch prediction to save on instruction values so they could fit more instructions in a smaller space). Now i'm really skeptical of AMD not being affected.
I was probably inadequately clear here - I said "they do not do the speculative reads that are responsible", meaning it's the specific vulnerable cross-privilege read, not a general case of "never speculatively read anything, including instructions". But to save further argument I dug up the specific quote from AMD:
Well, thank you for clarifying.
The AMD microarchitecture
does not allow memory references, including speculative references, that
access higher privileged data when running in a lesser privileged mode
when that access would result in a page fault.
Wait, so they only allow it when it's allowed? They think the intel code wouldn't hit rock bottom on a page fault? They're confusing me, here, on how they're different from intel.
kohlrak wrote:In what way can rings 1 and 2 jump into ring 0? I never read anything like that in the intel manuals. Have you coded software implementing this stuff, before?
The most obvious issue is that the page tables have a single bit for supervisor mode, there's no per-ring granularity. Anything less than ring 3 is supervisor mode. (See the paging documentation in Intel's manuals.) If you can read and write to all memory used by code in ring 0, it's clearly trivial to gain execution in ring 0.
Maybe we're thinking of different types of pages, because the ones i use are 2 bits, which allow 4 diferent rings (0-3).