Building the Super Computer: Indexing Data and Documents with 'Google Desktop'

MamesJay

Member
Joined
Dec 21, 2009
Messages
79
In another thread I was asking about the long-time storage of large amounts of data. It must be HDDs cause with the data consumption these days; DVDs are the new floppy discs.

This time it's about the XXL-Computer that sits at home and is holding insane amounts of data, present, available at a click of the mouse. Some would call it a home-server, I just don't like the term, it sounds to geeky for me. To me 'home-server' sounds more like somebody is just celebrating the capabilities of his system, instead of using them. But that's just me. I thought I mention the term so some people might get the idea.

If a user has about 100 or 200 gigabyte of data (most of it large and medium sized files like movies, shows and music), there is no need for an extensive structure and order. Those are laptop numbers. That is the amount of data where a manual search is still a funny little adventure. With two, three or even more Terabyte, it's a different story. It becomes a whole thing that is no fun anymore. The Windows internal search is a dry nightmare, even if the HDD has been indexed (which has to be a given anyway from a certain point on).

I discovered Google in the early 2000s, it was the greatest thing for me to discover the Internet. And it's still my #1 for any search. I'm totally used to the Google look and all. A while ago I heard about 'Google Desktop'. It was interesting but I didn't pay any more attention to it until I realized, that this application can completely index and search the text of documents and/or PDF files. And I have lots of documents. It's like a gigantic knowledge base, and the ability to have it indexed, so I can search INSIDE of ALL the documents for a word, in a split second...?! Getting every result from every document at once?! No more searching of each single document, and getting the results in a one-by-one manner, but getting multiple results in a Google look, with the searched word and passages, to look at the context?! The thought of it blew my mind, it's like a giant leap towards the Super-Computer I dreamed about. :idea: :happy: Does anyone have an iPad, I need a new coaster. :lol:

I just wondered how 'Google Desktop' handles large amounts of documents and PDF? Just indexing file names should be easy, the name of the file, the end. But how about, lets say, 200 gigabytes of documents? Does that mean that the index would be just as large? Is there an amount of data where the whole idea runs into problems? Maybe there are some people with an experience of handling large amounts of data with 'Google Desktop'.
Maybe I gave some people a new idea or perspective, I don't know. :)


It would be cool to have a system like that with a 5Tb HDD for data, and maybe a 320Gb SSD for running applications (getting the speed and all). One thing: I would never let this system on the Internet though. :lock: Cause of virus and malware, and then, having all my documents indexed. I might as well print fliers with my credit card number.
 
Google is evil and one should not allow it on his desktop.
 
My boss at work uses it for searching through his enormous inbox (he's an IT manager but doesn't lead by example :lol:). He finds it great, but I find it's just bloatware.
 
He finds it great, but I find it's just bloatware.

Becauuuuuuussssseeeee?



The issue is apparent. Searching through documents, and getting the results in the context of whole sentences. Being able to search through Gigabytes of texts in different formats at once. For example, making a vast amount of technical literature accessible, that would have been unused, cause it would take to long to go through all the documents manually. For example finding an information in book #500 on page 250.

And because the search engine doesn't care about formats, it's possible to make connections of literature and own work files/projects. That's the point.

Finding the latest episode of 'Top Gear' on the hard drive, or the theme song from 'Attack of the Killer Tomatoes', would just be a delightful extra. That's peanuts, no reason for a Desktop Search. It's about regaining access to a database of texts that has become too large to search it manually. And about making new connections of texts from different sources, to come up with new results.

'Google Desktop' in general, just because I like the Google look, and I'm familiar with it.

Right now it doesn't sound exciting for the average Joe, but it is, for somebody who deals with a lot of information. It will get interesting for normal people the bigger HDDs are. The biggest database is useless if there is no instant access. If you have to weed through large amounts of files to get anything, you're not going to use them. Picture it as a virtual basement. :|

---

At some time in the future it will be possible to search music for keywords. Some guy with a huge collection of MP3 is looking for a song about a hotel in California :hmm:, but he forgot the title. :blink: One search in his application that has indexed all the lyrics of the audio, and there it is!

BTW, Google is not evil, Facepage is. And I think I heard about an Apple Curse; that every male Apple user is doomed to look like Moby at some point. :lol:


Some links:

http://en.wikipedia.org/wiki/Desktop_search

http://www.copernic.com/en/products/desktop-search/home/index.html
 
i always wonder how it works, i have a mac and spotlight can bring up shit tons of results in seconds and seperates them out into file types for you. its genuinely great and i miss it at work when i need to search multiple, 2 TB drives
 
i always wonder how it works...

It's all about making an index of all the files on the computer. But you probably know this much. I don't know how it works exactly either. All I know is that, in order to search inside of documents, every single word of the text has to be indexed. But that would mean, 50Gb of indexed texts would create a 50Gb index. It would be worth to use that much space cause of the possibilities it enables. The question is, if the computer can handle a large index like that?

Indexing files like MP3, video or pictures (just as files, not concerning the actual content) is peanuts. The Desktop Search would collect the file names and Meta Data, ID3 tags, Exif and stuff like that. Plug-ins for the Desktop Search can extend the search options.

---

For the future:

In order to search through audio files (or audio from video and TV) for keywords, the search engine would need a sophisticated voice recognition. This way political debates could be searched for a certain passage, or the news. People could search movies for quotes. And then, they too would know why "nobody f--ks with the Jesus." :mrgreen:

Searching through scans and pictures for text is easier, it would just take a good OCR (that already exists).
 
But that would mean, 50Gb of indexed texts would create a 50Gb index.

That's not how all indexes work...

Anyway, I have about 750000 files (3-4TB of data) in my Windows search index currently and it's searched in moments and doesn't use that much space at all. And yes, I have fulltext searching enabled.


This thread makes no sense to me.
 
Last edited:
This thread makes no sense to me.

Want me to draw you a picture? I guess this one is calling for a simplified graphic. Just follow the dainty treats that will lead you to the knowledge.

nnvclv.png


-----

Around ten years ago the average HDD in a home computer had the size of about 4Gb. And about 25% were used for the OS and installed applications. Today an HDD of 500Gb is nothing special anymore. That's more than x100 in ten years. And while this trend continues, searching the home computer becomes just as important as searching the Internet. All the collected data turns into an uncontrolled pile of rubble. And if there is no application to search and display the stored data, all those files are not going to be used. Display and presentation of the search results has to have a GUI that is as intuitive as an XMBC Media Center (for example) or something similar. It cannot be some obscure presentation that only applies to geeks. Off course it depends on the purpose and the intention of the search. A person doing a search inside of documents, for study, business or research, has other priorities than somebody looking for a video or music files.

Searching through a Terabyte of file names, and maybe the indexed content of about 500 E-Mails is peanuts for me. For starters: think of 1000 books with 500 pages each, and each page has about 200 words, then you get a slight idea. Each word has to be part of the index, and that makes the size of the index roughly equal to the original. On one side there might be 1000 files like music or video, that are only indexed with a file name. On the other side 1000x500x200 possible results (as in 'each word'). That is a difference between 1000 and 100.000.000. DUH!!!

And why would anyone care about these horrendous numbers?
It could be people in a similar situation. Somebody who is studying, or doing research of any sort. It's nearly impossible to go through thousands of documents or books to do research manually (at least it's time consuming). Or how about just wanting to look up one isolated aspect in all those documents? And then combine my own results of a project/research with the study material, to double check it (for example). And all that has to be done in a manner that is easy to access and intuitive.

And the 'normal' people? Their numbers are piling up as the storage capacity increases. Soon the average person will face similar circumstances, and lose the ability or interest, to put the collected data to use. Unless there is an easy way to access it all.
Computing that is organized, structured and accessed, with a GUI that hardly changed in 15 years, is heading to become 'The virtual basement' as the capacity increases.

I had people burning CDs and DVDs for me, with files and documents. They all had one thing in common. They all looked like somebody threw up on the disc. One big mess. All those 1Tb external USB HDDs that are popular these days; I don't want to know what kind of a Zoo most of them are housing. Next year we might have 5Tb HDDs, and in a few years going on 10Tb. And the people are storing, collecting and piling up. Apple Store Books, music, movies, family photos, all in one big giant blender. The mess that I'm and maybe people in a similar situation are trying to sort out is like research for future standard computing. I don't know, it might give some people who are facing tons of collected data, ideas for structuring and possible ways to access it all.

The 'Super Computer' in the thread title is not one that can calculate Nuclear Fusion. It's a computer that gives me easy and instant access to my stored data. Or how about one that can combine and connect stored files, to generate results I didn't even think about. With TB HDDs that are getting bigger every year, typing 'C - Program Files - Yadda Yadda' won't do it anymore. Just as well; a Windows search that returns the results in the form of a spreadsheet won't spark any enthusiasm in the average user. It's like Windows 95 all over again. Think GUI.
I don't care for the Apple Cult, but they have realized the key to future computing.

ACCESS.
 
I realize that you have no idea what you are talking about.
 
Also, by chance do you work for Google?

Damn, you got me. :blink: No really, I'm not. :D

The Google Desktop Search was just a choice because I'm used to the Google look. I don't work for Apple either. I don't own a MAC cause I'm used to the PC environment, and the Apple gadgets are too restricted for me. But the success of the iPhone and the iPad shows me that people are looking for a different approach. These days nobody is bothering with DOS anymore. And in the near future, the current interface of Personal Computing is going to be replaced by something that is far more intuitive. And the important part is about; how stored content is approached (or searched).
 
Each word has to be part of the index, and that makes the size of the index roughly equal to the original.

You have absolutely no idea what an index is, do you? All has to be is a list of words, with each word followed by the files it's inside. You only need an entry for each *unique* word in your documents.

And Windows can search through documents, too. I use it to find text inside of PDFs all the time at work.
 
I like this explanation of what an index is:

Think of it as a table of contents for your computer

That's how search engines work, that's how Google desktop search works, its the basis for all indexing in any area, be it files/folders or database level indexing.
 
If you are serious about keeping track of citations etc in scientific texts and are too lazy or too unorganized to just keep track of what you read where, use Endnote or the like.
 
I'm not exactly sure what this thread is about either. But from what I understand:
1) yes google is great but no i don't like the idea of them knowing what's in my documents either.

2) Designing a search index is a matter of compromises. Depending on how it's going to be used you make a choice compared to thoroughness vs speed/size.
I bet with the search index from google you could reproduce all the text from a document using only the index, but does that mean the search index/ database is the same size as the original document? Maybe.
There are neat tricks to filter a lot of bloat that no one would ever search for. Think of the actual picture info in a document, you don't can't search based on that info (yet). It takes up a lot of space, so don't "index" it.
But.. if you think about databases and fast searching, you can make extra tables of predefined searches , or previous queries to speed up the actual searching. It gets complicated if you really want to look at the details, and I'm definitely no expert so I'm sure i've left out a lot of details.

An interesting read about this topic: http://en.wikipedia.org/wiki/Full_text_search
 
Top