Unix [SOLVED]: how to take out only unique words from a file that are not matching with any other words in any files(two or more files)?

Unix [SOLVED]: how to take out only unique words from a file that are not matching with any other words in any files(two or more files)?

Home Forums Unix Unix [SOLVED]: how to take out only unique words from a file that are not matching with any other words in any files(two or more files)?

Tagged: , ,

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #37032

    Anonymous

    QuestionQuestion

    #!/bin/sh
    for file1 in directorypath/*
    do
        for file2 in directorypath/*
             do
                   if [ "$file1" = "$file2" ]; then 
                          echo "files are same"
                   else
    
    
                                     cp /dev/null /home/temp.txt
                     grep -f $file1 $file2 > /home/common.txt
                     grep -v -x -f /home/common.txt $file1 > /home/temp.txt
                                     cp /dev/null $file1
                                     cat /home/temp.txt >> $file1
    
    
                                     cp /dev/null /home/temp.txt
                     grep -v -x -f /home/common.txt $file2 > /home/temp.txt
                                     cp /dev/null $file2
                     cat /home/temp.txt >> $file2
    
                    fi;
             done
    done
    

    This code is working fine for files of small size. Since I have big text files to process, this code is taking too much time even on server machine.
    Please help!
    How do I achieve the same efficiently?
    Thanks in advance.

    #37033

    Anonymous

    Accepted AnswerAnswer

    Try this python script(takes the directory as an argument):

    import sys
    import os
    
    # Keeps a mapping of word => file that contains it
    # word => None means that that word exists in multiple files
    words = {}
    
    def process_line(file_name, line):
        try:
            other_file = words[line]
            if other_file is None or other_file == file_name:
                return
            words[line] = None
        except KeyError:
            words[line] = file_name
    
    file_dir = sys.argv[1]
    for file_name in os.listdir(file_dir):
        with open(os.path.join(file_dir, file_name)) as fd:
            while True:
                line = fd.readline()
                if len(line) == 0:
                    break
                line = line.strip()
                if len(line) == 0:
                    continue
                process_line(file_name, line)
    
    file_descriptors = {}
    # Empty all existing files before writing out the info we have
    for file_name in os.listdir(file_dir):
        file_descriptors[file_name] = open(os.path.join(file_dir, file_name), "w")
    
    for word in words:
        file_name = words[word]
        if file_name is None:
            continue
        fd = file_descriptors[file_name]
        fd.write("%sn" % word)
    
    for fd in file_descriptors.values():
        fd.close()
    

    Memory requirement:

    You need to be able to hold all unique words at once in memory. Assuming there’s lots of dupes between files, this should be doable. Otherwise I honestly don’t see an approach faster than what you have already.

    If you end up not being able to fit everything needed in memory, have a look at this answer for possible ways to use a disk-based solution for the dict instead of holding it all in memory. I have no idea how much that will affect performance and if it will still run fast enough at that point.

    Why is it faster?(In theory, untested)

    It only makes a single pass through each file and it’s done. Your current approach is O(n^2) where n is the number of files

    Source: https://stackoverflow.com/questions/48020741/how-to-take-out-only-unique-words-from-a-file-that-are-not-matching-with-any-oth
    Author: entropy
    Creative Commons License
    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.