Does anyone know if there is a tool for collecting words from text files? What I need to do is to collect all unique words from a large text file and save them as a list: one word for each line.
Yopu can use a combination of sed, sort and uniq: Something like:
sed "s/ /\n/g"/RAM/FileToSort.txt| SDK:local/C/sort | uniq
Sed Substitues each " " with an "\n", and the g tells sed to perform the substitues for all hits, not just the first -- s/THIS/THAT/g. Sort sorts it all. Uniq find all the unique entries.
There is an Amiga-command "sort" clashing with the SDK sort, that's why you'd need to issue the full path. I'm assuming you have OS4 and the SDK.
As no one answer to you (i think it because of the question itself, solution of which are easy, and have a lot of ways, and google can help with most of them), so to avoid silence in the forum, there is detailed answer:
1. any scripting language (python, perl, arexx all in all) in the few strings, like:
Perl
aos4shell:> perl -nle "$w{$_}++ for grep /\w/, map { s/[\. ,]*$//g; lc($_) } split; sub END { printf("%s\n", $w) while (($w) = each(%w)) }"< input_file.txt>words.txt
You can of course make it as script.pl, and write it not in single line, but just as small script, like:
2. Simply script on bash with usage of unix's "sort" command line tool, like:
#!/usr/bin/bash
echo Enter the filename
read fn
for WORD in $(cat $fn)
do
echo "$WORD "
done | sdk:local/c/sort -u
Running will be just something like:
aos4shell:> sh sort.sh
And then type file name with words which need to parse. Or just do redirecting like sh sort.sh >ready, and type filename as well. Btw, to be noted, don't mess aos4 version of "sort" binary which are in the system:c/ , with unix "sort" binary, which are placed in the sdk:local/c/.
3. Unix command line programms which aos4 have, like: sed, awk, gawk
You could do it easily in Gui4Cli too. (using any of the suggested tools or using xlistview events and the gui4Cli command set)
You then would have just as easily have a gui that you can expand upon. Working bottom up you can integrate new ideas and have a working gui all the time.
I am interested in the possibility to collect words too. As a basis for document search, selecting the most usefull tag words, spellchecking & translation, a lot of applications can use this.
Integrated in the OS, why not, as a background something working on the set of text files(directories) you indicate. Maybe an application for X1000's second processor?
Yopu can use a combination of sed, sort and uniq: Something like:
Sed Substitues each " " with an "\n", and the g tells sed to perform the substitues for all hits, not just the first -- s/THIS/THAT/g. Sort sorts it all. Uniq find all the unique entries.
There is an Amiga-command "sort" clashing with the SDK sort, that's why you'd need to issue the full path. I'm assuming you have OS4 and the SDK.
As no one answer to you (i think it because of the question itself, solution of which are easy, and have a lot of ways, and google can help with most of them), so to avoid silence in the forum, there is detailed answer:
1. any scripting language (python, perl, arexx all in all) in the few strings, like:
Perl
You can of course make it as script.pl, and write it not in single line, but just as small script, like:
Running line will be:
2. Simply script on bash with usage of unix's "sort" command line tool, like:
Running will be just something like:
And then type file name with words which need to parse. Or just do redirecting like sh sort.sh >ready, and type filename as well. Btw, to be noted, don't mess aos4 version of "sort" binary which are in the system:c/ , with unix "sort" binary, which are placed in the sdk:local/c/.
3. Unix command line programms which aos4 have, like: sed, awk, gawk
Example on awk combined with sort:
4. As well as you can write the same on C or anything else.
ps. Blah, 2 days of silence and when i write post i realise that jaokim also answer in the same time :) But maybe my answer also can be helpfull.
I managed to create a list. Thanks jaokim and kas1e!
You could do it easily in Gui4Cli too. (using any of the suggested tools or using xlistview events and the gui4Cli command set)
You then would have just as easily have a gui that you can expand upon. Working bottom up you can integrate new ideas and have a working gui all the time.
I am interested in the possibility to collect words too. As a basis for document search, selecting the most usefull tag words, spellchecking & translation, a lot of applications can use this.
Integrated in the OS, why not, as a background something working on the set of text files(directories) you indicate. Maybe an application for X1000's second processor?
I just added a tool written today in Gui4Cli
http://www.os4coding.net/source/211