Нужен скрипт для удаления дубликатов из файлов больше 100ГБ

deadadam · 02.12.2022

This seems like a pretty good solution.

Можно попробовать так.

Python:

import sys
import fileinput


path_to_file = r'your\path\to\file'

with fileinput.FileInput(path_to_file, inplace=True, backup='.bak', mode='rb') as file:
    seen = set()
    for line in file:
        if line not in seen:
            seen.add(line)
            sys.stdout.buffer.write(line)
    print('OK!')

tonny_gram · 30.12.2022

ramus сказал(а):

Можно попробовать так.

Это годится только для файлов, размер которых помещается в оперативную память. Точнее - размер файла * 3 = столько оперативки нужно. Потому что все, что ты в set загоняешь - хешируется в памяти и занимает больше места. Хотя этим способом очень удобно удалять дубликаты. С использованием Python я вижу только один способ: делить 100Гб на части и удалять дубликаты с использованием SET. Таким образом, это удалит >90% дубликатов. хотя в разных файлах могут остаться пересечения строк.

T1Crazy · 30.12.2022

The script declares a variable for the user's desktop path using the GetFolderPath method of the Environment class. This allows the script to access the path to the desktop regardless of the user's operating system or language settings.

The script declares variables for the input and output paths using the desktop path variable. The input file is located on the desktop and is named "LargeFile.txt", and the output file will also be saved to the desktop and is named "Output.txt".

The script reads the contents of the input file into a string array using the Get-Content cmdlet.

The script creates a new array to hold the unique lines, and initializes it using the @() syntax.

The script begins a loop that will iterate over each line in the input file.

For each line, the script checks if the line is already in the unique lines array using the -notcontains operator. If the line is not in the array, it is added to the array using the += operator.

Once the loop has completed, the script writes the unique lines to the output file using the Set-Content cmdlet and the | (pipe) operator.

# Declare a variable for the user's desktop path
$desktopPath = [Environment]::GetFolderPath("Desktop")

# Declare variables for the input and output paths
$inputPath = "$desktopPath\LargeFile.txt"
$outputPath = "$desktopPath\Output.txt"

# Read the contents of the input file into a string array
$lines = Get-Content $inputPath

# Create a new array to hold the unique lines
$uniqueLines = @()

# Loop through each line in the input file
foreach ($line in $lines) {
# Check if the line is already in the unique lines array
if ($uniqueLines -notcontains $line) {
# If it is not, add it to the array
$uniqueLines += $line
}
}

# Write the unique lines to the output file
$uniqueLines | Set-Content $outputPath

Lupin · 22.01.2023

Python:

new = 'uniq.txt'
uniqlines = set(open('Top353Million-probable-v2.txt','r', encoding = 'utf-8',errors='ignore').readlines())
gotovo = open(new,'a', encoding='utf-8').writelines(set(uniqlines))

Lupin · 22.01.2023

или если овер большой файл

Python:

from itertools import islice
data = open('Top353Million-probable-v2.txt','r', encoding = 'utf-8',errors='ignore')
data_arr = []
str_position = 0
step = 1000
while True:
    try:
        value = list(islice(data, str_position, str_position+step))
        if value:
            for i in value:
                if i == "":
                    break
                else:
                    data_arr.append(i)
        else:
            print("Собрал!")
            break       
        str_position = str_position+step
    except:
        print("Ошибка при сборе массива")
        break

def get_unique_numbers(numbers):
    print('Иду к записи уникальных значений')
    new = 'uniq.txt'
    unique = []
    for number in numbers:
        if number not in unique:
            unique.append(number)
            open(new, 'a').write(f'{i}')
    print('Все записал!')
get_unique_numbers(data_arr)

Demidoff · 22.01.2023

DragonTool может вам подойдет

root3d · 07.02.2023

Python:

import os

def remove_duplicate_lines(file_path: str, output_file: str = None):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Sort file using the sort utility
    sort_cmd = f"sort {file_path} | uniq > {output_file}" if output_file else f"sort {file_path} | uniq > {file_path}"
    os.system(sort_cmd)

def count_duplicate_lines(file_path: str):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Sort and count duplicates using the sort and uniq utilities
    count_cmd = f"sort {file_path} | uniq -c"
    output = os.popen(count_cmd).read()

    # Extract duplicates from the output
    duplicates = {}
    for line in output.strip().split("\n"):
        count, value = line.strip().split()
        duplicates[value] = int(count)

    # Return the count of duplicates
    return {k: v for k, v in duplicates.items() if v > 1}

# Example usage
file_path = "sample.txt"
remove_duplicate_lines(file_path)
duplicates = count_duplicate_lines(file_path)
for line, count in duplicates.items():
    print(f"{line} appeared {count} times")

Marco van Basten · 07.02.2023

ramus сказал(а):

Можно попробовать так.

Python:

import sys
import fileinput


path_to_file = r'your\path\to\file'

with fileinput.FileInput(path_to_file, inplace=True, backup='.bak', mode='rb') as file:
    seen = set()
    for line in file:
        if line not in seen:
            seen.add(line)
            sys.stdout.buffer.write(line)
    print('OK!')

не взлетает:
AttributeError: '_io.BufferedWriter' object has no attribute 'buffer'

Marco van Basten · 07.02.2023

root3d сказал(а):

Python:

import os

def remove_duplicate_lines(file_path: str, output_file: str = None):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Sort file using the sort utility
    sort_cmd = f"sort {file_path} | uniq > {output_file}" if output_file else f"sort {file_path} | uniq > {file_path}"
    os.system(sort_cmd)

def count_duplicate_lines(file_path: str):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Sort and count duplicates using the sort and uniq utilities
    count_cmd = f"sort {file_path} | uniq -c"
    output = os.popen(count_cmd).read()

    # Extract duplicates from the output
    duplicates = {}
    for line in output.strip().split("\n"):
        count, value = line.strip().split()
        duplicates[value] = int(count)

    # Return the count of duplicates
    return {k: v for k, v in duplicates.items() if v > 1}

# Example usage
file_path = "sample.txt"
remove_duplicate_lines(file_path)
duplicates = count_duplicate_lines(file_path)
for line, count in duplicates.items():
    print(f"{line} appeared {count} times")

ValueError: not enough values to unpack (expected 2, got 0)
-- не работает.

root3d · 07.02.2023

Marco van Basten сказал(а):

ValueError: not enough values to unpack (expected 2, got 0)
-- does not work.

Python:

import os
import csv
import xlrd

from collections import defaultdict

def remove_duplicate_lines(file_path: str, output_file: str = None):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1].lower()

    # Read the file into a list using UTF-8 encoding
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()

    if file_extension in [".csv", ".xlsx", ".sql"]:
        # Remove duplicates from the list using a set
        lines = list(set(lines))

    elif file_extension == ".txt":
        # Remove duplicates from the list using a set
        lines = list(set(lines))

    # Write the lines back to the file or a new file using UTF-8 encoding
    with open(output_file or file_path, "w", encoding="utf-8") as f:
        f.writelines(lines)

def count_duplicate_lines(file_path: str):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1].lower()

    if file_extension == ".csv":
        # Read the CSV file into a list
        lines = []
        with open(file_path, "r", encoding="utf-8") as f:
            reader = csv.reader(f)
            for row in reader:
                lines.append(",".join(row))

    elif file_extension == ".xlsx":
        # Read the XLSX file into a list
        workbook = xlrd.open_workbook(file_path)
        sheet = workbook.sheet_by_index(0)
        lines = [sheet.cell_value(row, 0) for row in range(sheet.nrows)]

    elif file_extension == ".sql":
        # Read the SQL file into a list using UTF-8 encoding
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

    elif file_extension == ".txt":
        # Read the text file into a list using UTF-8 encoding
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

    # Count duplicates using a defaultdict
    duplicates = defaultdict(int)
    for line in lines:
        duplicates[line] += 1

    # Return the count of duplicates
    return {k: v for k, v in duplicates.items() if v > 1}

# Example usage
file_path="sample.txt"
duplicate_lines = count_duplicate_lines(file_path)

if duplicate_lines:
    print(f"Found {len(duplicate_lines)} duplicate lines:")
    for line, count in duplicate_lines.items():
        print(f"{line} - {count} times")

    remove_duplicate_lines(file_path)

    print("Duplicates removed.")
else:
    print("No duplicates found.")

root3d · 07.02.2023

made a fix that will work now.

Marco van Basten · 08.02.2023

root3d сказал(а):

Python:

import os
import csv
import xlrd

from collections import defaultdict

def remove_duplicate_lines(file_path: str, output_file: str = None):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1].lower()

    # Read the file into a list using UTF-8 encoding
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()

    if file_extension in [".csv", ".xlsx", ".sql"]:
        # Remove duplicates from the list using a set
        lines = list(set(lines))

    elif file_extension == ".txt":
        # Remove duplicates from the list using a set
        lines = list(set(lines))

    # Write the lines back to the file or a new file using UTF-8 encoding
    with open(output_file or file_path, "w", encoding="utf-8") as f:
        f.writelines(lines)

def count_duplicate_lines(file_path: str):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1].lower()

    if file_extension == ".csv":
        # Read the CSV file into a list
        lines = []
        with open(file_path, "r", encoding="utf-8") as f:
            reader = csv.reader(f)
            for row in reader:
                lines.append(",".join(row))

    elif file_extension == ".xlsx":
        # Read the XLSX file into a list
        workbook = xlrd.open_workbook(file_path)
        sheet = workbook.sheet_by_index(0)
        lines = [sheet.cell_value(row, 0) for row in range(sheet.nrows)]

    elif file_extension == ".sql":
        # Read the SQL file into a list using UTF-8 encoding
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

    elif file_extension == ".txt":
        # Read the text file into a list using UTF-8 encoding
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

    # Count duplicates using a defaultdict
    duplicates = defaultdict(int)
    for line in lines:
        duplicates[line] += 1

    # Return the count of duplicates
    return {k: v for k, v in duplicates.items() if v > 1}

# Example usage
file_path="sample.txt"
duplicate_lines = count_duplicate_lines(file_path)

if duplicate_lines:
    print(f"Found {len(duplicate_lines)} duplicate lines:")
    for line, count in duplicate_lines.items():
        print(f"{line} - {count} times")

    remove_duplicate_lines(file_path)

    print("Duplicates removed.")
else:
    print("No duplicates found.")

Programming master!

ckat_soft · 05.03.2024

Пиздец, что вы ерундой страдаете, берю любую бд хоть sqlite3 создай уникальный индекс и толкай в него свою базу

174region174 · 21.03.2024

Как бы тема уже давно затихла. Но всё равно напишу. Вот
на линуксе работает отлично.
Главное что бы на дисках места хватало для работы.

GitHub - nil0x42/duplicut: Remove duplicates from MASSIVE wordlist, without sorting it (for dictionary-based password cracking)

Remove duplicates from MASSIVE wordlist, without sorting it (for dictionary-based password cracking) - nil0x42/duplicut

github.com

decoco · 17.04.2024

Python:

with open('result.txt',"r",encoding="utf8") as result:
    uniqlines = set(result.readlines())
    with open('rmdup.txt', 'w', encoding="utf8") as rmdup:
        rmdup.writelines(set(uniqlines))

89t0rIhnt · 17.04.2024

Bash:

cat file.txt | sort | uniq > out.txt

Нужен скрипт для удаления дубликатов из файлов больше 100ГБ

deadadam

HDD-drive

tonny_gram

CryptoCoder

T1Crazy

RAID-массив

Lupin

Top

Lupin

Top

Demidoff

RAID-массив

root3d

RAID-массив

Marco van Basten

RAM

Marco van Basten

RAM

root3d

RAID-массив

root3d

RAID-массив

Marco van Basten

RAM

ckat_soft

(L3) cache

174region174

Bazilio

GitHub - nil0x42/duplicut: Remove duplicates from MASSIVE wordlist, without sorting it (for dictionary-based password cracking)

decoco

floppy-диск

89t0rIhnt

RAM