• XSS.stack #1 – первый литературный журнал от юзеров форума

Нужен скрипт для удаления дубликатов из файлов больше 100ГБ

This seems like a pretty good solution.
Можно попробовать так.

Python:
import sys
import fileinput


path_to_file = r'your\path\to\file'

with fileinput.FileInput(path_to_file, inplace=True, backup='.bak', mode='rb') as file:
    seen = set()
    for line in file:
        if line not in seen:
            seen.add(line)
            sys.stdout.buffer.write(line)
    print('OK!')
 
Можно попробовать так.

Это годится только для файлов, размер которых помещается в оперативную память. Точнее - размер файла * 3 = столько оперативки нужно. Потому что все, что ты в set загоняешь - хешируется в памяти и занимает больше места. Хотя этим способом очень удобно удалять дубликаты. С использованием Python я вижу только один способ: делить 100Гб на части и удалять дубликаты с использованием SET. Таким образом, это удалит >90% дубликатов. хотя в разных файлах могут остаться пересечения строк.
 
Пожалуйста, обратите внимание, что пользователь заблокирован
The script declares a variable for the user's desktop path using the GetFolderPath method of the Environment class. This allows the script to access the path to the desktop regardless of the user's operating system or language settings.

The script declares variables for the input and output paths using the desktop path variable. The input file is located on the desktop and is named "LargeFile.txt", and the output file will also be saved to the desktop and is named "Output.txt".

The script reads the contents of the input file into a string array using the Get-Content cmdlet.

The script creates a new array to hold the unique lines, and initializes it using the @() syntax.

The script begins a loop that will iterate over each line in the input file.

For each line, the script checks if the line is already in the unique lines array using the -notcontains operator. If the line is not in the array, it is added to the array using the += operator.

Once the loop has completed, the script writes the unique lines to the output file using the Set-Content cmdlet and the | (pipe) operator.



# Declare a variable for the user's desktop path
$desktopPath = [Environment]::GetFolderPath("Desktop")

# Declare variables for the input and output paths
$inputPath = "$desktopPath\LargeFile.txt"
$outputPath = "$desktopPath\Output.txt"

# Read the contents of the input file into a string array
$lines = Get-Content $inputPath

# Create a new array to hold the unique lines
$uniqueLines = @()

# Loop through each line in the input file
foreach ($line in $lines) {
# Check if the line is already in the unique lines array
if ($uniqueLines -notcontains $line) {
# If it is not, add it to the array
$uniqueLines += $line
}
}

# Write the unique lines to the output file
$uniqueLines | Set-Content $outputPath
 
Python:
new = 'uniq.txt'
uniqlines = set(open('Top353Million-probable-v2.txt','r', encoding = 'utf-8',errors='ignore').readlines())
gotovo = open(new,'a', encoding='utf-8').writelines(set(uniqlines))
 
или если овер большой файл
Python:
from itertools import islice
data = open('Top353Million-probable-v2.txt','r', encoding = 'utf-8',errors='ignore')
data_arr = []
str_position = 0
step = 1000
while True:
    try:
        value = list(islice(data, str_position, str_position+step))
        if value:
            for i in value:
                if i == "":
                    break
                else:
                    data_arr.append(i)
        else:
            print("Собрал!")
            break       
        str_position = str_position+step
    except:
        print("Ошибка при сборе массива")
        break

def get_unique_numbers(numbers):
    print('Иду к записи уникальных значений')
    new = 'uniq.txt'
    unique = []
    for number in numbers:
        if number not in unique:
            unique.append(number)
            open(new, 'a').write(f'{i}')
    print('Все записал!')
get_unique_numbers(data_arr)
 
Пожалуйста, обратите внимание, что пользователь заблокирован
Python:
import os

def remove_duplicate_lines(file_path: str, output_file: str = None):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Sort file using the sort utility
    sort_cmd = f"sort {file_path} | uniq > {output_file}" if output_file else f"sort {file_path} | uniq > {file_path}"
    os.system(sort_cmd)

def count_duplicate_lines(file_path: str):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Sort and count duplicates using the sort and uniq utilities
    count_cmd = f"sort {file_path} | uniq -c"
    output = os.popen(count_cmd).read()

    # Extract duplicates from the output
    duplicates = {}
    for line in output.strip().split("\n"):
        count, value = line.strip().split()
        duplicates[value] = int(count)

    # Return the count of duplicates
    return {k: v for k, v in duplicates.items() if v > 1}

# Example usage
file_path = "sample.txt"
remove_duplicate_lines(file_path)
duplicates = count_duplicate_lines(file_path)
for line, count in duplicates.items():
    print(f"{line} appeared {count} times")
 
Можно попробовать так.

Python:
import sys
import fileinput


path_to_file = r'your\path\to\file'

with fileinput.FileInput(path_to_file, inplace=True, backup='.bak', mode='rb') as file:
    seen = set()
    for line in file:
        if line not in seen:
            seen.add(line)
            sys.stdout.buffer.write(line)
    print('OK!')
не взлетает:
AttributeError: '_io.BufferedWriter' object has no attribute 'buffer'
 
Python:
import os

def remove_duplicate_lines(file_path: str, output_file: str = None):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Sort file using the sort utility
    sort_cmd = f"sort {file_path} | uniq > {output_file}" if output_file else f"sort {file_path} | uniq > {file_path}"
    os.system(sort_cmd)

def count_duplicate_lines(file_path: str):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Sort and count duplicates using the sort and uniq utilities
    count_cmd = f"sort {file_path} | uniq -c"
    output = os.popen(count_cmd).read()

    # Extract duplicates from the output
    duplicates = {}
    for line in output.strip().split("\n"):
        count, value = line.strip().split()
        duplicates[value] = int(count)

    # Return the count of duplicates
    return {k: v for k, v in duplicates.items() if v > 1}

# Example usage
file_path = "sample.txt"
remove_duplicate_lines(file_path)
duplicates = count_duplicate_lines(file_path)
for line, count in duplicates.items():
    print(f"{line} appeared {count} times")
ValueError: not enough values to unpack (expected 2, got 0)
-- не работает.
 
Пожалуйста, обратите внимание, что пользователь заблокирован
ValueError: not enough values to unpack (expected 2, got 0)
-- does not work.
Python:
import os
import csv
import xlrd

from collections import defaultdict

def remove_duplicate_lines(file_path: str, output_file: str = None):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1].lower()

    # Read the file into a list using UTF-8 encoding
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()

    if file_extension in [".csv", ".xlsx", ".sql"]:
        # Remove duplicates from the list using a set
        lines = list(set(lines))

    elif file_extension == ".txt":
        # Remove duplicates from the list using a set
        lines = list(set(lines))

    # Write the lines back to the file or a new file using UTF-8 encoding
    with open(output_file or file_path, "w", encoding="utf-8") as f:
        f.writelines(lines)

def count_duplicate_lines(file_path: str):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1].lower()

    if file_extension == ".csv":
        # Read the CSV file into a list
        lines = []
        with open(file_path, "r", encoding="utf-8") as f:
            reader = csv.reader(f)
            for row in reader:
                lines.append(",".join(row))

    elif file_extension == ".xlsx":
        # Read the XLSX file into a list
        workbook = xlrd.open_workbook(file_path)
        sheet = workbook.sheet_by_index(0)
        lines = [sheet.cell_value(row, 0) for row in range(sheet.nrows)]

    elif file_extension == ".sql":
        # Read the SQL file into a list using UTF-8 encoding
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

    elif file_extension == ".txt":
        # Read the text file into a list using UTF-8 encoding
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

    # Count duplicates using a defaultdict
    duplicates = defaultdict(int)
    for line in lines:
        duplicates[line] += 1

    # Return the count of duplicates
    return {k: v for k, v in duplicates.items() if v > 1}

# Example usage
file_path="sample.txt"
duplicate_lines = count_duplicate_lines(file_path)

if duplicate_lines:
    print(f"Found {len(duplicate_lines)} duplicate lines:")
    for line, count in duplicate_lines.items():
        print(f"{line} - {count} times")

    remove_duplicate_lines(file_path)

    print("Duplicates removed.")
else:
    print("No duplicates found.")
 
Python:
import os
import csv
import xlrd

from collections import defaultdict

def remove_duplicate_lines(file_path: str, output_file: str = None):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1].lower()

    # Read the file into a list using UTF-8 encoding
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()

    if file_extension in [".csv", ".xlsx", ".sql"]:
        # Remove duplicates from the list using a set
        lines = list(set(lines))

    elif file_extension == ".txt":
        # Remove duplicates from the list using a set
        lines = list(set(lines))

    # Write the lines back to the file or a new file using UTF-8 encoding
    with open(output_file or file_path, "w", encoding="utf-8") as f:
        f.writelines(lines)

def count_duplicate_lines(file_path: str):
    # Check if file exists
    if not os.path.exists(file_path):
        raise Exception(f"File not found: {file_path}")

    # Get the file extension
    file_extension = os.path.splitext(file_path)[1].lower()

    if file_extension == ".csv":
        # Read the CSV file into a list
        lines = []
        with open(file_path, "r", encoding="utf-8") as f:
            reader = csv.reader(f)
            for row in reader:
                lines.append(",".join(row))

    elif file_extension == ".xlsx":
        # Read the XLSX file into a list
        workbook = xlrd.open_workbook(file_path)
        sheet = workbook.sheet_by_index(0)
        lines = [sheet.cell_value(row, 0) for row in range(sheet.nrows)]

    elif file_extension == ".sql":
        # Read the SQL file into a list using UTF-8 encoding
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

    elif file_extension == ".txt":
        # Read the text file into a list using UTF-8 encoding
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

    # Count duplicates using a defaultdict
    duplicates = defaultdict(int)
    for line in lines:
        duplicates[line] += 1

    # Return the count of duplicates
    return {k: v for k, v in duplicates.items() if v > 1}

# Example usage
file_path="sample.txt"
duplicate_lines = count_duplicate_lines(file_path)

if duplicate_lines:
    print(f"Found {len(duplicate_lines)} duplicate lines:")
    for line, count in duplicate_lines.items():
        print(f"{line} - {count} times")

    remove_duplicate_lines(file_path)

    print("Duplicates removed.")
else:
    print("No duplicates found.")
Programming master!
 
Пожалуйста, обратите внимание, что пользователь заблокирован
Как бы тема уже давно затихла. Но всё равно напишу. Вот
на линуксе работает отлично.
Главное что бы на дисках места хватало для работы.
 
Пожалуйста, обратите внимание, что пользователь заблокирован
Python:
with open('result.txt',"r",encoding="utf8") as result:
    uniqlines = set(result.readlines())
    with open('rmdup.txt', 'w', encoding="utf8") as rmdup:
        rmdup.writelines(set(uniqlines))
 


Напишите ответ...
  • Вставить:
Прикрепить файлы
Верх